Duplicate Content in WordPress

I recently had to explain in-depth to someone how WordPress generates duplicate content by “creating” many more pages on your web site than you do. This guy, a member of the Manchester WordPress User Group I run, was asking about rebuilding a specific page on his web site because the content no longer matched the title and the page wasn’t “ranking” in Google. He said:

On the question of page ranking, Google doesn’t seem to rank my existing page AT ALL – and it is certainly nowhere in the search results

I pointed out that his page was indexed by Google: just  not listed anywhere in the first 1000 pages from his web site. To illustrate the pages that Google did have indexed on his site, I linked to the site-limited search results Google provides: http://www.google.com/search/q=site:example.com where you replace example.com with the domain of your site.

Google normally indexes your archives

I surprised him by demonstrating that Google had almost 3,800 pages indexed:

But my site only has about 1,400 posts plus about 150 pages – so do I take it that the other stuff is ‘archives’? Are they of any use? And how would I get rid of them?

My response: “They are of great use, both to your readers and the search engines; however, read on.” So, if you are asking yourself “Why has my WordPress website got duplicated content on every page?” you too should read on.
WordPress provides many ways to browse the posts on your site: in chronological order from the front page; by date; by category; by tag; and by author, and thus you have a lot of duplicate content in WordPress. The key is: you want them followed but not indexed.

Automatic Archives are to blame

Let’s say you have a post (“My Fantastic Pets”) created a few weeks ago, with a category (Cats) and three tags (Fluffy, Smoky, Killer) assigned. That post content could be found on the following URLs:

  • http://example.com/2012/09/28/my-fantastic-pets/ (it’s permalink or canonical URL)
  • http://example.com/page/2 (the second page of your blog if you’ve posted a lot since this one)
  • http://example.com/2012/09/28 (a date archive, only if you posted more than once on the same day)
  • http://example.com/2012/09/ (another date archive)
  • http://example.com/2012/page/11 (yet another one)
  • http://example.com/category/cats (a category archive – one of these for each category assigned)
  • http://example.com/tag/fluffy (a tag archive)
  • http://example.com/tag/smoky/page/2 (if you have a lot of posts about smokey
  • http://example.com/tag/killer
  • http://example.com/author/ron (an author archive)

That first URL is the one true URL, all the rest are URLs with duplicate content. The same content can also be found on feeds for some of those URLs (I suspect Google sometimes indexes feeds for blog searches). And even on your search result pages too http://example.com/?s=fantastic. (Though WordPress will never automatically generate such search links.)

A screen shot of Google search results showing the same content in 7 results.

The same content found on 7 pages.

Now, all this great. Honestly! For your readers and for Google. Your readers have many, many ways of discovering your content. So too does Google and all the other crawlers.

That’s how you get duplicate content in WordPress

The key difference is that Google will see lots of duplicate content (at least ten copies just for one category and three tags), so the important thing is to tell Google to follow the links on all those pages (which will take it to the one true URL), but not to index the archive pages themselves. Hence the meta tag you may have heard of: “follow, no-index”.

The goal is to have Google index each post, each page, and the  home page of your site (in my earlier example: 1400+150+1) and nothing else. But you still want Google to follow the links on all those archive pages to make sure it finds every bit of your content.

The quick fix

The simplest way to fix this issue it to use a great SEO plugin like WordPress SEO by Yoast to automatically add those “follow, no-index” meta tags to all your archive pages. See the “SEO” > “Titles & Metas” settings page under the “Post Types” and “Taxonomies” tabs.

I will show exactly how to do just that in a later article. Meanwhile, let me know what you think in the comments below.

Sign up now for email updates

  • Just enter your primary email address and hit the button!

Comments

  1. Hi Mike

    Great article mate, thanks. But where is the follow up article on how to correctly set up WP SEO?

  2. Hello Mike

    Thanks for the posting I was wondering how I could control the plethora of ‘search results’ from my sites.

    I assume, I need to check the ‘index, no follow’ for the items I don’t want to appear in the search results, and that the ‘WordPress SEO Meta Box:’ ‘Hide’ toggle is because there is no need to add meta data text to items that don’t appear as search snippets?

    Regards
    Doug

  3. mikelittle says:

    If you check “noindex, follow” in the SEO > Titles and Meta settings page of the plugin for a specific type of content (such as, Attachments), Google will follow all the links on each attachment page it finds, but will not add the attachment pages to its indexes. So those pages will not show up in search engine result pages (SERPs).

    The “Hide” checkbox will hide the plugin’s Metabox or panel on WordPress’ edit pages for those content types. You do that when you aren’t interested in optimising those content types in any way.

    This is independent of whether they appear in search results: You could have them appear in search results but not be interested in optimising them, but you probably wouldn’t want to optimise them if you were excluding them from search results.

    However, the latter doesn’t necessarily follow. After all, some of the optimisation advice the plugin gives is just about writing well (albeit how Google thinks), for example you might want to check the Flesch reading ease and other metrics regardless of whether the pages appear in SERPs.

    Hope that helps,
    Mike

  4. Eibhlin says:

    HUGE thanks! I didn’t realize the use of those boxes in the Post Types and Taxonomies sections. This has been a major SEO issue for one of my sites, and I’m very grateful for this post. Thank you!

  5. Thanks Mike. I always know where to come for clarity on confusing (for me) WP areas :)

    See you soon , Saz.

  6. Michael Hutton says:

    Thanks for this article Mike. I already had the WordPress SEO plugin installed and thought I had carefully gone through and set it all up correctly but had overlooked what you have highlighted. Glad I come across this as I have now managed to fix some duplicate content issues I was having :-)

  7. I have been searching for this answer for days, tried edit html script in ftp and cpanel keep changing back to default. Canonical in the script, robots txt exclusion, sitemap but Google still indexed the duplicate pages.
    This worked I checked all scripts of my custom name url posts.
    The latest post with the previous page from home page (page/2/) etc. And they all have the meta tag
    Home page will be indexed and custom pages posts but not the duplicate copies. I now have confidence search engines will not index those pages. Now the job of removing pages from index through remove urls Google index in WMT.

    THANK YOU

  8. Abhi says:

    Thanks mate, I was banging my head in all the wrong places and finally I know how to resolve the duplicate issue. Off to making the changes:)

Speak Your Mind

*