Duplicate Content in WordPress

I recently had to explain in-depth to someone how WordPress generates duplicate content by “creating” many more pages on your web site than you do. This guy, a member of the Manchester WordPress User Group I run, was asking about rebuilding a specific page on his web site because the content no longer matched the title and the page wasn’t “ranking” in Google. He said:

On the question of page ranking, Google doesn’t seem to rank my existing page AT ALL – and it is certainly nowhere in the search results

I pointed out that his page was indexed by Google: just  not listed anywhere in the first 1000 pages from his web site. To illustrate the pages that Google did have indexed on his site, I linked to the site-limited search results Google provides: http://www.google.com/search/q=site:example.com where you replace example.com with the domain of your site.

Google normally indexes your archives

I surprised him by demonstrating that Google had almost 3,800 pages indexed:

But my site only has about 1,400 posts plus about 150 pages – so do I take it that the other stuff is ‘archives’? Are they of any use? And how would I get rid of them?

My response: “They are of great use, both to your readers and the search engines; however, read on.” So, if you are asking yourself “Why has my WordPress website got duplicated content on every page?” you too should read on.
WordPress provides many ways to browse the posts on your site: in chronological order from the front page; by date; by category; by tag; and by author, and thus you have a lot of duplicate content in WordPress. The key is: you want them followed but not indexed.

Automatic Archives are to blame

Let’s say you have a post (“My Fantastic Pets”) created a few weeks ago, with a category (Cats) and three tags (Fluffy, Smoky, Killer) assigned. That post content could be found on the following URLs:

  • http://example.com/2012/09/28/my-fantastic-pets/ (it’s permalink or canonical URL)
  • http://example.com/page/2 (the second page of your blog if you’ve posted a lot since this one)
  • http://example.com/2012/09/28 (a date archive, only if you posted more than once on the same day)
  • http://example.com/2012/09/ (another date archive)
  • http://example.com/2012/page/11 (yet another one)
  • http://example.com/category/cats (a category archive – one of these for each category assigned)
  • http://example.com/tag/fluffy (a tag archive)
  • http://example.com/tag/smoky/page/2 (if you have a lot of posts about smokey
  • http://example.com/tag/killer
  • http://example.com/author/ron (an author archive)

That first URL is the one true URL, all the rest are URLs with duplicate content. The same content can also be found on feeds for some of those URLs (I suspect Google sometimes indexes feeds for blog searches). And even on your search result pages too http://example.com/?s=fantastic. (Though WordPress will never automatically generate such search links.)

A screen shot of Google search results showing the same content in 7 results.

The same content found on 7 pages.

Now, all this great. Honestly! For your readers and for Google. Your readers have many, many ways of discovering your content. So too does Google and all the other crawlers.

That’s how you get duplicate content in WordPress

The key difference is that Google will see lots of duplicate content (at least ten copies just for one category and three tags), so the important thing is to tell Google to follow the links on all those pages (which will take it to the one true URL), but not to index the archive pages themselves. Hence the meta tag you may have heard of: “follow, no-index”.

The goal is to have Google index each post, each page, and the  home page of your site (in my earlier example: 1400+150+1) and nothing else. But you still want Google to follow the links on all those archive pages to make sure it finds every bit of your content.

The quick fix

The simplest way to fix this issue it to use a great SEO plugin like WordPress SEO by Yoast to automatically add those “follow, no-index” meta tags to all your archive pages. See the “SEO” > “Titles & Metas” settings page under the “Post Types” and “Taxonomies” tabs.

I will show exactly how to do just that in a later article. Meanwhile, let me know what you think in the comments below.

Sign up now for email updates

  • Just enter your primary email address and hit the button!

Comments

  1. Hi Mike

    Great article mate, thanks. But where is the follow up article on how to correctly set up WP SEO?

  2. Hello Mike

    Thanks for the posting I was wondering how I could control the plethora of ‘search results’ from my sites.

    I assume, I need to check the ‘index, no follow’ for the items I don’t want to appear in the search results, and that the ‘WordPress SEO Meta Box:’ ‘Hide’ toggle is because there is no need to add meta data text to items that don’t appear as search snippets?

    Regards
    Doug

  3. mikelittle says:

    If you check “noindex, follow” in the SEO > Titles and Meta settings page of the plugin for a specific type of content (such as, Attachments), Google will follow all the links on each attachment page it finds, but will not add the attachment pages to its indexes. So those pages will not show up in search engine result pages (SERPs).

    The “Hide” checkbox will hide the plugin’s Metabox or panel on WordPress’ edit pages for those content types. You do that when you aren’t interested in optimising those content types in any way.

    This is independent of whether they appear in search results: You could have them appear in search results but not be interested in optimising them, but you probably wouldn’t want to optimise them if you were excluding them from search results.

    However, the latter doesn’t necessarily follow. After all, some of the optimisation advice the plugin gives is just about writing well (albeit how Google thinks), for example you might want to check the Flesch reading ease and other metrics regardless of whether the pages appear in SERPs.

    Hope that helps,
    Mike

  4. Eibhlin says:

    HUGE thanks! I didn’t realize the use of those boxes in the Post Types and Taxonomies sections. This has been a major SEO issue for one of my sites, and I’m very grateful for this post. Thank you!

  5. Thanks Mike. I always know where to come for clarity on confusing (for me) WP areas :)

    See you soon , Saz.

  6. Michael Hutton says:

    Thanks for this article Mike. I already had the WordPress SEO plugin installed and thought I had carefully gone through and set it all up correctly but had overlooked what you have highlighted. Glad I come across this as I have now managed to fix some duplicate content issues I was having :-)

  7. I have been searching for this answer for days, tried edit html script in ftp and cpanel keep changing back to default. Canonical in the script, robots txt exclusion, sitemap but Google still indexed the duplicate pages.
    This worked I checked all scripts of my custom name url posts.
    The latest post with the previous page from home page (page/2/) etc. And they all have the meta tag
    Home page will be indexed and custom pages posts but not the duplicate copies. I now have confidence search engines will not index those pages. Now the job of removing pages from index through remove urls Google index in WMT.

    THANK YOU

  8. Abhi says:

    Thanks mate, I was banging my head in all the wrong places and finally I know how to resolve the duplicate issue. Off to making the changes:)

  9. omri says:

    Hey and firstly thanks for this great article!

    I’m facing a duplicate problem that you didn’t mention here and it will be great to get some advice: I have a blog with about 15 posts, In the first months when I just opened it I used the “Your latest posts ” option in the reading settings and set it to show 3 posts per page and found that Google indexed these pages.

    Before about two months I created a static page for my home page and changed the reading settings to a “static page”. What happened since is that when I type site:mysite.com at google I see that in addition to my posts, google also shows these pages: mysite:com/page/2 mysite:com/page/3 mysite:com/page/4…. and all of them contain the exact same content as in my home page! These pages cause a big problem of duplicate content although they are not exist anymore and I really don’t know what to do in order to make them disappear from google search results… I lost my rankings since it happens and I believe that this issue have a lot to do with it… Any help guys on how to solve it?

    Thanks

    • Hi Omri,
      Your site should have a canonical meta tag in the header of each page ( . So although, WordPress will show the same content for your home page with /page/2, /page/3 etc. (as does mine), the canonical tag in the header points back to the main URL. Google will honour this tag (they invented it) and not treat the content as duplicate.
      If the tag is not there, it is likely your theme is not doing the right thing. Contact your theme author about it.

      • omri says:

        Hi Mike, I don’t know why but I saw your reply just now after getting an email alert so thanks :)

        I checked my site and other sites that I use with the same theme and noticed that the canonical meta tag exists but it is a bit different than in your site. For example, when I check the page here: mysite.com/page/2 the canonical meta tag for this page look like this:

        I don’t know if it is supposed to be like that (with the 2 at the end of the link) or like that (without the 2 at the end)

        If it is not OK this way can you please tell me how can I fix it or what should I do? I will really appreciate your advice… Thanks

        Omri

        • Unfortunately, your illustrative meta tags got stripped by the commenting system. You can enclose them in < code> tags or perhaps you could send me the link to the site in question (I won’t publish it here).

          • omri says:

            Hi Mike and thanks for the answer. I just checked and it seems that the canonical tag is the same in all my sites, not just the sites using this specific theme. Maybe it is due to one of my plugins? Here is an example of one of my sites that gets the tag for page 2:

            [url redacted]

            I really don’t know if it hurts my seo or not because some of my sites are ranked high in Google, however it will be great to know how to fix this problem quickly in my different sites if you think that it can improve my rankings…

            By the way, for the specific site that I mentioned in my earlier post before few motnhs I already told Google several weeks ago to remove the indexed pages 2, 3 and 4 from their index via the webmaster tools. These pages are not showing any more in their index since then but do you think that removing these pages was a good idea or should I disable the removal request?

            Thanks a lot for your time and advice!

          • Hi Omri,
            It looks like All in One SEO Pack is adding in that meta tag for you and clearly has a bug in it. Best thing is to contact the plugin author to ask them to fix it.

            Yes, asking Google to remove them was the right thing to do, they would definitely see it as duplicate content. Once you have a fixed version of the plugin, you can disable the removal requests if you go back to putting posts on the front page (though the urls are invalid anyway, and shouldn’t recur)

  10. omri says:

    Hi Mike and thanks again. I think that I figured it out:

    in the all in one seo plugin there is an option to choose “No Pagination for Canonical URLs:”. One I choose it all the pages now have this tag:

    (without the number after it.)

    Is it the solution? And should I choose this option now for all my wordpress blogs that have static home page? Let me know what you think… Thanks a lot!

Speak Your Mind

*