I recently had to explain in-depth to someone how WordPress generates duplicate content by “creating” many more pages on your web site than you do. This guy, a member of the Manchester WordPress User Group I run, was asking about rebuilding a specific page on his web site because the content no longer matched the title and the page wasn’t “ranking” in Google. He said:
On the question of page ranking, Google doesn’t seem to rank my existing page AT ALL – and it is certainly nowhere in the search results
I pointed out that his page was indexed by Google: just not listed anywhere in the first 1000 pages from his web site. To illustrate the pages that Google did have indexed on his site, I linked to the site-limited search results Google provides: http://www.google.com/search/q=site:example.com where you replace example.com with the domain of your site.
Google normally indexes your archives
I surprised him by demonstrating that Google had almost 3,800 pages indexed:
But my site only has about 1,400 posts plus about 150 pages – so do I take it that the other stuff is ‘archives’? Are they of any use? And how would I get rid of them?
My response: “They are of great use, both to your readers and the search engines; however, read on.”
WordPress provides many ways to browse the posts on your site: in chronological order from the front page; by date; by category; by tag; and by author, and thus you have a lot of duplicate content in WordPress. The key is: you want them followed but not indexed.
Automatic Archives are to blame
Let’s say you have a post (“My Fantastic Pets”) created a few weeks ago, with a category (Cats) and three tags (Fluffy, Smoky, Killer) assigned. That post content could be found on the following URLs:
- http://example.com/2012/09/28/my-fantastic-pets/ (it’s permalink or canonical URL)
- http://example.com/page/2 (the second page of your blog if you’ve posted a lot since this one)
- http://example.com/2012/09/28 (a date archive, only if you posted more than once on the same day)
- http://example.com/2012/09/ (another date archive)
- http://example.com/2012/page/11 (yet another one)
- http://example.com/category/cats (a category archive – one of these for each category assigned)
- http://example.com/tag/fluffy (a tag archive)
- http://example.com/tag/smoky/page/2 (if you have a lot of posts about smokey
- http://example.com/author/ron (an author archive)
That first URL is the one true URL, all the rest are URLs with duplicate content. The same content can also be found on feeds for some of those URLs (I suspect Google sometimes indexes feeds for blog searches). And even on your search result pages too http://example.com/?s=fantastic. (Though WordPress will never automatically generate such search links.)
Now, all this great. Honestly! For your readers and for Google. Your readers have many, many ways of discovering your content. So too does Google and all the other crawlers.
That’s how you get duplicate content in WordPress
The key difference is that Google will see lots of duplicate content (at least ten copies just for one category and three tags), so the important thing is to tell Google to follow the links on all those pages (which will take it to the one true URL), but not to index the archive pages themselves. Hence the meta tag you may have heard of: “follow, no-index”.
The goal is to have Google index each post, each page, and the home page of your site (in my earlier example: 1400+150+1) and nothing else. But you still want Google to follow the links on all those archive pages to make sure it finds every bit of your content.
The quick fix
The simplest way to fix this issue it to use a great SEO plugin like WordPress SEO by Yoast to automatically add those “follow, no-index” meta tags to all your archive pages. See the “SEO” > “Titles & Metas” settings page under the “Post Types” and “Taxonomies” tabs.
I will show exactly how to do just that in a later article. Meanwhile, let me know what you think in the comments below.