When it comes to a site’s search index, more pages isn’t necessarily better. Content duplication – from the whole site to individual pages- is a common issue web sites face that can hurt their search index and negatively impact overall organic search performance.
The same content served up via different URLs can confuse search engines, who are unsure which version of the page (URL) should rank for a given search term or phrase, hurting organic visibility for all versions of the page.
Duplicating pages also means it is splitting link equity across a number of different URLs rather than consolidating it on one. If a page can be reached via a number of different URLs, it can also be linked to a number of different ways.
In extreme examples, enough duplicate content can trigger a search engine penalty, especially if the duplication is perceived to be deceptive or manipulative.
How Does Duplicate Content Happen?
From a technical perspective, there are five common causes of duplicate content:
1. WWW vs Non-WWW
Many people assume that www.yoursite.com and yoursite.com are the same page. But, the two URLs are in fact two completely different pages in the eyes of search engines. Allowing each page on a site to be served up both ways results in a whole site being duplicated.
2. HTTPS vs HTTP
Many sites have both secure (https) and non-secure (http) versions. Similar to www and non-www, https://www.yourrsite.com/ and http://www.yoursite.com/ are not the same page when it comes to search engines. A site should be accessible via one or the other. Ideally, the preferred version of a site would be secure (https), as google has indicated having a secure site is a positive ranking factor and more recently, announced it would be indexing the secure versions of pages first.
3. Parameters and Session IDs
Websites often use parameters for filtering purposes or to track visitors. Similarly, Session IDs are used to track visitors –for example keeping track of what items they’ve place in their shopping cart. These parameters or Session ID’s are appended to the original URL without changing the content on the page. Again, https://www.yoursite.com/ is a different page than https://www.yoursite.com/?source=rss.
4. Trailing slash vs. non-trailing slash
These two pages – https://www.yoursite.com/duplicate-content/ and are https://www.yoursite.com/duplicate-content – are also two different pages in the eyes of search engines because one URL contains a trailing slash and one doesn’t. Pages on a site should retrievable either with the trailing slash or without it, but not both ways. Starting to sense a theme here?
5. Default Web Server Extensions
Finally, default web server extensions, like .html, can be a source of duplicate content. To drive the point home once more, https://www.yoursite.com/default.aspx is a different page than https://www.yoursite.com even if they both return the homepage.
Ensuring that only one version of each page on a site exists is essential to maintaining a healthy search index. Page duplication undermines this essential component of overall site health and positive organic search performance.
Basically, if the URL changes in any way and the on-page content doesn’t, there is the potential for duplicate content issues. This duplication will need to be addressed via 301 redirection, directives like a canonical tag or a sites’ robots.txt file. The method chosen depends on the situation.