Upgrade Your Drupal Skills

We trained 1,000+ Drupal Developers over the last decade.

See Advanced Courses NAH, I know Enough

How Do You Deal with Duplicate Content in Drupal? 4 Modules to Get this Issue Fixed

Parent Feed: 

Accidentally creating duplicate content in Drupal is like... a cold: 

Catching it is as easy as falling off a log.

All it takes is to:
 

  • further submit your valuable content on other websites, as well, and thus challenging Google with 2 or more identical pieces of content
  • move your website from HTTP to HTTPs, but skip some key steps in the process, so that the HTTP version of your Drupal is still there, “lurking in the dark”
  • have printer-friendly versions of your Drupal site and thus dare Google to face another duplicate content “dilemma”
     

So, what are the “lifebelts” or prevention tools that Drupal “arms” you with for handling this thorny issue?

Here are the 4 modules to use for boosting your site's immunity system against duplicate content.

And for getting it fixed, once the harm has already been made:
 

1. But How Does It Crawl into Your Website? Main Sources of Duplicate Content 

Let's get down to the nitty-gritty of how Drupal 8 duplicate content “infiltrates” into your website.

But first, here are the 2 major categories that these sources fall into:
 

  • malicious
  • non-malicious
     

The first ones include all those scenarios where spammers post content from your website without your consent.

The non-malicious duplicate content can come from:
 

  • discussion forums that create both standard and stripped-down pages (for mobile devices)
  • printer-only web page versions, as already mentioned
  • items displayed on multiple pages of the same e-commerce site
     

Also, duplicate content in Drupal can be either:
 

  • identical
  • or similar

And since it comes in “many stripes and colors”, here are the 7 most common types of duplicate content:
 

1.1. Scraped Content

Has someone copied content from your website and further published it? Do not expect Google to distinguish the copy from its source.

That said, it's your job and yours only to stay diligent and protect the content on your Drupal site from scrapers.
 

1.2. WWW and non-WWW Versions of Your Website

Are there 2 identical version of your Drupal website available? A www and a non-www one?

Now, that's enough to ring Google's “duplicate content in Drupal” alarm.
 

1.3. Widely Syndicated Content 

So, you've painstakingly put together a list of article submission sites to give your valuable content (blog post, video, article etc.) more exposure.

And now what? Should you just cancel promoting it?

Not at all! Widely syndicated content risks to get on Google's “Drupal 8 duplicate content” radar only if you set no guidelines for those third-party websites.

That is when these publishers don't place any canonical tags in your submitted content pointing out to its original source.

What happens when you overlook such a content syndication agreement? You leave it entirely to Google to track down the source. To scan through all those websites and blogs that your piece of content gets republished on.

And often times it fails to tell the original from its copy.
 

1.4. Printed-Friendly Versions

This is probably one of the sources of duplicate content in Drupal that seems most... harmless to you, right?

And yet, for search engines multiple printer-friendly versions of the same content translates as: duplicate pages.
 

1.5. HTTP and HTTPs Pages

Have you made the switch from HTTP to HTTPs?

Entirely?

Or are there:
 

  • backlinks from other websites still leading to the HTTP version of your website?
  • internal links on your current HTTPs website still carrying the old protocol?
     

Make sure you detect all these less obvious sources of identical URLs on your Drupal website.
 

1.6. Appreciably Similar Content 

Your site's vulnerable to this type of duplicate content “threat” particularly if it's an e-commerce one.

Just think of all those too common scenarios where you display highly similar product descriptions on several different pages on your eStore. 
 

1.7. User Session IDs 

Users themselves can non-deliberately generate duplicate content on your Drupal site. 
How? They might have different session IDs that generate new and new URLs.


2. 4 Modules at Hand to Identify and Fix Duplicate Content in Drupal

What are the tools that Drupal puts at your disposal to detect and eliminate all duplicate content?
 

2.1. Redirect Module

Imagine all the functionality of the former Global Redirect module (Drupal 7) “injected” into this Drupal 8 module!

In fact, you can still define your Global Redirect features by just:
 

  1. accessing the Redirect module's configuration page
  2. clicking on “URL redirects” 
     

How to Deal with Duplicate Content in Drupal: Global Redirect features
Image Source: WEBWASH.net

What this SEO-friendly module does is provide you with a user-friendly interface for managing your URL path redirects:
 

  • create new redirects
  • identify broken URL paths (you'll need to enable the “Redirect 4040” sub-module for that)
  • set up domain level redirects (use the “Redirect Domain” sub-module)
  • import redirects
     

Summing up: when it comes to handling duplicate content in Drupal, this module helps you redirect all your URLs to the new paths that you will have set up.

This way, you avoid the risk of having the very same content displayed on multiple URL paths.
 

2.2. Taxonomy Unique Module  

How about “fighting” duplicate content on your website at a vocabulary level?

In this respect, this Drupal 8 module:
 

  • prevents you from saving a taxonomy term that already exists in that vocabulary
  • is configurable for every vocabulary on your Drupal site
  • allows you to set custom error messages that would pop up whenever a duplicate taxonomy term is detected in the same vocabulary
     

2.3. PathAuto Module  

Just admit it now:

How much do you hate the /node125 type of URL path aliases?

They're anything but user-friendly.

And this is precisely the role that Pathauto's been invested with:

To automatically generate content friendly path aliases (e.g. /blog/my-node-title) for a whole variety of content.

Let's say that you want to modify the current “path scheme” on your website with no impact on the URLs (you don't want the change to affect user's bookmarks or to “intrigue” the search engines).

The Pathauto module will automatically redirect those URLs to the new paths using any HTTP redirect status.
 

2.4. Intelligent Content Tools      

Personalization is key when you strive to prevent duplicate content in Drupal, right? 

And this is precisely what this module here does: it helps you personalize content on your website.

How? Through its 3 main functionalities delivered to you as sub-modules:
 

  • auto tagging
  • text summarizing 
  • detecting plagiarized content 
     

Leveraging Natural Language Processing, this last sub-module scans content on your website and alerts you of any signs of duplicity detected.

Word of caution: keep in mind that the module is not yet covered by Drupal's security advisory policy!
 

3. To Sum Up

Setting a goal to ensure 100% unique content on your website is as realistic as... learning a new language in a week. 

Instead, you should consider setting up a solid strategy ”fueled” by (at least) these 4 modules “exposed” here. One that would help you avoid specific scenarios where entire pages or clusters of pages get duplicated.

Now, that's a far less utopian goal to set, don't you think?

Original Post: 

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web