Upgrade Your Drupal Skills

We trained 1,000+ Drupal Developers over the last decade.

See Advanced Courses NAH, I know Enough

Migrate Drupal WYSIWYG to Paragraphs

Parent Feed: 

So the Drupal 7 website you’ve been tasked with upgrading to Drupal 8 has WYSIWYG fields filled with various images, videos, iframes, and tables all with inconsistent formatting. You want to take advantage of this upgrade by switching to a more structured content editing system like Paragraphs, so all those special cases will have a consistent editor experience and display. There is too much content to do this manually, so you need an automated migration. But to impose all this structure, you need to intelligently divide ambiguous content into your specific destination paragraphs. This is a difficult migration task for a few reasons.

  • There are multiple paragraph types, so they can’t share one migration.
  • The original WYSIWYG content can be broken into several paragraphs of several of different types, but the Drupal migration API wants one source entity to become one destination entity as discussed in this blog post.
  • The destination node needs to reference the paragraphs in the exact order as the original content.

These are tricky situations without a widely agreed upon solution. There are some custom migration modules that could help like Migrate HTML to Paragraphs. However, they may not fit your exact guidelines, especially if your paragraphs are referencing further entities like media or ECK entities. So how do you handle this?

There is no perfect solution. That said, one recipe for success is to write a custom process plugin that breaks up the WYSIWYG content, creates the correct paragraphs on the fly (as well as any file/media/ECK entities), and returns the paragraph IDs in the correct order for the destination node to reference. But beware: these paragraphs can’t be rolled back or updated through the migration pipeline. This makes testing more difficult because running a migration update is no longer idempotent, which means running it multiple times in a row will stuff the database with orphaned paragraphs. This simplifies the whole problem to one custom plugin. Here, the credit belongs to Benji Fisher for the starting point code. Keep in mind there are two shortcomings with this code that we will resolve later on.

Let’s break it all down. The first step is to import the WYSIWYG data into a DOMDocument in order to programmatically analyze the HTML. The DOMDocument is a data tree where each HTML tag is represented as a DOMNode that references any tags inside of it as children. You want to split up this DOMDocument so that each piece of content gets neatly mapped into the best fitting paragraph type. Since most of the content is simple text, the text paragraph type will be the default. So you need to test each piece of content to see if it matches a different paragraph type like image, video, or table. If it doesn’t match any of them, then you can safely drop it into the default text paragraph. You should start this process by iterating through all the top-level DOMNodes. For example:

When you start with the wysiwyg-container div, you will see four direct children with the top-level class. Starting with the first child, test for each special case. One of the advantages of using a DOMDocument is that you can use the getElementsByTagName function to ask any DOMNode if it has children tags like img, table, or iframe. On a match, turn all the content in the branch to a new paragraph. Otherwise, put each chunk of consecutive text into a new text paragraph. Here lies the first problem with Benji’s template: branches with mixed content.

Most of the time this isn’t an issue. If there’s a table inside a div, you probably want the entire div for your table paragraph. However, if there’s an image inside an anchor tag, your image paragraph will grab the image data but ignore the link, losing data along the way.

{# Top level tag #}

This may be an acceptable loss, or an issue that can be flagged for manual intervention. Otherwise, you’ll need a method that can recursively traverse an entire branch of the DOMDocument, cherry-pick the desired HTML element, and put the rest into the default bucket.

For instance:

  /**
   * Recursively navigate DOM tree for a specific tag
   * 
   * @param $post
   *   The full page DOMDocument
   * @param $parent
   *   The parent DOM Node
   * @param $tag
   *   The tag name.
   * @param string $content
   *   The content string to append to
   *
   * @return []
   *   Return array of DOM nodes of type $tag
   */
static protected function recursiveTagFinder($post, $parent, $tag, &$current) {
  $tagChildren = [];
  // Iterate through direct children.
  foreach ($parent->childNodes as $child) {
    // DOMText objects represent leaves on the DOM tree
    // that can't be processed any further.
    if (get_class($child) == "DOMText") {
      $current .= $post->saveHTML($child);
      continue;
    }
    // If the child has descendents of $tag, recursively find them.
    if (!is_null($child->childNodes) 
      && $child->getElementsByTagName($tag)->length != 0) {
      $tagChildren += static::recursiveTagFinder($post, $child, $tag, $current);
    }
    // If the child is a desired tag, grab it.
    else if ($child->tagName == $tag) {
      $tagChildren[] = $child;
    }
    // Otherwise, convert the child to HTML and add it to the running text.
    else {
      $current .= $post->saveHTML($child);
    }
  }
  return $tagChildren;
}

If a top level DOMNode indicates that it has an img tag, then this method will search all the DOMNode children to find the img tag and push everything else into the default text paragraph, the $current variable. There will likely be some disjointed effects when pulling an element out of a nested situation: a table may lose special formatting or an image may not be clickable as a link anymore. Some issues can be fixed in the plugin. In the migration, I checked if any img tags had an adjacent div with the caption class and stored that in the caption field on the paragraph. Again, others may need to be manually adjusted, like reordering paragraphs or adjusting a table. Remember that it’s much faster to tweak flagged migration content than to find and fix missing data manually.

On to the next issue, let’s dig into embedded media in Drupal 7. The tricky aspect here is that the media embed is not stored in the database as an image tag, but as a JSON object inside the HTML. This requires a whole new set of tests to parse it out. To start, check each DOMNode for the opening brackets of the object [[{ and the closing brackets }]]. If it only contains one or the other, then there isn’t enough information to do anything. If it contains both, then get the substring from open to close and run json_decode. This will either return an array of data from the JSON object or null if it’s not valid JSON. The data in this array should contain an fid key that corresponds to the file ID of the embedded image. That file ID can then be used to grab the image from the migration database’s file_managed table and create an image paragraph.

Those are the main gaps in Benji Fisher’s custom source plugin. Of course each implementation requires even more tweaks and data manipulations to get the exact migration to work correctly. Some content simply will not transfer cleanly, so remember to test thoroughly and stay in tight communication with your client in order to turn WYSIWYG chaos into neat paragraphs.


If you found this migration series useful, share it with your friends, and don’t forget to tag us @redfinsolutions on Facebook, LinkedIn, and Twitter.

Original Post: 

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web