Upgrade Your Drupal Skills

We trained 1,000+ Drupal Developers over the last decade.

See Advanced Courses NAH, I know Enough

Migrate Drupal WYSIWYG to Paragraphs

Parent Feed: 

When you start with the wysiwyg-container div, you will see four direct children with the top-level class. Starting with the first child, test for each special case. One of the advantages of using a DOMDocument is that you can use the getElementsByTagName function to ask any DOMNode if it has children tags like img, table, or iframe. On a match, turn all the content in the branch to a new paragraph. Otherwise, put each chunk of consecutive text into a new text paragraph. Here lies the first problem with Benji’s template: branches with mixed content.

Most of the time this isn’t an issue. If there’s a table inside a div, you probably want the entire div for your table paragraph. However, if there’s an image inside an anchor tag, your image paragraph will grab the image data but ignore the link, losing data along the way.

{# Top level tag #}

This may be an acceptable loss, or an issue that can be flagged for manual intervention. Otherwise, you’ll need a method that can recursively traverse an entire branch of the DOMDocument, cherry-pick the desired HTML element, and put the rest into the default bucket.

For instance:

/** * Recursively navigate DOM tree for a specific tag * * @param $post * The full page DOMDocument * @param $parent * The parent DOM Node * @param $tag * The tag name. * @param string $content * The content string to append to * * @return [] * Return array of DOM nodes of type $tag */ static protected function recursiveTagFinder($post, $parent, $tag, &$current) { $tagChildren = []; // Iterate through direct children. foreach ($parent->childNodes as $child) { // DOMText objects represent leaves on the DOM tree // that can't be processed any further. if (get_class($child) == "DOMText") { $current .= $post->saveHTML($child); continue; } // If the child has descendents of $tag, recursively find them. if (!is_null($child->childNodes) && $child->getElementsByTagName($tag)->length != 0) { $tagChildren += static::recursiveTagFinder($post, $child, $tag, $current); } // If the child is a desired tag, grab it. else if ($child->tagName == $tag) { $tagChildren[] = $child; } // Otherwise, convert the child to HTML and add it to the running text. else { $current .= $post->saveHTML($child); } } return $tagChildren; }

If a top level DOMNode indicates that it has an img tag, then this method will search all the DOMNode children to find the img tag and push everything else into the default text paragraph, the $current variable. There will likely be some disjointed effects when pulling an element out of a nested situation: a table may lose special formatting or an image may not be clickable as a link anymore. Some issues can be fixed in the plugin. In the migration, I checked if any img tags had an adjacent div with the caption class and stored that in the caption field on the paragraph. Again, others may need to be manually adjusted, like reordering paragraphs or adjusting a table. Remember that it’s much faster to tweak flagged migration content than to find and fix missing data manually.

On to the next issue, let’s dig into embedded media in Drupal 7. The tricky aspect here is that the media embed is not stored in the database as an image tag, but as a JSON object inside the HTML. This requires a whole new set of tests to parse it out. To start, check each DOMNode for the opening brackets of the object [[{ and the closing brackets }]]. If it only contains one or the other, then there isn’t enough information to do anything. If it contains both, then get the substring from open to close and run json_decode. This will either return an array of data from the JSON object or null if it’s not valid JSON. The data in this array should contain an fid key that corresponds to the file ID of the embedded image. That file ID can then be used to grab the image from the migration database’s file_managed table and create an image paragraph.

Those are the main gaps in Benji Fisher’s custom source plugin. Of course each implementation requires even more tweaks and data manipulations to get the exact migration to work correctly. Some content simply will not transfer cleanly, so remember to test thoroughly and stay in tight communication with your client in order to turn WYSIWYG chaos into neat paragraphs.


If you found this migration series useful, share it with your friends, and don’t forget to tag us @redfinsolutions on Facebook, LinkedIn, and Twitter.

Original Post: 

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web