Jul 13 2017
Jul 13
July 13th, 2017

When creating the Global Academy for continuing Medical Education (GAME) site for Frontline, we had to tackle several complex problems in regards to content migrations. The previous site had a lot of legacy content we had to bring over into the new system. By tackling each unique problem, we were able to migrate most of the content into the new Drupal 7 site.

Setting Up the New Site

The system Frontline used before the redesign was called Typo3, along with a suite of individual, internally-hosted ASP sites for conferences. Frontline had several kinds of content that displayed differently throughout the site. The complexity with handling the migration was that a lot of the content was in WYSIWYG fields that contained large amounts of custom HTML.

We decided to go with Drupal 7 for this project so we could more easily use code that was created from the MDEdge.com site.

“How are we going to extract the specific pieces of data and get them inserted into the correct fields in Drupal?”

The GAME website redesign greatly improved the flow of the content and how it was displayed on the frontend, and part of that improvement was displaying specific pieces of content in different sections of the page. The burning question that plagued us when tackling this problem was “How are we going to extract the specific pieces of data and get them inserted into the correct fields in Drupal?”

Before we could get deep into the code, we had to do some planning and setup to make sure we were clear in how to best handle the different types of content. This also included hammering out the content model. Once we got to a spot where we could start migrating content, we decided to use the Migrate module. We grabbed the current site files, images and database and put them into a central location outside of the current site that we could easily access. This would allow us to re-run these migrations even after the site launched (if we needed to)!

Migrating Articles

This content on the new site is connected to MDEdge.com via a Rest API. One complication is that the content on GAME was added manually to Typo3, and wasn’t tagged for use with specific fields. The content type on the new Drupal site had a few fields for the data we were displaying, and a field that stores the article ID from MDedge.com. To get that ID for this migration, we mapped the title for news articles in Typo3 to the tile of the article on MDEdge.com. It wasn’t a perfect solution, but it allowed us to do an initial migration of the data.

Conferences Migration

For GAME’s conferences, since there were not too many on the site, we decided to import the main conference data via a Google spreadsheet. The Google doc was a fairly simple spreadsheet that contained a column we used to identify each row in the migration, plus a column for each field that is in that conference’s content type. This worked out well because most of the content in the redesign was new for this content type. This approach allowed the client to start adding content before the content types or migrations were fully built.

Our spreadsheet handled the top level conference data, but it did not handle the pages attached to each conference. Page content was either stored in the Typo3 data or we needed to extract the HTML from the ASP sites.

Typo3 Categories to Drupal Taxonomies

To make sure we mapped the content in the migrations properly, we created another Google doc mapping file that connected the Typo3 categories to Drupal taxonomies. We set it up to support multiple taxonomy terms that could be mapped to one Typo3 category.
[NB: Here is some code that we used to help with the conversion: https://pastebin.com/aeUV81UX.]

Our mapping system worked out fantastically well. The only problem we encountered was that since we were allowing three taxonomy terms to be mapped to one Typo3 category, the client noticed some use cases where too many taxonomies were assigned to content that had more than one Typo3 category in certain use cases. But this was a content-related issue and required them to re-look at this document and tweak it as necessary.

Slaying the Beast:
Extracting, Importing, and Redirecting

One of the larger problems we tackled was how to get the HTML from the Typo3 system and the ASP conference sites into the new Drupal 7 setup.

The ASP conference sites were handled by grabbing the HTML for each of those pages and extracting the page title, body, and photos. The migration of the conference sites was challenging because we were dealing with different HTML for different sites and trying to get get all those differences matched up in Drupal.

Grabbing the data from the Typo3 sites presented another challenge because we had to figure out where the different data was stored in the database. This was a uniquely interesting process because we had to determine which tables were connected to which other tables in order to figure out the content relationships in the database.

The migration of the conference sites was challenging because we were dealing with different HTML for different sites and trying to get get all those differences matched up in Drupal.

A few things we learned in this process:

  • We found all of the content on the current site was in these tables (which are connected to each other): pages, tt_content, tt_news, tt_news_cat_mm and link_cache.
  • After talking with the client, we were able to grab content based on certain Typo3 categories or the pages hierarchy relationship. This helped fill in some of the gaps where a direct relationship could not be made by looking at the database.
  • It was clear that getting 100% of the legacy content wasn’t going to be realistic, mainly because of the loose content relationships in Typo3. After talking to the client we agreed to not migrate content older than a certain date.
  • It was also clear that—given how much HTML was in the content—some manual cleanup was going to be required.

Once we were able to get to the main HTML for the content, we had to figure out how to extract the specific pieces we needed from that HTML.

Once we had access to the data we needed, it was a matter of getting it into Drupal. The migrate module made a lot of this fairly easy with how much functionality it provided out of the box. We ended up using the prepareRow() method a lot to grab specific pieces of content and assigning them to Drupal fields.

Handling Redirects

We wanted to handle as many of the redirects as we could automatically, so the client wouldn’t have to add thousands of redirects and to ensure existing links would continue to work after the new site launched. To do this we mapped the unique row in the Typo3 database to the unique ID we were storing in the custom migration.

As long as you are handling the unique IDs properly in your use of the Migration API, this is a great way to handle mapping what was migrated to the data in Drupal. You use the unique identifier stored for each migration row and grab the corresponding node ID to get the correct URL that should be loaded. Below are some sample queries we used to get access to the migrated nodes in the system. We used UNION queries because the content that was imported from the legacy system could be in any of these tables.

SELECT destid1 FROM migrate_map_cmeactivitynode WHERE sourceid1 IN(:sourceid) UNION SELECT destid1 FROM migrate_map_cmeactivitycontentnode WHERE sourceid1 IN(:sourceid) UNION SELECT destid1 FROM migrate_map_conferencepagetypo3node WHERE sourceid1 IN(:sourceid) … SELECTdestid1FROMmigrate_map_cmeactivitynodeWHEREsourceid1IN(:sourceid)UNIONSELECTdestid1FROMmigrate_map_cmeactivitycontentnodeWHEREsourceid1IN(:sourceid)UNIONSELECTdestid1FROMmigrate_map_conferencepagetypo3nodeWHEREsourceid1IN(:sourceid)

Wrap Up

Migrating complex websites is rarely simple. One thing we learned on this project is that it is best to jump deep into migrations early in the project lifecycle, so the big roadblocks can be identified as early as possible. It also is best to give the client as much time as possible to work through any content cleanup issues that may be required.

We used a lot of Google spreadsheets to get needed information from the client. This made things much simpler on all fronts and allowed the client to start gathering needed content much sooner in the development process.

In a perfect world, all content would be easily migrated over without any problems, but this usually doesn’t happen. It can be difficult to know when you have taken a migration “far enough” and you are better off jumping onto other things. This is where communication with the full team early is vital to not having migration issues take over a project.

Web Chef Chris Roane
Chris Roane

When not breaking down and solving complex problems as quickly as possible, Chris volunteers for a local theater called Arthouse Cinema & Pub.

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web