Migration Best Practices

Parent Feed:

The wonderful Migrate module is used in every one of our projects, whether we have actual legacy content, or “just” want to create dummy and [XSS content]({{ site.url }}/content/xss-attack/).

So you received from your client a scary looking SQL dump or Excel file with old website data. Where should you start?

Here are some of the best practices we apply in Gizra to ensure a smooth migration project, with the least possible amount of surprises or bugs.

Discovery phase

A good discovery phase before doing any actual coding is key to the success of the project. We believe it’s a good idea to do it as part of the [price estimation]({{ site.url }}/content/budget-goggles/).

Understand the data

The first step is to understand how to map the old data into Drupal’s entities - which data should be a node and which one a taxonomy term. As always, we try to map the “real world” into Drupal entities. An article, for example, is obviously a piece of content, so it needs to be a node, and not a taxonomy term. However there are more aspects to consider like the number of fields, semantic meaning of the content, and whether hierarchy is needed or not. It’s also worth noting that this is the exact time to think about improvements to the data structure.

There are some cases where content types were separate in the old website for historical reasons that can be merged and vice versa. Our rule of thumb is that the content type should try to map as much as possible the reality. That is, a “case study” content type might have very similar fields to an “article” content type but semantically they are different, thus they should be two different content types.

After you decide what fields your content type will have, pay careful attention to the data itself. If an article “topic” is a set of values that are constantly repeating, then naturally you would prefer to convert the field to a select list or even to a proper taxonomy term. Notice variants in the same data (“Politics” vs “politics”) and make the extra effort to sanitize and clean the data.

Don’t migrate what you don’t need

Not all of the existing data really needs to be migrated. For example, past events may not add much or any value to the new site. It’s up to you to present to the client the possibility to save some money on a low impact migrate and shift resources to something more important.

Don’t go fully automatic for nothing

Doing manual migration is fine. As part of the discovery phase figure out how many items per content type need to be migrated. If it is less than 50 offer the client to do it manually. Yes, you may “lose” some billable hours, but you gained the client’s trust.

Development phase

Convert to SQL

If you received your data in csv format it is advised to convert it to SQL. Your migrate may work fine with a few hundreds lines, but it will choke if you have more - there’s no primary key column in CSV, so it basically loops over the same rows again and again.

SQL also provides another layer of safety, since it won’t accept wrong data. For example strings cannot be inserted into Int columns, so if you get an SQL error, you can easily find where the data is corrupted.

We got you covered with our csv2sql.

Content during development

While you are developing the platform and constantly rebuilding the site from scratch, of course you don’t want to wait hours for the import of all the data. You can add some dummy data, but a better approach would be to take about 50 rows from each table/content type. However, don’t take 50 consecutive rows, but rather take random rows to increase the chances of hitting data corruption or just plain bugs in development instead of in production.

Migrations testing

Automatic tests are great, right? They catch bugs, and make you feel more confident about your code base, right?

So write automatic tests for your migration scripts, and wire them to Travis CI!

It’s obvious that when you have a huge amount of content you can’t check every single piece of content. But even little of coverage is better than none.

Start writing tests during the development against smaller subset of content (e.g. those 50 rows mentioned earlier). There is no need to create complicated scenarios for migration of content testing, you should simply check that the fields contain the correct data (texts, images), and that references are set correctly.

The crucial part in the tests requires your QA person to visually compare the old data with the new data. Once this is done, your automatic tests make sure there is no regression.

And please, don’t think writing those tests is a waste of time. On the contrary, it saves you so much effort chasing horrible regressions. Here’s an example of a [properly written]({{ site.url }}/content/behat-the-right-way/) Behat test.

@api
  Scenario Outline: Login to site, and check a article content.
    Given I login with user "test"
    When  I visit "" node of type "article"
    Then  I should see the title "<Title>"
    And   I should see the "description" field "<Description>"
    And   I should see the "tag" field "<Tag>"
    And   I should see "<Number of pictures>" pictures in "images" section


  Examples:
    | Title       |          Description     |      Tag      | Number of pictures |
    | Curabitur   | Nam sed ex vitae arcu    | Education     | 1                  |
    | Quisque     | Praesent maximus a mi si | Science       | 0                  |
    | Lorem ipsum | Aenean sem lectus, porta | Entertainment | 2                  |
</code></pre></div><p>Of course, don’t forget that when the migration is ready, run some final tests
against the site with all the content.</p><figure><img data-src=/blog/images/migration-best-practices/image1.jpg class=lazyload></figure><h2 id=getting-the-data-for-testing>Getting the data for testing</h2><p>Since you will migrate data from SQL tables (thanks to csv2sql), it should be easy
to get data already prepared for tests using MySQL queries in phpMyAdmin.</p><p>Here is an example of an SQL query to get data from the sql table which will take
the <code>_title</code> and <code>_description</code> from each 100th line of the <code>_raw_article</code> table.</p><p>The sql query selects the fields you want to check in your automatic test from
every 100th line, and adds empty placeholders (i.e. <code>leftPipe</code> and <code>rightPipe</code>)
that will be converted to pipes in the beginning
and in the end of each line during the export.</p><div class=highlight><pre style=color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-sql data-lang=sql><span style=color:#66d9ef>SELECT</span> <span style=color:#e6db74>'</span><span style=color:#e6db74>'</span> <span style=color:#66d9ef>as</span> leftPipe, <span style=color:#f92672>`</span>_title<span style=color:#f92672>`</span>,<span style=color:#f92672>`</span>_description<span style=color:#f92672>`</span>, <span style=color:#e6db74>'</span><span style=color:#e6db74>'</span> <span style=color:#66d9ef>as</span> rightPipe <span style=color:#66d9ef>FROM</span> <span style=color:#f92672>`</span>_raw_article<span style=color:#f92672>`</span> <span style=color:#66d9ef>WHERE</span> (<span style=color:#f92672>`</span>_raw_video<span style=color:#f92672>`</span>.<span style=color:#f92672>`</span>__id<span style=color:#f92672>`</span> <span style=color:#f92672>%</span> <span style=color:#ae81ff>100</span>) <span style=color:#f92672>=</span> <span style=color:#ae81ff>0</span>;
</code></pre></div><p>Now you can use phpMyAdmin to export the result table to CSV format. You can download the data as a CSV file or directly copy it:</p><figure><img data-src=/blog/images/migration-best-practices/image3.jpg class=lazyload><figcaption>Export result of the query.</figcaption></figure><figure><img data-src=/blog/images/migration-best-practices/image4.jpg class=lazyload><figcaption>Use custom settings.</figcaption></figure><figure><img data-src=/blog/images/migration-best-practices/image5.jpg class=lazyload><figcaption>Use CSV format, separate columns with pipes.</figcaption></figure><p>Here’s the output you will get, ready to be added to your Behat test:</p><div class=highlight><pre style=color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-txt data-lang=txt>  | Ludwig Blum    | The Israeli Ambassador |
  | Julia Lagusker | Julia Lagusker         |
  | Noah Heymann   | Interview the artist   |
</code></pre></div><h2 id=known-pitfalls>Known pitfalls</h2><ul><li>Some images may be missing. A <a href=https://gist.github.com/HelenaEksler/e01a3572afc39f189ecc>bash script</a>, for example, can help you identify which.</li><li>The text of content may be polluted with messy HTML. Worse, it may even have broken links. Don’t leave it to the end and deal with this early on, as
it can be a tough one.</li><li>If your data should be translated decide in advance what to do if some
translated items are missing.</li></ul><p>That’s it, I’d love to hear other best practices devs are applying to their migration projects.</p></div><div class="max-w-3xl mx-auto px-4 mt-5"><div class="flex flex-col bg-white rounded-lg rounded-br-lg shadow-md border-t p-8 text-center"><div class="grid gap-4 sm:gap-6 sm:grid-rows-1 grid-flow-col pb-4 border-b border-purple-400"><div><img class="w-16 h-16 mx-auto rounded-full" src=/images/team/avatars/HelenaEksler.jpg alt="HelenaEksler's profile"><h3 class="mt-6 text-gray-900 text-sm font-medium">Helena Eksler</h3></div></div><time class="mt-4 text-gray-600">01 April 2016</time></div></div><div class="bg-gray-800 py-8 sm:py-10 mt-8"><div class="max-w-3xl mx-auto px-4 font-semibold text-white sm:text-lg"><div class="grid grid-cols-2 gap-x-4"><div><a href=https://www.gizra.com/content/docker-travis-ci/><div class="flex flex-row"><i class="fas fa-angle-left mr-2 mt-1"></i>Docker and Travis-CI</div></a></div><div><a href=https://www.gizra.com/content/job-post-goes-viral/><div class="flex flex-row justify-end text-right">A Job Posting Goes Viral
<i class="fas fa-angle-right ml-2 mt-1"></i></div></a></div></div></div></div></main></div><footer class="border-t border-gray-100 pt-4 bg-gradient-to-b from-gray-100 to-white -mt-8"><div class="fluid-container max-w-2xl mt-5"><div class="grid grid-cols-2 sm:grid-cols-3"><div class="col-start-1 text-left"><div class="text-lg text-gray-700 mb-2">Chicago, USA</div><div class="text-xs text-gray-500">+1-312-585-7625</div></div><div class="col-start-1 row-start-2 text-left sm:text-right sm:col-start-3 sm:row-start-1"><div class="text-lg text-gray-700 mb-2">Tel-Aviv, Israel</div><div class="text-xs text-gray-500">+972-54-4444986</div></div><div class="flex flex-col space-x-3 items-end sm:items-center sm:space-x-0 sm:mx-auto col-start-2 row-start-1 sm:col-start-2 sm:row-start-1"><div class="flex flex-row space-x-5"><a class="w-9 h-9 text-pink-400 hover:text-pink-500" target=_blank rel=noopener aria-label="Gizra's Twitter Profile" href=https://twitter.com/gizraweb><i class="fab fa-twitter fa-2x"></i></a><a class="w-9 h-9 text-pink-400 hover:text-pink-500" target=_blank rel=noopener aria-label="Gizra's Github Profile" href=https://github.com/Gizra><i class="fab fa-github fa-2x"></i></a></div><div><a class="text-xs text-gray-500 hover:text-gray-600 italic" href="https://www.gizra.com/cdn-cgi/l/email-protection#97fef9f1f8d7f0feede5f6b9f4f8fa"><span class="__cf_email__" data-cfemail="2e474048416e4947545c4f004d4143">[email protected]</span></a></div></div></div></div><div class="mt-6 pt-6 pb-6"><div class="fluid-container text-center text-xs text-gray-500">Copyright <i class="far fa-copyright"></i> 2023, Gizra.</div></div></footer><script data-cfasync="false" src="https://www.gizra.com/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js"></script><script type=text/javascript src=https://www.gizra.com/app.min.f721e545f051c16ac18ed0fa3471afc3fd7c19bfdf682b3af61310b5ddb69dc9.js integrity="sha256-9yHlRfBRwWrBjtD6NHGvw/18Gb/faCs69hMQtd22nck=" media=screen></script><script type=text/javascript src=https://cdnjs.cloudflare.com/ajax/libs/anchor-js/4.2.2/anchor.min.js></script></body></html>

Original Post:

https://www.gizra.com/content/migration-best-practices/

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
Facet based on tags, author, or feed
Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web