Upgrade Your Drupal Skills
We trained 1,000+ Drupal Developers over the last decade.
See Advanced Courses NAH, I know EnoughFast, Parallel MySQL Backups and Imports with Mydumper
We recently did a Drupal Performance Assessment project for a very large site, with a massive database. The database takes over 174GB, and houses a total of over 707 million rows.
Exporting and importing a database of that size is challenging in many ways.
We had to use different tools and techniques than the standard mysqldump utility, which was causing the import to exceed 14 days for this database. We were able to cut that down to less than 4 days.
While we plan to talk about this site in detail at Drupal Camp Toronto, and publish an article on it, this article is about a very useful byproduct.
Maatkit mk-parallel-dump
We first considered the use of Maatkit parallel dump. However, this is deprecated and is no longer supported in Percona Toolkit, the successor to Maatkit.
Mydumper and Myloader
We then found the right tool: Mydumper.
Mydumper is a very useful tool for faster backups for many sites with smaller database sizes. The tool is written by current and past MySQL/Oracle staff.
The trick for this tool is that it relies on doing dumps and imports in parallel using multiple threads, rather than using a single thread, which is slow and inefficient.
Compiling and Installing Mydumper from Source
Although Mydumper is available as a package in the repository for Ubuntu Server 12.04 LTS Precise Pangolin, it does not work, and casues segmentation fault error.
For this reason, we have to build it from source. However, a patch is also needed in order for it to compile correctly.
This is described in detail in the following steps:
Install the required packages for compiling from source
sudo aptitude install cmake g++ patch libglib2.0-dev \ libmysqlclient-dev zlib1g-dev libpcre3-dev patch
Download Mydumper
You need to download Mydumper 0.5.1:
wget http://goo.gl/dVhJD
Extract the archive
Then extract the archive in a directory:
mv dVhJD mydumper-0.5.1.tar.gz tar xzf mydumper-0.5.1.tar.gz
Change to the directory where you extracted the source:
cd mydumper-0.5.1/
Get and apply the patch
Then get the patch from Launchpad issue #1002291.
wget http://goo.gl/eUMx6
Now patch the source with what you just downloaded, as follows:
patch -p1 < eUMx6
Compile the source
Then you compile the source. Do not forget the "." after cmake!
cmake . make
Install
Finally, you need to install the binaries to /usr/local/bin:
sudo make install
Using Mydumper
As we said earlier, Mydumper works by having multiple threads dumping multiple tables, or parts of tables, concurrently.
The program has many options, but the important ones are:
--database
Specifies the name of the database to dump
--threads
The number of threads to use. Generally set this to the number of CPUs in the server, but it can be more in some cases.
--rows
The number of rows per "chunk", before creating anothe chunk for the same table. We used 500,000, and got good results, but it depends on how many rows your tables have.
You can use this script for backup.
#!/bin/sh DIR_BACKUP=/backup-directory DB_NAME=database_name DB_USER=user DB_PASS=password DOW=`date "+%a"` DB_DUMP=$DIR_BACKUP/dbdump.$DB_NAME.$DOW mydumper \ --database=$DB_NAME \ --host=$DB_HOST \ --user=$DB_USER \ --password=$DB_PASS \ --outputdir=$DB_DUMP \ --rows=500000 \ --compress \ --build-empty-files \ --threads=2 \ --compress-protocol
Save that file to /usr/local/bin/dbdump.sh
Make it executable using:
chmod +x /usr/local/bin/dbdump.sh
Create a crontab job to run it daily, say at 4:10 am local server time.
10 04 * * * /usr/local/bin/dbdump.sh
Description of Mydumper's output data
Mydumper does not output to files, but rather to files in a directory. The --outputdir option specifies the name of the directory to use.
The output is two parts:
Schema
For each table in the database, a file containing the CREATE TABLE statement will be created. It will be named:
dbname.tablename-schema.sql.gz
You could manipulate this schema and gain some advantages, for example uncompress the file, edit the table creation statement to remove all indexes, other than the primary index. This would make the loading of huge tables much faster than with indexes.
Remember to do a CREATE INDEX for each one after that.
Data
For each table with number of rows above the --rows parameter, you will have a file called:
dbname.tablename.0000n.sql.gz
Where "n" starts with 0 up to the number of.
Import/restore using Myloader
You don't need to know the internals of how Mydumper stores its data in order to do a restore/import. The Myloader tool does that for you.
If you are importing on a machine that has more CPUs than the machine you dumped on, you can use a higher number for the --threads parameter, for example:
myloader \ --database=$DB_NAME \ --directory=$DB_DUMP \ --queries-per-transaction=50000 \ --threads=6 \ --compress-protocol \ --verbose=3
So, what improvement can we see? Here are the details.
Before: Using mysqldump:
For a site that has 5.8 million rows in one table, and 1.39 million in another, there was a noticeable improvement of using Mydumper over mysqldump.
11:00:01 - Start dump of database to .../dbdump.live.Sat.sql.gz 11:08:33 - Successfully dumped database live
So, it takes 8.5 minutes for the dump to complete.
After: Using Mydumper:
After using Mydumper with --rows=500000, the backup was much faster:
11:00:01 - Start dump of database to .../dbdump.live.Mon 11:00:43 - Finished dumping the database live
Only 42 seconds!
Other approaches for faster backups
For sites that do not have large tables that warrant the use of Mydumper, there are other approaches that can help.
For example, you can exclude the data from certain tables with transient content, such as watchdog, sessions, and all the cache tables.
This partial code snippet uses some shell script wizardry to exclude such tables with transient data, and only include their table creation statements, but not their content:
#!/bin/sh # Tables to backup structure, but not data DB_NO_DATA="/^(watchdog|sessions|cache(_.+)?)$/" ... TABLES=`mysql -u... -p... -B -N -e 'SHOW TABLES;' $DB_NAME` for TABLE in do RESULT=`echo $TABLE | awk "$DB_NO_DATA"` if [ "$RESULT" != "" ]; then # Structure only OPTS=" --no-data " else # Default, structure AND data OPTS="" fi mysqldump $OPTS -u... -p... $DB_NAME $TABLE >> $DB_DUMP if [ $? != 0 ]; then echo "ERROR: Failed in dumping database $DB_NAME" fi done # Compress the backup bzip2 $DB_DUMP mv $DB_DUMP.bz2 $DB_DUMP echo "Successful dump for database $DB_NAME"
A similar approach for excluding contents of tables with transient content is outlined in this article, if you are using Capistrano.
A different way that relies, like Mydumper, on doing parallel dump and restore, is to use Lullabot's MySQL Parallel, which uses a shell script wrapper around GNU Parallel, and Parallel bzip2. This is of course much better than plain mysqldump, but will not parallelize portions of tables. So a site with a huge table will do other tables quickly, but will be slow for the huge table since all of it will be done in a single thread.
Other use cases for Mydumper/Myloader
More information on how we used MyDumper/MyLoader to cut down the export/import of a massive Drupal database, can be found in our presentation: Huge! Drupal site with 381 modules and 174GB MySQL database and 200 million row tables at Drupal Camp Toronto in November 2012..
Further Information
You can find more information about Mydumper in the following articles:
Update: Added link to presentation that contains MyDumper/MyLoader info.
About Drupal Sun
Drupal Sun is an Evolving Web project. It allows you to:
- Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
- Facet based on tags, author, or feed
- Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
- View the entire article text inline, or in the context of the site where it was created
See the blog post at Evolving Web