Tuesday, April 20, 2010

The mysqldump program can be used to make logical database backups. Although the vast majority of people use it to create SQL dumps, it is possible to dump both schema structure and data in XML format. There are a few bugs (#52792, #52793) in this feature, but these are not the topic of this post.

XML output from mysqldump

Dumping in XML format is done with the --xml or -X option. In addition, you should use the --hex-blob option otherwise the BLOB data will be dumped as raw binary data, which usually results in characters that are not valid, either according to the XML spec or according to the UTF-8 encoding. (Arguably, this is also a bug. I haven't filed it though.)

I don't want to spend too much time discussing why it would be useful to make backups in this way. There are definitely a few drawbacks - for example, for sakila, the plain SQL dump, even with --hex-blob is 3.26 MB (3.429.358 bytes), whereas the XML output is 13.7 MB (14,415,665 bytes). Even after zip compression, the XML formatted dump is still one third larger than the plain SQL dump: 936 kB versus 695 kB.

Restoring XML output from mysqldump

A more serious problem is that MySQL doesn't seem to offer any tool to restore XML formatted dumps. The LOAD XML feature, kindly contributed by Erik Wetterberg could be used to some extent for this purpose. However, this feature is not yet available (it will be available in the upcoming version MySQL 5.5), and from what I can tell, it can only load data - not restore tables or databases. I also believe that this feature does not (yet) provide any way to properly restore hex-dumped BLOB data, but I really should test it to know for sure.

Anyway.

In between sessions of the past MySQL users conference I cobbled up an XSLT stylesheet that can convert mysqldump's XML output back to SQL script output. It is available under the LGPL license, and it is hosted on google code as the mysqldump-x-restore project. To get started, you need to download the mysqldump-xml-to-sql.xslt XSLT stylesheet. You also need a command line XSLT processor, like xsltproc. This utility is part of the Gnome libxslt project, and is included in packages for most linux distributions. There is a windows port available for which you can download the binaries.

Assuming that xsltproc is in your path, and the XML dump and the mysqldump-xml-to-sql.xslt are in the current working directory, you can use this command to convert the XML dump to SQL:

xsltproc mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql

On Unix-based systems you should be able to directly pipline the SQL into mysql using

The stylesheet comes with a number of options, which can be set through xsltproc's --stringparam option. For example, setting the schema parameter to N will result in an SQL script that only contains DML statements:

. There are additional options to control how often a COMMIT should be issued, whether to add DROP statements, whether to generate single row INSERT statements, and to set the max_allowed_packet size.

What's next?

Nothing much really. I don't really recommend people to use mysqldump's XML output. I wrote mysqldump-x-restore for those people that inherited a bunch of XML formatted dumps, and don't know what to do with them. I haven't thouroughly tested it - please file a bug if you find one. If you actually think it's useful and you want more features, please let me know, and I'll look into it. I don't have much use for this myself, so if you have great ideas to move this forward, I'll let you have commit access.

Hi Roland,very nice, very informative, very competent site (not only this article). In particular this XML-Dump_back_to_SQL-Dump-XSLT hint is right what I was looking for (could have found it using a search engine but sometimes you errm... "don't find a single tree within the woods" as we use to say in Germany). I fully agree with your advice not to use XML-dumps for 1:1 backups (should be obvious... but XML sounds "good" at first, and it surely is, but not for this kind of purpose). Also those other articles dealing with native implemented MySQL/XPath constructs are just gold! The point is that any of this may be found within the official MySQL docs - but it's nice to find it "rehashed" like this on a silver tablet.So thanks for sharing all this (it happened I bookmarked this site twice already and I do not regret this :) ) !Wolfram

The problem with using the XSLT file to transform the XML dump file to SQL is that it doesn't handle large files very well. I have an 877M XML dump file that I would like to convert. I ran out of memory on a linux box with 4G of physical memory + 6G of VM.

I was hoping the XSLT file would be better than using simplexml_load_file() in PHP (which has the same problem of opening the entire file in memory).

yes, most XSLT implementations first need to parse the XML and have some in-memory representation of the document to apply the stylesheet to. There is something called streaming XSLT, and I believe the latest release of Saxon implements it. However, I imagine not all types of documents and stylesheets lend themselves to this technique.

Personally, in your case, I would probably try to parse the XML using a SAX parser. To keep it simple I would probably try to use the SAX handlers to generate lines in CSV (or similar) format. You can then load the CSV with for example LOAD DATA INFILE.

If you don't want to create a custom program for this kind of thing, you could give Kettle (pentaho data integration) a go. It has a Load data from XML step, which may help here. Kettle doesn't like large XML files too (or actually, it handles them fine, but also requires the memory) however, it can use a file buffer to compensate (but of course, it costs performance). For more info on this approach, see:http://wiki.pentaho.com/display/EAI/Get+Data+From+XML extra info for handling large files:http://wiki.pentaho.com/display/EAI/Get+Data+from+XML+-+Handling+Large+Files

(next time, please use the issue tracker in the google code project to post issues. Thanks in advance)

I downloaded your file and I can confirm your result. But the structure of your xml is different from what I get when I use mysqldump with the -X option. What version of mysqldump did you create this dump with?

Mean while, I can cobble up something to make it work for your example, would you like that? But then it would be cool if you could test it and give me some feedback if it works. Ok?

Ok steve, I took a look and I think it can be done, but there's a huge challenge since the format you posted has zero information on data types.

Anyway, I just committed a new version (revision 4) which I hope will fix your problem. But please be aware that I don't see anyway to restore the table structures going by this format. I did a best guess to distinguish between numbers and strings, I hope this helps for now.

this is not helpful. If you think you found a bug, please use the tracker on the google code project.

Please provide a description of what you did, and what happened, and - very important - any error messages you received. Would also be great if you could attach the input file, or even better, a small sample from it that still reproduces that error.

My task was to develop a script that would scramble confidential information such as personal identity numbers, names, and the like from a MySQL dump taken from production. When scrambled, this data would then be transferred to test environment for use.

The data contained both simple values in table columns such as 'personalIdentityNumber', and (for the most part) lots of xml in clob columns. The confidential fields that had to be mangled were identified from the dump file using regular expressions. This would have been very difficult against the traditional MySQL dump file. The xml, however, provided enough boiler plate to get reliable regexp matches. Indeed it works fine against simple columns such as

But the problem was that we're running MySQL 5.1., which doesn't yet support xml import. And I couldn't figure out a schema-agnostic (or almost agnostic) way of doing this using MySQL's EXTRACTVALUE etc functions alone. So I was about to start writing the xslt, but luckily stumbled upon your blog first!

Yes, EXTRACTVALUE is available in 5.1. Our database schema is still evolving rapidly, and a db import solution based on EXTRACTVALUE and stored procs would require updates to follow db schema evolution. This is much easier when operating on MySQL xml dump and a set of regular expression templates for the mangling. All I have to do is to provide a list of column names (which happen to be the same across several tables in my case) and xml tag names used, and a simple python script will create and process regular expressions (and invoke mapping functions for substitutions) out of templates such as:

"()([^<]+)()" (for simple values)

and

"(<{tagname}>)([^&]+)(</{tagname}>)" (for xml content)

BTW: I also came across this issue: http://code.google.com/p/mysqldump-x-restore/issues/detail?id=1 when running the script with larger datasets. In our case it was ok to simply substitute single quotes with double quotes, which seemed to work.

Third thing encountered was that MySQL's xml export is not able to deal with bit(1) datatypes properly. XML contained strange bytes in place of the bits: 0x0a01 for 1 and 0x0a00 for 0. Substituting these with 0 (0x0a30) and 1 (0x0a31) did the trick.

In more detail, mysqldump seems to emit a byte 0x00 for "false" and 0x01 for "true" when column type is bit(1), at least when dumping in xml format. These are clearly nonprintable chars and the resulting file is no longer well-formed xml at least according to xmllint. Since MySQL allows use of 0 and 1 decimal digits in inserts to bit(1) columns, a simple tr replacement did the trick:

cat dump.xml | tr "\000" 0 | tr "\001" 1 > dump2.xml

bit(1) seems to be the default mapping for a JPA/Hibernate entities' boolean fields in MySQL.

Actually mysqldump flag --hex-blob also works for binary and varbinary datatypes (bit(1) for example), creating a hex-blob instead of breaking well-formedness of the xml. So I guess it's the same bug you mentioned in the post.

This was a great solution for me. I think the errors some received of "Unexpected data type" occur with errors parsing indexes and foreign keys. I received those errors which went away when my schema only included columns.

Search This Blog

About Me

I'm @rolandbouman, a Web- and BI Developer and Information Analyst. I have worked for MySQL AB, Sun Microsystems and I'm currently working as a software engineer for Pentaho (a Hitachi data systems company).

Together with Jos van Dongen I wrote a book called "Pentaho Solutions" (Wiley, ISBN: 978-0-470-48432-6, 630+ pages). This book is intended for people that want to get started with Business Intelligence and provides lots of practical examples to work with the open source Pentaho Business Intelligence Suite.

Together with Matt Casters and Jos van Dongen, I authored another book for Wiley called "Pentaho Kettle Solutions" (750+ pages, Wiley, ISBN: 978-0-470-63517-9). This book is more specialized and focuses on Pentaho data integration (Kettle) and ETL.