dtd-inf

dtd-inf is an XML schema inference tool that learns DTDs from positive examples. This is an implementation of the learning algorithms from the paper that Timo Kötzing and I published at ICDT 2013, containing bugfixes from the journal version (see there for links to both versions).

dtd-inf computes the most precise element type declarations of the input XML that are possible, with the restriction that every declaration may use element names only once. While this might sound quite restricted, it seems to be enough for most applications. If you do not care about DTDs, but do care about regular expressions, you can use the sore-inf tool (also, the flag -d and the hidden bonus flags that are mentioned in the README might be of interest to you).

You need to install Python 3 on your computer (I do not know or care whether Python 2 will work). Download the package, unpack it. You can then run python3 dtd-inf.py --help. (Depending on your system, you can give dtd.py executable rights and run it directly.)

My favorite example is the Mondial XML file. It is complicated enough to produce some interesting output, but not so large that parsing it takes forever. So it's perfect for toying around.

./dtd-inf.py mondial.xml

Computes a DTD for the file.

./dtd-inf.py mondial.xml -j

Omits the doctype stuff around the element type declarations.

./dtd-inf.py mondial.xml -js

Also omits all empty (boring) element type declarations.

./dtd-inf.py mondial.xml -j -e country city

Only learns the element type declarations for the elements country and city.
You can also read from multiple files, e.g.

./dtd-inf.py file1.xml file2.xml -e elt1 elt2

Here, the help that is automatically generated by argparse is a little bit misleading: First, specify the list of files, then use the flags, to avoid ambiguous statements like ./dtd-inf.py -e elt1 elt2 file1.xml file2.xml.

This tool does not learn attributes and does not handle #PCDATA. Perhaps at some point in the future. Turning this into an XSD inference tool is possible, but would take more work. (For now, you could always transform the DTD into an XSD. Perhaps it's good enough.)

The code has not been audited or debugged in large scale (it should work, but don't bet the farm on it). It is also not particularly optimized, neither for efficiency, nor for readability. (But the bottleneck seems to the parsing of the XML files, so there should be no need to optimize the other parts for efficiency.) Also, same files use white space as indentation, others tabs. Sorry.

The way the prettification algorithm is used is particularly ugly: Instead of prettifying the inferred regular expression in a tree type data structure, the expression is inferred as string, which is then parsed and prettified. A nice implementation would skip the intermediate string representation. This is one of the reasons why, if your element names contain any fancy unicode stuff, something might break.

Apart from the algorithms in the paper, this implementation also uses an additional post-processing step that makes the expressions prettier (e.g., turns nested expressions like (a|(b|(c|d))) into (a|b|c|d), etc.) The language theoretic details can be found in an eternally half-finished technical report that is available from me.

The core inference algorithm was implemented by Dominik D. Freydenberger and uses this implementation of Tarjan's Algorithm by Dries Verdegem (which, to our knowledge, is in the public domain). The prettification algorithm is a part of the M.O.D.O.D. library, which was designed (only for DREs) by Dominik D. Freydenberger and implemented by Christoph Burschka. The creation of the M.O.D.O.D. library was generously supported by the program "Nachwuchswissenschaftler/innen im Fokus" (Goethe University). We put this stuff under the MIT License, and the source code is already included.