One of the more well known bioinformatics databases, SWISS-PROT, is going to
be distributed in XML sometimes soon now. There has been a lot of demand,
mostly because the data is pretty complex and the current flat text format
is not quite trivial to parse, and of course because XML will solve
everyones problems.
The programmers that process our data are often, but not always, entry level
programmers, typically working with Perl, Java or C++. Tasks range from
trying to integrate our data (or more likely parts of it) with other
databases to simply reformatting and displaying single database entries in
detail.
We decided to provide an XML Schema because it can be used as a detailed
description of the data as well as for generating code for simple
applications, not to mention web services... Also, it provides me with a
pretext for asking people on this list for free XML advice :-)
There remain some open issues we haven't been able to decide on:
* Element naming: geneName vs. gene-name. Mixed-case names seem to be more
fashionable at the time, but I tend to prefer the second, less
programming-language-like style. Certainly the Perl programmers wouldn't
approve of Java-style names. And then there are the Python programmers...
Similarly, should lists be explicitly named as such, e.g. <geneList> or
<gene-list> vs. <genes>? What do you prefer to work with?
* Should we avoid Schema features not supported by JAXB? Any other features
you would advise against using if you don't want to anger/frustrate anyone?
* Importance of tools. The general opinion is that it should be left up to
users to write their own parsers. My view here is that we should provide a
set of tools for reading, writing and representing our data right from the
beginning, sort of a reference implementation (in Java and possibly Perl,
both of which we are using internally anyways). From your point of view, how
much of a help would you consider such tools? Would you use them or, being
an experienced XML developer, rather simply write your own code anyways?
* Normalize data? There is lots of data that is repeated. My current
approach is to put these elements into separate files for distributing large
amounts of data, but allowing them to be inlined for situations such as
users downloading small sets from the web (e.g. query results). Is this
strategy clever, or simply confusing?
In swissprot.xml:
<keyword id="25"/>
In keywords.xml OR swissprot.xml:
<keyword id="25">
<name>x</name>
<category>y</category>
...
</keyword>
* Go ahead vs. wait. Some people think: why wait with the first release in
XML? Since it's XML, we can always change the format, right?
Any comments would be greatly appreciated!
(A Schema and some example data are available at
http://viralgenomics.org/xml/.)
--
Eric Jain