Lincoln Stein's Keynote: Building a Bioinformatics Nation

A Land of City States

Lincoln Stein is one of the favorite personalities of both the Perl and Bioinformatics crowds, and his afternoon keynote at O'Reilly's Bioinformatics Technology Conference was predictably well attended and well received. Lincoln's talk was titled "Bioinformatics: Building a Nation from a Land of City States," and he started by comparing Italian city states of the middle ages to bioinformatics data providers.

The Italian city states were a disparate group with different legal and political systems, dialects, cultures, weights and measures, taxation, and currencies. Even though Italy had brilliant thinkers and scientists, its technological and industrial development lagged because of the difficulty in overcoming these differences. Lincoln argued that today's bioinformatics data providers are also suffering from too many differences, and these differences are hindering the advancement of science. "We see a lot of fragmentation in the landscape of data providers," he said, "and each of these data sites has its own view of the world."

Bioinformatics databases like NCBI, Ensembl, FlyBase, SGD, WormBase, and UCSC are all providing relevant data, but unfortunately they are using a wide range of different systems and formats. Lincoln riffed on the popular Perl slogan that claims the programming language makes easy things easy and hard things possible, by noting that the current situation in bioinformatics "was making easy things hard."

As an example, Lincoln described the typical processes involved in what should be an easy task: getting all the human sequences submitted to GenBank/EMBL in a week. There are many different ways to do this, but the typical approach would include writing one script to fetch the data from the provider's Web site, another one to parse the file format, a third to move the data into a private database, and a fourth to repeat the process on a weekly basis. Because all researchers who want this data write similar--but not identical--programs, thousands of such scripts will be created--and none of them will work together. Besides the wasted effort, every time a data provider tweaks an interface or a format, thousands of programs break and need to be fixed.

Part of the solution to this problem, Lincoln argued, is to "see the open source light." He encouraged the audience to take advantage of open source libraries like BioPerl, BioJava, and BioPython; open source protocols like BioXML and DAS; and open source end-user applications like Genquire and PyMol. He also mentioned some of the groups promoting open source efforts in bioinformatics, such as the Open Bioinformatics Foundation, the GMOD project, and
Bioinformatics.org.

Lincoln then summarized efforts to unify the bioinformatics data services. These efforts started 12 years ago with the Meetings of the Molecular Biology Databases (MMBD), which essentially ended in argument. Every member thought his or her way of doing things was the best way. Next came the federated models like Gaea and Kleisli, and then the data warehouses of Ensembl, UCSC, and others. This brings us to the ad hoc Web services that are currently in place. These allow programmatic access to data, as in the GenBank/EMBL example. To truly unify the services of bioinformatics data providers we need to
move beyond this to a more formal Web services model.

In this Web services model, the data providers would register their services in a formalized service registry, and researchers' scripts would no longer need to be concerned with the interface details of the different databases. This model represents the unification that Lincoln, and judging by the response, apparently everyone in the audience, hopes to see in bioinformatics. Lincoln argued that the necessary infrastructure to support this model is already almost entirely in place.

Since this formal Web services model has still not been realized, Lincoln offered a list of things data providers can do today to become good citizens in the bioinformatics nation. Dubbed the "Data Providers Code of Conduct," these ideas met with great audience approval, clearly striking a chord with many attendees.

First, he urged data providers to realize that their Web pages are really an interface for bioinformatics researchers. Not only do they need to work for human visitors, they must function successfully as an interface for the batch scripts that bioinformatics researchers will inevitably write. Second, he urged the providers to understand that this interface is essentially a contract with data consumers and it should be adequately documented. Changes to the interface shouldn't be made lightly, and when possible, legacy interfaces should be maintained.

Next in the code of conduct was the notion that choice is good and data providers should support as many different interfaces as possible, from HTML to SOAP-XML. Lincoln then urged providers to allow batch downloads and to make use of existing formats. If they absolutely have to create new data formats, they should use common sense in the design. Everyone knows how to deal with tab-delimited text, and XML is good for hierarchical data. Finally, Lincoln asked data providers to support ad hoc queries, noting that people will always end up using the data in unintended ways.

Lincoln wrapped up on an optimistic note, pointing out that Italy managed to pull it together and so can we. "I'm looking forward to the day when bioinformatics is no longer a set of hostile city states," concluded Lincoln, "but instead a world where we can all work together."