A few days back I discussed some thoughts on cheminformatics vis a vis bioinformatics, inspired by a review by Aaron Sterling. In that thread, Steven Salzberg made a comment, stating

… In my opinion, it is not a “hot” field, though, in part for some of the reasons mentioned in the post – particularly the fact that the data in the field is mostly proprietary and/or secret. So they hurt themselves by that behavior. But the other reason I don’t think it is moving that fast is that, unlike bioinformatics, chemoinformatics is not being spurred by dramatic new technological advances. In bioinformatics, the amazing progress in automated DNA sequencing has driven the science forward at a tremendous pace …

I agree with Steven and others that cheminformatics is not as “hot” as bioinformatics, based on varying metrics of hotness (groups, publications, funding, etc.). However I think the perceived lack of popularity stems from a number of reasons and that technological pushes are a minor reason. (Andrew Dalke noted some of these in a comment).

1. Lack of publicaly accessible data – this has been mentioned in various places and I believe is a key reason that held back the development of cheminformatics outside industry. This is not to say that academic groups weren’t doing cheminformatics in 70′s and 80′s, but we could’ve had a much richer ecosystem.

In this vein, it’s also important to note that just public structure data, while necessary, would likely not have been sufficient for cheminformatics developemnt. Rather, structure and biological activity data are both requred for the development of novel cheminformatics methodologies. (Of course certain aspects of cheminformatics are are focused purely on chemical structure, and as such do fine in the abensce of publically accesssible activity data).

2. Small molecules can make money directly – this is a primary driver for the previous point. A small molecule with confirmed activity against a target of interest can make somebody a lot of money. It’s not just that one molecule – analogs could be even more potent. As a result, the incentive to hold back swathes of structure and activity data is the financially sensible approach. (Whether this is actually useful is open to debate). On the other hand, sequence data is rarely commercialiable (though use of the sequence could be) and hence much easier to release.

3. Burden of knowledge – as I mentioned in my previous post, I believe that to make headway in many areas of cheminformatics requires some background in chemistry, sincce mathematical abstractions (cf graph representations) only take you so far. As Andrew noted, “Bioinformatics has an “overarching mathematical theory” because it’s based very directly on evolution, encoded in linear sequences“. As a result the theoretical underpinnings of much of bioinformatics make it more accessible to the broader community of CS and mathematics. This is not to say that new mathematical developments are not possible in cheminformatics – it’s just a much more complex topic to tackle.

4. Lack of federal funding – this is really a function of the above three points. The idea that it’s all been done in industry is something I’ve heard before at meetings. Obviously, with poor or no federal funding opportunities, fewer groups see cheminformatics as a “rewarding” field. While I still think the NIH’s cancellation of the ECCR program was pretty dumb, this is not to say that there is no federal funding for cheminformatics. Applications just have to be appropriately spun.

To address Stevens’ point regarding technology driving the science – I disagree. While large scale synthesis is possible in some ways (such as combinatorial libraries, diversity oriented synthesis etc.), just making large numbers of molecules is not really a solution. If it were, we might as well generate them virtually and work from the SMILES.

Instead, what is required is large scale activity measurements. And there have been technology developments that allow one to generate large amounts of structure-actvity data – namely, High Throughput Screening (HTS) technologies. Admittedly, the data so generated is not near the scale of sequencing – but at the same time, compared to sequencing, every HTS project usually requires some form of unique optimization of assay conditions. Added to that, we’re usually looking at a complex system and not just a nucleotide sequence and it’s easy to see why HTS assays are not going to be at the scale of next gen sequencing.

But, much of this technology was relegated to industry. It’s only in the last few years that HTS technology has been accesible outside industry and efforts such as the Molecular Libraries Initiative have made great strides in getting HTS technologies to academics and more importantly, making the results of these screens publicaly available.

As a bioinformatics bystander, while I see reports of next gen sequencing pushing out GBs and TBs of data and hence the need for new bioinformatics methods – I don’t see a whole lot of “new” bioinformatics. To me it seems that its just variations of putting together sequences faster – which seems a rather narrow area, if that’s all that is being pushed by these technological developments. (I have my asbestos underwear on, so feel free to flame)

Certainly, bioinformatics is helped by high profile projects such as the Human Genome Project and the more recent 1000 Genomes project which certainly have great gee-whiz factors. What might be an equivalent for cheminformatics? I’m not sure – but I’d guess something on the lines of systems biology or systems chemical biology might be a possibility.

Or maybe cheminformatics just needs to become “small molecule bioinformatics”?