So I’ve been working a lot with NCBI GEO recently for a paper on the Gene Ontology. During the course of this work I wound up implementing about 70% of the famous R package GEOQuery in Python (as I’m much more fluent in Python than R) and decided that it might be worthwhile to submit to the BioPython project. Their existing GEO parser is woefully inadequate and slightly buggy (I don’t believe it can handle the curated GEO Dataset format, it has no programmatic access to NCBI GEO, and offers no way to do any statistical analysis on the resulting microarray data).

My fork, which is available here, revamps the Geo package to provide the following features:

Automatic retrieval and parsing of GEO files, either from NCBI or from the local filesystem

Pretty-printing of metadata, column, and table information

Ability to convert GDS records into a form that provides a Numpy matrix representation of the sample/probe matrix

I still haven’t written unit tests for it all yet (a persistent failing- one of many, I’ll admit) mostly because it was developed a bit on-the-fly during my work. However, I also know that it works for at least a subsection of uses, and it’s well-documented.

You have no idea the pain I feel when I sit down to program. I’m walking on razor blades and broken glass. You have no idea the contempt I feel for C++, for J2EE, for your favorite XML parser, for the pathetic junk we’re using to perform computations today. There are a few diamonds in the rough, a few glimmers of beauty here and there, but most of what I feel is simply indescribable nausea.

That’s quite cool if you wanted to a historical analysis of how these annotations are changing with time (which I do). For instance, if you wanted to see how many terms have been marked as obsolete since 2004, and you’ve downloaded the current gene_ontology_ext.obo file and the goa file from ’04:

So we’ve got 381/3989 – 9.5%- that have been retired since ’04. That’s not too shabby, although I imagine the GO hierarchy and overall structure has changed more significantly since then. Still, it makes it plausible to track the gene annotations of the majority of terms over the last 8 years.