New software: pygsi

Whenever a paper involving sequencing the genome of bacteria (or other species for that matter), the researcher is obliged to deposit the (usually short reads) in either the European Nucleotide Archive (ENA) and the Short Read Archive (SRA) along with some metadata.

Sounds good, but there has been a flaw until recently; whilst one could deposit the short-read files, one could only search the associated metadata. This meant that, say you wanted to search the ENA for samples containing MCR-1, an important recently identified gene that confers colistin resistance, if it wasn’t explicitly mentioned in the metadata (and most of the time it wouldn’t have been as it wouldn’t have been identified yet!), you’d have had to download all the possible short read files and then trawl through them.

In other words, the ENA and SRA were archives; easy to put data into, difficult to search and interrogate.

Zam Iqbal and his group have developed an index for all the bacterial and viral pathogen genetic data in the ENA/SRA as of late 2017 which is searchable. It is called BIGSI and you can try it here (the resemblance to an early Google is not, I suspect, a coincidence) and you can find the preprint here.

Doesn’t seem like much, but suddenly we can ask all sorts of interesting questions. Like: how many samples contain MCR-1? One problem is when we are looking for a gene we are usually looking for the reference sequence and associated minor variants (e.g. couple of SNP differences). With the current BIGSI interface this is hard, since you’d have to systematically give it all possible variants of your base k-mer.

Fortunately, systematically is something computers are good at, so as a hack (because ultimately I imagine something like BIGSI will become a service at the EBI and this sort of functionality will be included), I wrote a Python package that takes a gene and then walks along the sequence and asking BIGSI how many times each minor variant occurs. Since each variant requires a web API call, it isn’t rapid, but you can work through a single gene overnight.

The package, including a more detailed description and examples, can be downloaded from its GitHub repository.