Sequencing Quality Control, statistics and bioinformatics blog

Main Menu

The case for a dataset recommendation engine

Science is carried out by testing hypothesis through experimentation in the ‘wet lab’ and in the ‘dry lab’. It often starts out with a hypothesis that we want to test, and then through series of experiments and bioinformatic analysis new hypothesis are generated, some are validated and others discarded. In the ‘dry lab’ this iterative process often relies on the use of published datasets. When I am testing a new concept, a new algorithm, or just want to compare my results with some other related data, I go to public databases to search for what is out there. Besides obvious huge data portals from big consortia (TCGA, ICGC, ENCODE, etc), there are many other databases such as GEO and ArrayExpress that host datasets from individual labs and publications.

Hackday with DNAdigest

So, faced with the increasing need for better tools to search through large databases in an intelligent way, we organised a hackaton to come up with ‘recommender systems for scientific datasets’.
The Hackday was organised by DNAdigest, a non-profit organisation with an aim to tackle the challenges of genomic data sharing. You can read more about them here.

Recommender systems for scientific datasets

The idea was simple – perhaps inspired by Amazon and Netflix recommender systems – the idea was to help scientists finding datasets from various sources (various experiments, studies, etc).
Below, there’s a good example of this concept. When I search GSE1379 on Google scholar it finds 41 papers with “GSE1379″ in the text. Along side with this, you can see how other related datasets are often co-cited:

A huge matrix of a dataset-citation-network:

You can read more about the outcomes of the hackday on the full blog post here. We’ve made all code produced so far available at the DNAdigest bitbucket repository.

We are open to more ideas and contributors to this project! If you are interested, have some ideas and would like to contribute, join the discussion at http://dnadigest.hackpad.com, or just add a comment below!