at OpenHelix

Tip of the Week: InterMine for mining “big data”

Integrating large data sets for queries within–and across–various collections is one of the arenas that has lately been pretty active in bioinformatics. As more and more “big data” projects yield huge numbers of data points and data types, this is only becoming more necessary. I love to browse data, but there are times when a large-scale customized query is what you’ll want to make some broader discoveries.

Right now there are a number of resources and interfaces that I turn to for structured and customized queries of data collections. The UCSC Table Browser, BioMart, Galaxy–these are the ones I have my hands on almost continuously. But there is another warehouse and interface system that we’re seeing more and more: InterMine.

My first real encounter with InterMine was for the modENCODE data. There’s some really terrific data flowing out of that project now (I talked a bit about that recently here), and the interface and storage system they are using is InterMine.

FlyMine was the initial impetus for the “Mine” system. Some years back FlyMine was created as a warehouse and query system for the increasing amounts of fly data that was coming from various projects. The goal was to have a system powerful enough for bioinformatics + super users, but also a friendly yet powerful interface for bench biologists to use.

The initial paper described the basic components: a user interface with 3 primary components: a Quick Search that’s great for browsing; a Template library that lets users access some pre-defined standard or likely query types that they can tweak for their needs; and a fully customizable Query Builder for the most advanced access. Since this paper development has continued, and there are other new and cool features present as well.

Another big goal of the FlyMine effort was to be able to deal with lists. One of the most common questions we still get in workshops is: “I have a list of _____. What’s the best way to deal with that?” FlyMine–and the InterMines in general–help people to query and manage their explorations with lists of stuff.

The MyMine feature of the InterMines is also a nice component. You can create a login and store things you want to have repeated access to: queries, lists, etc.

There are other people using InterMine for their systems too–a recent paper on TargetMine, for “Gene Prioritization and Target Discovery” is available, and might appear as an upcoming tip! Jennifer did a tip on YeastMine from SGD once as well.

But what triggered me to do this tip is that a letter came from the RGD mailing list last week that said this:

Effective Friday, May 20th, 2011 the MCW BioMart tool will be retired by RGD and the MCW Proteomics Center. For mining rat data, we have found that the RatMIne tool is easier to use, more flexible and incorporates more types of data than BioMart. In addition, RatMine includes analysis tools not found in BioMart, giving RatMine users a single, intuitive interface for both obtaining and analyzing data.

So they are moving fully to InterMine and retiring the Rat BioMart, exclusively using RatMine at their installation. So this tip of the week will explore InterMine, RatMine, and some other Mines. That’s a lot of ground to cover–but it’s probably worth your time to know about InterMine as it becomes more broadly available. It’s also important to understand how to query with the Mines if you want to bring the data to Galaxy for further analysis. If you visit Galaxy you’ll see that their “Get Data” section lets you access Mine tools–but you still need to know how to do the basic queries at the host site first.

Although this tip will touch on RatMine, the focus is the more general InterMine suite. RGD also said this in their notice: