Computational methods of information retrieval have revolutionized librarianship. Developments in text mining and natural language processing are likely to bring equally profound change to how scientists and clinicians interact with the biomedical literature!

Dr. Lawrence Hunter is the Director of the University of Colorado’s Computational Bioscience Program and a Professor of Pharmacology (School of Medicine) and Computer Science (Boulder). He received a Ph.D. in computer science from Yale University in 1989, and then joined the National Institutes of Health as a
staff scientist, first at the National Library of Medicine and then at the National Cancer Institute, before coming to Colorado in 2000. Dr. Hunter is widely recognized as one of the founders of bioinformatics; he served as the first President of the International Society for Computational Biology (ISCB), and
created several of the most important conferences in the field, including ISMB, PSB and VIZBI. Dr. Hunter’s research interests span a wide range of areas, from cognitive science to rational drug design.

This bioinformatics bite is going to be a little but more clinically oriented:

A patient is presenting with excess blood clotting, which she thinks might be related to “something that runs in her family”. How do I find known diseases and genes (if any) that are associated with that phenotype?

A good place to start to look for information about symptoms and diseases that are related to genetics is MedGen. This database organizes information related to human medical genetics, like symptoms (clinical features), related genes, diseases, or genomic loci.

There are 94 results, the first of which is a clotting disorder, but one that is associated with too little clotting rather than too much clotting. If you scroll down, you see records that are not actually diseases:

To find out what type of record you’re looking at, look at the text after concept ID (blue boxes). The screen captures above show a Disease or Syndrome, a Finding, and a Pharmacologic substance. Notice that the diseases has links to other databases (green circle) and the others do not.

So how do we specify that we’re looking for a patient symptom related to a genetic disease? Like all the other NCBI databases, MedGen has field tags.

Here are some useful ones:

Clinical Features: short stature[clinical features] – records for diseases that are associated with short stature

Now we’ve narrowed the MedGen results to those that have clotting listed as a clinical feature. If you read the description, you see that Factor V deficiency is the only one associated with excess clotting. The record also shows what gene is associated with this disorder (F5) and links to descriptions from other resources like GeneReviews and OMIM, as well as Professional guidelines and Recent clinical studies.

So how do you find out if this is in fact what your patient has? Find out next week!

This blog series has covered how to use both GEO Datasets (which holds both curated and uncurated datasets) and GEO Profiles (which holds expression profiles for individual genes from curated data sets).

But what if you want to see expression profiles of a gene from an uncharted dataset? That’s where GEO2R comes in. Once you’ve identified a dataset by searching GEO Datasets, you can start using GEO2R in 5 easy steps:

This record contains accession numbers (boxed in red) for the series, samples, and platform. GEO2R is looking for the Series Accession number, GSE73177. Enter this number into the search field in GEO2R, or click the Analyze with GEO2R link (blue arrow):

Define sample groups:

This experiment is measuring expression levels in 3 groups (parent strain, knockout strain, and complemented knockout strain**). Thus, we need to create 3 sample groups. To do this, click “Define Groups” link (green circle above.) This action activates a popup that allows you to enter free text to name the groups (red box, wt, ko, and comp in the following example)

Assign samples to groups:

Now you need to tell the program which samples belong to each group by selecting the samples that you want to put into a group, then clicking on the group you want to add them 2 in the Define Groups popup. In the example below, I have selected the complemented knockout samples (highlighted yellow) and will click the “comp” group to add them. After they are added to a group, the corresponding colors change to that of the group and the group column is populated, as in the case of wt and ko in the example. Repeat this process for all of your groups.

Perform the test:

Get a list of the top 250 differentially expressed genes using the default settings*, scroll down and click the Top 250 button under the GEO2R tab.

A table containing the top 250 differentially expressed probes from the platform that the probe ID, p-value, adjusted p-value, F statistic and probe sequence.

Interpret the results table:

Clicking on the probe ID will show you a graph of the gene expression among the groups you have specified. You can also click Sample Values to get the number values represented on this graph.

In this case, it looks like the gene that is probed by 55.m10280_at is highly expressed in the knockout relative to the wild type, but doesn’t revert to wild type levels in the complemented strain.

To determine the gene name of the probe used in this experiment, visit the corresponding platform record for this series. (You can find this by searching the Series accession in GEO datasets, and using the platform filter.) Then, scroll to the bottom of the page to see the platform data table, which gives probe IDs, identifiers for genes in the toxoplasma genome database, annotation,chromosome location and a description of the gene function if available. Search for the probe id in the table to find the corresponding gene. In this case, it’s a thioredoxin domain-containing protein.

But what if your gene of interest is not in the top 250? You can use the Profile Graph tab to search by probe id.

GEO2R also has basic QC tools. You can see the value distributions across samples to identify large scale problems in the dataset using the value distribution tab:

Finally, you can retrieve the R script for the analyses run in the R script tab.

This weeks bioinformatics bite will answer a question from the end of last week’s post:

What is the expression of ITGA2 gene in prostate cancer cells?

I’d check GEO profiles, because it is a gene centric question. Let’s start by typing in prostate cancer in the GEO profiles search box.

Note the link back to the GEO data set that this gene profile was derived from (green circle). You can also see the platform and the specific probe that measures this gene (orange box). Finally, you can see a cartoon of the expression level between sample and control on the right side of the record.

After applying the filters (blue and red boxes), you can see the search strategy in the Search details (orange box).

To get a close up of the expression graph, click the cartoon on the right (green circle).

At a glance, you can see that ITGA2 expression goes up when the microarray miR-205 is expressed (red bars). It also indicates how highly this gene is expressed relative to other genes from the same sample by percentile (blue squares). It also lists the expression values from each sample in a table below, along with its rank.

If you need more information about how the samples were prepared, you can click on the GSM number in the table. From there, you can access general information about the experiment by clicking on the Series ID (GSE) on any sample page, or the original GEO profile record.

But what do you do if the gene you’re attempting to access data in an uncurated data set? NCBI has a tool for that: GEO2R. We will discuss how to use this tool next time.

Wow! It’s been a long time since i’ve been able to take the time to write a post. I apologize. Sorry for the hiatus. I have been traveling and playing catchup and attending meetings post travel. I hope to get back to my weekly posts now.

Background: The Gene Expression Omnibus (GEO) was created by the NCBI to store gene expression data from microarray experiments. Most of the content is still microarrays, but some NGS data is also present (with the raw reads in the SRA database). Additionally, it now contains other types of high-throughput genomics data, like CHiP-chip.

All 3 of these components are necessary to make use of the study. Both samples and series link to the CEL files, and platform gives you information about the chip the samples were run on. NCBI is working to assemble all of the components from each submitted study into a curated DataSet, but there is some lag in the process. NGS studies can’t be curated at this time.

After curation, the data sets are broken out by gene instead of experiment and the data are loaded into GEO profiles.

if you do a quick search for “cancer”, here are what the results look like.

GEO DataSets results

This looks a lot like the output of a Gene search, but with different filters. You can see in the red box that there are over 1000 cancer data sets in GEO, and that the top data set has 6 samples by following the red arrow. You can filter by study type (blue box), things like tissue or strain (purple box), or Organism (orange arrow). You also get a search details box, which shows that MeSH terms are applied to your text search, just like PubMed.

Now that we have a general idea of how the data bases are structures, I will answer a practical question in next week’s post:

I’m going in a bit of a different direction with this bioinformatics bite segment. Instead of explicitly describing how to use a database or tool, I wanted to tell you all about a little project that I’m working on with a vet student from CSU.

Just because I don’t have a lab or large amounts of research funding doesn’t mean I can’t do science! There’s a wealth of bioinformatic data publicly available online and some user friendly tools that are available. We’re using these freely available resources to ask questions about how the distribution of microbes in the environment correlates with specific landmarks.

Our research question is as follows: Are there differences in microbial populations at sites that are close to zoos in the NYC area compared to the farther away. To address this question, we are using data from the PathoMap project, which swabbed surfaces in transit stops all over New York City. (This project is also being expanded to the top 10 cities worldwide for public transit ridership by a project called MetaSub.) See this publication for more information.

Repurposing data forces you to think differently. Our research question was designed in the context of what data and tools were publicly available. This type of research will only get easier as the mindset of creating well-designed community resources expands. Initiatives like Big Data to Knowledge (BD2K) and the Center for Open Science are driving this new trend.