Mine of information

Page Tools

Connected:
scientists around
the world are
sharing the
results of
research thanks
to the internet.

Weekend looking a bit ho-hum? Why not do a Vamsi Mootha and
trawl the net for leftover research data? Mootha managed to make a
major medical breakthrough in one weekend in 2001, identifying a
defective gene that caused a fatal disorder.

Mootha's custom-built software program did some serious
number-crunches on data from an unresolved genetics study sitting
in a publicly available database. The Harvard biologist identified
the gene likely to be the basis of the childhood disease called
LSFC (Leigh syndrome, French-Canadian variant), which affected one
in 2000 inhabitants of a region in Quebec, Canada.

In September last year, Mootha was awarded a $US500,000
($660,000) unrestricted MacArthur Fellowship, known as a Genius
Grant, for pioneering "computational strategies for mining data
collected in laboratories throughout the world".

There are truckloads of unused data all over the net -
information gathered for long-completed research projects and left
on web-servers for other fact-fossickers to access.

Scientists are discovering all sorts of gems in these databases
using data mining techniques such as those implemented by Mootha.
Data-mining uses statistics, classification and pattern recognition
to analyse large databases and find new relationships in the
information.

Geneticist Peter Little, a professor of medical biochemistry
heading a team of researchers at the University of NSW, said that
about 80 per cent of their research involves computer-based data
analysis, rather than laboratory work.

Techniques such as microarrays allow biologists to examine every
one of the 23,000 genes in a cell to determine how active each gene
is - far more informative than examining a slice of diseased cell
tissue under a microscope. "There are lots of bioinformaticians
doing this sort of stuff - ourselves included," Little says.

He and his colleagues access such sites as the public GenBank,
which holds annotations of the result of the 13-year Human Genome
Project - comprising the full sequence of human DNA.

Little points out that scientists have always made their
research public, and piggy-backed on each other's work to make new
discoveries: "Scientists do not work in isolation. Of all of the
areas of intellectual activity, science is absolutely dependent on
the free exchange of information." The internet has merely made
this information sharing global.

And, while the idea of running numbers through a database and
coming up with a breakthrough discovery sounds like fun, Little
warns that there's a bit more to it.

"Life is unbelievably complicated and the number of variations
that we are dealing with becomes huge. The size of these databases
is enormous and discoveries are added daily," he says.

Little's colleagues include Professor Ian Dawes, a
microbiologist who this year received a research grant to work with
Carlton and United Breweries to identify the yeast genes that
influence flavour in beer.

Who could resist the opportunity to devise experiments in
beer-tasting? "It's a very small part of what I do," Dawes
protests, adding that he relies heavily on public research data on
yeast genomes available on the web.

Dawes notes that yeast genomes are useful for testing new drugs:
"Yeast cells have 6000 genes and we have access to a complete set
of information on the effects of all sorts of mutations."

The huge number of experiments with yeast cells and the publicly
available results have led to some exciting discoveries. Dawes's
team found a compound that may block the blood supply to tumour
cells.

Little acknowledges that the net has changed his profession:
"Science now could not survive without the internet. We cannot now
do biology without access to this data."

This makes Google look lightweight. Enter a search term in the
box to search all of PubMed (12 million scientific articles dating
back to the mid-1960s), the entire Gene Expression Omnibus (GEO)
and terabytes of information in a collection of databases with
obscure functions - for example, the database of all the genes
expressed in the mouse brain at various stages of development.
Scary.

The entire raw sequence of data for the genomes of more than a
dozen species - including humans. The raw data for the 3 billion or
so bases in the human genome consists of monotonous strings of four
letters - A, T, C and G - in varying order. Baffling.

Not sure what that gene is called? This is the equivalent of
Wikipedia for your average geneticist - the Gene Ontology
Consortium is a voluntary group that agrees on what naming
conventions should be used. Mind-numbing.