New Guide Helps Researchers Mine Genome Data

How-to manual spreads the word on freely available data

October 2002

BETHESDA, Md. - The Internet is teeming with user's guides for everything from cell phones to the space station. Now, to encourage greater scientific exploration of public databases containing the human genome sequence, the National Human Genome Research Institute (NHGRI) has created "A User's Guide to the Human Genome."

Published as a freely available supplement to Nature Genetics [nature.com] the peer-reviewed, how-to manual is aimed at spreading the word about how easy it is for researchers to mine the wealth of human genomic data that is freely available online.

"There is no point amassing all of this data in data warehouses if no one is able to use it," said Andreas D. Baxevanis, Ph.D., associate director of NHGRI's Division of Intramural Research (DIR) and co-author of the guide. "There is a barrier that people perceive about their ability to access genomic data and effectively use the data."

NHGRI Director Francis S. Collins, M.D., Ph.D., who was one of the guide's co-authors, added, "The average researcher in a biological or medical lab is somewhat overwhelmed by the avalanche of new freely-available data produced by the Human Genome Project. They know it's incredibly useful, but they're not quite sure how to use it. This guide was designed to overcome that barrier, and give "power to the people."

In addition to Drs. Collins and Baxevanis, the guide was co-authored by Tyra G. Wolfsberg, Ph.D., associate director of NHGRI's Bioinformatics and Scientific Programming Core, Mark Guyer, Ph.D., director of the NHGRI Division of Extramural Research (DER), and Kris Wetterstrand, M.S., a program analyst for DER.

To further underscore the need for the user's guide, Dr. Baxevanis pointed to a Wellcome Trust's survey of nearly 800 biomedical scientists in 2001. The survey found that only half of the researchers who were already using genome databases were familiar with the free, public tools for accessing sequence-based data.

"Between this information and our own anecdotal information about how people were not availing themselves of the variety of freely-available genomic databases, it really became obvious to us that a user's guide was needed to fill the void. One of the main reasons for doing the Human Genome Project was to encourage researchers to use sequence data to guide their own research. So, this guide will hopefully allow our fellow scientists to better understand what types of data are out there and how to effectively browse and search these data," said Dr. Baxevanis, who heads NHGRI's Bioinformatics and Scientific Programming Core.

"We also saw that there was a growing stratification betwoeen 'power scientists' who could use the data and those who could not. We wanted to prevent another digital divide within the biomedical research community," Dr. Baxevanis added. "Basic bioinformatics tools should be in the arsenal of every single researcher who is doing biology in the 21st century."

In an invited accompanying article, Harold Varmus, M.D., president of the Memorial Sloan-Kettering Cancer Center in New York and former director of the National Institutes of Health (NIH), agrees that there is a pressing, unmet need for "genomic empowerment."

"Interested people who reside outside the centers for studying genomes need to be told where to best view the information in a form suitable for their purposes and how to take advantage of the software that has been provided for retrieval and analysis," Dr. Varmus writes. "The manual before us now offers such help to those who might otherwise have had trouble in attempting to use the products of genomics. Furthermore, the advice is offered in that spirit of altruism that has come to characterize the public world of genomics."

In the 79-page guide, the NHGRI team focuses mainly on the three major genome portals that contain freely available data produced by the International Human Genome Sequencing Consortium and other systematic sequencing efforts. Those Web-based portals are: the National Center for Biotechnology Information's Map Viewer; the University of California, Santa Cruz's Genome Browser; and the European Bioinformatics Institute's Ensembl system.

"The underlying data are fairly similar, but the individual tool sets available at the three sites are different," Dr. Baxevanis said. "In formulating the guide, we selected a representative cross-section of tools to best demonstrate the kinds of questions that could be answered using these three Web sites."

Arranged around a series of questions commonly encountered during the course biomedical research, the guide provides users with practical, hands-on instructions for searching and analyzing genomic data contained in the major browsers. The NHGRI authors show users how to set about answering each question by choosing and utilizing the appropriate tools in one or more of the main browsers.

For example, Question 2 of the guide asks: "How can sequence-tagged sites (STS's) within a DNA sequence be identified?" The NHGRI authors point users to the NCBI portal's UniSTS resource, which contains an electronic PCR (e-PCR) tool that can be utilized to find STS markers within a DNA fragment. Using instructive text and figures, the guide then walks users through the steps of identifying the STS markers contained in a sample sequence of interest, in this case a sequence with accession number AF288398. The e-PCR search reveals the sample sequence contains only one STS, stSG47693. By clicking on the marker name, users can obtain more details about the STS from UniSTS, such as alternative names for the marker, primer information and PCR product size. In addition, the NHGRI authors steer users to several electronic cross references to mapping information, as well as a link to NCBI's MapViewer which allows users to see the genomic context of the STS marker in all maps to which it has been mapped.

The NHGRI authors emphasize that the new user's guide is a dynamic manual that will be frequently revised to reflect the rapidly changing world of genomic data and technology. "This is not a cookbook because the databases are constantly changing," said Dr. Baxevanis, noting the guide will be updated at least once between now and the target date for finishing the human genome sequence in April 2003.

In addition to showcasing the tools available through the three main genome portals, the guide includes a convenient list of links to a wide range of additional resources: other genome browsers, genome annotation databases, public sequence databases, expressed sequence tag clustering databases, human genetic and physical maps, sequence-based search tools and model organism databases.

The guide also provides information on Human Genome Hub and Genome Central, Web sites that serve as jumping off points to major genome-based Web sites. While the guide focuses mainly on the mechanics of accessing and using human genomic data, links are also included to key Web sites for information on genetic education and ethical, legal and social issues (ELSI) related to genetic and genomic research.

NHGRI and The Wellcome Trust provided funding for the special supplement of Nature Genetics in which the guide was published.

NHGRI is one of the 27 institutes and centers at the NIH, which is an agency of the Department of Health and Human Services (DHHS). The NHGRI Division of Intramural Research develops and implements technology to understand, diagnose and treat genomic and genetic diseases. Additional information about NHGRI can be found at its Web site, www.genome.gov.