DOE Joint Genome Institute

DOE JGI in Walnut Creek, California, provides state-of-the-science capabilities for genome sequencing and analysis. With more than 1100 worldwide collaborators on active projects, JGI is the preeminent facility for sequencing plants, microbes, and microbial communities that are foundational to energy and environmental research.

Notes from DOE-NHGRI
Informatics Workshop, April 2-3, 1998

By Dan Drell, DOE Office of Biological and Environmental Research

On April 2 and 3, 1998, DOE's OBER and NIH's NHGRI convened a workshop to identify informatics needs and goals that could comprise the next genome 5-year plan (currently being developed) as well as begin to craft a vision for genome informatics over the next 5 years and beyond. In particular, following the announcement of the impending closure of Genome Data Base (GDB) at Johns Hopkins, OBER desires to clarify its future role in genome informatics that serves its role in the HGP. In attendance were 46 invited informatics and genomics experts, 6 DOE, 8 NHGRI, 2 NIGMS and 1 NSF staffers. The meeting was held at the Dulles Hilton on Rte. 28 in Herndon, VA.

Since the beginning of the Human Genome Project, informatics has been widely regarded as one of the most important elements of the HGP. The overall quantity of information, the mass and varying types of experimental raw data being generated, the spectrum of data from ABI traces to DNA sequences, to map positions of markers, to identified genes, ultimately to intelligent predictions of future genes (open reading frames) and their hypothetical functions, all absolutely require computational collection, management, storage, organization, access, and analysis. Not surprisingly, given the wide diversity of sponsoring agencies, participating institutions, and scientists who are involved in genomics, the resulting data are highly heterogeneous in terms of format, organization, quality, and content. Furthermore, not all uses for these data can be anticipated today; this implies a need for structural flexibility in the database(s) that support the genome project. Additionally, knowledge improves over time which implies that curation of the data, i.e. correcting it, adding to the functional and useful links it has, annotating it, must be done on a continuous basis.

Although universally regarded as critical to the success of the HGP, informatics is done by computer scientists, not biologists. This has led to some communication difficulties that have not been fully resolved. By and large, those doing informatics have not had practical biology backgrounds (there are, of course, exceptions to this), and biologists, to a large extent, have used computers only for word processing and e-mail. This situation is changing rapidly but still has a way to go. Additionally, the expectations from genome informatics are not uniform; biologists have a set of expectations that can vary from those of the computational scientists. Importantly, computational analyses of genomic data are not meant to generate "revealed truth"; rather, they are best understood as serving to generate testable hypotheses that must then be taken to a lab bench somewhere for critical testing. Both NHGRI and OBER took the starting position that it is the needs of the users that matter the most and which must drive the goals of genome informatics over the next 5 years. To this end, most of the invitees were, broadly defined, "users" of informatics services, and only a minority of invitees were "producers."

Prior to the workshop, the ORISE contractor E-mailed to all the invitees 4 broad questions to serve as a framework for the workshop. These four questions were:

Queries: What scientific questions will you want to answer? What types of data will you need to answer these questions? Which of these data types are permanent, which are temporary but important, and which will need to be regularly updated? What uses will you have for genomic sequence data in the next 5 years?

Tools: What protocols and tools for data submission, viewing, analysis, annotation, curation, comparison, and manipulation will you need to make maximal use of the data? What sorts of links among datasets will be useful?

Infrastructure: What critical infrastructures will be needed to support the queries you want to perform and what attributes should these infrastructures have? In what ways should they be flexible, and how should they stay current? How should they be maintained?

Standards: What kind of community-agreed standards are needed, e.g. controlled vocabularies, datatypes, annotations, and structures? How should these be defined and established?

The agenda consisted of 6 "user" talks the first morning, followed by breakout groups the first afternoon. The 4 breakout groups were (the bolded name was the breakout group chair):

The Informatics Workshop began with welcoming comments from John Wooley of DOE and Francis Collins of NHGRI. Both noted the importance of informatics to the success of the HGP, both noted that the time was now to begin thinking about the biology that could follow the completion of the first human sequence. An acute question for today is what tools will the post-HGP biologist need to do the work s/he wants to do? Collins noted the initiative on Single Nucleotide Polymorphisms (SNPs) that 17 NIH institutes are joining in. He also noted that with Genome Data Base closing this summer, "re-parking" of that data was important so it wouldn't be lost. He closed by saying that the genome programs were listening since the next HGP 5-Year plan was being developed and this workshop would be important towards the definition of the informatics goals that would appear in it.

Aravinda Chakravarti (Case Western Reserve University) and David Thomassen (DOE OBER) discussed the planning process for the 5-year plan. Thomassen, speaking for Ray Gesteland (University of Utah) who could not be present, noted the priority areas for the DOE: high-throughput sequencing at the Joint Genome Institute (and its Production Sequencing Facility), technology development (including improvements in current technologies, "hardening" of developing technologies, and keeping open an eye to future technologies such as leveraged sequencing), informatics, functional genomics, and ELSI. Chakravarti noted that the priority areas for NHGRI included sequencing, genetic variation, functional genomics, informatics, and ELSI. At Airlie House in Warrenton, VA, in May 28-29, the principals of both the DOE and NIH genome programs, along with invited outside scientists, will review a joint 5-Year plan for the HGP. This plan should be ready for publication in an October, 1998 issue of Science. Chakravarti closed emphasizing the need for a concrete, tangible, implementable plan that focused heavily on the next 5 years.

The morning proceeded with talks from various genome project users, each representing a different perspective. LaDeanna Hillier (Washington University in St. Louis) listed the informatics needs typical of a large sequencing center. Her comprehensive list included data tracking, physical mapping support (e.g. band calling, map assembly tools, and map publication and dissemination tools), data collection and analysis (e.g. lane tracking, image analysis, base calling, confidence estimations, and interfaces), data processing (e.g. vector clipping, sequence assembly, data management and reporting tools, QA/QC tools), finishing tools (editing, problem solving, etc.), technology development (rearraying, colony picking, data collection support), LIMS (lab information management systems), gene prediction aids, gene identification tools (naming conventions, map integration, graphical representation tools), annotation representation tools (data mining and analysis tools), and databases (for public tools, both phase 1 and phase 2 data). Additionally, standardization of the required fields that must be filled for an entry to be accepted by a given database needs to be agreed to. Stable identifiers are an absolute requirement so that data isn't lost when different things are done with it and the intellectual spoor trail can be followed back if desired. Hillier closed on data access, asserting that complete sharing between public databases using standardized well documented and centralized formats should be enforced, with libraries of routines callable by JAVA.

Takashi Gojobori (National Institute of Genetics, Japan) discussed the DNA Database of Japan, DDBJ, and noted it is one of the three corners of the sequence database triangle (the others are NCBI and EMBL.) DDBJ has a total staff of about 65 people (including post-docs and graduate students) and is roughly comparable to GenBank at NCBI. The four themes of his presentation were:

comparative genomics (including comparisons of whole genomes from different species, elucidation of the evolutionary process of genomic structure, and studies of biological relationships between different species);

genomic engineering (identification of essential genomic regions, identification of minimum genomic sequences for function, and the elucidation of the ancestral genome at the "origin of life"); and

Gojobori closed with an appeal for what he termed a "Humanity Genome Project" involving a search for genes for human psychology, emotion and behavior.

Anne Spence (University of California, Irvine) represented the perspective of the medical geneticist user. She gave a forthright, blunt talk about the need for data resources that a medical geneticist could use to answer, efficiently, questions about genes and their medical implications. A typical query might be "tell me everything about gene X." Today, this query involves interrogating several web sites, not always interlinked, and often with uncurated data of varying veracity and reliability. Spence gave a dramatic example involving a query about a gene implicated with attention deficit hyperactivity disorder ( a gene associated with ADHD has been located to the same spot as DRD4, which has been linked to a plethora of syndromes such as schizophrenia, Alzheimers, depression, novelty seeking, obsessive compulsive disorder, and the list goes on.) She noted 3 fundamental issues in informatics that complicated the medical geneticist's life: 1) genetics vs. computers or the challenge of capturing all the data vs. intelligently using the data; 2) the data volume problem (which is getting worse as more sequencing is done); and 3) the issue of data accuracy vs. completeness. What the medical geneticist needs is a user-friendly disease/gene entry, in a database with links to other resources, regular rapid updates, with accurate curated and annotated information, and population data. GDB had been a bridge to much of this data, between OMIM and GenBank, but GDB had been hard to use. There is an acute need to capture discovered knowledge and make it easily available and this is not being done now.

Debbie Nickerson (University of Washington, Seattle) talked about genetic variation. To maximally utilize the expected flood of human variation data that the SNP efforts will generate, SNP data needs to be integrated into existing maps. There are plenty of maps out there now, but they are on "boutique" web sites and are difficult to find, and virtually impossible to add to. DNA variation data could encompass type (e.g. substitutions, indels, repeats), location, discovery, mode (inherited or acquired), frequency (population, haplotype, linkages), method of genotyping, and phenotype/function. It is sobering to realize that the human genome might vary by as much as 6% in size (the genome could be 3 x 109 base pairs, plus or minus 9 x 107.) Some mechanism for external annotation so that other biologists could (using a simple format) add to the value of the data is desirable.

Roger Brent (Molecular Sciences Institute, Berkeley, CA) gave a provocative talk comparing functional genomics (as a set of interpretations and derivations from sequence data) to the information one might need to "guess the plot" of Shakespeare's Othello. For biologists, this information might comprise measurements of protein concentrations, states, and subcellular locations, as well as temporal changes as a function of cellular conditions and activities; for Othello, one would want to know who is in the room, what else is in the room, what is in the room that a character can form a complex with (e.g. a knife or pillow), and what is in the room that a given character does form a complex with. Brent used this analogy to point out that biological informatics needs to deal with "fuzzier" data and more tentative inferences, that queries phrased closer to "natural language" are needed, as are "canned" queries (e.g. most plots are the same or similar.) Brent's talk provoked some discussion that ranged from the sharp ("complete crap") to the diplomatic ("lots of problems with communications" between biologists and computer scientists.)

Rainer Fuchs (Ariad Pharmaceuticals) talked from the perspective of the biotechnology industry user. He noted that industry wasn't monolithic, that it is wide ranging in character and needs. Common to many are the needs for more potential targets for pharmaceutical generation. This implies better ways of identifying those that are worth investing resources in developing. The hopes that industry has for genomics includes better target identification, target validation, target prioritization; the informatics challenges include data analysis (knowledge discovery), establishment of standards, and training new young scientists for the future. New data types can easily be expected, including gene expression (at both the nucleic acid and protein levels), molecular interactions, gene regulation, and genetic variation (including polymorphisms, post-translational modifications, and splice variants.) "Tools for the rest of us" (as opposed to the high end, large scale sequencers) are also needed. This should involve tools that are easier to use, that are available, that are robust and of commercial quality, that are supported. Fuchs noted that although no one had mentioned it explicitly, the idea of "federated database systems," in which a query could cross from one database to others and return relevant information obtained from several of them, was still highly sought after. Industry also (along with medical geneticists) wants to be able to ask "tell me everything about this gene." To do this, Fuchs passionately argued for standards across the bioinformatics landscape. Today, industry standards are worlds apart from those considered in the genomic bioinformatics field. A group exists (the OMG, Object Management Group) that currently is an industry group but which could involve academic and government representatives if they showed interest. Fuchs noted that standards were critically important because in an era of industrial-scale sequencing, it made little sense to "let 1000 flowers bloom;" striving for perfection was laudable in principle, but not reasonable in practice. There was no need to reinvent the wheel. Core databases with centralized data management, explicit object definitions and access methods, better financial support not dependent on research grants (a bad mechanism for supporting infrastructure), but with rigorous review for both technical practice and continuing need and utility was important. Component oriented software standards would promote systems integration, interoperability, flexibility and responsiveness to change (e.g. CORBA). Annotation was critically important so that the who what where when and why of genome sequence products could be built up. Automated analyses using clearly defined standard operating procedures, consistent application, and sufficient documentation would help a lot. Finally, Fuchs mentioned the acute need for training of additional scientists (not exclusively biologists) in these technologies.

Bettie Graham of NHGRI concluded the morning with a short description of several NHGRI training programs that could help with the dearth of bioinformaticists in the public sector genome field.

The afternoon of the first day was devoted to 4 breakout groups; the results of those groups were presented the next morning. I visited each of the breakout groups to get a sense of how the discussions were going and took some notes while in each one, but the summaries below are based on the final products of each group.

PHYSICAL MAPS, GENE MAPS: develop integrated databases where identical sequence markers in different maps are in synonyms database; all markers should be located in a central database (e.g. NCBI); queries to maps: what are the markers, clones, and genes in a spatial interval; what are the genes/ESTs location and clones?

SEQUENCE READY MAPS: all data (full contig depth) should be accessible; assembly criteria (e.g. STS, fingerprints) should be included; data must contain: interval anchored to best maps, all clone addresses for members of the contig, members of tiling path clones that are (or will be ) sequenced, clone id coupled to library information, links to additional information; it would be desirable to have STS content and fingerprint of each clone. All the data must be prepared and presented in a standard fashion

SEQUENCE: the sequence data must contain the following: source of the sequence (the clone id), the sequence anchored to clones, the STS location and confirmation by electronic PCR, quality scores for each base (probability of error) for large genomic sequence, biological attribution as annotation. Contiguous genomic sequences should be assembled.

TOOLS: support for distribution and maintenance of tools for general use should be promoted; this includes tools for map and sequence assembly, new tools for interoperable systems, and (especially) robust tools for sequence finishing.

LINKAGES AND STANDARDS: clear definition of objects in databases including their behavior and semantics; standard interfaces for WWW and for systems communications, standards for sequence accuracy and which data needs to be captured, international genomic standards for objects to be represented in databases, establishing (in one year) a working group to develop, periodically review, modify these standards and to so advise the funding agencies who should then enforce the resulting standards.

QUERIES: what is known about this gene? What is known about this region (marker delimited, cytogenetic location)? Were does this sequence go (what is its genomic context)? Does this gene vary? From where did this information come (cell, tissue, population, species ethnicity, environment)? What are the genetic characteristics of this population (geography, origin, sample clinical diagnosis, phenotype)? What analysis reagents should be used?

STANDARDS: availability of raw data that support conclusions, tools used to generate data need to be well described and available, standard nomenclature/vocabulary (required use in public databases), standard formats for entry of data of the same type, data to support genetic conclusions submitted, methods used to generate data need to be specified.

Implementation thoughts: multiple databases (with different approaches/models) maintained by experts, need to conduct research in areas of integration tools, need to have data be open and accessible by many parties.

One issue that was mentioned was the incorporation of data and information generated by both large and small groups; this is based on a sense that the rules are different, e.g., large groups are expected to put sequence on their web sites each evening, while smaller labs can pretty much do what they want. Temple Smith suggested a Swiss-Prot Blocks-like data structure for genome sequence data that would not replace GenBank but improve on it (in a Leggo block fashion). Ed Uberbacher noted the importance of comprehensive annotation based on assembled genomes (not the fragments often found in GenBank) and on comprehensive complete information. The database must be queryable in a reasonable way (another criticism of GenBank) and all data, whether primary or derived, needs to be sourced. It was noted that annotation is not gospel, only testable hypotheses. Curation remains a touchy issue, as it needs to be done, but it isn't clear who should do it. Expert curation by selected editors is difficult and expensive, but would not be impossible if suitable incentives were used. It was noted that most of OMIM's budget goes to curation.

Recommendations:

Show us the money: robust software engineering for genomics is expensive and requires a stable infrastructure for R&D and deployment. Adequate funding must also be provided for innovative research in genome informatics to address grand challenge problems in data management analysis and visualization.

Implement fully automated genome annotation systems that keep pace with world-wide sequencing output. Must be capable of initial annotation and ongoing re-annotation. Must include visible policies, protocols, and evidence.

Explore new models for generation of publicly available data and informatics tools, including work done by private companies under contract to the government which then freely distributes the data and tools.

Establish 3-5 academic centers for genome informatics leading to the critical mass necessary for sustained R&D and deployment, and training programs. These will be the major centers for training genome informatics specialists.

Comparative genomics requires a high level of human intervention and curation to interpret and synthesize the data. Emphasis should be on increasing productivity not solely on scalability.

As the data increase in magnitude and complexity, more human resources will be required for curation.

You get what you pay for.

Without an effective bioinformatics infrastructure, the promise of the HGP will not be realized. Funding levels need to be consistent with the critical nature of informatics.

Summaries (4/3/98)

Each breakout group reported on its conclusions and recommendations.

Raju Kucherlapati (sequencing) noted the issues in the summary above. In the discussion, it came out that at Wash U (St. Louis), the Waterston group is currently sequencing about 100 Mb/yr (all sequences combined, e.g. human and C. elegans) but can finish only at 60Mb/yr so that finishing remains a major bottleneck in sequencing. It was suggested that the OMG standard setting working group be supported and that academic/government participants be encouraged. David Lipman (NCBI) said that they were working to hire more staff expressly to work with genome centers on sequence data submission.

Ken Buetow (gene finding/OMIM/variation) noted (based on a sketch of Jim Ostell's) the various gaps in the flow of mapping information. There are numerous gaps in morbid maps, many gaps in the map positions of physical reagents, clones, many deficiencies in raw data (ABI traces, etc.), gaps in the annotation and description of complex traits, huge gaps in knowledge about gene interactions and modifier genes, and little in the way of repositories on DNA variation, linked phenotypes, methods and reagents, homologies and orthologies, and historical data and annotations. Better integration tools were desperately needed. David Lipman noted that there was an overarching need to understand the data from the perspective of its utility. Others wanted to make sure it was all captured first.

Chris Overton (annotation/function): There are several models for high quality, curated databases out there, e.g. FlyBase, AceDB, OMIM. 3rd party annotation was, by and large, not a successful approach. Whatever was done, working closely with GenBank was important. Functional genomics data was a research issue since it wasn't at all clear what data needed to be collected. "User-friendly tools" is easy to say, hard to accomplish, very expensive, and means different things to different communities. There is a need for 3-5 centers of excellence (including UPenn?) where a critical mass of informatics together with biology can be accumulated. One of those centers now is NCBI.

Carol Bult (Comparative Genomics): The need here is ways to traverse across many resources to answer complex queries. A major part of comparative genomics that cannot be readily automated is homology comparisons. Computerized annotation can only do so much and needs to be viewed as a tool for hypothesis generation. Controlled vocabularies need to be constructed, but it is recognized that the slope between controlled vocabularies and a comprehensive (and complex) knowledge base is a slippery one.

David Lipman noted that the meeting was a useful one for him. NCBI has more than 60,000 users/per day (some 2 million per month). MedLine, which used to be available for a fee, now is free on the Web. PubMed is used by a wide audience, some 40% of whom are researchers, 10% are MDs, and the rest are the "public." NCBI is interested in expanding library functions and is talking with textbook publishers and hopes to strengthen connections to the literature and to tap into the education market. NCBI is growing 15% in usage every 45 days, but remains a small division within NLM. They are trying to make Entrez more robust and might find ways to export or disseminate it to others.

Ed Uberbacher gave a brief overview of the Annotation Consortium at ORNL and described its overall schema of Data Acquisition, Data Analysis, Data Storage, and Data Access (via the Genome Channel.) Several people (Jean-Francois Tomb [Dupont] and Carol Bult [U Maine]) expressed strong praise for Ed's efforts.

Wrap up (Eric Green and Elbert Branscomb): [This is the hardest part of the meeting to summarize; Francis Collins asked for priorities, estimated costs, and a timetable which did not map easily onto the earlier Queries, Tools, Infrastructure, and Standards pattern that the breakout groups had been asked to respond to.]

NSF S&T centers as model for needed genome informatics center perhaps on scale of $12 Million per year

Overall Policy Recommendations:

there should be open competition for supplying database/informatics needs

existing frameworks should be used where possible

standard data object definitions should be realized

continued support for model organism databases should be effected

raw data should be captured to the maximum extent possible

there should be investments made in hardening and exporting software tools from genome centers.

Afterward

This was a useful and rewarding meeting. While some consensus recommendations can be identified, there is still much vagueness among the informatics communities, mostly users, represented at this workshop. Those who generate the data have different concerns from those who want to use it. There is still some hesitation between the biologists who aren't conversant in the technical issues of informatics and the informatics scientists who aren't fully conversant in the biology. The presence of NCBI was a strong positive from this meeting. There was a general air of amity and agreement in the various breakout groups. It seems that the genome project still has many unmet informatics needs and there was, to my mind, remarkable concordance on what the "wish list" should still have on it. From a DOE-specific perspective, the importance of annotation efforts (highlighted by the work of Ed Uberbacher's group at ORNL) was underscored.

Infrastructure: This principally means databases and the workshop suggested a pile of them. These include:

curated structured reference genome (map and sequence) database,

integrated and linked databases,

variation database,

functional/expression database

an informatics tools and information database.

Standards. There was near uniformity on the need for intelligent standards that various constituencies of the genome project, academic, government, and industry, could join in defining and implementing. These include a variety of controlled vocabularies for various objects that would be entered into appropriate databases. Today, industry standards are very distinct from those (few) that exist (e.g. Phred/phrap for sequence QA/QC) in the HGP A group exists (the OMG, Object Management Group) that currently is largely composed of industry representatives, but should involve academic and government representatives. Explicit object definitions and access methods are desperately needed. Component-oriented software standards would promote systems integration, interoperability, flexibility and responsiveness to change (e.g. CORBA). Automated analyses (annotation) using clearly defined standard operating procedures, consistent application, and sufficient documentation would help a lot.

The workshop closed with some policy recommendations, (slightly expanded from above):

There should be open competition for supplying most database/informatics needs, but support for any large databases needs to be done outside of the regular grant mechanism (but NOT outside of periodic technical and mission relevance reviews.)

No one database can be expected to do everything for everybody; however, the user needs to "feel" that s/he is interacting with only one entity.

Existing frameworks (database schema, submission tools, etc.) should be used where possible to save money. Contracting out certain tasks to the private sector should be explored.

Standard data object definitions should be developed and promulgated in the near future and enforced by the agencies.

There should be continued support for model organism databases.

Raw data should be captured to the maximum extent possible before it is irretrievably lost.

There should be investments made in hardening and exporting software tools from genome centers.

The electronic form of the newsletter may be cited in the following
style:
Human Genome Program, U.S. Department of Energy,
Human Genome News (v9n3).

Human Genome Project 1990–2003

The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.

Published from 1989 until 2002, this newsletter facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.

Citation and Credit

Unless otherwise noted, publications and webpages on this site were created for the U.S. Department of Energy Human Genome Project program and are in the public domain. Permission to use these documents is not needed, but credit the U.S. Department of Energy Human Genome Project and provide the URL http://www.ornl.gov/hgmis when using them. Materials provided by third parties are identified as such and not available for free use.