GenBank & The Early Years of “Big Data”

In cooperation with our colleagues at the National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), the NLM’s History of Medicine Division recently acquired the archives of the early history of GenBank, the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. Today Circulating Now welcomes guest blogger Bruno J. Strasser. Dr. Strasser is a professor at the University of Geneva, Switzerland, an adjunct professor at Yale University, and author of the book Collecting Experiments: The New Production of Biomedical Knowledge, forthcoming from University of Chicago Press.

“Almost the number of stars in the Milky Way.” Through this stellar comparison, the National Institutes of Health proudly announced in 2005 that the content of their computerized collection of DNA sequences called GenBank had reached 50 billion bases or units of DNA. Today, it contains far more, over 200 billion bases from over 350,000 different species, making it one of the largest scientific database in the world.

Detail from a GenBank brochure, ca. 1985Courtesy National Library of Medicine Acc. 2015-045

The creation of GenBank, like that of the heavens, was no small achievement. This archival collection of hand-written, type-written, and printed documents deposited at the NLM reveals the first discussions among scientists and science administrators about this new infrastructure, created in 1982, and the first decade of its existence. These papers offer a unique window onto the coming of age of “big data,” of how it is transforming scientific research, and how it led to the “open access” movement. Today, as “big data” is heralded as the “new oil” and as our daily online actions are increasingly stored in databases for marketing and other purposes, it is useful to begin reflecting on the history of our information age.

In the sciences, the challenge of “big data” arose particularly early and has transformed the way scientific research is done. GenBank has become an indispensable tool for biomedical researchers around the world. This encyclopedia of gene sequences is now a truly collaborative and worldwide effort. It includes the complete genomes of over 3,000 organisms, from humans to zebrafish, from rice to bacteria like E. coli.

Biomedical researchers go to GenBank to find the sequence for a given gene and associated annotation, such as the organism from which the sequence was derived, biological functions, and scientific journal articles. More importantly, researchers search the database to find if it contains a sequence that closely resembles one they have determined in their laboratory from a specific organism. Often they do find a match, and the similarity tells them that both sequences, in different organisms, probably have a similar function, since they evolved though the same common ancestor. This comparative approach is key to the success of contemporary biomedical research.

NCBI News (Volume 1, Issue 3) September 1992, featuring news about the move of GenBank to the National Center for Biotechnology Information at the National Library of Medicine.Courtesy National Library of Medicine Acc. 2015-045

But in the late 1970s, when this collection of data was first envisioned by scientists, it was far from clear that it would be worth the effort. First, there wasn’t that much data to be collected, and many biomedical researchers were still uncomfortable with computers. To some experimentalists, a data collection even sounded somewhat antiquated, like a natural history museum or a library collection, not like one of the cutting-edge instruments enabling experimental virtuosity. And some doubted that it was the NIH’s mission to fund such an infrastructure. But Dr. Elke Jordan, deputy director of the Genetics Program Branch at the National Institute of General Medical Sciences (NIGMS), and Ruth L. Kirschstein, director of NIGMS, with the help of a few key scientists, like Dr. Richard J. Roberts, eventually succeeded in opening a request for proposals and, finally, signing a contract with Los Alamos National Laboratory to host GenBank. Ten years later, the operation of the GenBank database was transferred from Los Alamos National Laboratory to the National Center for Biotechnology Information, where it is maintained today, regularly accessed by scientists around the world, who also contribute measurably to its growth by depositing their own data. These contributions are a fundamental part of the current era of “big data” that continues to inform scientific discovery.

Thanks to the staff of the NLM’s History of Medicine Division and NCBI, the archives of GenBank are now preserved and publicly available for use at the NLM in the History of Medicine Reading Room. Tomorrow’s researchers will find much of interest in the GenBank archives as they look back on the last quarter of the twentieth century to learn how scientists first began to conceptualize and envision a computerized collection of DNA sequences, which eventually became the largest scientific database in the world.

Interested in learning more about the history of GenBank? See these articles: