The original data are sequencing chromatograms, gels, and comparable data traces that should be
archived in the originating laboratory

Important Molecular Biological Databases

NCBI, EMBL, DDBJ

Three interlinked database centers

Each sponsors several interlinked databases (e.g., GenBank, PubMed, Refseq, Taxonomy; all at NCBI), and provides other tools and services.

Each accepts submissions independently, share data daily.

We will generally just consider genbank, and treat all of these
as equivalent. They were, however,
established independently, and each has its own peculiarities.

Strictly speaking, all of these are secondary databases. The data are
not raw data -- they have been subject to interpretation. However
the data are still fairly close to the original source. Additional
databases have been developed by further reprocessing of genbank.
These are often called "secondary databases."

Swissprot, PIR

TIGR

JGI

Celera Genomics - One of several private sequence databases, involved in sequencing the human genome.

NCBI

NCBI's databases are some of the most important databases in bioinformatics.

National Center for Biotechnology Information (NCBI), which is part of the National Library of Medicine (NLM), which is itself a part of the National Institutes of Health (NIH), a government agency.

Comprised of assembled sequences from the literature, unpublished
submissions, and annotation provided either by the original authors,
from other commentary, or by the curators.

Genbank sequences are typically the product of assembly of several
overlapping fragments, and have had coding regions, introns, and
other features identified by a variety of methods.

As with any form of interpretation, these annotations can be
incorrect

Beware of plasmid sequences, primer sequences, and other
artifacts

Growth of the sequence databases has been logrithmic since the
mid-1980s, and shows no sign of slowing down.

Most sequences are submitted directly by the authors, typically
as a part of publication, and the submitting author continues to "own" the submission. This means that updates and changes are normally done with the permission of the author.

Data can be deposited, but held confidential until the article is
published

Most sequences are available for any use, although a few have been
patented, or have other legal restrictions.

GenBank data are also linked externally, to databases not maintained by GenBank

Amino Acid (Protein) Data

Originally easier to obtain, and consequently more common than DNA
sequence data, now mostly inferred sequences translated from DNA sequences.

Genbank has a parallel set of accession numbers for a protein database

Nucleotide Data

Now the most common original data type.

The amino acid data in a genbank file are often inferred sequences
translated from the DNA sequence, but in some cases represent actual
polypeptide sequencing. The annotations should tell you how a sequence
was obtained.

Associated databases

Swiss-Prot

Protein Information Resource (PIR)

Swiss-Prot and PIR are derived databases in which data from
genbank have been further analyzed and annotated.

Protein Data Base (PDB)

The main database for protein structural (x-ray crystallographic)
data.

Protein Families (pFam)

Profile HMM alignment database

Annotation

Seed alignment

Profile HMM

Full alignment (large, some with over 2500 sequences)

Assessing the reliability of data

Know the provenance of the data you are working with!

Because genbank attempts to be comprehensive, it is a very large database

It is not possible to verify every sequence. Consequently some data in genbank
will be erroneous

Some things to consider about your data:

Are they original data from a highly skilled and reputable lab?

Were they generated by an automated system without human intervention?

Are they preliminary data from a EST sequencing project, or a similar
technique expected to have a relatively high error rate?

Are gene-identity assignments made directly from biochemical data, or
are they second-generation (or worse) inferences made on the basis of
sequence similarity?

If you unwittingly work with bad data, your findings may prove to be invalid.

What can you do to ensure that you are working with "clean" data?

Work with a trusted source.

Check the data yourself

Use methods that can detect, correct, or accomodate invalid data.

In general your studies should always include an internal check
for the validity of the data.

Use Swiss-Prot, PIR, or another database where the identity of the
sequences has been more carefully checked. But be warned that this
means you have to trust the work of the curators.