Boulder::Genbank provides retrieval and parsing services for NCBI Genbank-format records. It returns Genbank entries in Stone format, allowing easy access to the various fields and values. Boulder::Genbank is a descendent of Boulder::Stream, and provides a stream-like interface to a series of Stone objects.

>> IMPORTANT NOTE <<

As of January 2002, NCBI has changed their Batch Entrez interface. I have modified Boulder::Genbank so as to use a "demo" interface, which fixes things, but this isn't guaranteed in the long run.

I have written to NCBI, and they may fix this -- or they may not.

>> IMPORTANT NOTE <<

Access to Genbank is provided by three different accessors, which together give access to remote and local Genbank databases. When you create a new Boulder::Genbank stream, you provide one of the three accessors, along with accessor-specific parameters that control what entries to fetch. The three accessors are:

This provides access to NetEntrez, accessing the most recent Genbank information directly from NCBI's Web site. The parameters passed to this accessor are either a series of Genbank accession numbers, or an Entrez query (see http://www.ncbi.nlm.nih.gov/Entrez/linking.html). If you provide a list of accession numbers, the stream will return a series of stones corresponding to the numbers. Otherwise, if you provided an Entrez query, the entries returned will be in the order returned by Entez.

This provides access to local Genbank entries by reading from a flat file (typically one of the .seq files downloadable from NCBI's Web site). The stream will return a Stone corresponding to each of the entries in the file, starting from the top of the file and working downward. The parameter in this case is the path to the local file.

This provides access to local Genbank entries using Will Fitzhugh's Yank program. Yank provides fast indexed access to a Genbank flat file using the accession number as the key. The parameter passed to the Yank accessor is a list of accession numbers. Stones will be returned in the requested order. By default the yank binary lives in /usr/local/bin/yank. To support other locations, you may define the environment variable YANK to contain the full path.

It is also possible to parse a single Genbank entry from a text string stored in a scalar variable, returning a Stone object.

The new() method creates a new Boulder::Genbank stream on the accessor provided. The three possible accessors are Entrez, Yank and File. If successful, the method returns the stream object. Otherwise it returns undef.

new() takes the following arguments:

-accessor Name of the accessor to use
-fetch Parameters to pass to the accessor
-proxy Path to an HTTP proxy, used when using
the Entrez accessor over a firewall.

Specify the accessor to use with the -accessor argument. If not specified, it defaults to Entrez.

-fetch is an accessor-specific argument. The possibilities are:

For Entrez, the -fetch argument may point to a scalar, in which case it is interpreted as an Entrez query string. See http://www.ncbi.nlm.nih.gov/Entrez/linking.html for a description of the query syntax. Alternatively, -fetch may point to an array reference, in which case it is interpreted as a list of accession numbers to retrieve. If -fetch points to a hash, it is interpreted as extended information. See "Extended Entrez Parameters" below.

For Yank, the -fetch argument must point to an array reference containing the accession numbers to retrieve.

For File, the -fetch argument must point to a string-valued scalar, which will be interpreted as the path to the file to read Genbank entries from.

For Entrez (and Entrez only) Boulder::Genbank allows you to use a shortcut syntax in which you provde new() with a list of accession numbers:

The get() method is inherited from Boulder::Stream, and simply returns the next parsed Genbank Stone, or undef if there is nothing more to fetch. It has the same semantics as the parent class, including the ability to restrict access to certain top-level tags.

The put() method is inherited from the parent Boulder::Stream class, and will write the passed Stone to standard output in Boulder format. This means that it is currently not possible to write a Boulder::Genbank object back into Genbank flatfile form.

The Entrez accessor recognizes extended parameters that allow you the ability to customize the search. Instead of passing a query string scalar or a list of accession numbers as the -fetch argument, pass a hash reference. The hashref should contain one or more of the following keys:

Each record returned from the Boulder::Genbank stream defines a set of methods that correspond to features and other fields in the Genbank flat file record. Stone::GB_Sequence gives the full details, but they are listed for reference here:

The tags returned by the parsing operation are taken from the NCBI ASN.1 schema. For consistency, they are normalized so that the initial letter is capitalized, and all subsequent letters are lowercase. This section contains an abbreviated list of the most useful/common tags. See "The NCBI Data Model", by James Ostell and Jonathan Kans in "Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins" (Eds. A. Baxevanis and F. Ouellette), pp 121-144 for the full listing.

The accession number of this entry. Because of the vagaries of the Genbank data model, an entry may have multiple accession numbers (e.g. after a merging operation). Accession may therefore be a multi-valued tag.

The taxonomic name of the organism from which this entry was derived. This line is taken from the Genbank entry unmodified. See the NCBI data model documentation for an explanation of their taxonomic syntax.

The Features tag points to a Stone record that contains multiple subtags. Each subtag is the name of a feature which points, in turn, to a Stone that describes the feature's location and other attributes. The full list of feature is beyond this document, but the following are the features that are most often seen:

Cds a CDS
Intron an intron
Exon an exon
Gene a gene
Mrna an mRNA
Polya_site a putative polyadenylation signal
Repeat_unit a repetitive region
Source More information about the organism and cell
type the sequence was derived from
Satellite a microsatellite (dinucleotide repeat)