One of the largest challenges in analyzing genomic data is associating
the experimental data with the available biological {\em metadata}, e.g.,
sequence, gene annotation, chromosomal maps, literature. Bioconductor
provides two main packages for this purpose: \texttt{annotate}
(end-user) and \texttt{AnnBuilder} (developer). These packages can be
used for querying databases such as GenBank, GO, LocusLink, and
PubMed, from R, and for processing the query results in R.\\

In this lab, we focus on annotation resources for Affymetrix
chips. Since Affymetrix chips have a standard layout, Bioconductor
can provide annotation data packages for the main types of Affymetrix
chips, e.g., hu6800, hgu33, hgu95, mgu74, and rgu34 series. These data packages are built using the
\texttt{AnnBuilder} package. By contrast, there is no standard
array design for two-color spotted microarray experiments; each lab tends
to have its own custom design. In this case, specific annotation data packages
will have to be created for each facility using
\texttt{AnnBuilder}. Once annotation data packages are constructed to provide
mappings between different sets of gene identifiers, the tools in
\texttt{annotate} can be used in a similar manner for both platforms.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Mappings between different gene identifiers}

A main task in annotation is to associate manufacturer (e.g., for
Affymetrix chips) or in-house (e.g., for custom cDNA spotted arrays)
probe identifiers to other available identifiers
(e.g., PMID, GenBank accession number).\\

In this lab, we will use the Bioconductor annotation data packages to map
between Affymetrix identifiers and identifiers for biological metadata
available on the WWW.
Because these data sources are very large, are constantly evolving, and
are similar across species, we have adopted the strategy of distributing
data in regularly updated R packages. Each mapping is contained in an {\em environment}.
For our purposes, an environment is simply a {\em hash table}.
It provides a very fast way of looking up {\em values} for specific
text {\em keys}.
This implementation is not the best and we hope to develop real hash tables
for R but that will take some time.\\

We will again use the Golub et al. (1999) dataset as a case study and
will explore annotation resources for the Affymetrix HU6800 chip used
in this study. The Bioconductor
annotation data package for the HU6800 chip is \texttt{hu6800} and can
be downloaded from the "Data packages" section of the Bioconductor
website. To load the packages needed for this lab
<>=
library(Biobase)
library(annotate)
library(tkWidgets)
library(geneplotter)
library(golubEsets)
library(hu6800)
library(GO)
@
Alternately, since all these packages are required by the
\texttt{EMBO03} course package, simply use
<>=
library(EMBO03)
@

After loading the \texttt{hu6800} package, we have access to a number of
different datasets (environments). These provide the many different
mappings from Affymetrix identifiers to other probe identifiers, such
as chromosomal location (in the \texttt{hu6800CHR} environment) and PMID
(in the \texttt{hu6800PMID} environment). Quality control data, counts,
etc., for the mappings are available by calling the function
\texttt{hu6800()} and by examining the help page, \texttt{?hu6800}.
<>=
hu6800()
@

The main R functions we will use for dealing with environments are \texttt{ls} (to list the names of objects in a
specified environment) and \texttt{get} (to search for an R object with
a given name and return its value, if found, in the specified environment).
We obtain the Affymetrix identifiers using \texttt{ls} and note that there are 7,129 identifiers for the
HU6800 chip (this matches the Golub \texttt{exprSets}).
<>=
affyID

We now arbitrarily select the probe set with Affymetrix identifier
\texttt{"U18237\_at"} and obtain a number of other IDs corresponding to
this probe set. Note that there may be several PMIDs, i.e., PubMed abstracts, associated
with a given gene.

In some cases, we need to obtain data on several genes at once. We
wrote a special function for this purpose: \texttt{multiget}.
<>=
fivegenes[6:10]
fivegenes
multiget(fivegenes, env = hu6800PMID)
@

Instead of relying on the general R functions for environments (e.g., \texttt{get}), the
development version of the \texttt{annotate} package also provides new and more user-friendly
functions for accessing specific identifiers.
<>=
getSYMBOL(fivegenes, data="hu6800")
getLL(fivegenes, data="hu6800")
getPMID(fivegenes, data="hu6800")
gg[2]], "MF")
@

A variety of information about the probes can now easily be obtained
by, for example, querying PubMed, GenBank, and LocusLink, as described below. In
addition, one can also use R to compute on these data and perform a
number of quality control and exploratory analyses.
<>=
whChrom

The National Library of Medicine (NLM) provides a great deal of information
on biological data. We have
only just started to explore these resources. We have already
developed tools for interacting with PubMed, but need to develop
further functionality on MeSH and other available resources.
A mapping has been done between LocusLink and PubMed. This associates specific
articles with specific genes. The articles can be queried interactively using
tools available in R and other systems.
Full text abstracts are readily available; in some cases full text
articles are also available (we don't have the tools to make use of
the later).
We next demonstrate how to interact with PubMed using functions
in \texttt{annotate}.\\

A \texttt{pubMedAbst} class structure was defined for handling
PubMed abstracts in R. The slots are
<>=
slotNames("pubMedAbst")
@

The basic engine for talking to PubMed is the function
\texttt{pubmed}. Given a vector of PMIDs, the function either has a
browser display a URL showing the results of the PubMed query for those
identifiers (\texttt{disp="browser"}) or creates an \texttt{XMLdoc}
object with the same data (\texttt{disp="data"}).
<>=
pmids

The function \texttt{pm.getabst} provides a simpler way to
download the specified PubMed abstracts (stored in XML) and create a
list of \texttt{pubMedAbst} objects. The following commands can be
used to store the PubMed abstracts for 5 genes in a list of objects of
class \texttt{pubMedAbst}, to compute the number of abstracts retrieved
for each gene, and to print the abstract(s) for the first gene. We can
see, for example, that one of the genes has 21 abstracts associated with it.
<>=
absts2[1]]
@

The functions
\texttt{pm.titles} and \texttt{pm.abstGrep} can then be used to
extract the titles from a set of PubMed abstracts and for regular
expression matching on these abstracts, respectively. For the genes with Affy IDs \texttt{fivegenes}, the following
commands extract the abstract titles and search the
abstracts for the word "protein", with lower case "p" or upper case "P".
<>=
pm.titles(absts2)
sapply(absts2, function(x) pm.abstGrep("[Pp]rotein", x))
@

The new function \texttt{pmAbst2html} from the development version of
\texttt{annotate} takes a list of
\texttt{pubMedAbst} objects and generates an HTML report with
the titles of the abstracts and links to their full page on
PubMed. The report will be stored in the file \texttt{pm.html} in the
working directory.

The function \texttt{ll.htlmpage} generates HTML reports, with one row per gene and a clickable entry which opens the LocusLink webpage for that gene. Additional information can be displayed using the
\texttt{othernames} argument as shown below. Such HTML reports provide an
easy way to communicate results with other researchers working on the same analysis.

The classes \texttt{chromLoc} and \texttt{chromLocation} are used to
keep track of location information for a single gene and a set of
genes (entire genome), respectively. The object
\texttt{hu6800ChrClass} from the \texttt{EMBO03} course package is an
instance of the class \texttt{chromLocation} for the Affymetrix HU6800
chip. It was constructed using the "byChroms" vignette in the \texttt{geneplotter} package.

The \texttt{geneplotter} package provides two main functions for
plotting genomic data: \texttt{cPlot} and \texttt{alongChrom}. These
functions operate on instances of the class \texttt{chromLocation}.
With \texttt{cPlot}, we can render all chromosomes the same length or
we can scale them by their relative lengths. Note that the \texttt{mva} package
in the next release of R (1.7.0) will have new functions for heat maps and dendrograms.