Scientific Objectives:
The long term goal of our collaboration is to develop an experimentally
tractable and closed model system to globally unravel and
understand the evolution, physiology and biochemistry of the
genus Oryza. The specific objectives of this proposal are
to: 1) Construct DNA fingerprint/BAC-end sequence physical
maps from 11 deep coverage BAC libraries that represent the
11 wild genomes of Oryza (830,000 fingerprints; 1,659,000
BAC ends);
2) align the 11 physical maps with the sequenced reference
subspecies japonica and indica;
3) construct high-resolution physical maps of rice chromosomes
1, 3 and 10 across the 11 wild genomes using a combination
of hybridization and in silico anchoring strategies, and;
4) provide convenient bioinformatics research and educational
tools (FPC and web-based) to rapidly access and understand
the collective Oryza genome.

Broader Impacts:
The research proposed will provide the first ever closed experimental
system to understand the evolution, physiology and biochemistry
of a single genus in plants or animals. We will align representatives
of eleven wild genomes of rice, including both diploids and
tetraploids, to the sequenced and finished O. sativa ssp.
japonica AA diploid genome. Such a system will empower the
scientific community to address complex scientific questions
on a whole genome scale. For example, one would be able to
determine the majority of genome rearrangements leading to
the present day wild species as compared with the sequenced
cultivated rice. Such data can be used to study the dynamics
of the evolution of a genus and the impacts of domestication.
Another example is that one could move vertically across genomes
to explore the diversity and evolution of disease resistance
gene clusters and their cis regulatory elements. Such data
could be used to rapidly identify new and useful disease resistance
genes as well as to define conserved regulatory sequences.
This research will not only impact rice genomics but will
be useful for understanding monocot biology in general and
can serve as a model to establish similar systems in both
plants and animals.

The PIs have received continuous
NSF support since 1994. Wing is PI/Co-PI on 11 (3 of which
will expire in September 2003; 2 of which all experimental
objective will have been completed); Jackson on 1 and Stein
on 1 active NSF Plant Genome grants. Here we will describe
our results from prior support of grants directly related
to the proposed work.

C.1.1) Title:
“Sequencing Rice Chromosomes 3 and 10”: PI R. Wing,
Co-PIs D. McCombie, R. Wilson, C. Soderlund, J. Jiang. NSF
Award# DBI-9982594 (09/15/99-09/14/2002); NSF Supplemental
# (09/15/02-09/14/03); USDA-CSREES award# 99-3517-8505 (09/15/99-09/14/2002);
USDA Supplemental award# (09/15/02-09/14/03); DOE Supplemental
award# (09/15/02-09/14/03). Award amount: $5,100,000 ($2,550,000
NSF & $2,550,000 USDA) Supplemental Award amount $1,100,000
($500k NSF, $500k USDA, $100k DOE).
The overall objective of this 4 year
project was to finish and annotate rice chromosomes 10 and
3. Our consortia ACWW 1 (Arizona
Genomics Institute [Wing, Yu, Soderlund], Cold Spring Harbor
Laboratory [McCombie, Stein], Washington University Genome
Sequencing Center [Wilson] and the University of Wisconsin
[Jiang]) was assigned the top half of chromosome 10 (Mb) and
the short arm of chromosome 3 (Mb), while TIGR (Buell) was
assigned the lower half of ch10 and the long arm of ch3. Ch10
is finished and published in Science in June 2003 (The Rice
Chromosome 10 Sequencing Consortium: Yu et al. 2003). Although
10 is the smallest of the rice chromosomes (22.4 Mb) it is
also one of the most heterochromatic, especially the top 12
Mb. Therefore it was a difficult chromosome to finish however,
in the end, ACWW finished all but 1 of its BACs without any
sequencing gaps to a standard of 1 error in 10,000 bases or
more.
We are now focusing our efforts on
finishing our region of chromosome 3. As an interim goal,
ACWW and TIGR finished a 10X BAC-by-BAC Phase 2 draft of ch3
in December 2002 (IRGSP 2002). For the ACWW region, all sequence
is contiguous, representing 16.7 Mb of the rice genome, except
for 3 physical gaps between FPC contigs. To date 112 BACs
are finished and are in annotation, 5 BACs are in the Phase
2 stage (3 in 1 contig) and 4 BACs are in phase 1 stage. We
therefore do not anticipate any delays in achieving the goal
of finishing and annotating rice chromosome 3 by September
15, 2003 if not earlier. This is especially true because the
remaining genomic sequence is from the more euchromatic region
of the ch3. ACWW sequencing status can be viewed for all chromosomes
at www.genome.arizona.edu/shotgun/rice/status,
which has link to the sequence location in the webFPC map.
In addition to the scientific contributions
made by this project, it also helped us to develop the talent,
infrastructure, methodology and management skills to tackle
large scale sequencing and physical mapping projects like
the one being proposed here.

C.1.2) Title: “The
Oryza BAC Library Project”: PI R. Wing, Co-PI C. Soderlund,
Tomkins (Clemson) and S. Jackson (Purdue). NSF Award# DBI-0208329
(09/15/02-09/14/04). Award amount: $600k.
The long term objective of this project
is to construct 11 deep-coverage large-insert BAC libraries
from representatives of all the known wild genomes of rice
(see Table 1 for species and genomes sizes) and provide affordable
access to these resources (clones, filters and libraries)
through the Arizona and Clemson BAC/EST Resource Centers.
The grant was awarded this October and will provide the raw
material for our proposal to align the wild genomes with the
cultivated and sequenced A genome diploids.
Progress: In mid-November 2002, PIs
Wing and Jackson traveled to the International Rice Research
Institute (IRRI: Philippines) to prepare high molecular weight
DNA for BAC library construction. The 4 week trip was very
productive and we were able to produce 1-2 DNA preparations
from 8 of the 11 species. Unfortunately the DNA samples were
not digestible with restriction enzymes most likely due to
the poor condition of the plant material. In May 2003 Dr.
Meizhong Luo returned to IRRI to prepare megabase-size DNA
from much younger tissue and brought these samples back to
Arizona. Fortunately, this time the majority of DNA samples
were digestible with restriction enzymes and therefore can
be used for BAC library construction (see figure 1 below).
These samples are now being used to build BAC libraries in
Arizona and Clemson. The first library has been cloned, from
O. rufipogon, with an average insert size of 134kb (see figure
2 below).We plan to construct the 11 BAC libraries over the
next year and 3 months. The Wing laboratory has over 10 years
of experience in construction of BAC libraries, including
over 10 rice libraries, and we do not anticipate any problems
in completing our objectives within the time frame of the
project now that we have suitable DNA preparations

FIGURE 1. Enzyme digestions
of wild species HMW DNA.

Figure 2. NotI digestion
of Oryza rufipogon BAC clones

C.1.3) Title: "Maize
Mapping Project": Lead PI: E. Coe (Missouri); Arizona
PI: R. Wing, Co-PI C. Soderlund. NSF Award# DBI-9872655 (09/15/98-09/14/03).
Total award amount: $1,629,246.
The primary role of our group, in
collaboration with Missouri, was to construct a genetically
anchored phase I BAC physical map of the maize genome. Our
objectives were to: 1) construct a deep-coverage BAC library
of the inbred B73 using the HindIII; 2) fingerprint the HindIII
library and an additional EcoR1 BAC library; 3) assemble the
fingerprints into contigs using FPC; 4) order and merge as
many contigs as possible along the maize genetic map to create
a phase I physical map.
Objectives 1-3 are complete and we
are now in the final year of the project. We generated about
300,000 successful fingerprints from the two libraries which
assembled into about 4500 contigs using FPC. All fingerprint
and anchor data can be downloaded and viewed using WebFPC
and WebChrom at the www.genome.arizona.edu/fpc/maize
site, where WebFPC displays the FPC contigs, and WebChrom
shows the location of genetic markers and the FPC contigs,
with links to external web based databases. Integration of
the genetic map with the physical map can be viewed using
iMAP at Missouri. The contigs cover approximately 2036 Mb
of the 2500 MB genome, and the longest contig is approximately
4 Mb. The libraries have been hybridized with about 14,715
probes, 798 of which are genetically mapped.
Our final year goal is to order and
merge as many contigs as possible with a realistic goal of
achieving about 3000 contigs in the end, 1000 of which will
be genetically anchored. Although not a complete physical
map we have been able to develop an important resource for
maize genetics that is widely used by the community.

C.1.4) Title: "Comparative
genomics of rice: reconstructing rice chromosome 1 in related
species.": PI S. Jackson, Co-PI P. SanMiguel (Purdue).
NSF# DBI-0227414 (10/01/02-09/31/07). Award amount: $1,630,537.
The long term objective of this proposal
is to use BAC libraries and Overgo technology to reconstruct
rice chromosome 1 in 6 related Oryza species to examine chromosome
evolution in group of closely related species and develop
tools for comparative mapping in plant genomes.
Progress: We have developed a computational
pipeline to sift through genomic sequences to find overgos
that will be used for comparative mapping. We have designed
and begun testing the first 96 out of several thousand overgos
on rice and a wild rice BAC library. A description of these
resources and an online database is available at http://rice.genomics.purdue.edu.
This proposal has just been funded but we have made tremendous
progress and one postdoc, one technician and two graduate
students are already on board working on this project.
We anticipate no problems attaining
the goals of this project within the timeline and are collaborating
with R. Wing to expand the scope of this project to include
the entire rice genome instead of just rice chromosome 1.

D.1.5) Title: "Gramene:
A resource for comparative grass genomics" PI: L. Stein; Co-PI:
S. McCouch (Cornell). USDA-IFAFS# 2000-04538 (09/01/00-08/31/04).
Award amount: $2,098,000.
Gramene (http://www.gramene.org)
is a comparative mapping resource for monocot genomes. It
combines the extensive colinearity between the genetic maps
of the cereals, with the draft and finished genomes of rice
to create an environment in which researchers can move easily
from a genetically-defined region in one species to a physical
map in another species, and ultimately to an annotated region
of the rice genome.
Gramene then builds on top of this
comparative mapping framework by adding a knowledgebase of
rice functional genomics. From the Gramene web site, researchers
can browse and download an extensive collection of annotated
rice mutants and their phenotypes, the gene ontology annotations
of rice gene products, protein orthology relationships, and
an extensively annotated bibliography of rice biology. Researchers
also have access to essential resources for comparative genomics,
including an ontology of monocot developmental and phenotypic
terms (developed jointly with MaizeDB), assay information
for the markers used to develop monocot genetic maps, and
outgoing links to stock centers and genomics sites.

Fig. 1: Comparing maize map to genetic and physical maps
of rice

In a typical use case, a researcher
who is interested in identifying candidate genes in a genetically
defined region of maize can use the comparative map display
to find the corresponding region in the genetic and physical
maps of rice. From here he can navigate to a display of the
rice genome in the selected area (Figure 1) and find annotated
candidate genes.
Gramene currently provides the following
data sets to the research community. Typically these sets are
available both as browsable web pages and bulk downloads from
the Gramene FTP site. We produce a major new web site build
every three months, but some data sets are updated more rapidly
as circumstances warrant.

1. Rice genomic sequence
data: Draft and finished rice genomic sequence, including
both japonica and indica subspecies.
Fully attributed annotations from
the rice sequencing groups. These vary in type and protocol
from region to region.
Uniform annotations that Gramene
generates internally. Currently this consists of FGENESH gene
predictions (Salamov and Solovyev 2000), aligned ESTs and
EST clusters, BAC end sequences from rice and other monocots,
and genetic markers from rice and other monocots. In the near
future we will be adding BLASTX (Altschul et al. 1990) protein
to nucleotide alignments, followed by Genewise (Birney and
Durbin 2000) alignments and Ensembl pipeline consensus gene
models, as described in the Experimental Plan.

2. Rice protein/gene product
data: Gramene provides protein records for all published
rice protein sequences and hypothetical gene products that
have been submitted by the genomic sequencing groups to GenBank/EMBL.
To provide researchers with the most
up to date information on the putative functions of rice gene
products, we cross reference all confirmed and hypothetical
products with Interpro (Mulder et al. 2003) to provide electronic
Gene Ontology annotations. In addition, we manually inspect
all confirmed protein entries in order to produce hand-curated
GO annotations.
Phylogenetic relationship information
is provided by a table of precomputed BLAST scores, as well
as a link to the BLINK service (Wheeler et al. 2003).

3. Monocot genetic and
physical map data: We currently provide access to 22 genetic
and physical maps for rice, maize, barley, sorghum, oat and
the triticeae. Included in this set is the Oryza sativa
japonica BAC physical map developed by R. Wing (Chen et
al. 2002), and 11 widely used genetic maps of rice. We select
which maps to publish in consultation with our Scientific
Advisory Board, collaborators and other representatives of
the academic and commercial research communities.
For any map displayed on Gramene,
researchers can find the molecular data required to reproduce
the component marker assays. In the case of rice maps, the
marker molecular data (e.g. primer pairs) is retrieved directly
from the Gramene database. For other species, the molecular
data is retrieved indirectly via links to affiliated databases
such as MaizeDB (Coe et al. 2002) and GrainGenes [http://wheat.pw.usda.gov/index.shtml].

4. Ontology annotations
of rice mutants and strains: Researchers can access a
set of ontologies from the Gramene web site. In addition to
the standard GO, we have developed several plant-specific
ontologies in collaboration with MaizeDB and TAIR. The Plant
Ontology (PO) is a collection of concepts to describe plant
anatomic locations and developmental stages. The Trait Ontology
(TO) is a collection of concepts to describe abnormal traits
and phenotypes. All of these ontologies are preliminary, and
a major effort for the coming years is to generalize and refine
them as we apply them to monocot genome annotation.
We provide researchers with access
to 470 published rice mutants, each of which have been annotated
with PO and TO terms. Mutant records are often accompanied
by illustrations of the phenotype, and have links to the genome,
when known, to the OryzaBase strain database, and to literature
references.
The Gramene architecture is based
around the Oracle 9i database, accessed via a large open source
middleware layer. We use the Ensembl data model (Clamp et
al. 2003) for managing and displaying the rice genome and
its annotations, the Gene Ontology Consortium schema for ontologies,
and custom software for display of protein annotations, comparative
maps, rice mutants, and bibliographic references. With the
exception of the Oracle DBMS itself, all software used or
produced by Gramene is open source, allowing any group to
adapt and use our system.
The bulk of Gramene's analytic work
is producing alignments to the rice genome. We use a version
of the Ensembl pipeline called Biopipe, which has recently
been released by Elia Stupka's group in the Singapore Fugu
Sequencing Project. Biopipe is open source cluster management
software that allows us to distribute tasks among multiple
machines. The primary workhorse for our alignments is Blat
(Clamp et al. 2003), a fast sequence alignment algorithm developed
by Jim Kent for use in human genome assembly. The combination
of Biopipe and Blat allows us to align 300,000 single reads
(ESTs or BAC ends) to the rice genome per hour. We use stringent
criteria to ensure that alignments are good ones. These criteria
provide an overall alignment success rate of 77% for rice
genomic sequence sources, such as BAC ends, and 80% for rice
ESTs. ESTs derived from non-rice monocots are aligned unambiguously
roughly 70% of the time. These numbers should be viewed in
the context of the incomplete state of the rice genome. Extrapolating
from the current state of the rice genome to a 99% complete
state, we expect to see a success rate of 88% and 91% for
rice genomic and EST sequences respectively by the time the
IRGSP sequencing project is complete in December 2004.
Since going on line in January 2002,
the community usage of Gramene has increased rapidly and now
stands at over 50,000 meaningful hits per month, where a meaningful
hit is defined as one that causes a database access rather
than the retrieval of a static page or image.

The
Poaceae family of grasses is one of the most intensely studied
families in plant science and is thought to have originated
70-55 millions years ago. The grass family includes about 10,000
species which covers approximately 20% of the earth's land surface
(Shantz 1954). Poaceae includes all the major cereal species
such as corn, sorghum, sugarcane, millet, wheat, barley, rye,
oats and rice.
Conservation of gene order across large
sections of grass genomes has been documented for maize and
sorghum (Hulbert et al. 1990), rice and maize (Ahn and Tanksley
1993), wheat and rice (Kurata et al. 1994) and maize, wheat
and rice (Ahn et al. 1993) for evolutionary periods as long
as 65 million years. In 1994 Moore et al. (Moore et al. 1995)
designed an ingenious representation of Poaceae know as "The
Circle Diagram" which shows the relationships of the genomes
of several members of Poaceae drawn in concentric circles with
rice (Oryza sativa) forming the smallest circle. These studies
and others have been a major driving force to describe the grasses
as a "Collective Model Genetics System" - to study plant evolution,
development and genetics.
Rice is the most important food crop
in the world. Its compact genome, evolutionary relationship
with other cereals and sophisticated molecular genetic tools
have made sequencing the rice genome a top priority for plant
science (Sasaki and Burr 2000). To meet this priority, the International
Rice Genome Sequencing Project (IRGSP:http://rgp.dna.affrc.go.jp/Seqcollab.html)
was formed in 1998 with the goal of completely sequencing the
rice genome by the end of 2008. The IRGSP relied heavily on
the BAC physical map/BAC end sequence framework that the Wing
lab constructed to sequence the rice genome (Chen et al. 2002;
Mao et al. 2000; Wing et al. 2001).The 2008 goal was accelerated
by announcements from Monsanto (April 2000) (Barry 2001) and
Syngenta (January 2001) (Davenport 2001; Goff et al. 2002) and
with their help, a 10X draft of the rice genome was publicly
released on December 18th, 2002 (IRGSP 2002). The new goal set
by the IRGSP to generate a complete finished rice genome is
December 2004. As part of the IRGSP, the ACWW Rice Genome Sequencing
Consortium [Arizona Genomics Institute (AGI), Cold Spring
Harbor, Washington University - GSC, U. of Wisconsin],
was funded in October 1999 to sequence the short arms of rice
chromosomes 10 and 3. Together with The Institute for Genomics
Research (TIGR) and the Plant Genome Initiative at Rutgers (PGIR
~ 3 Mb of ch10) we finished chromosome 10 (Wing et al. 2003)
and are scheduled to finish and annotate chromosome 3 by October
of 2003. Japan and China recently published finished sequences
for chromosomes 1 (Sasaki et al. 2002) and 4 respectively (Feng
et al. 2002). With knowledge that the rice genome will soon
be finished it is critical that we have the tools in place to
properly annotate and functionally characterize the rice genome
and be able to apply this information to other grass genomes.
To this end we were recently awarded a NSF grant to construct
large-insert deep-coverage BAC libraries from representatives
of the 11 wild genomes of rice shown in Table 1 and described
in detail in the results from prior NSF Support section.
Table 1 Oryza BAC libraries under construction

These
BAC libraries will be the tools necessary to explore how plants
evolve and adapt to variability in genome size and structure,
ecological habitats and changes in development and physiology.
Within the genus Oryza, genome size varies 5-fold (Table 1),
polyploidy exists, there are structural chromosome changes
between species (Hass et al. 2003; Shishido et al. 2001) and
the habitat adaptation varies from forests to swamps and from
Himalayan foothills to Caribbean islands (Vaughan 1994). Developmentally
these species show differences in many aspects of plant growth
and response to environmental stimuli. Since the central species,
O. sativa, is a crop species, the ability to examine crop
evolution/domestication at the genic/genome level represents
an unprecedented opportunity. Several genes important to rice
domestication have been placed on the high-density O. sativa
linkage map (Cai and Morishima 2000; Zhou et al. 2001), therefore,
the opportunity to study specific genes involved in crop domestication
will be available. Because of the depth of genetic information
available for O. sativa, this Oryza BAC resource will form
a closed system for the study of evolution of specific physiological/developmental
genes and large-scale genome events.
Such work will also lead to rapid
isolation of developmentally, physiologically and agriculturally
important loci and regulatory regions for comparative functional
studies.

The
long-term focus of our research is to develop and exploit the
tools of genomics to make the Oryza genus the most advanced
and tractable model system to study plant evolution, development,
physiology and crop science in the world. The specific objectives
of this proposal are to: 1) Construct DNA fingerprint/BAC-end
sequence physical maps from 11 deep coverage BAC libraries that
represent the 11 wild genomes of Oryza (830,000 fingerprints;
1,659,000 BAC ends) (Table 1); 2) align the 11 physical maps
with the sequenced reference subspecies japonica and
indica; 3) construct high-resolution physical maps of
rice chromosomes 10 and 3 across the 11 wild genomes using a
combination of hybridization and in silico anchoring
strategies, and; 4) provide convenient bioinformatics research
and educational tools (FPC and web-based) to rapidly access
and understand the collective Oryza genome.

We
will fingerprint all BAC clones from each wild genome library
for a total of 830,000 clones using robust techniques established
in the AGI Physical Mapping Center (Chen et al. 2002; Marra
et al. 1997). Briefly, DNA from BAC clones will be prepared
from 1.2 ml cultures in 96 well format by a modified alkaline
lysis method. Our lab employs a Tomtec Quadra 320 liquid handling
robot to reduce manual pipetting errors. Typical yields from
1.2 ml cultures provide sufficient DNA for both a fingerprinting
digest and two end-sequencing reactions. DNA from each prep
will be divided into a fingerprinting sample and two end-sequencing
samples. The sample for fingerprinting will be digested with
HindIII and electrophoresed on high-resolution agarose gels.
Fingerprint data will be scored using Image3 software from the
Sanger Center (www.sanger.ac.uk/Software/Image/). The AGI fingerprinting
lab has an established throughput of 24 96-well plates per day,
allowing a production of 11,520 fingerprints/week. Bench work
should be completed in approximately 18 months, and band calling
by 24 months from project initialization. We plan to fingerprint
one library at a time to avoid possible contamination between
libraries. Band data will be uploaded to the assembly program
FPC (Soderlund et al. 1997). All available marker data will
be included with the assembly. Merging of contigs and additional
finishing of the physical maps will be done in collaboration
with CSHL and Purdue as the data set for each library is completed.
Although other methods are available
for BAC fingerprinting using capillary electrophoresis (e.g
high information content fingerprinting (HICF, Ding et al. 2001))
we selected the agarose method because is much less expensive
(~$2,000,000 less expensive than HICF for this project) and
is robust in our laboratory. Further, by having all BACs end-sequenced
(below) and the finished rice genome, we believe the contigs
produced using the agarose method will be quite sufficient for
aligning them to the reference rice genome. Ed Butler will run
and coordinate the day to day of operations this component.

We
will sequence the ends of the 830,000 BAC clones (1,659,000
ends: 3/4 at AGI, 1/4 at Purdue) using the same template preparation
used for fingerprinting using routine and standard techniques
established in the AGI DNA Sequencing Center. Briefly, 5ul of
DNA isolated as described above is reacted with 4ul of sequencing
chemistry (ABI BigDye v.3.x) with T7 and M13 reverse primers
in total reaction volume of 15ul. Cycle sequencing reaction
is done on Tetrad (M.J. Research) in 96-well format with following
parameters; 96 C for 4min followed by 100 cycles of 95 C for
15sec, 50 C for 10 sec and 60 C for 3 min. EtOH precipitation
is applied to remove excessive terminators from the sequencing
reaction and purified reactions are resuspended with 5ul of
HIDI (Applied Biosystems). Reactions are separated on ABI3730xl
DNA sequencer for 90 min with injection time of 45 sec. Software
Phred (Ewing and Green 1998; Ewing et al. 1998) and Cross_match
(Green 1999) are used for base-calling and trimming vector sequences.
Successful sequences (sequence having > 100bp of phred 20 value)
are collected and submitted to Genbank with trace files using
our AGI automated sequence pipe-line. At the same time all BAC
end sequences will be displayed through the web at AGI. Based
on these standard conditions, we can routinely achieve a 90%
success rate with >650bp high quality bases for BAC end sequencing.
We are considering using liquid handling robot to set up and
clean up reactions to reduce a human error and increase a consistency
between plates. The DNA Sequencing Center at AGI routinely generates
over 3,000 reads/day (max. 3,456 reads) with two 3730xl DNA
sequencers. Availability of third ABI 3730xl at AGI and a fourth
at Purdue will increase throughput to over 5,000 reads/day and
we expect to complete 1.6M reactions in two years from project
initiation without conflict with currently funded projects.
Yeisoo Yu will run and coordinate the
day to day operations of this component at AGI. AGI will perform
DNA isolations and sequencing reaction for all samples. One
quarter of the reacted samples will be shipped to Purdue where
they will be loaded onto automated DNA sequencers, base-called
and submitted to Genbank under the direction of Phillip SanMiguel.

Fingerprint
maps and end sequences produced by this project will be transferred
by electronic means to Cold Spring Harbor Laboratory, where
they will be aligned to the Oryza sativa japonica and indica
genomes, analyzed for SNPs and repeats and integrated into the
resources available through Gramene and made available to the
research community.

Data Transfer between CSHL
and AGI/Purdue: We
will set up a secure incoming FTP site at CSHL that allows
AGI and Purdue to deposit fingerprint maps and BAC end sequences.
BAC end sequence data will include the quality scores produced
by Phred (Ewing and Green 1998; Ewing et al. 1998) in order
to facilitate SNP calling. We will transfer BAC contig data
using the .ace format output by the FPC program. Uploads will
be checksummed to ensure completeness and acknowledged by
the project data manager.
Analysis results, such as mapping
positions and SNPs, will be returned to AGI via an outgoing
FTP directory. Both the incoming and outgoing FTP sites will
be password protected to avoid casual abuse.

Mapping of BAC End Sequences
to the Rice Genome: We will use the Biopipe/Blat mapping
pipeline to align BAC end sequences from the 11 wild rice
species to the reference japonica and indica genomes. Since
it is estimated that these genomes different from each other
at one position every 200 bp, the polymorphism rate is well
below the single-pass sequencing error rate, and the presence
of SNPs will not present a significant obstacle to alignment.
Under our current mapping protocols,
80% of the BAC end sequences are expected to align uniquely
to the genome. This means that essentially all BAC contigs
will be anchorable to the reference genome except for those
locations where the genome sequence will contain gaps due
to centromeric repeats or unclonable regions
An important internal quality control
metric will be contiguity. In the ideal case, we expect all
BAC end sequences in a contig to map to a contiguous region.
In the real case, we will have to deal with three confounding
factors:

1. Incorrect mapping of one
or more BAC ends. An error in mapping a BAC end sequence will
lead to outliers in the distribution of map positions in the
contig. This can be recognized easily because the mismapped
ends will be in the minority and uncorrelated by their position
within the contig.

2. An FPC contig misassembly.
Typically this will appear as a "chimeric" contig in which
the BAC ends in two or more segments of the contig map to
different areas in the genome. Within a segment, the map positions
will be correlated. Such cases will be reported back to AGI
for manual inspection.

3. A genomic rearrangement.
The contig is correct, but there has been a genomic rearrangement
between a wild species and the reference species since divergence
from their common ancestor. For large rearrangements that
are not spanned by the BAC contig, this case will often be
indistinguishable from (2). For small inversions and rearrangements
in which both endpoints of the rearrangement are captured
by the BAC contig, we will be able to distinguish and flag
this event. In any event, these cases will be reported back
to AGI for manual inspection. Cases that cannot be confirmed
to be the result of a BAC contig misassembly will be returned
to CSHL for entry into the database as a putative rearrangement
event.

Identification of SNPs (Stein):
After aligning BAC ends to the genome, we will call SNPs using
the SSAHA-SNP program (Mullikin and Ning 2003). This software
uses neighborhood quality scores to identify high likelihood
SNPs in single genomic reads aligned to a reference genome.
SSAHA-SNP was used by The SNP Consortium (TSC) to call over
2 million SNPs in the human genome, and the Cold Spring Harbor
group has experience in using the software from its participation
in this project.
For the purposes of OMAP, we will restrict
SNP calling to regions of the finished genome. This is due to
anticipated difficulties in obtaining Phrap quality scores for
phase II rice sequence data. At the beginning of the project,
we know that finished sequence will be available from at least
rice chromosomes 1, 3, 4 and 10. SNP calling on other chromosomes
will be performed as their finished sequences are released by
the IRGSP.
In addition to calling SNPs between
the wild strains and the reference genomes, we will call SNPs
that appear between the wild strains. Putative SNPs will be
correlated with genomic annotations and characterized according
to their presence within genes and exons, whether they introduce
a non-synonymous amino acid change, or affect a splice site.
This characterization will use software previously developed
for use with the TSC project.

Evolution of repetitive
DNA sequences in the wild Oryza genomes (SanMiguel and Jackson):
Repetitive DNA sequences constitute a majority of most plant
genomes. In rice, repetitive elements comprise nearly 40%
of the sequenced genome. Previous studies have shown that
there is some variation in copy numbers of certain repeats
in the Oryza genomes and that this may have contributed to
the variation in genome sizes (Uozu et al. 1997). However,
in comparison to other cereals (i.e. wheat or maize) there
is little variation in genome size among the Oryzas-except
that due to polyploidy. Although there is little variation
in genome size, the repeats that constitute a major portion
of each Oryza genome do differ from each other (reviewed in,
Uozu et al. 1997). Furthermore, we know that transposable
elements insert and mutate beyond recognition in less than
10 million years (SanMiguel et al. 2002) and that there is
approximately 15 million years of evolution represented in
the genus Oryza (E. Kellogg, personal communication). The
BAC end sequence data representing nearly 10% of each genome
presents an opportunity to sample the repetitive fraction
of each genome.
Studies of the transposable element
fraction of orthologous segments of grain genomes (For example,
in maize and sorghum (Tikhonov et al. 1999) and in wheat and
barley (SanMiguel et al. 2002) found no conserved transposable
elements that could convincingly be shown to have inserted
before the divergence of their respective species. One class
of transposable elements, retrotransposons, can be dated as
to when they inserted by determining the sequence divergence
of their 5' and 3' long terminal repeats (LTRs) (SanMiguel
et al. 1998; SanMiguel et al. 2002) The maize and wheat sequences
compared were composed largely of these type of elements.
Most of these retrotransposons that could be dated, had inserted
in the last 3 million years. None that could be dated had
inserted more than 6 million years ago.
Evidence of insertions that occurred
longer than 6 million years ago appears to be lost through
deletion via an illegitimate recombination mechanism (Devos
et al. 2002) Hence, studying orthologous sequences in species
that diverged longer than 10 million years ago does not allow
one to study orthologous transposable elements. Given the
more limited temporal scope of the species studied in this
proposal, orthologous transposable elements should be founC.
A BAC-end sequence that is composed
entirely of a repetitive sequence will confound the anchoring
of that BAC-end on the Nipponbare genomic sequence on the
basis of sequence similarity. However, it might be anchored
as a member of a fingerprinted BAC contig. Further, a sequence
that is repeated many times in the entire genome may be present
only once in a segment of the Nipponbare genome to which a
contig is anchored. In this limited segment of the genome,
the sequence read would no longer be repetitive and in many
places could be placed unambiguously. The majority of such
repeats would likely be transposable elements.
The insertion times of numerous retrotranspons
have been estimated. But these times are based on the speculation
that retrotransposon sequence will mutate at a rate approximately
equal to other sequences not under selection. That is, synonymous
site bases in codons or introns. The mutation rate of retrotransposons
(and other transposable elements) may well be much higher.
In cases where BAC end sequences from several Oryza species
can be shown to both overlap a retrotransposon and are positioned
orthologously, mutation rates for this type of DNA can be
surmised. As the BACs will be constructed using a few restriction
enzymes, the chance of any given read of the 0.1x coverage
of each species lining up with other non-Nipponbare BAC-end
reads is substantially increased.
A repeat database will be constructed
for each Oryza genome structured to map each repeat class
onto a complete transposable element, where possible. Further,
it will be noted for each repeat discovered whether it has
been reliably mapped to a certain position in the Nipponbare
genomic sequence. The repeats present in each genome will
be broken down into classifications by presumed transposition
mechanism (RNA or DNA) and further sub-classified where appropriate.
Each repeat db will be compared against another in pair-wise
comparisons focusing especially on sequences present at orthologous
locations. In addition, conserved classes of repeats will
be used to infer genome phylogeny.
To further understand how these repetitive
elements evolve and mold genomes, select repetitive elements
will be mapped to chromosomes and extended DNA fibers to get
a picture of their long-range (chromosomal) organization.
Together with the sequence data this will present a picture
of the copy numbers of repetitive DNA sequences, their sequence
arrangement and conservation as well as the chromosomal distribution
and evolution of these repeats. This approach will provide
a unique, almost complete, snapshot of how repetitive DNA
sequences evolve in a family structure and how they mold genomes/chromosomes.

Databasing of Results (Stein):
Data from FPC contig mapping and SNP calling activities will
be entered into the Gramene project database. Our schema does
not currently handle deep sets of variations among a collection
of genomes.
To accommodate this, we will modify
our schema so that it follows a data model originally developed
by the Neomorphic Corporation for use in the Celera annotation
of Drosophila, and since adapted and used for the Chado modular
sequence database schema [http://www.gmod.org/chado]. In this
data model, a region of the genome is represented as a set
of triplets S=[[r1,s1,e1],[r2,s2,e2]...]
where r is the ID of the reference coordinate system (e.g.
the contig), and s and e are the start and end of the sequence
using interbase coordinates. An additional pointer provides
access to the variant nucleotide sequence itself. This data
model provides sufficient flexibility to represent a deep
set of variations including single nucleotide polymorphisms,
indels, inversions, and rearrangements.

Analysis of Conserved Regions
(Stein): The alignment of multiple wild rice species to
the reference genome will present us with an opportunity to
proofread and improve the rice genome annotations. For example,
an Oryza sativa pseudogene that was misannotated as a protein
coding gene in the reference sequence, can be caught if it
is missing entirely in the wild rice species or has accumulated
one or more frameshift or nonsense mutations. In other cases,
an unannotated region of the genome where the polymorphism
rate drops off significantly may indicate a missed functionally
important region, such as a protein-coding gene or RNA. We
will systematically examine the genome for such regions.
Although the evolutionary closeness
of the wild and domestic strains diminishes the information
available to such tools, we will experiment with using recently-published
software that uses sequence conservation to improve predictions
of protein-coding and RNA genes. We will scan for protein-coding
genes using the TwinScan software (Korf et al. 2001) package,
which was recently used to improve gene predictions in regions
of human/mouse orthology (Flicek et al. 2003). This software
performs ab initio gene prediction on two genomes simultaneously,
taking into account such factors as the increased likelihood
of seeing polymorphisms at the wobble base of coding sequence.
We will also scan the genome for putative functional RNA genes
using QRNA (Rivas and Eddy 2001). This software identifies
putative RNA genes by searching for coordinated mutations
across the axes of symmetry introduced by RNA stem-loop structures.

Data Presentation (Stein):
The results of these analyses will be presented on the Gramene
web site. The physical maps of the wild rice strains will
be displayed side-by-side with each other and with the reference
maps using the comparative map viewer CMap. This will provide
an intuitive visualization of rearrangements among the various
wild genomes of rice. The contig maps displayed on the Gramene
viewer will be linked to the WebFPC viewer at AGI for the
convenience of researchers wishing to examine the underlying
fingerprint data.
BAC end alignments that support the
alignment of FPC contigs to the reference genomes will be
displayed using the Gramene sequence viewer. Researchers will
be able to "drill down" into the data to the individual nucleotide
level, where they will be shown a multiple alignment of the
reference genome and one or more aligning wild strains.
We will provide specialized query
tools in addition to the standard search tools already on
the Gramene web site. One query tool will allow researchers
to search for and download sets of SNPs selected by map position
and type, such as coding region SNPs. Another tool will allow
researchers to retrieve data on all wild strain contigs that
overlap a particular region of the reference genome.
For the convenience of researchers
wishing to design assays for selected SNPs, we will provide
a primer-picking interface to the PRIMER3 (Rozen and Skaletsky
2000) program.
All data analysis results will be
downloadable from the Gramene web site as flat files and in
XML format. The data will also be available for remote query
and browsing using the Distributed Annotation System (Dowell
et al. 2001), an XML-based protocol for sharing genomic annotations.
Further, all fingerprint data will be downloadable from the
AGI web sites to construct and manipulate FPC maps locally

In
silico anchoring: Approximately 10% of each of the 11 genomes
will be represented at the sequence level as a result of the
BAC end sequencing included in this project. This will be a
rich resource with which to anchor these genomes to the sequenced
rice genome. CoPI Stein will map BAC end sequences to the rice
genome (see informatics section). If the wild rice genomes have
a repetitive DNA content similar to rice then approximately
50% of the BAC end sequences will be repetitive (this can vary
depending on restriction sites used to construct the BAC libraries).
Given that 10% of the genome (or a chromosome) will be represented
in ~500 bp sequence fragments, and that 1/2 of these sequences
will not unambiguously anchor to the rice genome due to repetitive
elements, there will then be a predicted sequence link about
every 10 kb.
There are, as noted previously (informatics
section), several issues that will confound proper alignment
of orthologous BAC contigs to the rice genome. From a computational
perspective, BAC end sequences may be incorrectly mapped to
the rice genome or the contigs may be misassembled. From a genome
evolution perspective there are several issues such as local
and or long-range chromosomal rearrangements. All of these can
lead to chimeric contigs or BAC end sequences that do not collapse
properly to the rice genome. Gene/chromosome duplications will
also introduce error into the alignment of the genomes due to
BAC ends derived from different genomic regions that collapse
onto the same location in the rice genome. BAC end sequences
that contain sequences repetitive in rice but not in their respective
genomes will also cause errors in map alignments. Conversely,
elements repetitive in rice but not in related species will
introduce errors in aligning contigs to the rice genome. Another
problem is that there will likely be genomic regions, both in
rice and the related genomes, that are under-represented in
BAC end sequences and are therefore not easily aligned.
The Jackson lab will work with the
integrated FPC and sequence data at Gramene but will also assemble
- in house - the BAC end sequences and contigs onto the rice
sequence map using the Gramene CMAP viewer and the GMOD project's
Gbrowse: Generic Genome Browser (www.gmod.org/ggb).
The alignments will be scanned both computationally and visually
for errors in assembly resulting from the previously described
problems. Our goal is to reconstruct, in a contiguous manner
as possible, rice chromosomes 1, 3 and 10 in the wild rice genomes.
In chromosomal regions where the BAC contigs from the wild rice
genomes cannot be collapsed completely on the rice framework,
we will use overgo technology to build a scaffold on which to
either collapse the contigs or to better understand what evolutionary
events have perturbed that region to the point where it is impossible
to align these chromosomes. We will also use fluorescence in
situ hybridization to probe rearrangements at the chromosomes
level.

Rod Wing:
Wing will serve as overall project director for OMAP. Members
of his laboratory will serve as Project Manager (Dave Kudrna),
direct the fingerprinting and assembly (Ed Butler), direct
BAC-end sequencing (Yeisoo Yu). In years 3-4, the Wing lab
will hire a Postdoc and Graduate Student to work on the global
alignment and chromosome reconstruction experiments in collaboration
with Lincoln Stein and Scott Jackson. The AGI will also serve
as a mentor for 2 high school teachers during each summer
and several UBRP students throughout the year.

Lincoln Stein: Stein
will oversee all aspects of integrating OMAP data into Gramene
and coordinating with the other project PIs and Senior Personnel
to update the global alignment and chromosome reconstruction
experiments during the course of the project. The Stein lab
will also conduct the SNP data mining analysis and present
the results though Gramene and in publications. The Stein
lab will also serve as a mentor for 2 high school teachers
and 1 UBRP intern during each summer.

Scott Jackson: Jackson
will oversee all aspects, bioinformatics and experimental,
of the chromosome reconstruction experiments in collaboration
with the Wing and Stein labs. The Jackson and SanMiguel labs
will also serve as a mentor for 2 high school teachers, several
MARC/AIM students and 1 UBRP intern during each summer.

Phillip SanMiguel:
SanMiguel will oversee all aspects of BAC-end sequencing at
the Purdue sequencing center. SanMiguel will also lead the
repeat analysis efforts in collaboration with the Jackson
lab.

An
active and experienced Advisory Committee (3-4 members) will
be essential for the success of OMAP. We have already received
emails from Drs. Ronald Phillips (University of Minnesota) and
Susan Wessler (University of Georgia) and Jonathan Wendel (Iowa
State University) agreeing to serve on the committee if the
proposal is awarded .
We envision the committee will travel
to AGI, CSHL, Purdue and AGI over the 4-year course of the project,
respectively, once a year for a 1 day AC meeting to review our
progress and give advice. The committee will be expected to
write a formal report that we will use as a guide to help us
during the course of the year. This report and our response
will become part of our NSF annual progress report due in June
of each year.

Teacher
interns will gain practical experience in physical mapping,
DNA sequencing, bioinformatics and comparative genomics to better
understand the importance of rice and using wild genomes to
improve rice. We will host 6 Teacher Interns per summer, 2 at
each site (AGI, Purdue, CSHL). Weekly conference calls will
be organized so that the OMAP interns can discuss what they
are doing, share their experiences and coordinate activities
if needed. Development of a (web-based) lesson plan incorporating
the principles learned during the internship will be required
of each intern.
Underrepresented undergraduates at University of Arizona will
participate in plant genomics research through the UBRP and
MARC/AIM programs. Students will work with faculty, postdoctoral
and graduate student mentors on specific projects in physical
mapping, DNA sequencing, bioinformatics and comparative genomics.
UBRP students will work throughout the year at AGI and (2-4/year)
will be given the opportunity to perform short (2 week) summer
research projects in our collaborators labs at Purdue and CSHL.
The UBRP and MARC/AIM students will gain valuable experience
in learning how research is done and will help prepare them
for careers at all levels of science.
At Purdue University, CoPIs Jackson and SanMiguel participate
in the Purdue MARC/AIM Summer research program which brings
undergraduate minority students to Purdue University for 8 weeks
during the summer session to participate in lab research. The
students are expected to be actively involved in research and
must write a research summary at the end of the session. This
program is used to expose minority students to research and,
hopefully, to spark an interest in biological research as well
as to recruit minority students to pursue a graduate career
at Purdue University. This program has been extremely successful.
In its 20 year history more than 700 students have matriculated
of which at least 375 have pursued or are currently pursuing
postgraduate education. Nearly 100 of these students have been
awarded a Ph.D., M.D., D.D.S. or D.V.M. at various universities.
For OMAP, we will actively recruit underrepresented undergraduates
to participate in plant genomics research through the UBRP and
MARC/AIM programs. Students will work with faculty, postdoctoral
and graduate student mentors on specific projects in physical
mapping, DNA sequencing, bioinformatics and comparative genomics.
UBRP students will work throughout the year at AGI and (2-4/year)
will be given the opportunity to perform short (2 week) summer
research projects in our collaborators labs at Purdue and CSHL.
The UBRP and MARC/AIM students will gain valuable experience
in learning how research is done and will help prepare them
for careers at all levels of science.

The
University of Arizona (UA) has a strong track record in
education and outreach. One of the most recognized programs
is "The
Teacher Internships in Plant Genomics Program"
(TIPG) which is designed to provide pre-service and in-service
biology teachers with university-based lab experience
in plant genomics at UA.
Teacher Interns are placed in UA plant genomics labs for
eight-week sessions of summer research, and are generally
invited to return for a second summer. The interns are
paired with an experienced faculty member, post-doctoral
fellow, or graduate student who serves as the intern's
mentor. This program is designed to provide Teacher Interns
with opportunities to understand the nature of science,
gain first-hand experience in scientific inquiry, and
to better understand and share their ideas about genetics,
genomics, and plant biology.
In addition, this program provides scientist mentors with
an opportunity to learn about communicating and sharing
their work with science teachers, pre-college students,
and the general public. Scientist mentors learn how to
present content, concepts, and methodologies to a non-scientist
audience. The TIGP program was organized by Plant Science
Professor Rich Jorgensen and has been in place for the
last 2 years, funded through a NSF RUE grant to Jorgensen.
The program presently has 12 faculty mentors and has trained
11 teachers. Dr. Nadja Wehmeyer, with over 6 years of
experience in education and outreach in biology, is the
coordinator of this successful program.
The structure of each Teacher Intern's research experience
is tailored by the host PI and research mentor, but may
include: conducting a small, independent research project
with a finite endpoint or collecting and analyzing data
for a larger project that will continue beyond the termination
of the internship.
During the last week of the internship, the Teacher Interns
are expected to produce a poster summarizing the research
conducted during this program and present it at a poster
conference.
During weekly meetings, Teacher Interns have the opportunity
to develop and share teaching materials related to genetics,
genomics, and plant biology as well as discuss their summer
research. Equipment needed to teach these newly developed
activities will be provided on a loan basis for the teachers
throughout the school year, and up to $1000 is made available
to each teacher for materials and supplies. Providing
classroom support allows the Plant Genomics outreach to
target middle and high schools students, reaching potentially
thousands of students, a large number of whom are underrepresented
minorities (especially Hispanic and Native American).
The Plant Genomics Internship aids teachers in enriching
their own understanding of plant biology and the nature
of science through research partnerships with university
scientists. Pre-service and in-service teachers will have
increased accessibility to university-based resources,
which they can bring back to the classroom. By partnering
with pre-service and in-service teachers to shape the
future, we are striving to provide the best possible context
for pre-college students to understand biology and the
nature of science.
For OMAP, Interns will gain practical experience in physical
mapping, DNA sequencing, bioinformatics and comparative
genomics to better understand the importance of rice and
using wild genomes to improve rice. We will host 6 Teacher
Interns per summer, 2 at each site (AGI, Purdue, CSHL).
Weekly conference calls will be organized so that the
OMAP interns can discuss what they are doing, share their
experiences and coordinate activities if needed. Development
of a (web-based) lesson plan incorporating the principles
learned during the internship will be required of each
intern.

At
the University of Arizona, 19% of the undergraduates are
underrepresented minorities, of which Hispanics, predominate.
Moreover, a significant number of Native American undergraduates
are enrolled on our campus.
Underrepresented students are recruited for research training
through the Undergraduate Biology Research Program (UBRP),
which has links to statewide Community Colleges and targeted
funds for minorities (www.blc.arizona.edu/ubrp).
Of the 1,219 students accepted to UBRP since 1988, 57
% are women and 37% are minority students (of these, 19%
are students from ethnic groups underrepresented in the
sciences). Additional outreach programs to minority students
include the Minority Access to Research Careers Program
(MARC), which recruits students from UBRP, and the McNair
Program for disadvantaged and underrepresented minority
students.
In addition, the campus has a very active AISES group
(American Indians in Science and Engineering Society)
which was ranked 2nd in the nation in 2000. A large contingent
of UA undergraduates attends the annual Society for Chicanos
and Native Americans in Science (SACNAS) meeting, and
many U.A. students have presented and won awards at these
meetings. Moreover, research experienced undergraduates
can participate in the BRAVO! (Biomedical Research Abroad:
Vistas Open!) Program, to work with their UBRP faculty
sponsor's foreign collaborator(s).Since 1992, 118 students
have had a BRAVO! experience lasting anywhere from 3 months
to a year.
For OMAP, we will actively recruit underrepresented undergraduates
to participate in plant genomics research through the
UBRP and MARC/AIM programs. Students will work with faculty,
postdoctoral and graduate student mentors on specific
projects in physical mapping, DNA sequencing, bioinformatics
and comparative genomics.
UBRP students will work throughout the year at AGI and
(2-4/year) will be given the opportunity to perform short
(2 week) summer research projects in our collaborators
labs at Purdue and CSHL. The UBRP and MARC/AIM students
will gain valuable experience in learning how research
is done and will help prepare them for careers at all
levels of science.

The
Undergraduate Biology Research Program (UBRP), University
of Arizona (http://www.blc.arizona.edu/ubrp/history.HTML)
is an educational program designed to teach students science
by involving them in biologically related research.
Students are paid for their time in the lab where they
develop an understanding of scientific method and receive
a realistic view of biological research. They also acquire
the tools necessary to be successful in post-graduate
studies in biology should they choose careers related
to biology or biomedical research.
UBRP demonstrates how the resources of a major research
university can be brought to bear on undergraduate education.

PROGRAM DESCRIPTION: unique research, mentoring,
financial and academic opportunity for underrepresented
minority students who have interest and potential to pursue
careers in biomedical research training and financial support
for the last two years of the student's enrollment in
the University financial benefits including:
tuition and fees support; health insurance; monthly stipend
funding to attend national
scientific meetings and to seek a summer research experience
outside the UA outstanding faculty from Colleges
of Science, Agriculture and Medicine with active and well-funded
research programs to provide research guidance and intensive
mentoring to participants overall mentoring provided
by Professor Marc Tischler, Biochemistry, and by Professor
William Velez, Mathematics assistance with preparation
for the Graduate Record Exam, and applying to graduate
schools and for graduate fellowships by Associate Dean
Maria Teresa Velez, Graduate College a seminar series for trainees
to meet outstanding minority scientists from other institutions
and interact amongst themselves and with other UA faculty
mentors

The
Purdue MARC/AIM Summer Research Program, Purdue University
(http://www.biochem.purdue.edu/marc_aim/maim.htm)
offers to talented African-American, Pacific Islander,
Hispanic, and Native-American college students who are
U.S. citizens:

8 weeks of research under the direction of a Purdue
faculty mentor GRE Preparation Workshop
Graduate school information-how
to apply and how to obtain funding An opportunity to gain viewpoints
from faculty and from graduate students Oral and written presentations
of research results Recreational activities and
access to the university gym An opportunity to form friendships
with other participants from all over the U.S. Double occupancy university
housing A stipend of $3,600, from
which you pay for meals and other living expenses Round-trip travel for external
students