18.11 Annotating the genome

The process of ‘annotating the genome’ starts
once the genome sequence has been established and its assembly completed.
Annotation is the association of its component sequences with specific
functions, and, if the Saccharomyces
cerevisiae example is a guide, this process can continue for a long
time. Annotation requires sophisticated computation, that is: it is an
in silico analysis. Gene identification is probably the most difficult
problem and relies on computer programs that align sequences and use ‘gene
finder’ programs.

Gene finding is easier with bacterial genomes, in which computer programs
can find 97-99% of all genes automatically. In eukaryotes both gene finding,
and gene function assignment remain challenging tasks. The problem can be
likened to identifying the beginning and end of every word in a book when
the text has lost all punctuation and you have no clear idea of the language
and vocabulary used in the book.

Sense is made of genome sequences by annotation in silico to:

identify ORFs by their start and finish codons, and allowing for the
minimum length of functional proteins (Fig. 15);

detect the presence of recognisable functional motifs in segments of
the deduced gene or protein;

compare against known protein or DNA sequences using homologous
genes from the same or other genomes (Fig. 16).

Fig. 15. Searching for ORFs in DNA sequences,
every one of which has 6 reading frames.

Fig. 16. Sequence annotation with homologous
genes from the same or other genomes.

Further annotation is done experimentally by:

classical gene cloning and functional analysis;

analysis of cDNA clones or EST sequences (an expressed
sequence tag or EST is a short component sequence of a
transcribed cDNA, so it is a portion of an expressed gene), and gene
expression data.

No single method of genome annotation is comprehensive; all have their
limitations, so they must be used in concert. Many of the genes identified
in sequencing projects will be ‘new’ in the sense that when the sequence is
identified the gene function is unknown. Establishing the cellular role of
such new ORFs requires a different set of bioinformatics tools that
integrate sequence information with the accumulated knowledge of metabolism
so that conjectures can be made about likely functions.Those predictions are then tested experimentally by using
heterologous expression, gene knockouts, and characterisation of purified
proteins. Parallel analysis of phylogenetically diverse genomes can also
help in understanding the physiology of the organism whose genome is being
sequenced.

When the sequence of the whole genome has been established and annotated,
the genome can be compared with others on the
databases. Prokaryotic genomes are generally much smaller than those of
eukaryotes. The Escherichia coli genome, for example, is composed
of 4.64 Mb (megabase pairs) of DNA; that of Streptomyces coelicolor
is 8 Mb, while the yeast genome, at 12.1 Mb, is about three times the size
of the E. coli genome, and the human genome is 3,300 Mb (see Table
5.2).

The physical organisation is also different, because in prokaryotes the
genome is contained in a single, circular, DNA molecule. Eukaryotic nuclear
genomes are divided into linear DNA molecules, each contained in a different
chromosome. In addition, all eukaryotes have mitochondria, and these possess
small, usually circular, mitochondrial genomes. Photosynthetic eukaryotes
(plants, algae, some protists) have a third small genome in their
chloroplasts.

The size range of the genome corresponds to some extent with the degree
of complexity of the organism, but the fit is not exact by any means because
this correlation depends on the structure and organisation of the genes. For
example, the Escherichia coli genome has 4,397 genes and the yeast
genome comprises about 5,800 genes, so you might feel confident about
believing that yeast has more genes because it is a eukaryote, and you can
understand why it doesn’t have many more, because it’s a simple eukaryote.
However, the genome of the streptomycete bacterium Streptomyces
coelicolor
contains more than 7,000 genes. This organism is a prokaryote, but it has
nearly 30% more genes than the model eukaryote, yeast.

Admittedly,
Streptomyces is a complex bacterium and highly advanced in an
evolutionary sense; but it is a bacterium. The arithmetic difference lies in
the fact that the average yeast gene is 2,200 base pairs long, while the
average Streptomyces coelicolor gene is only 1,200 base pairs long.
But we can’t explain
why such a difference in gene size exists.

The yeast Saccharomyces cerevisiae is a well-established model
organism with a long history in physiology, biochemistry and molecular
biology (see
Section 5.2); its genome continues
to be a useful model for eukaryotes, comprising a grand total of 12.1 Mb
distributed over 16 chromosomes, which range in size between 250 kb and more
than 2.5 Mb. The yeast genome-sequencing project was started in 1989. The
sequence of chromosome III was the first to be published in 1992,
chromosomes II and XI followed in 1994, and the sequence of the entire
genome was released in April 1996. Quality control measures ensured a 99.97%
level of accuracy of the sequence.

Today, the place to learn about this genome is the Saccharomyces
Genome Database (SGB) website at
https://www.yeastgenome.org/ and the
Yeast Genome Snapshot at
https://www.yeastgenome.org/genomesnapshot. As of January 2017, there
were 6,572 open reading frames (ORFs) which possibly encode metabolically
active proteins, of which 5,138 were verified, 754 were uncharacterised, and
680 were considered dubious.

On average, a protein‑encoding gene is found every two kb in the yeast
genome. The ORFs vary from 100 to more than 4,000 codons, although
two-thirds are less than 500 codons, and they are evenly distributed on the
two strands of the DNA. In addition to these, the yeast genome contains 27
rRNA genes in a large tandem array on chromosome XII, 77 genes for small
nuclear RNAs, 277 tRNA genes (belonging to 42 codon families) scattered
across the chromosomes, and 51 copies of the yeast retrotransposons (Ty
elements).

There are also non‑chromosomal elements, most notably the yeast
mitochondrial genome (80 kb) and the 6 kb
2μ plasmid DNA, but there may be other plasmids, too. So,
21 years after the genome was sequenced, only about 80% of the ORFs had been
verified; a rate of progress that makes it even more amazing that on April
17, 2018, SGB announced a single publication in the journal Nature
by a team of researchers jointly led by Joseph Schacherer
and Gianni Liti, that had reported the whole-genome
sequences and phenotypes of no fewer than 1,011 different
Saccharomyces cerevisiae yeast strains (Peter et al.,
2018).

Isolates of Saccharomyces cerevisiae
were gathered from many diverse geographical locations and ecological
niches; from wine, beer and bread, but also from rotting bananas, sea water,
human blood, sewage, termite mounds, and more. The authors then surveyed the
evolutionary relationships among the strains to describe the worldwide
population distribution of this species and deduce its historical spread.
This unusually large-scale population genomic survey demonstrates that the
likely geographic origin of S. cerevisiae lies somewhere in East
Asia. Budding yeast began spreading around the globe about 15,000 years ago,
undergoing several independent domestication events during its worldwide
journey. For example, whereas genomic markers of domestication appeared
about 4,000 years ago in sake yeast, such markers appeared in wine yeast
only 1,500 years ago. While domesticated isolates exhibit high variation in
ploidy, aneuploidy and genome content, genome evolution in wild isolates was
mainly driven by the accumulation of single nucleotide polymorphisms, most
of which are present at very low frequencies.

The alleged purpose of study of a model organism like yeast is the
expectation that its analysis will enable the identification of genes
relevant to disease in humans; and this expectation seems to be fulfilled.
Comparing the sequences of human genes available in the sequence databases
with yeast ORFs shows that over 30% of yeast genes have homologues among the
human sequences, most of these representing basic cell functions. Finding
this sort of homology can contribute to the understanding of human disease.

The first example of this seems to be Friedreich ataxia,
which is the most common type of inherited ataxia (loss of control of bodily
movements) in humans, the biochemistry of which was uncovered by
demonstrating homology to a yeast ORF of known function. Friedreich’s ataxia
is caused by enlargement of a GAA repeat in an intron that results in
decreased expression of the frataxin gene; frataxin is a highly conserved
iron-binding protein present in most organisms, and Friedreich’s ataxia
pathology is associated with disruption of iron-sulfur cluster biosynthesis,
mitochondrial iron overload, and oxidative stress. Frataxin is the human
mitochondrial protein that has homologues in yeast. In yeast, mutants
defective in the frataxin homologue accumulate iron in mitochondria and show
increased sensitivity to oxidative stress. Biosynthesis of Fe-S clusters in
yeast is a vital process involving the delivery of elemental iron and sulfur
to scaffold proteins and the architecture of the protein complex to which
frataxin contributes is essential to ensure concerted and protected transfer
of potentially toxic iron and sulfur atoms to the mitochondrion.
This comparison suggests that Friedreich’s ataxia is caused by mitochondrial
dysfunction and may point towards novel methods of treatment (Pastore &
Puccio, 2013; Ranatunga
et al., 2016).

In many ways, this kind of comparison alone can justify all the effort
devoted to sequencing the yeast genome. Functional genomics studies the
roles of genes and proteins to define gene/protein function. The outcome is
known as the Gene Ontology. Originally,
ontology was a branch of metaphysics; a philosophical inquiry
into the nature of being. For the computer scientist, ontology is the
rigorous collection and organisation of knowledge about a specific feature.

The aims of Gene Ontology (GO) are to:

develop and standardise the vocabulary about the attributes of genes
and gene products that is species-neutral, and equally applicable to
prokaryotes and eukaryotes, and uni- and multicellular organisms;

annotate genes and gene products within sequences, and coordinate
understanding and distribution of annotation data;

and provide bioinformatics tools to aid access to all these data.

To achieve all this, there are three organising principles of GO to
describe the function of any gene/protein sequence as follows:

Biological process; effectively the answer
to the question why does the sequence exist? This can be cast in very
broad terms describing the biological goals accomplished by function of
the sequence, for example mitosis, meiosis, mating, purine metabolism,
etc.

Annotation has been automated by annotation programs
(available online) that quickly identify ORFs for hypothetical genes in a
genome. Many sequences are conserved across large evolutionary distances, so
many functional assignments can be inferred using information already
available from other organisms; this sequence search and comparison process
can also be automated.

Annotating the genes of filamentous fungi, even other Ascomycota and
close relatives of Saccharomyces cerevisiae, is more demanding
because their genomes are much larger and their gene structure more complex
than those of yeast. Genes of filamentous fungi often contain multiple
introns (Section 18.6), with some
within the open reading frame of the gene (very few yeast genes contain
introns, those that do have a single intron at the start of the coding
sequence, often interrupting the initiation codon). Also, the
intron-boundary sequences may not become evident until the transcriptome is
analysed, and alternative splicing events catalogued (Section
18.7).

The greater complexity of gene structure in filamentous fungi demands
independent data on gene expression to make confident functional
assignments. Methods have been described that use cDNA or EST sequence
alignments, and gene expression data to predict reliably the function of
Aspergillus nidulans genes. We recommend you read the discussion and
explanation of the approach by Sims et al. (2004).

Yandell & Ence (2012) have published ‘A beginner’s guide to
eukaryotic genome annotation’ and further information and
advice is freely available online at:

The most up-to-date information on the genes of any organism in which you
are interested can be obtained from the website devoted to that organism
(use your preferred web search engine to find it). For example, entering ‘coprinopsis
cinerea genome’ into the search engine finds the
Coprinopsis cinerea home page, which gives you general information
about the organism and its genome, on the JGI Genome Portal
[https://genome.jgi.doe.gov/Copci1/Copci1.home.html].
This page has a menu of hyperlinks across the top that give access to the
deepest detail about the genome of this species.

The main Internet sites for fungal genomic data are discussed in the next
Section (Section 18.12).
Bioinformatics is essentially the use of computers to process biological
information when computation is necessary to manage, process, and understand
very large amounts of data. Although there are many bioinformatics tools and
databases, using them effectively often requires specialised knowledge;
where this is lacking, the BioStar platform can help.
Biostar is an online forum where experts and those seeking
solutions to problems of computational biology exchange ideas. BioStar can
be accessed at
https://www.biostars.org/
(Parnell et al., 2011).

Bioinformatics is particularly important as an adjunct to genomics
research, because of the large amount of complex data this type of research
generates, so to a great extent the word, and the approaches it encompasses,
have become synonymous with the use of computers to store, search and
characterise the genetic code of genes (genomics), the transcription
products of those genes (transcriptomics), the proteins related to each gene
(proteomics) and their associated functions (metabolomics) (see
Section 18.10). But there are other large data sets in need of analysis
that rightly fall within range of the fundamental definition of the word
‘bioinformatics’.

These are large data sets arising from:

Survey data and censuses, particularly, but not only, those
involving automatic data capture, and 'surveys of
surveys' (metadata) (for example see
Section 13.17).

Data generated by mathematical models that seek to simulate a
biological system and its behaviour in time (for example see
Section 4.9).

The aim of functional genomics is to determine the biological function of
all the genes and their products, how they are regulated and how they
interact with other genes and gene products. Add interactions with the
environment and this is fully
integrated biology; what has come to be known as
systems biology (Klipp et al., 2009; Nagasaki
et al., 2009; Horgan & Kenny, 2011). Comprehensive studies of such
large collections of molecules as occur in the transcriptome, proteome, and
metabolome require what are described as high throughput methods of analysis
at each stage from the generation of mutants through to the determination of
which proteins are associated with which functions.
Each stage generates massive amounts of data that are qualitatively and
quan­titatively different, which must be integrated to allow construction of
realistic models of the living system (Delneri et al., 2001).

Functional genomic analysis of the yeast
Saccharomyces cerevisiae
established the key concepts, approaches and techniques, although research
on filamentous fungi is expanding (Foster et al., 2006).
Considerable progress was made in analysis of yeast gene function using
mutants with deletions of ORFs. However, genetic redundancy in the genome,
resulting perhaps from gene duplication(s) during evolution, can be a
problem in this type of analysis. In retrospect, analysis of yeast shows
that much of the redundancy in the yeast genome is made up of identical, or
almost identical, gene products fulfilling distinct physiological roles due
to differential expression of the genes under different
physiological conditions, and/or targeting the similar proteins to different
cellular compartments.

Nevertheless, more extensive studies require more extensive collections
of mutants; those in which entire gene families are deleted and, ultimately,
a collection in which all genes are represented by appropriate mutants.
There is scope for large scale international collaboration in this sort of
exercise and 1999 saw the establishment of a collection of mutant yeast
strains, each bearing a defined deletion in one of 6,000+ potential protein
encoding genes in yeast (Winzeler et al., 1999). This is the
EUROSCARF collection (EUROpean
Saccharomyces
Cerevisiae
ARchive for
Functional analysis; see
http://www.euroscarf.de/). Using a
PCR-based gene disruption strategy, mutant strains
with a deletion of most of the ORFs in the genome were prepared in this
systematic deletion project. In addition, each deleted ORF was flanked by
two 20 base pair sequences unique for each deletion. These allow the
sequences to be detected easily; effectively they act as molecular barcodes
that allow large numbers of deletion strains, potentially the whole library,
to be analysed in parallel at the same time.

Another approach used a transposon that created
gene fusions in a yeast clone library so that the protein products of the
mutated yeast genes could be identified and analysed by immunofluorescence
using antibodies to the peptide introduced by the transposon. In the
original work a yeast genomic DNA library was mutagenised in Escherichia
coli with a multipurpose minitransposon derived from the bacterial
transposable element known as Tn3. The minitransposon contained cloning
sites and a 274-base pair sequence encoding 93 amino acids, called a
HAT tag, which was inserted into the yeast target proteins.

The HAT tag allows immunodetection of the mutated yeast protein.
Transposon mutagenesis generated 106 independent transformants.
Subsequently, individual transformant colonies were selected and stored in
96-well plates. Plasmids were prepared from these strains and transformed
into a diploid yeast strain in which homologous recombination integrated
each fragment at its corresponding genomic locus, thereby replacing its
genomic copy. Then, 92,544 plasmid preparations and yeast transformations
were carried out, identifying a collection of over 11,000 strains, each
carrying a transposon inserted within a region of the genome expressed
during vegetative growth and/or sporulation. These insertions affected
nearly 2,000 annotated genes, distributed over all 16 yeast chromosomes and
representing about one-third of the yeast genome.
The study demonstrated the value of a particular strategy for mutant
generation and detection, but it also indicated the scale of what has been
called ‘new yeast genetics’.

Messenger RNA molecules are the subject of transcriptome analyses and can
be studied in a fully comprehensive manner using
hybridisation-array analysis, which is described as a massively parallel technique because it allows so many
sequences to be examined at one time. Remember, though, that mRNA molecules
transmit instructions for synthesising proteins; they do not function
otherwise in the workings of the cell, so transcriptome analyses are
considered to be an indirect approach to functional genomics. The
transcriptome comprises the complete set of mRNAs synthesised in the cell
under any given well-defined set of physiological conditions. Unlike the
genome, which has a fixed collection of sequences,
the transcriptome is context dependent, which
means that its content of sequences depends on the cell response to the
current set of physiological circumstances, and the make up of that set will
change when the physiological circumstances change.

Those physiological circumstances will be adapted in response to changes
in both the intracellular and extracellular environment of the cell; its
nutritional status, state of differentiation, age, etc. The mRNA of genes
that are newly expressed (up-regulated) will appear in the
sequence collection, and the mRNA of genes that are not expressed (down-regulated)
in the new circumstance will disappear from, or be
greatly reduced in, the sequence collection. Determination of the
nature and sequence content of the transcriptome in all these circumstances
is precisely what transcriptome analysis is intended to achieve, because the
pattern of mRNA content in the transcriptome reveals the pattern of gene
regulation.

Hybridisation arrays are now used widely to
study the transcriptome because of their ability to measure the expression
of many genes with great efficiency. Microarrays permit assessment of the
relative expression levels of hundreds, even thousands, of genes in a single
experiment. Hybridisation arrays are also called DNA micro- or
macroarrays,
DNA chips, gene chips,
and
bio chips (Nowrousian, 2007, 2014a). The web
definition of DNA microarray is: a collection of microscopic DNA spots
attached to a solid surface forming an array; used to measure the expression
levels of many genes simultaneously
(https://en.wiktionary.org/wiki/DNA_microarray).

The array of single-stranded DNA molecules is typically distributed on
glass, a nylon membrane, or silicon wafer (any of which might be called ‘a
chip’), each being immobilised at a specific location on the chip in a
predetermined (and computer-recorded) grid formation. Microarrays and
macroarrays differ in the size of the sample spots of DNA; in
macroarrays the size of the spot is over 300 µm, in
microarrays it is less than 200 µm. Macroarrays are normally
spotted by high-speed robotics onto nylon membranes, microarrays are made on
glass (usually called custom arrays) or quartz surfaces (GeneChip®, from
Affymetrix Inc.; see
https://www.affymetrix.com/site/mainPage.affx) (Lipshutz et al.,
1999). The immobilisation onto the solid matrix is the most crucial aspect
of the technique as it must preserve the biological activity of the
molecules. The spotted material can be genomic DNA, cDNA, PCR products (any
of these sized between 500 to 5,000 base pairs) or oligonucleotides (20 to
80-mer oligos). The identities and locations of the single-stranded DNAs are
known, so when the chip is treated with a suspension of experimental cDNA
molecules prepared from a set of mRNAs, the cDNAs complementary to those on
the chip will bind to those specific spots.The complementary binding pattern can be detected and since the DNAs
at each position on each grid are known, the complementary binding pattern
indicates the pattern of gene expression in the sample.

Macroarrays are hybridised using a radioactive
probe; normally 33P, an isotope of phosphorus which decays by
β-emission so that the decay, and therefore the position of the
complementary binding can be imaged with a phosphorimager,
a device in which β-particle emissions excite the phosphor molecules on the
plate in a way that can be detected by scanning the plate with a laser and
the attached computer converts the energy it detects to an image in which
different colours represent different levels of radioactivity.

Microarrays are exposed to a set of targets either
separately (single dye experiment) or in a mixture (two dye experiment) to
determine the identity/abundance of complementary sequences. Laser
excitation of the spots yields an emission with a spectrum characteristic of
the dye(s), which is measured using a scanning confocal laser microscope.
Monochrome images from the scanner are imported into software in which the
images are pseudo-coloured and merged and combined with information about
the DNAs immobilised on the chip. The software outputs an image which shows
whether expression of each gene represented on the chip is unchanged,
increased (up-regulated) or decreased (down-regulated) relative to a
reference sample. In addition, data is accumulated
from multiple experiments and can be examined using any number of
data mining software tools.

There are many uses for DNA microarrays. Apart from the
expression profiling to examine the effect of physiological
circumstance on gene expression on which we have so far concentrated,
hybridisation arrays can be used to:

The proteome is the complete set of proteins synthesised
in the cell under a given set of conditions. The traditional method for
quantitative proteome analysis combines protein separation by
high-resolution 2-dimensional isoelectric focusing
(IEF)/SDS-PAGE (2DE) with mass spectrometric (MS)
or tandem mass spectrometric (MS/MS) identification of selected protein
spots detected in the 2DE gels by use of specific protein stains. Continued
improvement in technology is steadily increasing the throughput of protein
identifications from complex mixtures and permitting quantification of
protein expression levels and how they change in different circumstances
(Aebersold, 2003; Bhadauria et al., 2007; Rokas, 2009).

An important feature that arises from analysis of the proteome is the
enormous extent and complexity of the network of interactions among proteins
and between proteins and other components of the cells. These networks can
be visualised as maps of cellular function, depicting potential interactive
complexes and signalling pathways.

'Metabolomics consists of strategies to
quantitatively identify cellular metabolites and to understand how
trafficking of these biochemical messengers through the metabolic network
influences phenotype’ (quoted from Jewett et al., 2006).

Metabolomics is particularly important in fungi because
these organisms are widely used to produce chemicals. The main difficulty in
metabolome analysis is not technical as there are sufficient analytical
tools and mathematical strategies available for extensive metabolite
analyses. However, the indirect relationship between the metabolome and the
genome raises conceptual difficulties. The biosynthesis or degradation of a
single metabolite may involve many genes, and the metabolite itself may
impact on many more. Consequently, the bioinformatics tools and software
required must be exceptionally powerful.

Ultimately, you may think in terms of applying all this knowledge to the
creation of something entirely new. That is, to developing a biological
system of some form that does not already exist in the biosphere. In the
past this was achieved by the evolutionary process of artificial selection
(selective breeding), producing crop
species (like maize) or domesticated animals (like high milk-yield cattle)
that simply could not exist in the wild.

The ‘modern’ version of this is called synthetic biology,
and with the current passion for applying management definitions to long
standing activities it has been defined as the area of science that applies
engineering principles to biological systems to design and build novel
biological functions and systems.

Kaznessis (2007) adds the crucial rider that synthetic biological
engineering is emerging from molecular biology as a distinct discipline
based on quantification. And that’s the real defining feature, this is a
branch of biology that depends on large scale computer
processing of large amounts of numerical data. In fact, this
is a branch of biology that verges on engineering (Silver et al.,
2014).