3
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions of genes in a genome Gene Prediction: Computational Challenge

8
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Central Dogma was proposed in 1958 by Francis Crick Crick had very little supporting evidence in late 1950s Before Crick’s seminal paper all possible information transfers were considered viable Crick postulated that some of them are not viable (missing arrows) In 1970 Crick published a paper defending the Central Dogma. Central Dogma: Doubts

10
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info In the following string THE SLY FOX AND THE SHY DOG Delete 1, 2, and 3 nucleotifes after the first ‘S’: THE SYF OXA NDT HES HYD OG THE SFO XAN DTH ESH YDO G THE SOX AND THE SHY DOG Which of the above makes the most sense? The Sly Fox

12
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info In 1964, Charles Yanofsky and Sydney Brenner proved colinearity in the order of codons with respect to amino acids in proteins In 1967, Yanofsky and colleagues further proved that the sequence of codons in a gene determines the sequence of amino acids in a protein As a result, it was incorrectly assumed that the triplets encoding for amino acid sequences form contiguous strips of information. Great Discovery Provoking Wrong Assumption

16
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Exons and Introns In eukaryotes, the gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns) This makes computational gene prediction in eukaryotes even more difficult Prokaryotes don’t have introns - Genes in prokaryotes are continuous

29
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info mRNA is now Ready From lectures by Chris Burge (MIT)

30
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Newspaper written in unknown language –Certain pages contain encoded message, say 99 letters on page 7, 30 on page 12 and 63 on page 15. How do you recognize the message? You could probably distinguish between the ads and the story (ads contain the “$” sign often) Statistics-based approach to Gene Prediction tries to make similar distinctions between exons and introns. Gene Prediction Analogy

31
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Noting the differing frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’) and numerical symbols could you distinguish between a story and the stock report in a foreign newspaper? Statistical Approach: Metaphor in Unknown Language

32
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Statistical: coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns). Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes. Two Approaches to Gene Prediction

33
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info If you could compare the day’s news in English, side-by-side to the same news in a foreign language, some similarities may become apparent Similarity-Based Approach: Metaphor in Different Languages

36
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Detect potential coding regions by looking at ORFs –A genome of length n is comprised of (n/3) codons –Stop codons break genome into segments between consecutive Stop codons –The subsegments of these that start from the Start codon (ATG) are ORFs ORFs in different frames may overlap Genomic Sequence Open reading frame ATGTGA Open Reading Frames (ORFs)

37
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Long open reading frames may be a gene –At random, we should expect one stop codon every (64/3) ~= 21 codons –However, genes are usually much longer than this A basic approach is to scan for ORFs whose length exceeds certain threshold –This is naïve because some genes (e.g. some neural and immune system genes) are relatively short Long vs.Short ORFs

38
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Testing ORFs: Codon Usage Create a 64-element hash table and count the frequencies of codons in an ORF Amino acids typically have more than one codon, but in nature certain codons are more in use Uneven use of the codons may characterize a real gene This compensate for pitfalls of the ORF length test

41
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Codon Usage and Likelihood Ratio An ORF is more “believable” than another if it has more “likely” codons Do sliding window calculations to find ORFs that have the “likely” codon usage Allows for higher precision in identifying true ORFs; much better than merely testing for length. However, average vertebrate exon length is 130 nucleotides, which is often too small to produce reliable peaks in the likelihood ratio Further improvement: in-frame hexamer count (frequencies of pairs of consecutive codons)

42
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction and Motifs Upstream regions of genes often contain motifs that can be used for gene prediction -10 STOP 010-35 ATG TATACT Pribnow Box TTCCAAGGAGG Ribosomal binding site Transcription start site

44
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Ribosomal Binding Site

45
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Splicing Signals Try to recognize location of splicing signals at exon-intron junctions –This has yielded a weakly conserved donor splice site and acceptor splice site Profiles for sites are still weak, and lends the problem to the Hidden Markov Model (HMM) approaches, which capture the statistical dependencies between sites

46
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Donor and Acceptor Sites: GT and AG dinucleotides The beginning and end of exons are signaled by donor and acceptor sites that usually have GT and AC dinucleotides Detecting these sites is difficult, because GT and AC appear very often exon 1exon 2 GTAC Acceptor Site Donor Site

48
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info TestCode Statistical test described by James Fickett in 1982: tendency for nucleotides in coding regions to be repeated with periodicity of 3 –Judges randomness instead of codon frequency –Finds “putative” coding regions, not introns, exons, or splice sites TestCode finds ORFs based on compositional bias with a periodicity of three

49
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info TestCode Statistics Define a window size no less than 200 bp, slide the window the sequence down 3 bases. In each window: –Calculate for each base {A, T, G, C} max (n 3k+1, n 3k+2, n 3k ) / min ( n 3k+1, n 3k+2, n 3k ) Use these values to obtain a probability from a lookup table (which was a previously defined and determined experimentally with known coding and noncoding sequences

50
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info TestCode Statistics (cont’d) Probabilities can be classified as indicative of " coding” or “noncoding” regions, or “no opinion” when it is unclear what level of randomization tolerance a sequence carries The resulting sequence of probabilities can be plotted