Title: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b
1Gene prediction and HMMComputational Genomics
2005/6Lecture 9b
- Slides taken from (and rapidly mixed) Larry
Hunter, Tom Madej, William Stafford Noble, and
Eyal Pribman. Partially modified by Benny Chor.
2Annotation of Genomic Sequence
- Given the sequence of an organisms genome, we
would like to be able to identify - Genes
- Exon boundaries splice sites
- Beginning and end of translation
- Alternative splicings
- Regulatory elements (e.g. promoters)
- The only certain way to do this is
experimentally, - but it is time consuming and expensive.
Computational methods can achieve reasonable
accuracy quickly, and help direct experimental
approaches.
primary goals
secondary goals
3Prokaryotic Gene Structure
Promoter CDS Terminator
Genomic DNA
transcription
mRNA
- Most bacterial promoters contain the
Shine-Delgarno signal, at about -10 that has the
consensus sequence 5'-TATAAT-3'. - The terminator a signal at the end of the
coding sequence that terminates the transcription
of RNA - The coding sequence is composed of nucleotide
triplets. Each triplet codes for an amino acid.
The AAs are the building blocks of proteins.
4Pieces of a (Eukaryotic) Gene(on the genome)
exons (cds utr) / introns ( 102-103 bp)
( 102-105 bp)
5What is it about genes that we can measure (and
model)?
- Most of our knowledge is biased towards
- protein-coding characteristics
- ORF (Open Reading Frame) a sequence defined by
in-frame AUG and stop codon, which in turn
defines a putative amino acid sequence. - Codon Usage most frequently measured by CAI
(Codon Adaptation Index) - Other phenomena
- Nucleotide frequencies and correlations
- value and structure
- Functional sites
- splice sites, promoters, UTRs, polyadenylation
sites
6A simple measure ORF length Comparison of
Annotation and Spurious ORFs in S. cerevisiae
Basrai MA, Hieter P, and Boeke J Genome Research
1997 7768-771
7Codon Adaptation Index (CAI)
- Parameters are empirically determined by
examining a large set of example genes - This is not perfect
- Genes sometimes have unusual codons for a reason
- The predictive power is dependent on length of
sequence
8CAI Example Counts per 1000 codons
9Splice signals (mice) GT , AG
10General Things to Remember about (Protein-coding)
Gene Prediction Software
- It is, in general, organism-specific
- It works best on genes that are reasonably
similar to something seen previously - It finds protein coding regions far better than
non-coding regions - In the absence of external (direct) information,
alternative forms will not be identified - It is imperfect! (Its biology, after all)
11Rest of Lecture Outline
- Eukaryotic gene structure
- Modeling gene structure
- Using the model to make predictions
- Improving the model topology
- Modeling fixed-length signals
12A eukaryotic gene
- This is the human p53 tumor suppressor gene on
chromosome 17. - Genscan is one of the most popular gene
prediction algorithms.
13A eukaryotic gene
Introns
Final exon
Initial exon
3 untranslated region
Internal exons
This particular gene lies on the reverse strand.
14An Intron
revcomp(CT)AG
revcomp(AC)GT
GT signals start of intron AG signals end of
intron
3 splice site
5 splice site
15Signals vs contents
- In gene finding, a small pattern within the
genomic DNA is referred to as a signal, whereas a
region of genomic DNA is a content. - Examples of signals splice sites, starts and
ends of transcription or translation, branch
points, transcription factor binding sites - Examples of contents exons, introns, UTRs,
promoter regions
16Prior knowledge
- We want to build a probabilistic model of a gene
that incorporates our prior knowledge. - E.g., the translated region must have a length
that is a multiple of 3.
17Prior knowledge
- The translated region must have a length that is
a multiple of 3. - Some codons are more common than others.
- Exons are usually shorter than introns.
- The translated region begins with a start signal
and ends with a stop codon. - 5 splice sites (exon to intron) are usually GT
- 3 splice sites (intron to exon) are usually AG.
- The distribution of nucleotides and dinucleotides
is usually different in introns and exons.
18A simple gene model
Intergenic
Intergenic
Intergenic
Gene
Transcription stop
Transcription start
Start
End
Intergenic
19A probabilistic gene model
Pr(TACAGTAGATATGA) 0.0001 Pr(AACAGT)
0.001 Pr(AACAGTAC) 0.002
Intergenic
Intergenic
0.25
Intergenic
Gene
Transcription stop
Transcription start
0.67
1.00
0.75
Start
End
0.33
Intergenic
Every box stores transition probabilities for
outgoing arrows. Every arrow stores emission
probabilities for emitted nucleotides.
20Parse
S ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCG P
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGG
TATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC
GGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- For a given sequence, a parse is an assignment of
gene structure to that sequence. - In a parse, every base is labeled, corresponding
to the content it (is predicted to) belongs to. - In our simple model, the parse contains only I
(intergenic) and G (gene). - A more complete model would contain, e.g., -
for intergenic, E for exon and I for intron.
21The probability of a parse
Pr(ATGCGTATGTTTTGA) 0.00000000142
Pr(ACTGACTACTACGACTACGATCTACTACGGGCGCGACCT)
0.0000543
Pr(ACTGACTATGCGATCTACGACTCGACTAGCTAC) 0.0000789
Intergenic
Intergenic
0.25
Intergenic
Gene
Transcription stop
Transcription start
0.67
1.00
0.75
Start
End
0.33
Intergenic
S ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCGTA
TGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC P
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGGGGGGGG
GGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Pr(parse P sequence S, model M) 0.67 ?
0.0000543 ? 1.00 ? 0.00000000142 ? 0.75 x
0.0000789 3.057 ? 10-18
22Finding the best parse
- For a given sequence S, the model M assigns a
probability Pr(PS,M) to every parse P. - We want to find the parse P that receives the
highest probability.
23An analogy
- Given two sequences, S1 and S2, and a scoring
system M, find the alignment A that maximizes
the score. - Solution dynamic programming
24Dynamic programming
A C A C C T T G T A C G T
Intergenic 1 Intergenic 2 Intergenic 3 Intergenic
4 Gene
Intergenic 2
Intergenic 4
Intergenic 1
Gene
Transcription stop
Transcription start
Start
End
Intergenic 3
25DP matrix
A C A A T G T G A C
I1
I2
I3
I4
G
S ACAATGTGAC P IIIGGGGGGI
26Beyond Simplest Model
- Improving the gene model topology
- Fixed-length signals
- PSSMs
- Dependencies between positions
- Variable-length contents
- Using HMMs
- Semi-Markov models
- Parsing algorithms
- Viterbi
- Posterior decoding
- Including other types of data
- Expressed sequence tags
- Orthology
27Improved model topology
Intergenic 2
Intergenic 4
Intergenic 1
Gene
Transcription stop
Transcription start
Start
End
Intergenic 3
- Draw a model that includes introns
28Improved model topology
Start
Transcription start
5 splice site
3 splice site
Transcription stop
End
29Improved model topology
Start
Transcription start
5 splice site
3 splice site
4 intergenics 1 intron 4 exons
Transcription stop
End
30Improved model topology
Start
Transcription start
Single exon
Initial exon
Internal exon
5 splice site
3 splice site
Intron
Final exon
Transcription stop
End
31Modeling the 5 splice site
5 splice site
3 splice site
Intron
GT
- Most introns begin with the letters GT.
- We can add this signal to the model.
32Modeling the 5 splice site
5 splice site
3 splice site
Intron
G
T
Pr(A)0 Pr(C)0 Pr(G)0 Pr(T)1
Pr(A)0 Pr(C)0 Pr(G)1 Pr(T)0
- Most introns begin with the letters GT.
- We can add this signal to the model.
- Indeed, we can model each nucleotide with its own
arrow.
33Modeling the 5 splice site
5 splice site
3 splice site
Intron
G
T
Pr(A)0.01 Pr(C)0.01 Pr(G)0.01 Pr(T)0.97
Pr(A)0.01 Pr(C)0.01 Pr(G)0.97 Pr(T)0.01
- Like most biological phenomenon, the splice site
signal admits exceptions. - The resulting model of the 5 splice site is a
length-2 PSSM.
34Real splice sites
- Real splice sites show some conservation at
positions beyond the first two. - We can add additional arrows to model these
states.
weblogo.berkeley.edu
35Modeling the 5 splice site
5 splice site
3 splice site
Intron