Title: Gene predictions for eukaryotes
1Gene predictions for eukaryotes
- attgccagtacgtagctagctacacgtatgctattacggatctgtagc
ttagcgtatctgtatgctgttagctgtacgtacgtatttttctagagctt
cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt
agctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgta
gtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttc
taggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatc
tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtc
tatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgta
cgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcg
tagtcgttagcttagtcgtgtagtcttgatctacgtacgtatttttctag
agcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatg
ctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctag
tcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtat
ttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgtta
gcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcg
tagtctatggctagtcgtagtcgtagtcgttagcatctgtatgtacgtac
gtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgtt
agcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgta
gtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagct
gtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtag
tcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct
aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatct
gtatggtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtat
ttttctaggggagcttcgtagtctatggctag
2Gene predictions for eukaryotes
- attgccagtacgtagctagctacacgtatgctattacggatctgtagc
ttagcgtatctgtatgctgttagctgtacgtacgtatttttctagagctt
cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt
agctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgta
gtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttc
taggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatc
tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtc
tatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgta
cgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcg
tagtcgttagcttagtcgtgtagtcttgatctacgtacgtatttttctag
agcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatg
ctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctag
tcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtat
ttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgtta
gcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcg
tagtctatggctagtcgtagtcgtagtcgttagcatctgtatgtacgtac
gtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgtt
agcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgta
gtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagct
gtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtag
tcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct
aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatct
gtatggtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtat
ttttctaggggagcttcgtagtctatggctag
3Gene predictions for eukaryotes
4Gene predictions for eukaryotes
- Three different approaches to computational
gene-finding - Intrinsic use statistical information about
known genes (Hidden Markov Models) - Extrinsic compare genomic sequence with known
proteins / genes - Cross-species sequence comparison search for
similarities among genomes
5Hidden-Markov-Models (HMM) for gene prediction
- 3 5 6 6 6 4 6 5 1 6 5 1 2 s
- B F F U U U U U F F F F F F E f
- For sequence s and parse f
- P(f) probability of f
- P(f,s) joint probability of f and s
- P(f) P(sf)
- P(fs) a-posteriori probability of f
6Hidden-Markov-Models (HMM) for gene prediction
- 3 5 6 6 6 4 6 5 1 6 5 1 2
- B F F U U U U U F F F F F F E
- Goal find path f with maximum a-posteriori
probability P(fs) - Equivalent find path that maximizes joint
probability P(f,s) - Optimal path calculated by dynamic programming
(Viterbi algorithm)
7Hidden-Markov-Models (HMM) for gene prediction
- 3 5 6 6 6 4 6 5 1 6 5 1 2
- B F F U U U U U F F F F F F E
- Program parameters learned from training data
8Hidden-Markov-Models (HMM) for gene prediction
Application to gene prediction A T A A T G C C
T A G T C s (DNA) Z Z Z E E E E E E I I I I f
(parse) Introns, exons etc modeled as states in
GHMM (generalized HMM) Given sequence s, find
parse that maximizes P(fs) (S. Karlin and C.
Burge, 1997)
9 10AUGUSTUS
- Basic model for GHMM-based intrinsic gene finding
comparable to GenScan (M. Stanke)
11AUGUSTUS
12AUGUSTUS
13AUGUSTUS
- Features of AUGUSTUS
- Intron length model
- Initial pattern for exons
- Similarity-based weighting for splice sites
- Interpolated HMM
- Internal 3 content model
-
14Hidden-Markov-Models (HMM) for gene prediction
A T A A T G C C T A G T C s (DNA) Z Z Z E E E
E I I I I f (parse) Explicit intron length
model computationally expensive.
15AUGUSTUS
Intron length model
Intron (expl.)
Exon
Exon
Intron (geo.)
Intron (fixed)
- Explicit length distribution for short introns
- Geometric tail for long introns
16AUGUSTUS
17AUGUSTUS
- Extension of AUGUSTUS using include extrinsic
information - Protein sequences
- EST sequences
- Syntenic genomic sequences
- User-defined constraints
-
18Gene prediction by phylogenetic footprinting
-
- Comparison of genomic
sequences - (human and mouse)
19Gene prediction by phylogenetic footprinting
20AUGUSTUS
- Extended GHMM using extrinsic information
- Additional input data collection h of hints
about possible gene structure f for sequence s - Consider s, f and h result of random process.
Define probability P(s,h,f) - Find parse f that maximizes P(fs,h) for given s
and h. -
21AUGUSTUS
- Hints created using
- Alignments to EST sequences
- Alignments to protein sequences
- Combined EST and protein alignment (EST
alignments supported by protein alignments) - Alignments of genomic sequences
- User-defined hints
-
22AUGUSTUS
EST
G1
Alignment to EST hint to (partial) exon
23AUGUSTUS
Protein
EST
G1
EST alignment supported by protein hint to exon
(part), start codon
24AUGUSTUS
ESTs, Protein
G1
Alignment to ESTs, Proteins hints to introns,
exons
25AUGUSTUS
G2
G1
Alignment of genomic sequences hint to (partial)
exon
26AUGUSTUS
- Consider different types of hints
- type of hints start, stop, dss, ass, exonpart,
exon, introns - Hint associated with position i in s (exons etc.
associated with right end position) - max. one hint of each type allowed per position
in s - Each hint associated with a grade g that
indicates its source. -
27AUGUSTUS
hi,t information about hint of type t at
position i hi,t grade, strand, (length,
reading frame) if hint available (hints created
by protein alignments contain information about
reading frame) hi,t if no hint of type t
available at i
28AUGUSTUS
Standard program version, without hints A T A A
T G C C T A G T C s (sequence) Z Z Z E E E E E
E I I I I f (parse) Find parse that maximizes
P(fs)
29AUGUSTUS
AUGUSTUS using hints A T A A T G C C T A G T C
s (sequence) X h (type
1) h (type 2)
X h (type 3) . . . . Z Z Z E
E E E E E I I I I f (parse) Find parse that
maximizes P(fs,h)
30AUGUSTUS
As in standard HMM theory maximize joint
probability P(f,s,h) How to calculate P(f,s,h)
?
31AUGUSTUS
Simplifying assumption Hints of different types
t and at different positions i independent of
each other (for redundant hints ignore weaker
types).
32AUGUSTUS
Simplifying assumption Hints of different types
t and at different positions i independent of
each other (for redundant hints ignore weaker
types).
33AUGUSTUS
Simplifying assumption Hints of different types
t and at different positions i independent of
each other (for redundant hints ignore weaker
types).
34AUGUSTUS
- Results
- Gene (sub-)structures supported by hints receive
bonus compared to non-supported structures - Gene (sub-)structures not supported by hints
receive malus - (M. Stanke et al. 2006, BMC Bioinformatics)
35AUGUSTUS
36AUGUSTUS
- Using hints from DIALIGN alignments
- Obtain large human/mouse sequence pairs (up to
50kb) from UCSC - Run CHAOS to find anchor points
- Run DIALIGN using CHAOS anchor points
- Create hints h from DIALIGN fragments
- Run AUGUSTUS with hints
37AUGUSTUS
- Hints from DIALIGN fragments
- Consider fragments with score 20
- Distinguish high scores ( 45) from low scores
- Consider reading frame given by DIALIGN
- Consider strand given by DIALIGN
- gt 222 8 grades
38AUGUSTUS
EGASP competition to evaulate and compare
gene-prediction methods (Sanger Center,
2005) AUGUSTUS best ab-initio method at EGASP
39 EGASP test results
40 EGASP test results
41 EGASP test results
42 EGASP test results
43 EGASP test results
44Application of AUGUSTUS in genome projects
- Brugia malayi (TIGR)
- Aedes aegypti (TIGR)
- Schistosoma mansoni (TIGR)
- Tetrahymena thermophilia (TIGR)
- Galdieria Sulphuraria (Michigan State Univ.)
- Coprinus cinereus (Univ. Göttingen)
- Tribolium castaneum (Univ. Göttingen)