Title: BIOINFORMATICS
1BIOINFORMATICS
Gene Finding With A Hidden Markov model Of
Genomic Structure and Evolution.
Jakob Skou Pedersen and Jotun Hein
Deepak Verghese CS 6890
2Number of models have incorprated evolutionary
information in them
- GPHMM
- CONSERVED Exon method
- 2 step GLASS n ROSETTA
- TWINSCAN which extends GENESCAN
- etc
3- Do not exploit all information in evolutionary
pattern - Not easily extended to multiple genome sequences.
4(EHMM)
EVOLUTIONARY HIDDEN MARKOV MODEL
A Probabilistic model of both Genome Structure
and Evolution
- Composed of
- Hidden Markov Model (HMM)
- Phylogenetic Tree
5ADVANTAGES
- Can handle any number of sequences in an
alignment. - Can have properties of higher order HMMs
- Can handle variability in the sequences along the
alignment - State of art evolutionary models can be
incorporated later - Evolutionary events between different genomes are
not treated independently
6MODEL
- SCOPE
-
- Not to compete with the existing finding methods
- on performance but to illustrate the power of
this approach. - Relies on a pre produced alignment.
7MARKOV CHAINS
- A set of states
- The transitions from one state to all other
states, including itself, are governed by a
probability distribution - First order Markov chain the probabilities
depend solely on the current state - n-th order Markov chain n previous states
8HIDDEN MARKOV MODEL
5 Components
- A set of states
- Matrix of transition probabilities ( A )
- Set of alphabets ( C )
- Set of emission distribution (e)
- Initial state distribution ( B )
9 Example of hidden Markov model
- A C A - - - A T G
- T C A A C T A T C
- A C A C - - A G C
- A G A - - - A T C
- A C C G - - A T C
NO 11 correspondence between states and
symbols Why the name Hidden ?
10Components
- State k
- Emits symbols (observables) C
- PROBABILISTIC MODEL
- Emission Distribution e
- Initial state
distribution B - Transition Probabilities
A -
11Path ?
- Different paths possible for same sequence
12In EHMM
- Emission distribution
- e specified by
- Evolutionary model Ek
- Phylogenetic tree T
13 PHYLOGENETIC TREES
14Motivation The problem of explaining the
evolutionary history of today's species
- In Phylogenetic trees
- Leaves represent present day species
- Character states of inner nodes are missing data
- Interior nodes represent hypothesized ancestors
- The length of the brances of a tree represent the
evolutionary difference.
15Evolution is often modeled by continuous markov
chains Here evolution along the branches
of the phylogenetic tree is modelled by
Ek Transition probability Pk ( t ) For a branch
length t P k ( t ) exp ( t Q k
) Increasing the number of sequences is
increasing the amount of evolutionary
information. THE ALIGNMENT COLUMN CORRESPONDS
TO THE STATE OF ELOVUTION AT THE LEAVES OF THE
PHYLOGENETIC TREE
16 THE PEOPABILITY OF GENERATING AN ALIGNMENT
COLUMN IN STATE K EQUALS PROBABILITY OF
OBSERVING A GIVEN CHARACTER PATTERN ON THE
LEAVES OF T WHEN GIVEN E k
Phylogenetic tree of the entries of the 3
alignment columns
17- Codon based evolutionary model used to calculate
- emission probability of columns of A
- Nucleotide Based evolutionary model used to
calculate - emission probability of column B
- Emission probability of C is got from the
equilibrium distribution - of the the relevant evolutionary model
18Parameter Estimation
- Parameters of HMM are estimated by a combination
of - Baum Welch
- Powell
- Evolutionary model E
- divided into
- E equ
- E evo
19Initial State Distribution B can be estimated by
Baum-Welch but It is generally set to 0.000 01
for all states except the intergenic . The
expectation step of Baum-Welch estimates
the number of nucleotides emitted from each
state the expected number of state
transitions Expected number of times a state is
used. Powell another optimization method
estimates E evo
phylogenetic tree T Baum Welch method is used
to estimate E equ
A
20Therefore Likelihood of an alignment ( x ) given
a parameterization of the EHMM Can be found by
the equation
Here we are summing over all possible paths This
can be done in linear time by Dynamic Programming
21EHMM is fully probabilistic and can be used to
simulate data and find genes.
- EUKARYOTIC GENOME MODEL can be used to generate
alignments. - Reduced model produces only inner exons.
eukaryotic EHMM
22Results
- Benefits of modeling evolution with a EHMM
- using a data set of orthologous
mouse/human gene pair - Benefit will depend on divergence between
sequences compared - Key parameter for modelling the difference
between exons and introns is the dN/dS ratio.
23(No Transcript)
24Moreover we see that Evolutionary model shows a
distinct difference between the intergenic
/intron state and the codon state
25Evaluations were performed on both single and
aligned sequences
26Graphical Representation
27Simple model used now not comparable to state of
art methods
Any number of aligned sequences can be handled
28- GENESCAN can be extended into HMM
- Splice site finders
- Models of ribosome binding site and promoter
regions - Non geometric length distributions of exons
- Pseudo higher order EHMM can be constructed.
- Idea of pair HMM to multiple sequences
29Disadvantages in present model
- Existing frame work does not model gaps but
treats it as missing data. - Optimal data for EHMM is a multiple alignment of
full length genome. - Challenge in constructions of the alignment is to
reduce the noise per signal ratio. - BUT ..