Gene finding with GeneMark.HMM (Lukashin - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Gene finding with GeneMark.HMM (Lukashin

Description:

Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997) CS 466 Saurabh Sinha – PowerPoint PPT presentation

Number of Views:199
Avg rating:3.0/5.0
Slides: 31
Provided by: Saur96
Category:

less

Transcript and Presenter's Notes

Title: Gene finding with GeneMark.HMM (Lukashin


1
Gene finding with GeneMark.HMM(Lukashin
Borodovsky, 1997)
  • CS 466
  • Saurabh Sinha

2
Gene finding in bacteria
  • Large number of bacterial genomes sequenced (10
    at the time of paper, 1997)
  • Previous work Genemark program identified gene
    as ORF that looks more like genes than non-genes.
  • Uses Markov chains of coding and non-coding
    sequence
  • 5 (starting) boundary not well predicted
  • Resolution of start point 100 nucleotides

3
Genemark.hmm
  • Builds on Genemark, but uses HMM for better
    prediction of start and stop
  • Given DNA sequence S b1,b2,.bL
  • Find functional sequence Aa1,aL where each
    ai 0 if non-coding, 1 if coding in forward
    strand, 2 if coding in reverse strand
  • Sounds like the Fair Bet Casino problem (sequence
    of coin types fair or biased)
  • Find Pr(A S) and report A that maximizes this

4
Functional sequence
  • A carries information about where the coding
    function switched into non-coding (stop of gene)
    and vice versa.
  • Model sequence by HMM with different states for
    coding and non-coding
  • Maximum likelihood A is the optimal path
    through the HMM, given the sequence
  • Viterbi algorithm to solve this problem

5
(No Transcript)
6
Hidden Markov Model
  • In some states, choose (i) a length of sequence
    to emit and (ii) the sequence to emit
  • This is different from the Fair Bet Casino
    problem. There, each state emitted exactly one
    observation (H or T)

7
Hidden Markov Model
  • Typical and Atypical gene states (one for
    each of forward and reverse strands)
  • These two states emit coding sequence (between
    and excluding start and stop codons) with
    different codon usage patterns
  • Clustering of E. coli genes showed that
  • majority of genes belong to one cluster
    (Typical)
  • many genes, believed to have been horizontally
    transferred into the genome, belong to another
    cluster (Atypical)

8
Hidden State Trajectory A
  • This is similar to the functional sequence
    defined earlier
  • except that we have one for each state, not one
    for each nucleotide
  • Sequence of M hidden states ai having duration
    di
  • A (a1d1), (a2d2), . (aMdM)
  • ?di L
  • Find A that maximizes Pr(AS)

9
Formulation
  • Find trajectory (path) A that has the highest
    probability of occurring simultaneously with the
    sequence S
  • Maximizing Pr(A,S) is the same as maximizing
    Pr(AS). Why ?

10
Solution
  • Maximization problem solved by Viterbi algorithm
    (seen in previous lecture)

11
Solution
maximizing over all possible trajectories
12
Solution
transition prob.
Define (for dynamic progamming)
prob. of duration
prob. of sequence
the joint probability of a partial trajectory of
m states (with the last state being am) and a
partial sequence of length l.
13
Solution
14
Parameters of the HMM
  • Transition probability distributions, emission
    probability distributions
  • Fixed a priori
  • What was the other possibility ?
  • Learn parameters from data
  • Emission probabilities of coding sequence state
    obtained from previous statistical studies What
    does a coding sequence look like in general?
  • Emission probabilities of non-coding sequence
    obtained similarly

15
Parameters of the HMM
  • Probability that a state a has duration d
    (i.e., length of emission is d) is learned from
    frequency distribution of lengths of known coding
    sequences

16
Parameters of the HMM
  • and non-coding sequences

17
Parameters of the HMM
  • Emission probabilities of start codon fixed from
    previous studies
  • Pr(ATG)0.905, Pr(GTG)0.090, Pr(TTG)0.005
  • Transition probabilities Non-coding to
    Typical/Atypical coding state 0.85/0.15

18
Post-processing
  • As per the HMM, two genes cannot overlap. In
    reality, genes may overlap !

G2
G1
19
Post-processing
  • As per the HMM, two genes cannot overlap. In
    reality, genes may overlap !

G2
G1
Will predict second gene to begin here
What about the start codon for that second gene?
20
Post-processing
  • As per the HMM, two genes cannot overlap. In
    reality, genes may overlap !

G2
G1
  • Look for an RBS somewhere here.
  • Take each start codon here, and find RBS -19 to
    -4 bp upstream of it

21
Ribosome binding site (RBS)
22
How to search for RBS?
  • Take 325 genes from E. coli (bacterium) with
    known RBS
  • Align them using sequence alignment
  • Use this as a PWM to scan for RBS

23
Gene prediction in different species
  • The coding and non-coding state emission
    probabilities need to be trained from each
    species for predicting genes in that species

24
Gene prediction accuracy
  • Data set 1 all annotated E. coli genes
  • Data set 2 non-overlapping genes
  • Data set 3 Genes with known RBS
  • Data set 4 Genes with known start positions

25
Results
VA Viterbi algorithm PP With post-processing
26
Results
  • Gene overlap is an important factor
  • Performance goes up from 58 to 71 when
    overlapping genes are excluded from data set
  • Post-processing helps a lot
  • 58 --gt 75 for data set 1
  • Missing genes False negatives lt 5
  • Wrong gene predictions False positives 8
  • Are they really false positives, or are they
    unannotated genes?

27
Results
  • Compared with other programs

28
Results
  • Robustness to parameter settings
  • Alternative set of transition probability values
    used
  • Little change in performance (20 change in
    parameter values leads to lt 5 change in
    performance)

29
Higher Order Markov models
  • Sequence emissions were modeled by a second order
    Markov chain.
  • Pr (XiXi-1, Xi-2,X1) Pr (XiXi-1, Xi-2)
  • Examined the effect of changing the Markov
    order (0,1,3,4,5)
  • Even zeroth order Markov chain does pretty well.

30
Higher Order Markov models
Write a Comment
User Comments (0)
About PowerShow.com