Hidden Markov Models in Bioinformatics - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Hidden Markov Models in Bioinformatics

Description:

Ra Ra Su Su Su Ra.. Lots of parameters. One for each table ... Two major tasks. Evaluate the probability of an observation sequence given the model (Forward) ... – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 33
Provided by: ColinC98
Category:

less

Transcript and Presenter's Notes

Title: Hidden Markov Models in Bioinformatics


1
Hidden Markov Models in Bioinformatics
  • Example Domain Gene Finding
  • Colin Cherry
  • colinc_at_cs

2
To recap last episode
  • Hidden Markov Models (HMMs)
  • Protein Family Characterization
  • Profile HMMs for protein family characterization
  • How profile HMMs can do homology search

3
...picking up where we left off
  • Profile HMMs were good to start with
  • Todays goal Introduce HMMs as general tools in
    bioinformatics
  • I will use the problem of Gene Finding as an
    example of an ideal HMM problem domain

4
Learning Objectives
  • When Im done you should know
  • When is an HMM a good fit for a problem space?
  • What materials are needed before work can begin
    with an HMM?
  • What are the advantages and disadvantages of
    using HMMs?
  • What are the general objectives and challenges in
    the gene finding task?

5
Outline
  • HMMs as Statistical Models
  • The Gene Finding task at a glance
  • Good problems for HMMs
  • HMM Advantages
  • HMM Disadvantages
  • Gene Finding Examples

6
Statistical Models
  • Definition
  • Any mathematical construct that attempts to
    parameterize a random process
  • Example A normal distribution
  • Assumptions
  • Parameters
  • Estimation
  • Usage
  • HMMs are just a little more complicated

7
HMM Assumptions
  • Observations are ordered
  • Random process can be represented by a stochastic
    finite state machine with emitting states.

8
HMM Parameters
  • Using weather example
  • Modeling daily weather for a year
  • Ra Ra Su Su Su Ra..
  • Lots of parameters
  • One for each table entry
  • Represented in two tables.
  • One for emissions
  • One for transitions

9
HMM Estimation
  • Called training, it falls under machine learning
  • Feed an architecture (given in advance) a set of
    observation sequences
  • The training process will iteratively alter its
    parameters to fit the training set
  • The trained model will assign the training
    sequences high probability

10
HMM Usage
  • Two major tasks
  • Evaluate the probability of an observation
    sequence given the model (Forward)
  • Find the most likely path through the model for a
    given observation sequence (Viterbi)

11
Gene Finding(An Ideal HMM Domain)
  • Our Objective
  • To find the coding and non-coding regions of an
    unlabeled string of DNA nucleotides
  • Our Motivation
  • Assist in the annotation of genomic data produced
    by genome sequencing methods
  • Gain insight into the mechanisms involved in
    transcription, splicing and other processes

12
Gene Finding Terminology
  • A string of DNA nucleotides containing a gene
    will have separate regions (lines)
  • Introns non-coding regions within a gene
  • Exons coding regions
  • Separated by functional sites (boxes)
  • Start and stop codons
  • Splice sites acceptors and donors

13
Gene Finding Challenges
  • Need the correct reading frame
  • Introns can interrupt an exon in mid-codon
  • There is no hard and fast rule for identifying
    donor and acceptor splice sites
  • Signals are very weak

14
What makes a good HMM problem space?
  • Characteristics
  • Classification problems
  • There are two main types of output from an HMM
  • Scoring of sequences
  • (Protein family modeling)
  • Labeling of observations within a sequence
  • (Gene Finding)

15
HMM Problem CharacteristicsContinued
  • The observations in a sequence should have a
    clear, and meaningful order
  • Unordered observations will not map easily to
    states
  • Its beneficial, but not necessary for the
    observations follow some sort of grammar
  • Makes it easier to design an architecture
  • Gene Finding
  • Protein Family Modeling

16
HMM Requirements
  • So youve decided you want to build an HMM,
  • heres what you need
  • An architecture
  • Probably the hardest part
  • Should be biologically sound easy to interpret
  • A well-defined success measure
  • Necessary for any form of machine learning

17
HMM Requirements Continued
  • Training data
  • Labeled or unlabeled it depends
  • You do not always need a labeled training set to
    do observation labeling, but it helps
  • Amount of training data needed is
  • Directly proportional to the number of free
    parameters in the model
  • Inversely proportional to the size of the
    training sequences

18
Why HMMs might be a good fit for Gene Finding
  • Classification Classifying observations within a
    sequence
  • Order A DNA sequence is a set of ordered
    observations
  • Grammar / Architecture Our grammatical structure
    (and the beginnings of our architecture) is right
    here
  • Success measure of complete exons correctly
    labeled
  • Training data Available from various genome
    annotation projects

19
HMM Advantages
  • Statistical Grounding
  • Statisticians are comfortable with the theory
    behind hidden Markov models
  • Freedom to manipulate the training and
    verification processes
  • Mathematical / theoretical analysis of the
    results and processes
  • HMMs are still very powerful modeling tools far
    more powerful than many statistical methods

20
HMM Advantages continued
  • Modularity
  • HMMs can be combined into larger HMMs
  • Transparency of the Model
  • Assuming an architecture with a good design
  • People can read the model and make sense of it
  • The model itself can help increase understanding

21
HMM Advantages continued
  • Incorporation of Prior Knowledge
  • Incorporate prior knowledge into the architecture
  • Initialize the model close to something believed
    to be correct
  • Use prior knowledge to constrain training process

22
How does Gene Finding make use of HMM advantages?
  • Statistics
  • Many systems alter the training process to better
    suit their success measure
  • Modularity
  • Almost all systems use a combination of models,
    each individually trained for each gene region
  • Prior Knowledge
  • A fair amount of prior biological knowledge is
    built into each architecture

23
HMM Disadvantages
  • Markov Chains
  • States are supposed to be independent
  • P(y) must be independent of P(x), and vice versa
  • This usually isnt true
  • Can get around it when relationships are local
  • Not good for RNA folding problems

P(x)
P(y)

24
HMM Disadvantagescontinued
  • Standard Machine Learning Problems
  • Watch out for local maxima
  • Model may not converge to a truly optimal
    parameter set for a given training set
  • Avoid over-fitting
  • Youre only as good as your training set
  • More training is not always good

25
HMM Disadvantagescontinued
  • Speed!!!
  • Almost everything one does in an HMM involves
    enumerating all possible paths through the
    model
  • There are efficient ways to do this
  • Still slow in comparison to other methods

26
HMM Gene FindersVEIL
  • A straight HMM Gene Finder
  • Takes advantage of grammatical structure and
    modular design
  • Uses many states that can only emit one symbol to
    get around state independence

27
HMM Gene FindersHMMGene
  • Uses an extended HMM called a CHMM
  • CHMM HMM with classes
  • Takes full advantage of being able to modify the
    statistical algorithms
  • Uses high-order states
  • Trains everything at once

28
HMM Gene FindersGenie
  • Uses a generalized HMM (GHMM)
  • Edges in model are complete HMMs
  • States can be any arbitrary program
  • States are actually neural networks specially
    designed for signal finding

29
Conclusions
  • HMMs have problems where they excel, and problems
    where they do not
  • You should consider using one if
  • Problem can be phrased as classification
  • Observations are ordered
  • The observations follow some sort of grammatical
    structure (optional)

30
Conclusions
  • Advantages
  • Statistics
  • Modularity
  • Transparency
  • Prior Knowledge
  • Disadvantages
  • State independence
  • Over-fitting
  • Local Maximums
  • Speed

31
Some final words
  • Lots of problems can be phrased as classification
    problems
  • Homology search, sequence alignment
  • If an HMM does not fit, theres all sorts of
    other methods to try with ML/AI
  • Neural Networks, Decision Trees Probabilistic
    Reasoning and Support Vector Machines have all
    been applied to Bioinformatics

32
Questions
  • Any Questions?
Write a Comment
User Comments (0)
About PowerShow.com