Hidden Markov Models in Bioinformatics - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Hidden Markov Models in Bioinformatics

Description:

Ra Ra Su Su Su Ra.. Lots of parameters. One for each table ... Two major tasks. Evaluate the probability of an observation sequence given the model (Forward) ... – PowerPoint PPT presentation

Number of Views:183

Avg rating:3.0/5.0

Slides: 33

Provided by: ColinC98

Category:

more less

Transcript and Presenter's Notes

Title: Hidden Markov Models in Bioinformatics

1
Hidden Markov Models in Bioinformatics

Example Domain Gene Finding
Colin Cherry
colinc_at_cs

2
To recap last episode

Hidden Markov Models (HMMs)
Protein Family Characterization
Profile HMMs for protein family characterization
How profile HMMs can do homology search

3
...picking up where we left off

Profile HMMs were good to start with
Todays goal Introduce HMMs as general tools in
bioinformatics
I will use the problem of Gene Finding as an
example of an ideal HMM problem domain

4
Learning Objectives

When Im done you should know
When is an HMM a good fit for a problem space?
What materials are needed before work can begin
with an HMM?
What are the advantages and disadvantages of
using HMMs?
What are the general objectives and challenges in
the gene finding task?

5
Outline

HMMs as Statistical Models
The Gene Finding task at a glance
Good problems for HMMs
HMM Advantages
HMM Disadvantages
Gene Finding Examples

6
Statistical Models

Definition
Any mathematical construct that attempts to
parameterize a random process
Example A normal distribution
Assumptions
Parameters
Estimation
Usage
HMMs are just a little more complicated

7
HMM Assumptions

Observations are ordered
Random process can be represented by a stochastic
finite state machine with emitting states.

8
HMM Parameters

Using weather example
Modeling daily weather for a year
Ra Ra Su Su Su Ra..
Lots of parameters
One for each table entry
Represented in two tables.
One for emissions
One for transitions

9
HMM Estimation

Called training, it falls under machine learning
Feed an architecture (given in advance) a set of
observation sequences
The training process will iteratively alter its
parameters to fit the training set
The trained model will assign the training
sequences high probability

10
HMM Usage

Two major tasks
Evaluate the probability of an observation
sequence given the model (Forward)
Find the most likely path through the model for a
given observation sequence (Viterbi)

11
Gene Finding(An Ideal HMM Domain)

Our Objective
To find the coding and non-coding regions of an
unlabeled string of DNA nucleotides
Our Motivation
Assist in the annotation of genomic data produced
by genome sequencing methods
Gain insight into the mechanisms involved in
transcription, splicing and other processes

12
Gene Finding Terminology

A string of DNA nucleotides containing a gene
will have separate regions (lines)
Introns non-coding regions within a gene
Exons coding regions
Separated by functional sites (boxes)
Start and stop codons
Splice sites acceptors and donors

13
Gene Finding Challenges

Need the correct reading frame
Introns can interrupt an exon in mid-codon
There is no hard and fast rule for identifying
donor and acceptor splice sites
Signals are very weak

14
What makes a good HMM problem space?

Characteristics
Classification problems
There are two main types of output from an HMM
Scoring of sequences
(Protein family modeling)
Labeling of observations within a sequence
(Gene Finding)

15
HMM Problem CharacteristicsContinued

The observations in a sequence should have a
clear, and meaningful order
Unordered observations will not map easily to
states
Its beneficial, but not necessary for the
observations follow some sort of grammar
Makes it easier to design an architecture
Gene Finding
Protein Family Modeling

16
HMM Requirements

So youve decided you want to build an HMM,
heres what you need
An architecture
Probably the hardest part
Should be biologically sound easy to interpret
A well-defined success measure
Necessary for any form of machine learning

17
HMM Requirements Continued

Training data
Labeled or unlabeled it depends
You do not always need a labeled training set to
do observation labeling, but it helps
Amount of training data needed is
Directly proportional to the number of free
parameters in the model
Inversely proportional to the size of the
training sequences

18
Why HMMs might be a good fit for Gene Finding

Classification Classifying observations within a
sequence
Order A DNA sequence is a set of ordered
observations
Grammar / Architecture Our grammatical structure
(and the beginnings of our architecture) is right
here
Success measure of complete exons correctly
labeled
Training data Available from various genome
annotation projects

19
HMM Advantages

Statistical Grounding
Statisticians are comfortable with the theory
behind hidden Markov models
Freedom to manipulate the training and
verification processes
Mathematical / theoretical analysis of the
results and processes
HMMs are still very powerful modeling tools far
more powerful than many statistical methods

20
HMM Advantages continued

Modularity
HMMs can be combined into larger HMMs
Transparency of the Model
Assuming an architecture with a good design
People can read the model and make sense of it
The model itself can help increase understanding

21
HMM Advantages continued

Incorporation of Prior Knowledge
Incorporate prior knowledge into the architecture
Initialize the model close to something believed
to be correct
Use prior knowledge to constrain training process

22
How does Gene Finding make use of HMM advantages?

Statistics
Many systems alter the training process to better
suit their success measure
Modularity
Almost all systems use a combination of models,
each individually trained for each gene region
Prior Knowledge
A fair amount of prior biological knowledge is
built into each architecture

23
HMM Disadvantages

Markov Chains
States are supposed to be independent
P(y) must be independent of P(x), and vice versa
This usually isnt true
Can get around it when relationships are local
Not good for RNA folding problems

P(x)
P(y)

24
HMM Disadvantagescontinued

Standard Machine Learning Problems
Watch out for local maxima
Model may not converge to a truly optimal
parameter set for a given training set
Avoid over-fitting
Youre only as good as your training set
More training is not always good

25
HMM Disadvantagescontinued

Speed!!!
Almost everything one does in an HMM involves
enumerating all possible paths through the
model
There are efficient ways to do this
Still slow in comparison to other methods

26
HMM Gene FindersVEIL

A straight HMM Gene Finder
Takes advantage of grammatical structure and
modular design
Uses many states that can only emit one symbol to
get around state independence

27
HMM Gene FindersHMMGene

Uses an extended HMM called a CHMM
CHMM HMM with classes
Takes full advantage of being able to modify the
statistical algorithms
Uses high-order states
Trains everything at once

28
HMM Gene FindersGenie

Uses a generalized HMM (GHMM)
Edges in model are complete HMMs
States can be any arbitrary program
States are actually neural networks specially
designed for signal finding

29
Conclusions

HMMs have problems where they excel, and problems
where they do not
You should consider using one if
Problem can be phrased as classification
Observations are ordered
The observations follow some sort of grammatical
structure (optional)

30
Conclusions

Advantages
Statistics
Modularity
Transparency
Prior Knowledge

Disadvantages
State independence
Over-fitting
Local Maximums
Speed

31
Some final words

Lots of problems can be phrased as classification
problems
Homology search, sequence alignment
If an HMM does not fit, theres all sorts of
other methods to try with ML/AI
Neural Networks, Decision Trees Probabilistic
Reasoning and Support Vector Machines have all
been applied to Bioinformatics

32
Questions