Title: HMM%20for%20CpG%20Island%20%20Combined%20Model
1HMM for CpG Islands
Arti Kelkar Pete Rossetti Peter Warren
2HMM for CpG Islands
- HMM history
- General background
- Three Fundamental problems
- Evaluation
- Decoding
- Training
3HMM for CpG Islands
- HMM Applications
- Bioinformatics
- Non-Bioinformatics
- CpG Islands Problem
- CpG Islands
- Definition
- Why interesting
- Hidden Markov Model for CpG
- Whats Hidden
- Mathematica Implementation
- Training
- Decoding
4Andrei Andreyevich Markov1856-1922
5AA Markov
- Early 1900s
- Markov conceives Markov chains including a
proof of the Central Limit theorem for Markov
Chains - Studies with Chebyshev and takes over his classes
at Univ. of St. Petersburg - 1913
- Russian government celebrates the 300th
anniversary of the House of Romanov - AA Markov organizes a counter-celebration the
200th anniversary of Bernoullis Law of Large
Numbers
6HMM History
- 1960s
- Use of HMMs developed by a cold-war era research
team in a classified program at the Communication
Research Division of the Institute for Defense
Analyses. (Oscar Rothaus). - 1970s
- HMM work is de-classified and is soon being used
in many peaceful applications.
7Markov Chain
- Sunny yesterday
- gt 0.5 probability that it will be sunny today
and 0.25 that it will be cloudy or rainy
8Hidden Markov Model
9HMM Definition
- Hidden Markov Model is a triplet (?, A, B)
- ? Vector of initial state probabilities
- A Matrix of state transition probabilities
- B Matrix of observation probabilities
- N Number of hidden states in the model
- M Number of observation symbols
10HMM Three Problems
- Evaluation
- Decoding
- Training
11HMM - Overview Evaluation Problem
- Given a set of HMMs, which is the one most
- likely to have produced the observation sequence?
GACGAAACCCTGTCTCTATTTATCC
p(HMM-3)?
p(HMM-1)?
p(HMM-n)?
p(HMM-2)?
HMM 1
HMM n
HMM 3
HMM 2
Forward Algorithm is used to find Maxp(HMMs)
12HMM - Overview Decoding Problem
- States A,C,G,T,A-,C-,G-,T-
A
A
A
A
A
C
C
C
C
C
G
G
G
G
G
T
T
T
T
T
A-
A-
A-
A-
A-
C-
C-
C-
C-
C-
G-
G-
G-
G-
G-
T-
T-
T-
T-
T-
A
G
C
G
C
Obs seq
13HMM - OverviewTraining Problem
From raw seqence data to Transition
Probabilities
A C G T A- C- G- T-
A C G T A- C- G- T-
How?
14HMM - Applications BioInformatics
- DNA Sequence analysis
- Protein family profiling
- Prediction of protein folding
- Prediction of genes
- Horizontal gene transfer
- Radiation hybrid mapping, linkage analysis
- Prediction of DNA functional sites.
- CpG island prediction
- Splicing signals prediction
15HMM - Applications Non-BioInformatics
- Speech Recognition
- Vehicle Trajectory Projection
- Gesture Learning for Human-Robot Interface
- Positron Emission Tomography (PET)
- Optical Signal Detection
- Digital Communications
- Music Analysis
16Some HMM based Bioinformatics Resources
- PROBE www.ncbi.nlm.nih.gov/
- BLOCKS www.blocks.fhcrc.org/
- META-MEME www.cse.ucsd.edu/users/bgrundy/metameme.
1.0.html - SAM www.cse.ucsc.edu/research/compbio/sam.ht
ml - HMMERS hmmer.wustl.edu/
- HMMpro www.netid.com/
- GENEWISE www.sanger.ac.uk/Software/Wise2/
- PSI-BLAST www.ncbi.nlm.nih.gov/BLAST/newblast.html
- PFAM www.sanger.ac.uk/Pfam/
17HMM for CpG Islands
- CpG ISLANDS
- CpG means C precedes G
- Not CG base pairs
18HMM for CpG Islands
- Nucleotides - 4 bases in DNA
- A (Adenine)
- C (Cytosine)
- G (Guanine)
- T (Thymine)
19HMM for CpG Islands Whats a CpG Island
CG-poor regions P(CG) 0.07!
CG-rich region P(CG) 0.25
Gene coding region
Promoter region
20HMM for CpG Islands Why the difference?
- Away from gene regions
- The C in CG pairs is usually methylated
- Methylation inhibits gene transcription
- These CGs tend to mutate to TG
- Near promoter and coding regions
- Methylation is suppressed
- CGs remain CGs
- Makes transcription easier!
21HMM for CpG Islands Motivation
- CpG-rich regions are associated with genes which
are frequently transcribed. - Helps to understand gene expression related to
location in genome.
22HMM for CpG Islands Motivation
- Q Why an HMM?
- It can answer the questions
- Short sequence does it come from a CpG island or
not? - Long sequence where are the CpG islands?
- So, whats a good model?
- Well, we need states for ISLAND bases and
- NON-ISLAND bases
23HMM for CpG Islands Straight Markov Models
CpG NON-Island (-)
CpG Island ()
24HMM for CpG Islands Combined Hidden Markov Model
CpG Island
CpG NON-Island
25 HMM for CpG IslandsWhats hidden?
Visible
Hidden
26 HMM for CpG IslandsThe Three Problems
- (Evaluation not in CpG Islands)
- Training
- Decoding
27HMM for CpG IslandsTraining Problem
HOW? ML or Forward/Backward algorithm
28HMM for CpG Islands Decoding Problem
- Viterbi Algorithm
- Decoding- Meaning of observation sequence by
looking at the underlying states. - Hidden states A,C,G,T,A-,C-,G-,T-
- Observation sequence CGCGA
- State sequences C,G,C,G,A or
C-,G-,C-,G-,A- - or C,G-,C,G-,A
- Most Probable Path C,G,C,G,A
29HMM for CpG Islands Decoding Problem II
- Viterbi Algorithm
- Hidden Markov model S, akl, , el(x).
- Observed symbol sequence E x1,.,xn.
- Find - Most probable path of states that resulted
in symbol sequence E - Let vk(i) be the partial probability of the most
probable path of the symbol sequence x1, x2, ..,
xi ending in state k. Then - v l(i 1) e l(xi1) max(vk(i) akl)
30HMM for CpG Islands Decoding Problem III
A
A
A
A
A
C
C
C
C
C
G
G
G
G
G
T
T
T
T
T
A-
A-
A-
A-
A-
C-
C-
C-
C-
C-
G-
G-
G-
G-
G-
T-
T-
T-
T-
T-
A
C
G
C
G
31HMM for CpG Islands Decoding Problem III
- Summary
- Computationally less expensive than forward
algorithm. - Partial probability of reaching final state is
the probability of the most probable path. -
- Decision of best path based on whole sequence,
not an individual observation.
32HMM for CpG Islands
- Now, on to our Mathematica
- implementation
33HMM for CpG Islands
- References
- R.Dubin,S.Eddy, A.Krogh, and G. Mitchison.
"Biologiclal Sequence Analysis Probablistic
models of Proteins and nucleic acids. Cambridge
University Press, 1998. chapters 3 and 5. - A.Krogh,M.Brown,I.Saira Mian,Kimmen Sjolander and
David Haussler "Hidden Markov Models in
Computational Biology Appications to Protein
Modeling J.Mol Biol. (1994) 253, 1501-1531 - L. Rabiner, A Tutorial on Hidden Markov Models
and Selected Applications in Speech Recognition,
Proceedings of the IEEE, Vol. 77, No. 2, Feb.
1989 - On-line tutorial
- http//www.comp.leeds.ac.uk/roger/HiddenMarkovMode
ls/html_dev/main.html