Title: Evolving the Structure of Hidden Markov Models
1Evolving the Structure of Hidden Markov Models
- Paper from IEEE Trans. on EC
- Present Cyrus
2Outline
- Background and Motivation
- The Proposed Method
- Result and Analysis
- Comments
3Outline
- Background and Motivation
- The Proposed Method
- Result and Analysis
- Comments
4Hidden Markov Models (HMMs)
- probabilistic finite-state machines used to find
the underlying structures of sequential data - defined by the set of states, the transition
probabilities between states, and a table of
emission probabilities associated with each state
for all possible symbols (e.g. ATCG in DNA) that
occur in the sequence
5HMMsIllustration
- An example for DNA sequences
C 0.6 T 0.4
A 0.25 T 0. 5 C 0.25
A 0.9 T 0.1
A 0.05 C 0.05 G 0.9
G 1
x1
x2
xi
xn
x3
S1S2S3SiSn
x1x2x3xixn
AGTCG
AGCTG
S1S2S1S2Si
x1x2x1x2xi
AGTGC
6HMMsGiven the model and parameters
- We can calculate
- Probability of an observed sequence
- forward-backward algorithm
- The most likely sequence of hidden states that
could have generated a given output sequence - Viterbi algorithmdynamic programming algorithm
7HMMsParameter Estimation
- Given a set of output sequences
- The modeling problem How to determine the
transition and emission probabilities - Baum-Welch algorithm
- Update the parameters to maximize the model
likelihood - An expectation maximization method iterations
until convergence - Briefly introduced
8Baum-Welch Algo
- HMM Parameter Estimation
- Forward Variable
-
- Backward Variable
-
- Will be used for parameter calculation
Markovian nature of the model
9Update Rules
- EM procedure, the parameters that maximize the
likelihood value are calculated iteratively - Update of the transition probability
- Where is the number of transitions from state
i to state j summed over the sequence
10Update Rules
- Update of the emission probability
- In Baum-Welch, unknown transition and emission
frequencies are replaced by their expected values
the number of times the symbol a is emitted
when in state i
11The procedure
- The parameters are re-estimated by the update
rules - The procedure is iterated until some convergence
criterion is met - Pseudo-counts are used to avoid excessive
over-fitting
12What Architecture to Choose?
- Usually involved with expert knowledge to
manually design the HMM
Or
?
13Automatic discovery of HMM structures
- Some significant attractions
- Allow the data to speak for themselves and get
rid of the requirement of experts - Possibility to find completely novel structures,
free from theoretical prejudice - Automation provides many more structures to be
tested than is possible from manual design
14Motivation
- The aim of the paper
- utilize genetic algorithms (GAs) to gain the
advantage of automatic HMM structure discovery - retain some of the benefits of a hand-designed
architecture for biological sequences analysis - HMMs have received little attention from
evolutionary computing community - It is novel that evolving the architecture of
HMMs using GA
15Outline
- Background and Motivation
- The Proposed Method
- Result and Analysis
- Comments
16Representation of Block HMM
- In order to constrain the search of HMM
topologies to biologically meaningful structures - HMMs structure is represented as a number of
blocks - Three basic structures in biological analysis are
used as the blocks
17Blocks of HMMs
- (a) linear (to model conserved regions)
- (b) self-loop (to model a sequence of any length)
- (c) forward blocks (to describe varying length
subsequences)
18Block HMM
- Types tied or untied
- Tied all the emission and transition
probabilities inside the block are set equal - The blocks are fully linked together to form the
whole HMM architecture
19Genetic Operators
- Crossover
- Each block is represented by a pair
- First element type linear, self-loop
- Second element tied or untied and other info
- The whole HMM is represented by a string of
pairs
20Crossover
- Combination between blocks
- full transitions between the blocks are not shown
for simplicity
21Mutation
22Mutation
23Fitness Evaluation
- To achieve generalization ability and avoid
over-fitting - training data are split into two sets
- one half is used as training set using BaumWelch
to estimate the parameters - the other half as a evaluation set to measure the
fitness for selecting members from the population
24Fitness Function
-
- stands for parameters of the HMM individual,
is the ith sequence for evaluation and is
its length - Notice that the formula in the paper is not
precise
25Reproduction
- The individuals are selected for reproduction
with Boltzmann probability - where
- s (1)is the parameter to control selection
strength N is the population size - Stochastic universal sampling is used to reduce
genetic drift in selection
a development of Fitness proportionate selection
which exhibits no bias and minimal spread uses a
single random value to sample all of the
solutions by choosing them at evenly spaced
intervals
26Stochastic universal sampling
(A)
Expectation pie.
(B)
Divide another pie by population size to get
children pie.
2.77 1.23 4
Child 1
0.58
i1
i4
Child 4
Child 2
i2
2.55 0.22 2.77
i3
Child 1
0.58 1.97 2.55
Child 2
Pie slice for each E(i)
Child 4
(C)
Child 2
Choose a random number in (0,1) and spin
children pie by that amount
Child 3
27Stochastic universal sampling
(D)
Child 1
Superimpose children pie on top of expectation
pie. This gives the number of children of each
individual.
Child 4
Child 2
Child 3
The number of children generated cannot be less
than the floor of E(i) and cannot be greater than
the ceiling of E(i).
28Outline
- Background and Motivation
- The Proposed Method
- Result and Analysis
- Comments
29Artificial Data
- Model (ATG) and (AAGATGAGGACG)
- Two-block models are used
- GA configuration
- Results
30Promoter Model of C. jejuni
- For each individual HMM in GA
- 175 sequences available
- 132 sequences are used for Baum-Welch training,
43 for fitness evaluation - The best HMMs found with nine- or eight-block
settings - 9, 8 find the AAGGA and TAtAAT regions
- 9, 8 find the presence of semi-conserved TGx
upstream of TATA box - 9 finds the ten-base periodicity which is
discovered in a handcrafted HMM - 7 only AAGGA is found
31HMMs found by GA
ten-base periodicity
9-block
8-block
7-block
32Discrimination Test
- These HMM architectures are tested for
discrimination ability - The architectures are kept while the parameters
are reset to be random for Baum-Welch training - Five-fold cross validation each time of the 175
sequences 140 are for training and 35 for testing - Background sequences are generated by a
third-order Markov chain - A log-odds threshold is set so that there are ten
or fewer false positives (FP) and then the number
of true positives (TP) is measured
33Results
- The total true positives in the five-folds are
(355) 175 - Compared with a previous handcrafted HMM (with
expert knowledge)
34Outline
- Background and Motivation
- The Proposed Method
- Result and Analysis
- Comments
35Contributions
- This paper proposes a novel method to
automatically discover the structure of HMMs
using GA - To preserve biologically meaningful building
blocks of HMMs, block representation is employed - GA explores different combinations of these
blocks and mutates the blocks to form new HMMs - To avoid over-fitting only half of the training
data are trained while another half are used for
fitness evaluation
36Problems
- The huge complexity is still great weakness, as
the authors mention - Each individual involves a whole process of
training and testing a HMM! - Some descriptions are not clear
- I have to refer to other three related papers by
the same authors to get a more complete view - Incorrect figure how to mutate to untied lack
systematic descriptions of the mutation cases
fitness function not precise
37Questions
- Experiment
- more details are desired such as the emission and
transition probabilities of the blocks in HMM - more explanation is needed about those blocks
which are not triple-equivalent in the results
and their affect - the reason of obtaining TP by setting a threshold
so that there are ten or fewer FP should be
justified
38Evolving the Structure of Hidden Markov Models
- The End! Thank You!
- Q A
- 06238760
- Chan Tak Ming