Title: Project Name PostMortem
1Time-line Hidden Markov Experts for Time Series
Prediction
Xin Wang xinw_at_infoscience.otago.ac.nz PhD
Candidate Department of Information
Science University of Otago Dunedin New Zealand
2Outline of Talk
- Background on Chaotic Time Series Prediction
- Mixture of Experts (ME) Models for Prediction
- Time-line Hidden Markov Experts (THME) for
Prediction - Experiments on One-step-ahead and
Multi-step-ahead Prediction.
3Chaotic Time Series
- Chaotic Time series is a
chronological sequence of observations from a
non-linear (deterministic) dynamical system. - A simple time series
- State space, a m-dimensional space
- of
- Velocity of trajectory
- derivative of state vector
- to time t
0.08, 0.14, 0.19, 0.22, 0.23, 0.23,0.22, 0.20,
......
trajectory
4Prediction of Chaotic Time Series
- For chaotic time series , by Takens
embedding theorem, - there exists a mapping from the state
vector to the future value - of the time series
- The task for the prediction
- Reconstruction of the state space
- Learn the mapping
-
- i.e. approximation of with training
samples - Generate future values.
5Techniques for Prediction (1)
- Global Model
- One regression model covering the entire range
of the underlying trajectory, such as -
- Polynomial, or
- Neural Networks MLP, RBF, etc. or
- Other regression model, e.g. Support Vector
Machine (SVM) - These models learn with observed samples
(training set) and make prediction (for test set
) afterwards.
6Techniques for Prediction (2)
- Local Model
- 1. Models based on Nearest Neighbours
- Local averaging
- Local regression
- Locally weighted averaging
- Locally Weighted regression.
- Neighbours of the query are identified from the
observed samples, then the prediction for the
query is achieved by averaging the target outputs
of the neighbours, or estimation from the (linear
or non-linear) regression function built over the
neighbours.
7Techniques for Prediction (3)
- 2. Models based on Divide Conquer principle
- Piece-wise regression
- Threshold Autoregressive Model (TAR)
- Switching Regression.
- Mixture of Experts (ME) (in connectionist
society) - Divide the state space into some sub-spaces
- Learn the mapping from the divided trajectory in
each sub-space to the target output with a
regression model (expert) - Combine (Linear average) the outputs from the
experts as the output of the model. -
8Three ME models (1)
- Gated Experts (GE)
- State space is divided into a set of sub-spaces
- A connectionist model (MLP) learns the mapping on
the phases of divided trajectory in each
sub-space - The experts are combined by probabilities of the
point on - trajectory being in each sub-space.
- Problem
- Combination relies only on input position.
- Hidden Markov Experts (HME)
- Similar to ME, but
- The experts are combined by a HMM.
- The probabilities for the combination relies on
the previous state and the state transition
probabilities of the HMM.
9Three ME models (2)
- Problem
- Transition probabilities are constant without
concern about the influence from outside. - Unable to indicate state transition at distinct
time point precisely. - Input/Output HMM (IOHMM)
- Local experts are combined by a inhomogeneous
HMM, where the transition probabilities are
time-varying. - More information for expert combination.
- Problem
- The experts must be linear perceptron or MLPs.
10"Time-line" Hidden Markov Experts ------THME
- The trajectory is divided into phases belonging
to some categories according to the velocity. - A regression model is applied to learn the
mapping from the phases in each category to the
target outputs. - HMM is applied for expert combination.
- Each category defines a state of the trajectory
and associates with a state of the HMM. - The transition probabilities of the HMM are
designed as time-varying, the HMM thus is called
time-line HMM and the model is called THME. - The time-varying state transition probabilities
are conditional on the "velocity" of the
trajectory and modelled by a connectionist model.
11Architecture of THME
- THME with M local experts moderated by a HMM.
- is embedded series values as
input. - The experts are some regression models that
trained with the samples in the categories. - is the output of expert i .
- The experts are combined by the probabilities
of the underlying process being in each state of
the HMM.
Time-line HMM
. . .
. . .
Expert 1
Expert 2
Expert M
12Dividing the Trajectory Local Learning
- The dividing of the trajectory in the state space
is implemented on the information contained in
the state vector and the corresponding output. - To enable the velocity-based dividing, the
feature used for the dividing is - Fuzzy C-means clustering algorithm applied for
dividing the trajectory. - The samples on the phases in each cluster
(category) then are - used to training a regression model to make
it a local expert - over the phases (state).
- The Local experts could be MLP, RBF, SVM.
13HMM for Expert Combination (1) ------An
Example of a Three-state HMM
a22
S2
A process with observations,
generated from a hidden state series,
belonging to three kinds of states
a23
a12
a21
a32
a13
S3
S1
a33
a11
a31
f( S1) Emission probability distribution
P(yt1 st1S1)
P(yt stS1)
P(yt-1 st-1S1)
P(yt stS2)
P(yt1 st1S2)
P(yt-1 st-1S2)
P(yt1 st1S3)
P(yt stS3)
P(yt-1 st-1S3)
f( S1)
f( S1)
f (S1)
Pt(S1)
Pt-1(S1)
Pt1(S1)
f( S2)
f( S2)
f( S2)
Pt(S2)
Pt-1(S2)
Pt1(S2)
f( S3)
f( S3)
f( S3)
Pt(S3)
Pt-1(S3)
Pt1(S3)
t1
t
t-1
14HMM for Expert Combination (2)
- Suppose a HMM with Gaussian emission
distribution - Apply the experts on all the training samples.
- The expert j trained with the samples on some
phases of trajectory has less error over the
phases than over others of the trajectory. - The samples on the phases have higher emission
probabilities, so the phases are associated with
state j in the HMM. - The evolution of the underlying system from one
phase to another on the trajectory is the state
transition in the HMM.
15Time-line HMM for Expert Combination
- In traditional HMM, constant transition
probabilities are hold for all time points,
expert combination is not always good. -
- Time-line HMM could be applied, the transition
probabilities are time-varying, i.e. for
different time point, there are different
transition probabilities. - The learning of the time-line HMM is searching
the best transition probabilities on every time
point to observe the samples with maximum
probability. - A modified Baum-Welch algorithm in EM
(Expectation and Maximisation) process is
developed for time-line HMM learning.
16Diagram of Expert Combination
yt
- P(s tS i) the probability of being in
- state S i at time t.
- ? differentiating operation,
- .
- ? multiplication between two real values.
- ? multiplication between two matrixes.
- "State Transition Network generates transition
probabilities for the time-line HMM.
.
.
.
.
. . .
. . .
P(stS2)
P(stS1)
P(stSM)
P(st-1S1)
P(stS1)
P(st-1S2)
P(stS2)
. . .
. . .
P(st-1SM)
P(stSM)
aij(t)
State Tran. Net.
. . .
Expert 2
Expert M-1
Expert 1
Expert M
?Xt
?
Xt
17"Time-line" HMM learning ------Modified
Baum-Welch Algorithm
- For the time-line HMM,
- Â Â Â Â Gaussian emission distribution is assumed
-
- State transition probability
- log Likelihood Function (auxiliary function)
about the current parameter ? and to be
estimated parameter ?
18"Time-line" HMM learning ------EM Step in the
Algorithm
- Expectation (E-step)
- Forward and Backward to estimate Q function.
- Maximisation (M-step)
- Maximising the Q function to update the
parameter. - Initial state probability
- Time-varying transition probability
- Variance of Gaussian
Expectation Step Forward/backward steps to
estimate Q function.
Maximisation Step Maximising the Q function to
reach a critical point of ?.
EM step for time-line HMM Learning
19State Transition Probability Modelling
- The state transition property is a series of
matrix entries corresponding to the time points. - The state of the time series is defined by
, the - velocity of state vector, , could be used
to learn the state transition probabilities. - A RBF structured state transition network
performs the modelling. - In training, the network learns the mapping from
to the transition probabilities at the
time. - In prediction, the vector about the
previous values of the time series, estimates
transition probabilities by the network.
20Review of THME training
Outcome
Technique Applied
Training Step
- Fuzzy Membership Degree From
- Fuzzy C-Means clustering.
Trajectory Dividing
- M Regression Models as Experts
- Non-linear Regression by MLP, or RBF, or SVM.
Expert Training
- HMM Gauss Distribution Parameters
- Transition Probabilities for Time Points
- Modified Baum-Welch Algorithm in EM Steps.
HMM Learning
Transition Probability Modelling
21Prediction
- Â Prior probability
-
-
- Â Combine experts
- Single-step-ahead prediction posterior
probability (by Bayes law) for state
re-estimation. -
-
- Multi-step-ahead prediction Feed back output as
input for next step prediction and repeat the
steps. -
22Experiments ------ Data Sets
- One-step-ahead prediction (compare with global
models HME) - Laser Data Leuven Data
- 1000 points for Training, next 500 for Test.
- 5-fold Cross Validation.
- Multi-step-ahead prediction (compare with
Benchmark results) - Laser Data first 1000 points for Training,
next 100 for test. - Leuven Data first 2000 points for Training,
next 200 for test. - Mackey-Glass Data (17) 1000 points for
Training next 500 - for Test, predict from
. - 1. Direct Prediction,
- 2. Iterated One-step-ahead Prediction,
-
- 85 iterations
23Prediction Result ------ One-step-ahead
Prediction with THME-MLP, THME-RBF, THME-SVM, in
NMSE (Normalized Mean Squared Error). Number of
Expert2.
24THME-SVM for Laser Time Series ------One-step-ah
ead Prediction
25 Prediction Error from, 1 THME-RBF, 2HMM-RBF
for Laser Data
1
2
26Expert Combination
- HMM transition probabilities
- THME transition probabilities for the two experts
on the points 5053.
27Prior probabilities on Leuven data
------One-step-ahead Prediction with THME-RBF
28Prediction of Laser Leuven Data
------Multi-step-ahead Prediction
-
- Prediction of Laser data with THME-RBF, THME-SVM.
- Prediction of Leuven data with THME-RBF, THME-SVM.
29Prediction of Mackey-Glass Data
------Multi-step-ahead Prediction
- Prediction of Mackey-Glass data with THME-RBF,
THME-SVM in Direct Iterated mode.
30Prediction of Laser Data
------Multi-step-ahead with THME-RBF
31Prediction of Leuven Data
------Multi-step-ahead
32Prediction Error for Mackey-Glass data
------Multi-step-ahead with THME-SVM
33Review ------ Features of THME
- Dynamics introduced by HMM.
- Time-varying transition probabilities detect
state transition. - Combine the experts by
relying on both exterior information
and interior state status . - Similar to IOHMM, But
- Experts could be MLP, RBF, or SVM.
- The variance of Gaussian emission distribution
is adjustable to fit the noise level for each
state instead of pre-setting in IOHMM. - This makes the state estimation more precise and
gives high quality distribution evaluation for a
series value
34Summary
- Velocity-based trajectory-dividing in ME is
applied on chaotic time series prediction. - The "time-line" HMM introduces dynamics for the
expert combination and more information is
utilised. - The modified Baum-Welch algorithm in EM steps has
been developed for the learning of the
time-line HMM. -
- The time-line" hidden Markov expert model has
better performance on some time series in
one-step-ahead and multi-step-ahead prediction. - A connectionist network is used to model the
time-varying state transition probabilities along
a time series.
35Discussion
- Feature scheme for dividing the trajectory
- may have other choice.
- How to choose the number of local experts.
- How to choose the parameters of the RBF
transition probability network.
36Reference
- L. E. Baum, T. Petrie, G. Soules and N. Weiss, A
Maximization Technique Occurring in the
Statistical Analysis of Probabilistic Functions
of Markov Chains, Annals of Mathematical
Statistics, Vol. 41, pp. 164-171, 1970. - Y. Bengio and P. Frasconi, An Input Output HMM
Architecture, in G. Tesauro, D. S. Touretzky and
T. K. Leen Eds, Advances in Neural Information
Processing Systems, Vol. 7, MIT Press, Cambridge,
MA, 1995, pp. 427-434. - J. Bezdek and S. Pal, Fuzzy Models for Pattern
Recognition, IEEE Press, 1992. - A. Dempster, N. Laird and D. Rubin, Maximum
Likelihood from Incomplete Data via the EM
Algorithm, Journal of the Royal Statistical
Society, Series B, No. 39, pp. 1-38, 1977. - J. D. Farmer and J. J. Sidorowich, Predicting
Chaotic Time Series, Physical Review Letters,
Vol. 59, No. 8, pp. 845-848, 1987. - R. A. Jacobs, M. I. Jordan, S. J. Nowlan and G.
E. Hinton, Adaptive Mixtures of Local Experts,
Neural Computation, Vol. 3, pp. 79-87, 1991. - F. Takens, Detecting Strange Attractors in
Turbulence, Proceedings of Symposium on
Dynamical Systems and Turbulence, Lecture Notes
in Mathematics, 1980, pp. 366-381. - A. S. Weigend, M. Mangeas and A. N. Srivastava,
Nonlinear Gated Experts for Time Series
Discovering Regimes and Avoiding Overfitting,
International Journal of Neural Systems, Vol. 6,
No. 4, pp. 373-399, 1995. - A. S. Weigend and S. Shi, Predicting Daily
Probability Distributions of SP500 Returns,
Journal of Forecasting, Vol. 19, pp. 375-392,
2000.