Title: New developments for recurrent neural systems
1New developments for recurrent neural systems
Barbara Hammer Institute of Informatics TU
Clausthal hammer_at_in.tu-clausthal.de
2- Recurrent neural networks
- Definition
- Architectural bias
- Recurrent self-organizing maps
- Definition
- Capacity
- Contextual models
- Background
- Approximation capability
3Recurrent neural networks - Definition
4Recurrent neural networks
- Feedforward processing
- Recurrent processing
Moo
Moo
Moo
Moo
5Recurrent neural networks
- Application areas
- Hawkins, Boden, The Applicability of Recurrent
Neural Networks for Biological Sequence Analysis,
IEEE/ACM TCBB, 2005 - Xu, Hu, Wunsch, Inference of genetic regulatory
networks with recurrent neural network models,
IEMBS 2004 - Pollastri, Baldi, Prediction of contact maps by
GIOHMMs and recurrent neural networks using
lateral propagation from all four cardinal
corners, Bioinformatics, 2002 - Bonet et al., Predicting Human Immunodeficiency
Virus (HIV) Drug Resistance using Recurrent
Neural Networks. Proceedings of the 10th
International Electronic Conference on Synthetic
Organic Chemistry, 2006 - Reczko et al., Finding signal peptides in human
protein sequences using recurrent neural
networks, WABI 2002 - Chen, Chaudhari, Bidirectional segmented-memory
recurrent neural network for protein secondary
structure prediction, Soft Computing - A Fusion
of Foundations, Methodologies and Applications,
2006 - Bates et al., Detection of seizure foci by
recurrent neural networks, Engineering in
Medicine and Biology Society, 2000. Proceedings
of the 22nd Annual International Conference of
the IEEE - Güler, Übeyli, Güler, Recurrent neural networks
employing Lyapunov exponents for EEG signals
classification, Expert Systems with Applications,
2005 - Petrosian, Prokhorov, Schiffer, Early
recognition of Alzheimer's disease in EEG using
recurrent neural network and wavelet transform,
Proc. SPIE, 2000 -
6Recurrent neural networks
7Recurrent neural networks
- Feedforward neural network
- neurons connected in an acyclic graph
- every neuron computes x ? sgd(wtx-b)
- network computes function on vector spaces
- Recurrent neural network
- feedforward network enriched with recurrent
connections which set a temporal context - recurrent connections use the output of the
previous time step - network computes function on time series
8Recurrent neural networks
x(t)
o(t)
z(t) f(x(t),z(t-1)) o(t) g(z(t))
z(t)
z(t-1)
9Recurrent neural networks
- well established, training by minimizing the
quadratic error ? backpropagation through time,
real time recurrent learning, Kalman filtering, - long term dependencies cannot be captured due to
vanishing gradients
derivative
The derivative vanishes if propagated through
several time steps!
10Recurrent neural networks
fixed recurrent part based on universal pinciples
readout trained by means of simple gradient
mechansim
11Recurrent neural networks
- Fractal prediction machines
input alphabet T C G A context is two-dimensional
A C A
T
C
- resulting points constitute a fractal
- Markovian property emphasis on one part of the
sequence
G
A
12Recurrent neural networks
- Fractal predistion machine demo
- daily volatility change of the Dow Jones
Industrial Average 2/1918-4/1997, predict the
direction of volatility move for the next day
Tino,Dorffner, Predicting the future from
fractal representations of the past, Machine
Learning, 2001
13Recurrent neural networks
very high dimension, random connections
- echo state property
- in the limit, the context does not depend on the
initialization - e.g. spectral radius smaller than one
- activation initialized by long enough recurrence
14Recurrent neural networks
Mackey-Glass time series
Laser data
Lorentz attractor
Jaeger/Haas, Harnessing nonlinearity predicting
chaotic systems and saving energy in wireless
communication, Science, 2004
15Recurrent neural networks Architectural bias
16Recurrent neural networks
- Approximation completeness RNNs can aproximate
every recursive system with continuous transition
and finite time horizon - connection to recursive symbolic computation
mechanisms?
dynamics of a symbolic formalism?
17Recurrent neural networks
- Symbolic mechanisms
- finite memory models look only at a finite time
window - f(x1,x2,) f(x1,,xL) for fixed L
- finite state automata computation based on a
finite internal state - pushdown automata computation based on an
interior stack - context sensitive language - computations in
linear space - Turing machines
- beyond (? computation with real numbers!)
18Recurrent neural networks
RNNs with arbitrary weights non uniform Boolean
circuits (super Turing capability)
Siegelmann/Sontag
RNNs with rational weights Turing
machines Siegelmann/Sontag
RNNs with limited noise finite state
automata Omlin/Giles, Maass/Orponen
RNNs with Gaussian noise finite memory models
Maass/Sontag
19Recurrent neural networks
- Motivation architectural bias
easy divide this form into two parts with the
same size and form
difficult divide this form into four parts with
the same size and form
extremely difficult divide this form into six
parts with the same size and form
20Recurrent neural networks
- RNNs are initialized with small weights what is
the bias? - It holds Hammer/Tino
- small weight RNNs ? FMMs For every RNN with
small weights one can find a finite memory length
L such that the RNN can be approximated by a FMM
with memory length L. - FMMs ? small weight RNNs For every FMM, an RNN
with randomly initialized small weights exists
which approximates the FMM. - small weight RNNs have excellent generalization
ability (distribution independent UCED property)
for RNNs with small weights, the empirical error
represents the real error independent of the
underlying distribution
21Recurrent neural networks
RNNs with arbitrary weights non uniform Boolean
circuits (super Turing capability)
Siegelmann/Sontag
RNNs with rational weights Turing
machines Siegelmann/Sontag
RNNs with limited noise finite state
automata Omlin/Giles, Maass/Orponen
RNNs with Gaussian noise finite memory models
Maass/Sontag
22Recurrent Self-organizing maps Definition
23Recurrent self-organizing maps
- Supervised learning
- Unsupervised learning
24Recurrent self-organizing maps
Self-organizing map (SOM) Kohonen popular
unsupervised self-organizing neural method for
data mining and visualization
network given by prototypes wj ? Rn in a lattice
j(j1,j2)
mapping Rn?x ? position j in the lattice for
which x-wj minimal
Hebbian learning based on examples xi and
neighborhood cooperation
x
j x - wj minimal
i.e. choose xi and adapt all wj wj wj
?nhd(j,j0)(xi-wj)
?
25Recurrent self-organizing maps
- Neural gas Martinetz no prior lattice,
adaptation according to the rank - wj wj ? rk(wj,xi)(xi-wj)
- HSOM Ritter hyperbolic lattice structure
- wj wj ?nhdH(j,j0)(xi-wj)
- but for real vectors of fixed size only!
- Time series and recurrence?
-
26Recurrent self-organizing maps
- Temporal Kohonen map Chappell/Taylor
- Recurrent SOM Varsta/Heikkonen
x1,x2,x3,x4,,xt,
d(xt,wi) xt-wi ad(xt-1,wi)
training wi ? xt
d(xt,wi) yt where yt (xt-wi) ayt-1
training wi ? yt
27Recurrent self-organizing maps
- TKM/RSOM compute a leaky average of time series
- It is not clear how they can differentiate
various contexts - no explicit context!
is the same as
28Recurrent self-organizing maps
- Merge SOM Hammer/Strickert, 2003 explicit
notion of context
(wj,cj) in Rnxn
wj represents the current entry xt cj
represents the context the content of the
winner of the last step
d(xt,wj) axt-wj (1-a)Ct-cj where Ct
?wI(t-1) (1-?)cI(t-1), I(t-1) winner in step
t-1
merge
29Recurrent self-organizing maps
C1 (42 50)/2 46
C2 (3345)/2 39
C3 (3338)/2 35.5
30Recurrent self-organizing maps
- Training
- MSOM wj wj ?nhd(j,j0)(xt-wj)
- cj cj
?nhd(j,j0)(Ct-cj) - MNG wj wj ?rk(wj,xt)(xt-wj)
- cj wj
?rk(wj,xt)(Ct-cj) - MHSOM wj wj ?nhdH(j,j0)(xt-wj)
- cj cj
?nhdH(j,j0)(Ct-cj)
31Recurrent self-organizing maps
- Experiment
- speaker identification, Japanese vowel ae
- 9 speakers, 30 articulations per speaker in
training set - separate test set
- http//kdd.ics.uci.edu/databases/JapaneseVowels/Ja
paneseVowels.html
time
12-dim. cepstrum
32Merge SOM
- MNG with posterior labeling
- ? 0.5, a 0.99?0.63, ? 0.3
- 150 neurons
- 0 training error
- 2.7 test error
- 1000 neurons
- 0 training error
- 1.6 test error
- rule based 5.9, HMM 3.8 Kudo et al.
33Merge SOM
- Experiment
- Reber grammar
- 3106 input vectors for training
- 106 vectors for testing
- MNG, 617 neurons, ? 0.5, a 1?0.57
- evaluation by the test data
- attach the longest unique sequence to each winner
- 428 distinct words
- average length 8.902
- reconstruction from the map
- backtracking of the best matching predecessor
- triplets only valid Reber words
- unlimited average 13.78
- TVPXTTVVEBTSXXTVPSEBPVPXTVVEBPVVEB
BTXXVPXVPXVPSE
BTXXVPXVPSE
(W,C)
34Merge SOM
- Experiment
- classification of donor sites for C.elegans
- 5 settings with 10000 training data, 10000 test
data, 50 nucleotides TCGA embedded in 3 dim, 38
donor Sonnenburg, Rätsch et al. - MNG with posterior labeling
- 512 neurons, ?0.25, ?0.075, a 0.999 ?
0.4,0.7 - 14.060.66 training error, 14.260.39 test
error - sparse representation 512 6 dim
35Recurrent self-organizing maps Capacity
36Recurrent self-organizing maps
- Theorem context representation
- Assume
- a SOM with merge context is given (no
neighborhood) - a sequence x0, x1, x2, x3, is given
- enough neurons are available
- Then
- the optimum weight/context pair for xt is
- w xt, c ?i0..t-1
?(1-?)t-i-1xi - Hebbian training converges to this setting as a
stable fixed point - Compare to TKM
- optimum weights are w ?i0..t (1-a)ixt-i /
?i0..t (1-a)i - but no fixed point for TKM
37Recurrent self-organizing maps
- Theorem - capacity
- MSOM can simulate finite automata
- TKM cannot
- ? MSOM is strictly more powerful than TKM/RSOM!
state
input
state
d
state
input (1,0,0,0)
38Recurrent self-organizing maps
General recursive maps
xt,xt-1,xt-2,,x0
xt-1,xt-2,,x0
xt
(w,c)
xt w2
Ct - c2
The methods differ in the choice of context!
Ct
Hebbian learning w ? xt c ? Ct
39Recurrent self-organizing maps
xt,xt-1,xt-2,,x0
(w,c)
xt w2
Ct - c2
xt
MSOM Ct merged content of the winner in the
previous time step TKM/RSOM Ct activation of
the current neuron (implicit c)
Ct
xt-1,xt-2,,x0
40Recurrent self-organizing maps
- MSOM
- Ct merged content of the winner in the
previous time step - TKM/RSOM
- Ct activation of the current neuron
(implicit c) - Recursive SOM (RecSOM) Voegtlin
- Ct exponential transformation of the
activation of all neurons - (exp(-d(xt-1,w1)),,exp(-d(xt-1,wN)))
- Feedback SOM (FSOM) Horio/Yamakawa
- Ct leaky integrated activation of all
neurons - (d(xt-1,w1),, d(xt-1,wN)) ?Ct-1
- SOM for structured data (SOMSD)
Hagenbuchner/Sperduti/Tsoi - Ct index of the winner in the previous
step - Supervised recurrent networks
- Ct sgd(activation), metric as dot product
41Recurrent self-organizing maps
for normalized or WTA semilinear context
42Recurrent self-organizing maps
- Experiment
- Mackey-Glass time series
- 100 neurons
- different lattices
- different contexts
- evaluation by the temporal quantization error
-
average(mean activity k steps into the past -
observed activity k steps into the past)2
43Recurrent self-organizing maps
SOM
quantization error
RSOM
NG
RecSOM
SOMSD
HSOMSD
MNG
now
past
44Contextual models - Background
45Contextual models
46Contextual models
- time series ? sensor signals, spoken language,
- sequences ? text, DNA,
- tree structures ? terms, formulas, logic,
- graph structures ? chemical molecules, graphic,
networks, - neural networks for structures
- kernel methods Haussler, Watkins et al.
- recursive networks Küchler et al.
47Contextual models
Recursive network
- training given pattern (xi,f(xi))
- selection of the architecture
- optimization of the weights
- evaluation of the test error
inp.
output
cont.
cont.
directed acyclic graphs over Rn with one
supersource and fan-out 2
where frec(Rn)2?Rc frec(?)
0 frec(a(l,r)) f(a,frec(l),frec(r))
g?frec(Rn)2?Ro
48Contextual models
- Cascade Correlation Fahlmann/Lebiere
- given data (x,y) in RnxR, find f such that f(x)y
-
minimize the error on the given data
x
y
maximize the correlation of the units output and
the current error ? the unit can serve for error
correction in subsequent steps
hi(x)fi(x,h1(x), ..., hi-1(x))
etc.
49Contextual models
- few, cascaded, separately optimized neurons
- ? efficient training
- ? excellent generalization
- .. as shown e.g. for two spirals
50Contextual models
For trees recursive processing of the structure
starting at the leaves towards the root
q-1(hi(v))(hi(ch1(v)),...,hi(chk(v))) gives
the context
q-1
acyclic!
q-1
h1(v) f1(l(v),h1(ch1(v)),...,h1(chk(v)))
q-1
q-1
... not possible since weights are frozen after
adding a neuron!
h2(v) f2(l(v),h2(ch1(v)),...,h2(chk(v)),
h1(v),h1(ch1(v)),...,h1(chk(v)))
etc. no problem!
51Contextual models
- Recursive cascade correlation
- init
- repeat
- add hi
- train fi(l(v), q-1(hi(v)),
- h1(v), q-1(h1(v)),
- ...,hi-1(v), q-1(hi-1(v)))
- on the correlation
- train the output on the error
52Contextual models
Restricted recurrence allows to look at
parents q-1(hi(v))(hi(ch1(v)),...,hi(chk(v))) q
1(hi(v))(hi(pa1(v)),...,hi(pak(v)))
q-1
q1
q1
q-1
... would yield cycles
q-1
possible due to restricted recurrence!
Contextual cascade correlation hi(v)fi(l(v),q-1(
hi(v)),h1(v),q-1(h1(v)),q1(h1(v))
...,hi-1(v),q-1(hi-1(v)),q1(hi-1(v)))
53Contextual models
- q1 extends the context of hi
i3
i1
i2
54Contextual models
- Experiment QSPR-problem Micheli,Sperduti,Sona
predict the boiling point of alkanes (in oC).
Alkanes CnH2n2, methan, ethan, propan, butan,
pentan,
hexan
2-methyl-pentan
the larger n and the more side strands, the
higher the boiling point ? excellent benchmark
55Contextual models
CH3(CH2(CH2(CH(CH2(CH3),CH(CH3,CH2(CH3))))))
56Contextual models
- Alkane 150 examples, bipolar encoding of
symbols, - 10-fold cross-validation,
- number of neurons 137 (CRecCC), 110/140 (RecCC),
- compare to FNN with direct encoding for
restricted length Cherqaoui/Vilemin - codomain -164,174
57Contextual models
58Contextual models Approximation capability
59Contextual models
- Major problem Giles RCC cannot represent all
finite automata!
CRecCC
NN CC
RCC
RecCC
FSA
RNN
RecNN
60Contextual models
- ? RCC is strictly less powerful than RNN due to
the restricted recurrence if considering
approximation for inputs of arbitrary size/length - ? It is not clear, what we get for restricted
size/length resp. approximation in L1-norm - ? The restricted recurrence enables us to
integrate the parents into the context, i.e. to
deal with a larger set of inputs (acyclic graphs
instead of trees)
61Contextual models
- Supersource transductions
- IO-isomorphic tranductions
real number
62Contextual models
- ... for L1-approximation, we get
Hammer/Micheli/Sperduti - ? RCC is approximation complete for sequences and
supersource transduction (required squashing
function which is C1 and non vanishing at one
point) - ? RecCC with multiplicative neurons is
approximation complete for tree structures and
supercource transduction (required squashing
function which is C2 and nonvanishing at one
point) - ? Contextual Cascade Correlation with
multiplicative neurons is approximation complete
for acyclic graphs and IO-isomorphic transduction
(required graphs possess one supersource and a
mild structural condition, squashing function
which is C2 and nonvanishing at one point)
63Conclusions.
64Conclusions
- Recurrent networks
- FMM and learning bias ? alternative training
mechanisms - Recurrent self-organizing maps
- context defines function and capacity
- Contextual processing
- general forms of recurrence open the way towards
structures
65(No Transcript)