Title: Neuroinformatics
1Neuroinformatics
- Barbara Hammer, Institute of Informatics,
- Clausthal University of Technology
2 problems in winter
3hartelijk bedankt
meinauto(X) - left(Y,X),holländisch(Y),rot(Y). me
inauto(X) - kennzeichen(X,Y), enthält(Y,Z),enthäl
t(Y,14). enthält(X_,X). .
???
?? !!!
GS-Z-14
pattern recognition ? vectors speech recognition,
time series processing ? sequences symbolic,
general data ? tree structures, graphs
4Part 5 Symbolic structures and neural networks
- Feedforward networks
- The good old days KBANN and co.
- Useful neurofuzzy systems, data mining pipeline
- State of the art structure kernels
- Recursive neural networks
- The general idea recursive distributed
representations - One breakthrough recursive networks
- Going on towards more complex structures
- Unsupervised methods
- Recursive SOM/NG
5The good old days KBANN and co.
feedforward neural network
- black box
- distributed representation
- connection to rules for symbolic I/O ?
y
x
neuron
fw Rn ? Ro
6The good old days KBANN and co.
- Knowledge Based Artificial Neural Networks
Towell/Shavlik, AI 94 - start with a network which represents known rules
- train using additional data
- extract a set of symbolic rules after training
7The good old days KBANN and co.
8The good old days KBANN and co.
data
train
use some form of backpropagation, add a penalty
to the error e.g. for changing the weights
- The initial network biases the training result,
but - There is no guarantee that the initial rules are
preserved - There is no guarantee that the hidden neurons
maintain their semantic
9The good old days KBANN and co.
(complete) rules
- There is no exact direct correspondence of a
neuron and a single rule/boolean variable - It is NP complete to find a minimum logical
description for a trained network Golea,
AISB'96 - Therefore, a couple of different rule extraction
algorithms have been proposed, and this is still
a topic of ongoing research
10The good old days KBANN and co.
(complete) rules
decompositional approach
pedagogical approach
11The good old days KBANN and co.
- Decompositional approaches
- subset algorithm, MofN algorithm describe single
neurons by sets of active predecessors
Craven/Shavlik, 94 - local activation functions (RBF like) allow an
approximate direct description of single neurons
Andrews/Geva, 96 - MLP2LN biases the weights towards 0/-1/1 during
training and can then extract exact rules Duch
et al., 01 - prototype based networks can be decomposed along
relevant input dimensions by decision tree nodes
Hammer et al., 02 - Observation
- usually some variation of if-then rules is
achieved - small rule sets are only achieved if further
constraints guarantee that single weights/neurons
have a meaning - tradeoff between accuracy and size of the
description
12The good old days KBANN and co.
- Pedagogical approaches
- extraction of conjunctive rules by extensive
search Saito/Nakano 88 - interval propagation Gallant 93, Thrun 95
- extraction by minimum separation
Tickle/Diderich, 94 - extraction of decision trees Craven/Shavlik, 94
- evolutionary approaches Markovska, 05
- Observation
- usually some variation of if-then rules is
achieved - symbolic rule induction required with a little
(or a bit more) help of a neural network
13The good old days KBANN and co.
- Where is this good for?
- Nobody (inside neuroinformatics) uses FNNs these
days ? - Insertion of prior knowledge might be valuable.
But efficient training algorithms allow to
substitute this by additional training data
(generated via rules) ? - Validation of the network output might be
valuable, but there exist alternative (good)
guarantees from statistical learning theory ? - If-then rules are not very interesting since
there exist good symbolic learners for learning
propositional rules for classification ? - Propositional rule insertion/extraction is often
an essential part of more complex rule
insertion/extraction mechanisms ? - one can extend it to automaton insertion/extractio
n for recurrent networks (basically, the
transition function of the automaton is
inserted/extracted this way) ? - People e.g. in the medical domain also want an
explanation for a classification ? - extension beyond propositional rules interesting
? - There is at least one application domain (inside
neuroinformatics) where if-then rules are very
interesting and not so easy to learn
fuzzy-control ?
14Useful neurofuzzy systems
process
input
observation
control
Fuzzy control
if (observation ? FMI) then (control ? FMO)
15Useful neurofuzzy systems
Fuzzy control
if (observation ? FMI) then (control ? FMO)
Neurofuzzy control
Benefit the form of the fuzzy rules (i.e. neural
architecture) and the shape of the fuzzy sets
(i.e. neural weights) can be learned from data!
16Useful neurofuzzy systems
- NEFCON implements Mamdani control
Nauck/Klawonn/Kruse, 94 - ANFIS implements Takagi-Sugeno control Jang, 93
- and many other
- Learning
- of rules evolutionary or clustering
- of fuzzy set parameters backpropagation or some
form of Hebbian learning
17State of the art structure kernels
kernel k(x,x)
data
just compute pairwise distances for this complex
data using structure information
sets, sequences, tree structures, graph
structures
18State of the art structure kernels
- Closure properties of kernels Haussler, Watkins
- Principled problems for complex structures
computing informative graph kernels is at least
as hard as graph isomorphism Gärtner - Several promising proposals - taxonomy Gärtner
derived from local transformations
semantic
count common substructures
derived from a probabilistic model
syntax
19State of the art structure kernels
- Count common substructures
GA AG AT
Efficient computation dynamic programming suffix
trees
GAGAGA
3 2 0
3
GAT
1 0 1
locality improved kernel Sonnenburg et al., bag
of words Joachims string kernel Lodhi et al.,
spectrum kernel Leslie et al. word-sequence
kernel Cancedda et al.
convolution kernels for language Collins/Duffy,
Kashima/Koyanagi, Suzuki et al. kernels for
relational learning Zelenko et al.,Cumby/Roth,
Gärtner et al.
graph kernels based on paths or subtrees Gärtner
et al.,Kashima et al. kernels for prolog trees
based on similar symbols Passerini/Frasconi/deRae
dt
20State of the art structure kernels
- Derived from a probabilistic model
describe by probabilistic model P(x)
compare characteristics of P(x)
Fisher kernel Jaakkola et al., Karchin et al.,
Pavlidis et al., Smith/Gales, Sonnenburg et al.,
Siolas et al. tangent vector of log odds Tsuda
et al. marginalized kernels Tsuda et al.,
Kashima et al.
kernel of Gaussian models Moreno et al.,
Kondor/Jebara
21State of the art structure kernels
- Derived from local transformations
is similar to
expand to a global kernel
local neighborhood, generator H
diffusion kernel Kondor/Lafferty,
Lafferty/Lebanon, Vert/Kanehisa
22State of the art structure kernels
- Intelligent preprocessing (kernel extraction)
allows an adequate integration of
semantic/syntactic structure information - This can be combined with state of the art neural
methods such as SVM - Very promising results for
- Classification of documents, text Duffy, Leslie,
Lodhi, - Detecting remote homologies for genomic sequences
and further problems in genome analysis
Haussler, Sonnenburg, Vert, - Quantitative structure activity relationship in
chemistry Baldi et al.
23Conclusions feedforward networks
- propositional rule insertion and extraction (to
some extend) are possible ? - useful for neurofuzzy systems ?
- structure-based kernel extraction followed by
learning with SVM yields state of the art results
? - but sequential instead of fully integrated
neuro-symbolic approach ? - FNNs itself are restricted to flat data which can
be processed in one shot. No recurrence ?
24Recursive networks The general idea recursive
distributed representations
- How to turn tree structures/acyclic graphs into a
connectionist representation?
25The general idea recursive distributed
representations
recursion!
inp.
fRixRcxRc?Rc yields
output
cont.
fenc where fenc(?)0 fenc(a(l,r)) f(a,fenc(l),fen
c(r))
cont.
26The general idea recursive distributed
representations
encoding fenc(Rn)2?Rc fenc(?)
0 fenc(a(l,r)) f(a,fenc(l),fenc(r))
right
decoding hdec Ro ?(Rn)2 hdec(0) ? hdec(x)
h0(x) (hdec(h1(x)), hdec(h2(x)))
fRn2c?Rc
gRc?Ro
hRo?Rn2o
27The general idea recursive distributed
representations
- recursive distributed description Hinton,90
- general idea without concrete implementation ?
- tensor construction Smolensky, 90
- encoding/decoding given by (a,b,c) ? a?b?c
- increasing dimensionality ?
- Holographic reduced representation Plate, 95
- circular correlation/convolution
- fixed encoding/decoding with fixed dimensionality
(but potential loss of information) ? - necessity of chunking or clean-up for decoding ?
- Binary spatter codes Kanerva, 96
- binary operations, fixed dimensionality,
potential loss - necessity of chunking or clean-up for decoding ?
- RAAM Pollack,90, LRAAM Sperduti, 94
- trainable networks, trained for the identity,
fixed dimensionality - encoding optimized for the given training set ?
28The general idea recursive distributed
representations
- Nevertheless results not promising ?
- Theorem Hammer
- There exists a fixed size neural network which
can uniquely encode tree structures of arbitrary
depth with discrete labels ? - For every code, decoding of all trees up to
height T requires O(2T) neurons for sigmoidal
networks ? - ? encoding seems possible, but no fixed size
architecture exists for decoding
29One breakthrough recursive networks
- Recursive networks Goller/Küchler, 96
- do not use decoding
- combine encoding and mapping
- train this combination directly for the given
task with backpropagation through structure - ? efficient data and problem adapted encoding is
learned
encoding
transformation
y
30One breakthrough recursive networks
- Applications
- term classification Goller, Küchler, 1996
- automated theorem proving Goller, 1997
- learning tree automata Küchler, 1998
- QSAR/QSPR problems Schmitt, Goller, 1998
Bianucci, Micheli, Sperduti, Starita, 2000
Vullo, Frasconi, 2003 - logo recognition, image processing Costa,
Frasconi, Soda, 1999, Bianchini et al. 2005 - natural language parsing Costa, Frasconi, Sturt,
Lombardo, Soda, 2000,2005 - document classification Diligenti, Frasconi,
Gori, 2001 - fingerprint classification Yao, Marcialis, Roli,
Frasconi, Pontil, 2001 - prediction of contact maps Baldi, Frasconi,
Pollastri, Vullo, 2002 - protein secondary structure prediction Frasconi
et al., 2005
31One breakthrough recursive networks
- application prognosis of contact maps of proteins
x1x2x3x4x5x6x7x8x9x10
x1x2x3x4x5x6x7x8x9x10
(x2,x3)
x1x2x3x4x5x6x7x8x9x10
x1x2x3x4x5x6x7x8x9x10
0
32One breakthrough recursive networks
x1x2x3x4x5x6x7x8x9x10
PDB
PDBselect(Ct,nCt,dist.truePos) 6?
0.71,0.998,0.59 12? 0.43,0.987,0.55
Pollastri,Baldi,Vullo,Frasconi, NIPS2002
33One breakthrough recursive networks
Theory approximation completeness - for every
(reasonable) function f and egt0 exists a RecNN
which approximates f up to e (with appropriate
distance measure) Hammer
34Going on towards more complex structures
Baldi,Frasconi,,2002
35Going on towards more complex structures
q1
Contextual cascade correlation Micheli,Sperduti,0
3
36Contextual cascade correlation
Cascade Correlation 1990,Fahlmann/Lebiere given
data (x,y) in RnxR, find f such that f(x)y
minimize the error on the given data
x
y
maximize the correlation of the units output and
the current error ? the unit can serve for error
correction in subsequent steps
hi(x)fi(x,h1(x), ..., hi-1(x))
etc.
37Contextual cascade correlation
- few, cascaded, separately optimized neurons
- ? efficient training
- ? excellent generalization
- .. as shown e.g. for two spirals
38Contextual cascade correlation
for trees recursive processing of the structure
starting at the leaves towards the root
q-1(hi(v))(hi(ch1(v)),...,hi(chk(v))) gives
the context
q-1
acyclic!
q-1
h1(v) f1(l(v),h1(ch1(v)),...,h1(chk(v)))
q-1
q-1
... not possible since weights are frozen after
adding a neuron!
h2(v) f2(l(v),h2(ch1(v)),...,h2(chk(v)),
h1(v),h1(ch1(v)),...,h1(chk(v)))
etc. no problem!
39Contextual cascade correlation
- cascade correlation for trees
- init
- repeat
- add hi
- train fi(l(v), q-1(hi(v)),
- h1(v), q-1(h1(v)),
- ...,hi-1(v), q-1(hi-1(v)))
- on the correlation
- train the output on the error
40Contextual cascade correlation
restricted recurrence allows to look at
parents q-1(hi(v))(hi(ch1(v)),...,hi(chk(v))) q
1(hi(v))(hi(pa1(v)),...,hi(pak(v)))
q-1
q1
q-1
... would yield cycles
q1
possible due to restricted recurrence!
q-1
contextual cascade correlation hi(v)fi(l(v),q-1(
hi(v)),h1(v),q-1(h1(v)),q1(h1(v))
...,hi-1(v),q-1(hi-1(v)),q1(hi-1(v)))
41Contextual cascade correlation
q1 extends the context of hi
i3
i1
i2
42Contextual cascade correlation
QSPR-problem Micheli,Sperduti,Sona predict the
boiling point of alkanes (in oC). Alkanes
CnH2n2, methan, ethan, propan, butan, pentan,
hexan
2-methyl-pentan
the larger n and the more side strands, the
higher the boiling point ? excellent benchmark
43Contextual cascade correlation
structure for alkanes
CH3(CH2(CH2(CH(CH2(CH3),CH(CH3,CH2(CH3))))))
44Contextual cascade correlation
Alkane 150 examples, bipolar encoding of
symbols, 10-fold cross-validation, number of
neurons 137 (CRecCC), 110/140 (RecCC), compare to
FNN with direct encoding for restricted length
Cherqaoui/Vilemin codomain -164,174
45Contextual cascade correlation
visualization of the contexts
46Contextual cascade correlation
QSAR-problem Micheli,Sperduti,Sona predict the
activity (IC50) of benzodiazepines
tranquillizer/analgesic Valiquid, Rohypnol,
Noctamid, .. function potentiates the inhibiting
neurotransmitter GABA problem cross tolerance,
cross dependencies ? very important to take an
accurate dose
47Contextual cascade correlation
- structure for benzodiazepines
e.g. bdz(h,f,h,ph,h,h,f,h,h) bdz(h,h,h,ph,c3(h,h,
h),h,c3(h,h,h),h,h) bdz(c3(h,h,o1(c3(h,h,h))),h,h,
ph,h,h,n2(o,o),h,h)
48Contextual cascade correlation
- Benzodiazepine 79 examples, bipolar encoding,
- 5 test points taken from Hadjipavlou/Hansch and
additional folds, - max number of neurons 13-40
49Contextual cascade correlation
50Contextual cascade correlation
- Supersource transductions
- IO-isomorphic tranductions
real number
51Contextual cascade correlation
- Approximation completeness
- recursive cascade correlation is approximation
complete for trees and supersource transductions - contextual recursive cascade correlation is
approximation complete for acyclic graph
structures (with a mild structural condition) and
IO-isomorphic transduction
52Conclusions recursive networks
- Very promising neural architectures for direct
processing of tree structures ? - Successful applications and mathematical
background ? - Connections to symbolic mechanisms (tree
automata) ? - Extensions to more complex structures (graphs)
are under development ? - A few approaches which achieve structured outputs
?
53Unsupervised models
j(j1,j2)
self-organizing map network given by prototypes
wi ? Rn in a lattice
Hebbian learning based on examples xi and
neighborhood cooperation
x
j x - wj minimal
online learning iterate choose xi determine the
winner adapt all wj wj wj
?exp(-n(j,j0)/s2)(xi-wj)
?
54Unsupervised models
- Represent time series of drinks of the last night
55Old recursive SOM models
x1,x2,x3,x4,,xt,
d(xt,wi) xt-wi ad(xt-1,wi)
training wi ? xt
Recurrent SOM
d(xt,wi) yt where yt (xt-wi) ayt-1
training wi ? yt
56Old recursive SOM models
- TKM/RSOM compute a leaky average of time series,
i.e. they are robust w.r.t. outliers
57Old recursive SOM models
- TKM/RSOM compute a leaky average of time series
- It is not clear how they can differentiate
various contexts - no explicit context!
is the same as
58Merge neural gas/ Merge SOM
(wj,cj) in Rnxn
- explicit context, global recurrence
- wj represents entry xt
- cj repesents the context which
equals the winner content of the last time step - distance d(xt,wj) axt-wj (1-a)Ct-cj
- where Ct ?wI(t-1) (1-?)cI(t-1), I(t-1)
winner in step t-1 (merge) - training wj ? xt, cj ? Ct
59Merge SOM
C1 (42 50)/2 46
C2 (3345)/2 39
C3 (3338)/2 35.5
60Merge neural gas/SOM
- speaker identification, japanese vovel ae
UCI-KDD archive - 9 speakers, 30 articulations each
time
12-dim. cepstrum
MNG, 150 neurons 2.7 test error MNG, 1000
neurons 1.6 test error rule based 5.9, HMM
3.8
61Merge neural gas/SOM
xt,xt-1,xt-2,,x0
xt-1,xt-2,,x0
xt
(w,c)
xt w2
Ct - c2
merge-context content of the winner
Ct
training w ? xt c ? Ct
62General recrusive SOM
xt,xt-1,xt-2,,x0
(w,c)
xt w2
xt
Ct - c2
Context RSOM/TKM neuron itself MSOM winner
content SOMSD winner index RecSOM all
activations
Ct
training w ? xt c ? Ct
xt-1,xt-2,,x0
63General recursive SOM
- Experiment
- Mackey-Glass time series
- 100 neurons
- different lattices
- different contexts
- evaluation by the temporal quantization error
-
average(mean activity k steps into the past -
observed activity k steps into the past)2
64General recursive SOM
SOM
quantization error
RSOM
NG
RecSOM
SOMSD
HSOMSD
MNG
now
past
65Merge SOM
- Theory (capacity)
- MSOM can simulate finite automata
- TKM/RSOM cannot
- ? MSOM is strictly more powerful than TKM/RSOM!
state
input
state
d
state
input (1,0,0,0)
66General recursive SOM
for normalised WTA context
67Conclusions unsupervised networks
- Recurrence for unsupervised networks
- allows mining/visualization of temporal contexts
- different choices yield different
capacity/efficiency - ongoing topic of research
68Structures and neural networks
- Feedforward
- rule insertion/extraction to link symbolic
descriptions and neural networks - kernels for structures to process structured
outputs - Recursive networks for structure processing
- standard recursive networks for dealing with
tree-structured inpputs - contextual recursive cascade correlation for
IO-isomorphic transductions on acyclic graphs - Unsupervised networks
- variety of recurive SOM models with different
capacity - relevant choice context representation