Title: Tutorial: Mathematical Aspects of Neural Networks
1TutorialMathematical Aspects of Neural Networks
- Barbara Hammer, University of Osnabrück,
- Thomas Villmann, University of Leipzig, Germany
2Relevance of math for NNs
- math is used to
- develop and present algorithms
- (linear algebra, analysis, statistics,
optimization, control theory, statistical
physics, differential geometry, ...) - investigate applicability, evaluate algorithms
- (statistics, ...)
- investigate theoretical properties
- (Algebra, Borsuk theorem, Christoffel symbols,
Differentialtopology, Entropy, Functional
analysis, ...)
3Relevance of math in NN-history
1985 Hopfield-networks for TSP
1969 Minsky/Papert
SVM
Backprop
1958 Rosenblatt
1943 McCulloch/Pitts
... mathematical theory and mathematical
questions established for most classical models
4Classical models
Recurrent networks
Self organizing maps
Feed-forward networks
Cottrell, Letremy Analyzing qualitative
variables using the Kohonen algorithm Claussen,
Villmann Magnification control in winner
relaxing neural gas Archambeau, Lee, Verleysen
On convergence problems of the EM algorithm for
finite Gaussian mixtures
Schiller, Steil On the weight dynamics of
recurrent learning Jain, Wysotzky A neural graph
algorithm based on local invariants Jain,
Wysotzky An associative memory for the
automorphism group of structures
Jianyu Li, Siwei Luo, Yingjian Qi Approximation
of functions by adaptively growing radial basis
function neural networks
5Feed-forward networks
... for classification and function approximation
1 Universal approximation ability Does there
exist an appropriate architecture for every
function which is to be approximated?
1 architecture
2 optimization
2 Complexity of training How do good error
minimization algorithms look like and what is
their complexity?
3 test
3 Learnability Can generalization to
previously unseen examples be guaranteed?
x
y
y0
61 Universal approximation
feed-forward networks
Perceptron solves only linearly separable
problems. 1969 Minsky/Papert
MLPs constitute universal approximators 1989
Hornik/Stinchcombe/White, 1993 Leshno et al.
RBFs constitute universal approximators 1990
Girosi/Poggio, 1993 Park/Sandberg
SVMs constitute universal approximators 2001
Steinwart, 2003 Hammer/Gersmann
71 Universal approximation
number of neurons, rates of convergence, ... ?
feed-forward networks
1992 Sontag n neurons are sufficient to
interpolate n points for output dimension 1
1992 Jones, 1993 Barron convergence of order
1/n for appropriate functions
1995 Girosi, 1997 Gurvits/Koiran, 1997
Kurkova/Kainen/Krei- novich, 1998
Kurkova/Savicky/Hlavackova, 2002 Lavretsky, ...
Jianyu Li, Siwei Luo, Yingjian Qi determine
number of neurons during training
... this session
82 Complexity of training
abcdefghijklmnop....
feed-forward networks
Perceptron training is polynomial ? Karmakar
algorithm
... investigation of the perceptron algorithm
SVM training is polynomial ? quadratic
optimization of the dual problem
... properties of online solutions, decomposition
schemes
MLP training is NP-hard 1988 Blum/Rivest,
1990 Judd
... design of alternative learning algorithms,
investigation of more realistic scenarios
92 Complexity of training
MLP training is NP-hard
1988 Blum/Rivest 3-node network, 1990 Judd
networks which encodes SAT
the loading problem
1995 Pinter, 1998 Hammer more than one layer
and two neurons, neurons related to training
size, varying number of hidden neurons
too small
1996 Sima, 1997 Jones, 1997 Vu, 1998
Hammer sigmoidal settings
too specific
1995 Hoeffgen/Simon/VanHorn, 2002
Bartlett/Ben-David, 2002 Sima, 2003
DasGupta/Hammer approximate optimization is
NP-hard in several settings, even for one neuron
activation function should be sigmoidal
approximate settings
2000 Ben-David/Simon training a neuron optimum
with large margin is polynomial
... any other idea?
103 Learnability
feed-forward networks
statistical physics
... measures the mean effects of online
algorithms ? Opper, ...
... a regularity can be learned exactly in the
limit ? Gold, ...
identification in the limit
... at least one good learning-algorithm exists,
thereby valid generalization with high
probability and guaranteed bounds ? 1984
Valiant
PAC learnability
... the empirical error converges uniformly to
the real error with high probability, guaranteed
bounds can be derived ? 1971
Vapnik/Chervonenkis
UCEM property
113 Learnability
Statistical learning theory (in three lines
...) PAC ? finite covering number UCEM ?
appropriate empirical covering distribution
independent learnable ? finite VC dimension
feed-forward networks
1989 Baum/Haussler link NNs to VC-theory
and estimate VC-dim of perceptron-networks
1994 Maass lower bound for perceptron networks
1992 Sontag several ugly examples
1993 Macintyre/Sontag VC of sigmoidal networks
is finite
1995 Karpinski/Macintyre estimate of VC of
sigmoidal networks
1997 Koiran/Sontag lower bound of VC of
sigmoidal networks
1992 Haussler, ... Bartlett et al., 2002
Schmitt, ...
VC dim of SVM scales with the margin
? ESANN03
luckiness framework for structural risk
minimization
12Feed-forward networks
positive results
ongoing work
formalized
Universal approximation ability
Complexity of training
Learnability
13Recurrent networks
NARX/TDNN
local/global
Hopfield
... tasks
Elman
discrete/co
1 Approximation ability and capacity
sequence prediction, sequence transduction, seque
nce generation, associative memory, optimization,
binding and grouping, computation
- as operators on functions
2 Complexity of training/design of training
algorithms
- error optimization Hebbian learning energy
function stability constraints
- complexity numeric dynamic properties
potential
3 Learnability
? ESANN02
142 Training
recurrent networks
Long term dependencies
... this session
1994 Bengio/Simard/Frasconi
Schiller/Steil
? design of learning algorithm
mathematic investigation of this algorithm
RTRL, BPTT, LSTM, EKF, EM-approaches,
recirculation, ...
2000 Atiyah/Parlos unification and one new
approach
152 Training
recurrent networks
Stability
... this session
? guarantees for global or local stability
Schiller/Steil
via linear matrix inequalities 1997,2000
Suykens/Vandewalle, 1999-2002 Steil et al.,
2002 Liao/Chen/Sanchez, ...
local/global stability and convergence rates for
fully connected RNNs 2001 Wersing/Beyn/Ritter,
2002 Chen/Lu/Amari, 2002 Chen, 2002 Peng/
Qia/Xu, ...
162 Training
recurrent networks
Training based on stable states
... this session
1997 Lee storing sequences in Hopfield-type
networks
Schiller/Steil
2002 Welling/Hinton mean-field Boltzmann
machines
Jain/Wysotzki
graph isomorphisms
2002 Weng/Steil training CLM
2001 Li/Lee invariant matching
Jain/Wysotzki
2002 Dang/Xu, 2002 Talavan/Yanez TSP
automorphism group of structures
2002 DiBlas/Jagota/Hughuy graph coloring
17Recurrent networks
positive results
ongoing work
formalized
Universal approximation ability
Complexity of training
Learnability
18Self-organizing maps
faithful data representation
VQ
SOM
NG
statistical approaches
ICA/PCA
1 Training algorithms, convergence How can
reasonable training algorithms be designed? What
is the objective, cost function? Convergence of
the algorithm to desired states?
1 training
2 Topology preservation Does the topology of
the representation match the underlying data
topology?
2 data mining
3 Distribution representation Is important
information preserved? What is the magnification?
? ESANN
191 Training algorithms
self-organizing maps
iterative updates ? e.g. Kushner/Clark or Ljung
batch updates ? e.g.
Geoffrey/Hinton
... this session
Archambeau/Lee/ Verleysen
VQ ok
NG 1993,1994 Martinetz et al. ok
investigation of convergence problems of the EM
algorithm
SOM 1992 Erwin et al. no
1999 Heskes but almost
Convergence of SOM yes in dim one/two, otherwise
difficulties Cottrell, Der, Erwin, Flanagan,
Fort, Herrmann, Lin, Pages, Ritter, Sadeghi,
Obermayer, ...
202 Topology preservation
self-organizing maps
VQ
NG
SOM
1992 Bauer/Pawelzik Topographic product
1997 Villmann et al. Topographic function
1992 Ritter et al., 1993 Heskes, 1994
Der/Herrmann, 1996 Bauer et al., 1999
Der/Herrmann/Villmann mathematical investigation
of mismatching states
1997 Bauer/Villmann, 1999 Ritter alternative
or adaptive lattices
213 Distribution preservation
self-organizing maps
... this session
Claussen/Villmann
Thomas will fill this area ...
magnification control for winner relaxing neural
gas
224 Recent developments
self-organizing maps
... this session
extension of SOM to general domains
Cottrell/Letremy
2001 Kaski/Sinkkonen, 2002 Hammer/Strickert/Vi
llmann SOM/NG with adaptive metric
SOM for contingency analysis
2002 Hagenbuchner/Sperduti/Tsoi, 2002
Voegtlin, 2003 Hammer/Micheli/Sperduti SOM for
sequences and structures
2001 Kohonen SOM for discrete objects
23Self-organizing maps
formalized
Convergence
Topology preservation
Distribution preservation
24Finale
- Theorem You need at least two neurons to follow
this talk. - Proof By contradiction. Assume you had only one
neuron. - Then you couldnt understand the following...
- If neither Thomas nor Barbara drink beer, the
idea for the special session will be good. - If Thomas drinks beer and Barbara does not drink
beer, the idea for the session will not be good. - If Barbara drinks beer and Thomas does not, the
idea for the special session will not be good. - If both, Barbara and Thomas drink beer, the idea
for the special session will be good. - ...because it includes XOR, not solvable with one
neuron. - you couldnt follow the last proof in this talk.
- since the last proof is deeply connected to the
other 66 slides of the talk, you couldnt follow
the talk.
25Finale
- need math to
- develop and present algorithms
- investigate applicability, evaluate algorithms
- investigate theoretical properties
- but
- often limited to simple questions
- (... how many neurons are sufficient?)
- possibly not applicable
- (... who drank beer?)
- does not fit in all details
- (... we dont have 66 slides.)
26(No Transcript)
271 Approximation ability
recurrent networks
Partially recurrent networks constitute universal
approximators
1992 Sontag, 1993 Funahashi/Nakamura, 02
Back/Chen
They show rich dynamic behavior
1991 Wang, 2002 Tino et al., ... Pasemann,
Haschke, ...
They include automata, Turing machines,
non-uniform circuits
Omlin, Giles, Carrasco, Forcada, Siegelmann,
Sontag, Kilian, ...
Hopfield networks can minimize polynomials,
number of stable patterns can be estimated,
various extensions
283 Learnability
recurrent networks
2003 Hammer/Tino finite for small weights
VC-dim of RNNs
1997 Koiran/Sontag in the general setting, VC
depends on the maximum length of input sequences
1999, 2001 Hammer one can achieve distribution
dependent or posterior bounds (which might be
very bad ...)
covering number or entropy number?
1993 Nobel/Dembo finite VC dim and finite
mixing coefficients are sufficient
non i.i.d. data?
2001 Vidyasagar ... is working on nice
alternatives and generalizations