Title: Generalization and Systematicity in Echo State Networks
1Generalization and Systematicity in Echo State
Networks
- Stefan Frank
- Institute for Logic, Language and Computation
- University of Amsterdam
- The Netherlands
- Michal Cernanský
- Institute of Applied Informatics
- Slovak University of Technology
- Bratislava, Slovakia
2Systematicity in language
- The ability to produce/understand some sentences
is intrinsically connected to the ability to
produce/ understand certain others (Fodor
Pylyshyn, 1988) - If you understand Quokkas are cute. I eat nice
food. - you also understand Quokkas are nice food. I
eat cute quokkas. (and many more...) - ...unless you learned (a bit of) English by
memorizing a phrase book
3Systematicity and connectionismFodor Pylyshyn
(1988)
- A compositional symbol system is needed to
explain this phenomenon - Neural networks do not provide such a system
- So connectionism cannot account for systematicity
(and connectionist modelling should be abandoned)
Do neural networks learn sentences as if they
memorize a phrase book, or can they display
systematicity?
4Systematicity and connectionism
- Systematicity in language is just
likegeneralization in neural networks - Do neural networks generalize to the same extent
as people do? - Hadley (1994)
- People display strong systematicity words that
have only been observed in one grammatical
position (e.g., quokkas as a subject noun) can be
generalized to new positions (e.g., quokkas as
object of eat) - Connectionist models of sentence processing have
not been shown to generalize in this way (note
in 1994)
5Systematicity and connectionism
- Standard approach in connectionist modelling of
sentence processing - Small, artificial language
- Random sampling of many sentences for training
- Simple recurrent network (SRN Elman, 1990)
trained on next-word prediction - Test on new sentences
- Because of large random sample, each word will
have occurred in each legal position ? no test
for strong systematicity - Even when SRN systematicity has been claimed
- Excessive training ? not psychologically
realistic - Training details were crucial ? no robust outcomes
6Echo state networks
- SRNs require slow, iterative training (e.g.,
backprop) - Echo state network(ESN Jaeger, 2001)
- Train only output connections
- One-shot learning by linear regression
- No training parameters
Simple recurrent network
Echo state network
output (word predictions)
recurrent layer
Can ESNs display strong systematicity in sentence
processing?
input (words)
7SimulationsThe language
- 26 words
- 12 plural nouns (3 females, 3 males, 6 animals)
- 10 transitive plural verbs
- 2 prepositions
- 1 relative clause marker that
- 1 end-of-sentence marker end
- Sentences types
- Simple N V N girls see boys end
- Prepositional phrase girls see boys with quokkas
end - Subject-relative clause girls that see boys like
quokkas end - Object-relative clause girls that boys see like
quokkas end - multiple embeddings girls that see boys that
quokkas like avoid elephants end
8SimulationsTraining and test sentences
- For training 5,000 sentences all females are
subject and all males are object - For testing new sentences with one
subject-relative clause (SRC) or one
object-relative clause (ORC) - SRC1 girls that like boys see men end
- SRC2 girls like boys that see men end
- ORC1 girls that women like see men end
- ORC2 girls like boys that women see end
- Mere generalization 10,759 sentences with
female subjects and male objects (as during
training)
9SimulationsTraining and test sentences
- For training 5,000 sentences all females are
subjects and all males are object - For testing new sentences with one
subject-relative clause (SRC) or one
object-relative clause (ORC) - SRC1 boys that like girls see women end
- SRC2 boys like girls that see women end
- ORC1 boys that men like see women end
- ORC2 boys like girls that men see end
- Mere generalization 10,759 sentences with
female subjects and male objects (as during
training) - Strong systematicity 10,800 sentence with male
subjects and female objects (unlike during
training)
10SimulationsRating performance
- Performance
- The output vector is the networks estimate of
next-word probabilities - The true probability distribution follows from
the grammar - The cosine between the two is the measure for
network performance - Baseline
- Take all n-gram models (based on training
sentences),from n 1 to the number of words in
the sentence so far - The one that performs best (at each point in each
test sentence) is the baseline - To be considered systematic, the ESN should
generally perform better than the best n-gram
model
11ESN resultsGeneralization
- The ESN generally outperforms n-gram models when
testing for mere generalization
12ESN resultsSystematicity
- The ESN often performs much worse than n-gram
models when testing for strong systematicity
13Improving ESN performance
- Old solution (Frank, 2006) Add a layer of units
? iterative training needed - New solution Use informative rather than random
word representations - Let the representation of word i (i.e., input
weight vector wi ) encode co-occurrence info - Efficient (one-shot, non-iterative)
- Unsupervised (not task-dependent)
- Captures paradigmatic relations (representations
of words from the same syntactic category tend to
cluster together)
14ESN resultsGeneralization
- ESN and ESN perform similarly when tested for
mere generalization
15ESN resultsSystematicity
- ESN generally outperforms both n-gram models and
ESN when tested for systematicity - Strong systematicity without iterative training