Simulation of Language Acquisition

About This Presentation

Title:

Simulation of Language Acquisition

Description:

Explains and predicts empirical data (observations, experimental results) ... Consistence and completeness can sometimes be proven. Falsifiable through simulations ... – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 82

Provided by: walt144

Category:

more less

Transcript and Presenter's Notes

Title: Simulation of Language Acquisition

1
Simulation of Language Acquisition

Walter Daelemans
(CNTS, University of Antwerp)
walter.daelemans_at_ua.ac.be
http//www.cnts.ua.ac.be/walter
EMLAR 2005 Utrecht

2
Overview

Theories, computational models and simulations
Machine Learning
Generalization versus abstraction
Eager versus lazy learning
Memory-based models of language acquisition and
processing
Case Study 1 Stress acquisition
TiMBL crash course and demonstration
Case Study 2 German plural

3
Simulation (1)

Theory
Explains and predicts empirical data
(observations, experimental results)
Cogsci in terms of knowledge representation,
acquisition, and processing framework
Problems
Verbal
Sometimes vague, underspecified
Every theoretical description, however exact,
turns out to contain errors when you try to
implement it ( Hugo Brandt Corstius, second law
of Computational Linguistics)

4
Simulation (2)

Computational Model
Translation of a theory into specific symbol
representation and processing framework
(algorithms and data structures)
Advantages
Precise formulation
Explicit in all details
Consistence and completeness can sometimes be
proven
Falsifiable through simulations
Simulations
A computational model with specific parameter
settings used to mimic specific empirical data

5
Machine Learningas a model for acquisition

Cognitive architecture
Competence (knowledge representation)
Performance (search)
Acquisition (search)
Bias
Restrictions on input and output representations
Restrictions on learning algorithm
Restrictions on knowledge representation
formalism

6
Experience
BIAS
Learning Component
Search
Rj
Ri
Rk
Output
Input
Rl
Performance Component
7
Generalisation ? Abstraction
abstraction
Rule Induction Connectionism Statistics Handcrafti
ng
(Fill in your most hated linguist here)
generalisation
- generalisation
Memory-Based Learning
Table Lookup
- abstraction
8
Nativism ? Rule-Based
nativist
Hard-wired neural networks Innate
probabilities? Innate exemplars?
Innate mental rules
rule-based
- rule-based
Rule Induction
Connectionism Statistics Memory-Based Learning
empiricist
9
Machine Learning crash course

The field of machine learning is concerned with
the question of how to construct computer
programs that automatically learn with
experience. (Mitchell, 1997)
Dynamic process learner L shows improvement on
task T after learning.
Getting rid of programming.
Handcrafting versus learning.
Machine Learning is task-independent.

10
Machine Learning Roots

Information theory
Artificial intelligence
Pattern recognition
Took off during 70s
Major algorithmic improvements during 80s
Forking neural networks, data mining

11
Machine Learning 2 types

Theoretical ML (what can be proven to be
learnable by what?)
Gold, identification in the limit
Valiant, probably approximately correct learning
Empirical ML (on real or artificial data)
Evaluation Criteria
Accuracy
Quality of solutions
Time complexity
Space complexity
Noise resistance

12
Empirical ML Key Terms 1

Instances individual examples of input-output
mappings of a particular type
Input consists of features
Features have values
Values can be
Symbolic (e.g. letters, words, )
Binary (e.g. indicators)
Numeric (e.g. counts, signal measurements)
Output can be
Symbolic (classification linguistic symbols, )
Binary (discrimination, detection, )
Numeric (regression)

13
Empirical ML Key Terms 2

A set of instances is an instance base
Instance bases come as labeled training sets or
unlabeled test sets (you know the labeling, the
learner does not)
A ML experiment consists of training on the
training set, followed by testing on the disjoint
test set
Generalization performance (accuracy, precision,
recall, F-score) is measured on the output
predicted on the test set
Splits in train and test sets should be
systematic n-fold cross-validation
10-fold CV
Leave-one-out testing
Significance tests on pairs or sets of (average)
CV outcomes

14
Empirical ML 2 Flavors

Eager
Learning
abstract model from data
Classification
apply abstracted model to new data
Lazy
Learning
store data in memory
Classification
compare new data to data in memory

15
Eager vs Lazy Learning

Eager
Decision tree induction
CART, C4.5
Rule induction
CN2, Ripper
Hyperplane discriminators
Winnow, perceptron, backprop, SVM
Probabilistic
Naïve Bayes, maximum entropy, HMM
(Hand-made rulesets)

Lazy
k-Nearest Neighbour
MBL, AM
Local regression

16
-etje
Rule Induction
-kje
Coda last syl
Nucleus last syl
17
-etje
MBL
-kje
Coda last syl
?
Nucleus last syl
18
Eager vs Lazy Learning

Decision trees keep the smallest amount of
informative decision boundaries (in the spirit of
MDL, Rissanen, 1983)
Rule induction keeps smallest number of rules
with highest coverage and accuracy (MDL)
Hyperplane discriminators keep just one
hyperplane (or vectors that support it)
Probabilistic classifiers convert data to
probability matrices
k-NN retains every piece of information available
at training time

19
Eager vs Lazy Learning

Minimal Description Length principle
Ockhams razor
Length of abstracted model (covering core)
Length of productive exceptions not covered by
core (periphery)
Sum of sizes of both should be minimal
More minimal models are better
Learning compression dogma
In ML, length of abstracted model has been focus
not storing periphery

20
Eager vs Lazy So?

Highly relevant to language modeling
In language data, what is core? What is
periphery?
Often little or no noise productive exceptions
(Sub-)subregularities, pockets of exceptions
disjunctiveness and polymorphism
Some important elements of language have
different distributions than the normal one
E.g. word forms have a Zipfian distribution
Hard to distinguish noise from exceptions on the
basis of
Frequency
Typicality

21
(No Transcript)
22
ML and Natural Language

Apparent conclusion ML could be an interesting
tool to do psycholinguistic modeling
Next to probability theory, information theory,
statistical analysis (natural allies)
More and more annotated data available
Skyrocketing computing power and memory

23
Case Study

Exemplar-based acquisition of Dutch Stress
(Durieux / Gillis / Daelemans)

24
MBL Use memory traces of experiences as a basis
for analogical reasoning, rather than using rules
or other abstractions extracted from experience
and replacing the experiences.
This rule of nearest neighbor has considerable
elementary intuitive appeal and probably
corresponds to practice in many situations. For
example, it is possible that much medical
diagnosis is influenced by the doctor's
recollection of the subsequent history of an
earlier patient whose symptoms resemble in some
way those of the current patient. (Fix and
Hodges, 1952, p.43)
25
MBL Acquisition

Language process is represented by a set of
exemplars in memory
Exemplars act as models
Learning is incremental storage of exemplars
Compression and Metrics
Exemplar consists of set of (mostly symbolic)
features

26
MBL Processing

New instances of a performance process are solved
through
Memory retrieval
Analogical (Similarity-Based) Reasoning
Similarity metric
Language (faculty) - independent
Adaptive (feature and exemplar weighting)

27
Operationalization

Basis k nearest neighbor algorithm
store all examples in memory
to classify a new instance X, look up the k
examples in memory with the smallest distance
D(X,Y) to X
let each nearest neighbor vote with its class
classify instance X with the class that has the
most votes in the nearest neighbor set
Choices
similarity metric
number of nearest neighbors (k)
voting weights

28
The Overlap distance function

Count the number of mismatching features

29
The MVDM distance function

Estimate a numeric distance between pairs of
values
e is more like i than like p in a phonetic
task
book is more like document than like the in
a parsing task

30
Feature weighting in the distance function

Mismatching on a more important feature gives a
larger distance
Factor in the distance function

31
Entropy IG Formulas
32
Exemplar weighting

Scale the distance of a memory instance by some
externally computed factor
Smaller distance for good instances
Bigger distance for bad instances

33
Distance weighting

Relation between larger k and smoothing
Make more distant neighbors contribute less in
the class vote
Linear inverse of distance (w.r.t. max)
Inverse of distance
Exponential decay

34
Learning word stressA case study

Learn primary stress
Compare MBL with PP/UG
Match acquisition and processing data
Durieux, G. (2003) Computermodellen en
klemtoon. Fonologische Kruispunten, BICN.
Daelemans, W., Gillis, S., and Durieux, G.
(1994). The acquisition of stress A
data-oriented approach." Computational
Linguistics 20 421-451.
Daelemans, W., Gillis, S., Durieux, G., and Van
den Bosch, A. (1993). Learnability and
markedness Dutch stress assignment. In T.M.
Ellison and J.M. Scobbie (Eds.), Computational
Phonology . Edinburgh Working Papers in Cognitive
Science, 8, pp. 157-178.

35
MBL for psychology

Similarity metric
Analogy engine
Feature weighting
Relevance assignment
Information fusion
Value weighting
Implicit concept formation
Exemplar weighting
Recency, priming
Distance-weighted extrapolation
Distributions, probabilities
Local modeling
Heterogeneity and density

36
Dominant Linguistic Approach

Principles and Parameters, UG
Typology
Acquisition
Formalism Metrical trees, metrical grids
Stress prominence relations between
constituents in a hierarchical structure

37
YOUPIE (Dresher Kaye, 1990)

Assumptions
11 parameters (216 languages)
Task-specific system for learning stress (domain
knowledge)
Core grammar only
Learning
Cue-based parameter setting results in a grammar
of stress
Performance
Generate tree with grammar and algorithmically
determine stress location

38
PLD
Cue-based Learning
word
?
?
?
1 0 1 0 0 0 0 1 1 0 1 UG-stress Grammar and
Assignment rules
39
Parameters (with setting for Dutch)
40
MBL

Assumptions
Lexical storage and generalization
Generic learning method, no task-specific
linguistic knowledge
Core and periphery
Learning
Based on storage of exemplars
Performance
Similarity-based reasoning with feature weighting
on stored exemplars

41
PLD
Storage
Syllable-structure representations Retrieval
or Similarity-based reasoning on exemplars
word
Stress pattern
42
YOUPIE tested

Experimental design
216 languages
117 items per language generated by YOUPIE
performance component (no exceptions, core only)
For each language, grammar learned with YOUPIE
cue-based learning component
Results
For 60 of the languages, YOUPIE reconstructs the
original parameter setting with which the words
were generated
For 21 convergence is to a compatible setting
For 19 of the languages errors in one or more
stress patterns
Upper Boundary!
Perfect input, no exceptions to be learned

43
MBLP vs.Youpie
44
Discussion

No significant quantitative difference in
performance
Clear qualitative difference
YOUPIE more languages perfectly learned
MBLP fewer errors per language
Issues
Real language data
Core and periphery
Acquisition
Processing

45
Dutch stress

Stress on one of the last three syllables
Predictable, but not completely
E.g., py-a-ma ca-na-da pa-ra-plu
Words not covered by the parameter-configuration
for Dutch need lexical marking with exception
features (one, two or completely idiosyncratic)

46
MBLP on Dutch data

CELEX, 4868 monomorphemes
Exemplar encoding schemes
For each of the three final syllables
S1 syllable weight (SL, L, H, SH)
S2 nucleus and coda (complete rhymes, VC)
S3 nucleus and coda (separate features,
phonemes)
S4 onset, nucleus, and coda (phonemes)
Class final, penultimate, ante-penultimate

47
Results
48
Language Acquisition

Learning rules or learning lexical items?
Rules (Hochberg 88 Spanish, Nouveau 93 Dutch)
Lexical learning lacks generalization capacity
Lexical learning incompatible with acquisition
data
Imitation task
Errors increase with irregularity
Tendency to regularization (but irregularization
occurs)
By stress shift
By changing structure of repeated word

49
Error Percentages
50
Discussion

MBLP error correlates with markedness like
childrens errors
MBLP has a tendency for regularization like
children
Direction of stress shifts
Structural changes from inspection of nearest
neighbors
Irregularization and differences 3 and 4
year-olds on marked patterns hard to explain in
rule-based context
Rule learning is not the only possible
explanation for the language acquisition data

51
Adult processing

Rule-based stress grammar and set of irregular
words, marked in the lexicon
Known words rule application except when blocked
by lexicon
Unknown words rule application
MBLP lexical storage and analogy
Known words look-up
Unknown words analogy

52
Experimental set-up

Stimuli
Create pseudo-words and transcribe them (encoding
4)
Have a machine learner assign stress (regular or
irregular)

53
Experimental set-up

Method
18 adult participants
Reading task
3 independent judges, consensus
Result
Main effect for regularity-variable (ANOVA p lt
.001) regular stress only in regular conditions
In all conditions, participants do the same as
model prediction (ANOVA p lt .001)

54
Results
55
Results
56
Discussion

Adult speakers sometimes prefer marked stress
patterns for non-words
These cases are partially predictable with an
MBLP model and are problematic in a rule-based
model (regularization only)
BUT
MBLP has a significantly better match with
participant behavior in the regular conditions
Hypothesis differences between mental lexicon
and celex
Using a set-up with a population of machine
learners using different samples from celex
explains the variability

57
Summary

Goal put MBLP to the test on a concrete
linguistic problem of sufficient complexity by
comparing it to
Linguistic theory
Child language acquisition data
Adult processing data
Results
MBLP and YOUPIE (PP/UG) comparable
MBLP can learn core as well as periphery using
superficial representations
MBLP shows same errors and tendencies as children
learning stress placement
MBLP better predictor of human adult behaviour
with non-words

58
Overall Conclusion

Exemplar-based models should be taken as a
serious alternative for rule-based/PP/UG/dual
route type theories
Workable operationalisation of analogy
Adequacy
Similar results in morphology and syntax
(grammatical relations, chunking, pp-attachment)
Well see

59
Simulation with TiMBL

Demonstration German plural

60
TiMBLhttp//ilk.uvt.nl/timbl

Tilburg Memory-Based Learner
Available for research and education
Lazy learning, extending k-NN and IB1
Optimized search for NN
Internal structure tree, not flat instance base
Tree ordered by chosen feature weight
Many built-in optional metrics feature weights,
distance function, distance weights, exemplar
weights,

61
Current practice

Default TiMBL settings
k1, Overlap, GR, no distance weighting
Work well for some morpho-phonological tasks
Rules of thumb
Combine MVDM with bigger k
Combine distance weighting with bigger k
Very good bet higher k, MVDM, GR, distance
weighting
Especially for sentence and text level tasks

usage Timbl -f data-file -t test-file
options
Algorithm and Metric options
-a n algorithm.
0 or IB1 IB1 (default)
1 or IG IGTree
2 or TRIBL TRIBL
3 or IB2 IB2
4 or TRIBL2 TRIBL2
-m s use feature metrics as specified in
string s
format GlobalMetricMetricRangeMetric
Range
e.g. -mON3I2,5-7
D Dot product. (Global only. numeric features
implied)
O weighted Overlap. (default)
M Modified value difference.
N numeric values.
I Ignore named values.

-w 0 No Weighting.
1 Weight using GainRatio. (default)
2 Weight using InfoGain
3 Weight using Chi-square
4 Weight using Shared Variance
f use Weights from file 'f'.
-b n number of lines used for bootstrapping
(IB2 only).
-d val weight neighbors as function of their
distance
Z all the same weight. (default)
ID Inverse Distance.
IL Inverse Linear.
EDa Exponential Decay with factor a. (no
whitespace!)
EDab Exponential Decay with factor a and
b. (no whitespace!)
-k n k nearest neighbors (default n 1).

-q n TRIBL treshold at level n.
-L n MVDM treshold at level n.
-R n solve ties at random with seed n.
-t f test using file 'f'.
-t leave_one_out test with Leave One Out,using
IB1.
-t cross_validate Cross Validate Test,using IB1.
_at_f test using files and options described
in file 'f'.
Supported options d e F k m o p q R
t u v w x -
-t ltfilegt is mandatory

Input options
-f f read from Datafile 'f'.
-f f OR use filenames from 'f' for CV test
-F format Assume the specified inputformat.
(Compact, C4.5, ARFF, Columns, Binary,
Sparse )
-l n length of Features (Compact format
only).
-i f read the InstanceBase from file 'f'.
(skips phase 1 2 )
-u f read value_class probabilities from
file 'f'.
-P d read data using path 'd'.
-s use exemplar weights from the input
file
-s0 silently ignore the exemplar weights
from the input file

Output options
-e n estimate time until n patterns tested.
-I f dump the InstanceBase in file 'f'.
-n f create names file 'f'.
-p n show progress every n lines. (default
p 100,000)
-U f save value_class probabilities in file
'f'.
-V Show VERSION.
v or -v level set or unset verbosity level,
where level is
s work silently.
o show all options set.
f show Calculated Feature Weights.
(default)
p show MVD matrices.
e show exact matches.
as show advanced statistics. (memory
consuming)
cm show Confusion Matrix.
cs show per Class Statistics. (implies
vas)
di add distance to output file.
db add distribution of best matched to
output file
k add a summary for all k neigbors to
output file (sets -x)

-W f save current Weights in file 'f'.
or - do or don't save test result () to
file.
-o s use s as output filename.
-O d save output using path 'd'.
Internal representation options
-B n number of bins used for
discretization of numeric feature values
-c n clipping frequency for prestoring
MVDM matrices
-D Don't store distributions.
(saves memory, but disables vDB
option)
H or -H write hashed trees (default H)
-M n size of MaxBests Array
-N n Number of features (default 2500)
-T n ordering of the Tree
DO none.
GRO using GainRatio
IGO using InformationGain
( and many others)
x or -x Do or don't use the exact match
shortcut.

68
Data Representation

Symbolic features
segmental information (syllable structure)
stress
gender
German Plural ( 25,000 from CELEX)
Vorlesung (lecture) l e - z U N F en
Classes e (e)n s er - U- Uer Ue

69
Cognitive Architectures of Inflectional Morphology
Dual Route

Dual Route (Pinker, Clahsen, Marcus )
Rules for regular cases
(over)generalization
default behaviour
Associative memory for exceptions
irregularization / family effects
Single Route (RM, MacWhinney, Plunkett, Elman,
)
Frequency-based regularity

Suffix-class
Memory
Failure
Pattern
Rule
Associator
Input Features
70
German Plural

Notoriously complex but routinely acquired (at
age 5)
Evidence for Dual Route ?
-s suffix is default/regular (novel words,
surnames, acronyms, )
-s suffix is infrequent (least frequent of the
five most important suffixes)

71
(No Transcript)
72
The default status of -s

Similar item missing Fnöhk-s
Surname, product name Mann-s
Borrowings Kiosk-s
Acronyms BMW-s
Lexicalized phrases Vergissmeinnicht-s
Onomatopoeia, truncated roots, derived nouns, ...

73
(No Transcript)
74
Discussion

Three classes of plurals ((-en -)(-e -er))(s)
the former 4 suffixes seem regular, can be
accurately learned using information from
phonology and gender
-s is learned reasonably well but information is
lacking
Hypothesis more features are needed
(syntactic, semantic, meta-linguistic, ) to
enrich the lexical similarity space
No difference in accuracy and speed of learning
with and without Umlaut
Overall generalization accuracy very high 95
(90)
Schema-based learning (Köpcke).

,,,,i,r,M e
75
(No Transcript)
76
(No Transcript)
77
Acquisition DataSummary of previous studies

Existing nouns
(Park 78 Veit 86 Mills 86 Schamer-Wolles 88
Clahsen et al. 93 Sedlak et al. 98)
Children mainly overapply -e or -(e)n
-s plurals are learned late
Novel words
(Mugdan 77 MacWhinney 78 Phillis Bouma 80
Schöler Kany 89)
Children inflect novel words with -e or -(e)n
More irregular plural forms produced than
defaults

78
MBLP simulation

model overapplies mainly -en and -e
-s is learned late and imperfectly
Mainly but not completely parallel to input
frequency (more -s overgeneralization than -er
generalization)

79
Bartke, Marcus, Clahsen (1995)

37 children age 3.6 to 6.6
pictures of imaginary things, presented as
neologisms
names or roots
rhymes of existing words or not
choice -en or -s
results
children are aware that unusual sounding words
require the default
children are aware that names require the default

80
MBLP simulation

sort CELEX data according to rhyme
compare overgeneralization
to -en versus to -s
percentage of total number of errors
results
when new words dont rhyme more errors are made
overgeneralization to -en drops below the level
of overgeneralization to -s

81
Conclusions

Computational models in language acquisition
shouldnt necessarily be connectionist
From rule induction to exemplar-based models
TiMBL may be useful as software for computational
psycholinguistics

Write a Comment

User Comments (0)