Title: Lambert%20Schomaker
1KI2 5 Grammar inference
Kunstmatige Intelligentie / RuG
2Grammar inference (GI)
- methods, aimed at uncovering the grammar which
underlies an observed sequence of tokens - Two variants
- explicit, formal GI deterministic
token -
generators - implicit, statistical GI stochastic
token -
generators
3Grammar inference
- AABBCCAA..(?).. whats next?
- ABA ? 1A 1B 1A
- AABBAA ? 2A 2B 2A
- or
- ? AAB(mirrorsymmetric)
- ? (2A B)(mirrored)
-
- repetition, mirrorring, insertion, substitution
4Strings of tokens
- DNA ACTGAGGACCTGAC
- output of speech recognizers
- words from an unknown language
- tokenized patterns in the real world
5Strings of tokens
- DNA ACTGAGGACCTGAC
- output of speech recognizers
- words from an unknown language
- tokenized patterns in the real world
A B A
6Strings of tokens
- DNA ACTGAGGACCTGAC
- output of speech recognizers
- words from an unknown language
- tokenized patterns in the real world
A B A ? Symm(B,A)
7GI
- induction of structural patterns from
- observed data
- representation by a formal grammar
- versus
- emulating the underlying grammar withoutmaking
the rules explicit (NN,HMM)
8GI, the engine
Grammar Induction
Data
Grammatical rules
(seq (repeat 3 a)(repeat 3 b)) (seq a
b) (symmetry (repeat 2 c) (seq a b))
aaabbb ab abccba
9The hypothesis behind GI
Grammar Induction
Data
Generator process
G
G0
aaabbb ab abccba
Find G ? G0
10The hypothesis behind GI
Grammar Induction
Data
Generator process
G
G0
aaabbb ab abccba
Find G ? G0
It is not claimed that G0 actually exists
11Learning
- Until now it was implicitly assumed that the data
consists of positive examples - A very large amount of data is needed to induce
an underlying grammar - It is difficult to find a good approximation to
G0 if there are no negative examples - e.g. aaxybb does NOT belong to the grammar
12Learning
Convergence G0 G is assumed for infinite N
Grammar Induction
Data
Generator process
G
G0
sample1
G1 sample2
G12 sample3
G123 . . . sampleN
G
13Learning
(Convergence G0 G is assumed for infinite
N) More realistic a PAC, probably approximately
correct G
Grammar Induction
Data
Generator process
G
G0
sample1
G1 sample2
G12 sample3
G123 . . . sampleN
G
14PAC GI
the language generated by G0
the language explained by G
L(G0)
L(G)
P p(L(G0) ? L(G)) lt ? gt (1 - ?)
15PAC GI
the language generated by G0
the language explained by G
L(G0)
L(G)
The probability that the probability of finding
elements L0 xor L is smaller than ?, will be
larger than 1- ?
P p(L(G0) ? L(G)) lt ? gt (1 - ?)
16Example
a
a
?
aa
a
17Example
a
a
?
a
aa
b
a
ab
ba
18Example
a
a
?
a
aa
b
a
ab
ba
bb
b
a
19Example
a
a
?
a
aa
b
a
ab
ba
bb
b
a
b
20Many GI approaches are known (Dupont, 1997)
21Second group Grammar Emulation
- Statistical methods, aiming at producing token
sequences with the same statistical properties as
the generator grammar G0 - 1 recurrent neural networks
- 2 Markov models
- 3 hidden-Markov models
22Grammar emulation, training
ABGBABGACTVYAB ltxgt. . .
predict x
context window
Grammar emulator
23Recurrent neural networks for grammar emulation
- Major types
- Jordan (output-layer recurrence)
- Elman (hidden-layer recurrence)
24Jordan MLPs
- Assumption current state is represented by
- output unit activation at the previous time
- step(s) and by the current input
Input state
Output
?t
25Elman MLPs
- Assumption current state is represented by
- hidden unit activation at the previous time
- step(s) and by the current input
Input state
Output
?t
26Markov variants
- Shannon fixed 5-letter window for English to
predict next letter - Variable-length Markov Models (VLMM)
- (Guyon Pereira)
- idea the width of the context window
- to predict the next token in a sequence
- is variable and depends on statistics
27Results
- Example output of letter VLMM, trained on news
item texts (250 MB training set) - liferator member of flight since N. a report
the managical including from C all N months after
dispute. C and declaracter leaders first to do a
lot of though a ground out and C C pairs due to
each planner of the lux said the C nailed by the
defender begin about in N. the spokesman
standards of the arms responded victory the side
honored by the accustomers was arrest two
mentalisting the romatory accustomers of ethnic C
C the procedure.