Title: Building Finite-State Machines
1Building Finite-State Machines
2Finite-State Toolkits
- In these slides, well use Xeroxs regexp
notation - Their tool is XFST free version is called FOMA
- Usage
- Enter a regular expression it builds FSA or FST
- Now type in input string
- FSA It tells you whether its accepted
- FST It tells you all the output strings (if any)
- Can also invert FST to let you map outputs to
inputs - Could hook it up to other NLP tools that need
finite-state processing of their input or output - There are other tools for weighted FSMs (Thrax,
OpenFST)
3Common Regular Expression Operators (in XFST
notation)
- concatenation EF
- iteration E, E
- union E F
- intersection E F
- \ - complementation, minus E, \x, F-E
- .x. crossproduct E .x. F
- .o. composition E .o. F
- .u upper (input) language E.u domain
- .l lower (output) language E.l range
600.465 - Intro to NLP - J. Eisner
3
4Common Regular Expression Operators (in XFST
notation)
- concatenation EF
- EF ef e ? E, f ? F
- ef denotes the concatenation of 2 strings.
- EF denotes the concatenation of 2 languages.
- To pick a string in EF, pick e ? E and f ? F and
concatenate them. - To find out whether w ? EF, look for at least one
way to split w into two halves, w ef, such
that e ? E and f ? F. - A language is a set of strings.
- It is a regular language if there exists an FSA
that accepts all the strings in the language, and
no other strings. - If E and F denote regular languages, than so does
EF.(We will have to prove this by finding the
FSA for EF!)
5Common Regular Expression Operators (in XFST
notation)
- concatenation EF
- iteration E, E
- E e1e2 en n?0, e1? E, en? E
-
- To pick a string in E, pick any number of
strings in E and concatenate them. - To find out whether w ? E, look for at least one
way to split w into 0 or more sections, e1e2
en, all of which are in E. - E e1e2 en ngt0, e1? E, en? E EE
600.465 - Intro to NLP - J. Eisner
5
6Common Regular Expression Operators (in XFST
notation)
- concatenation EF
- iteration E, E
- union E F
-
- E F w w? E or w? F E ? F
-
- To pick a string in E F, pick a string from
either E or F. - To find out whether w ? E F, check whether w ?
E or w ? F.
600.465 - Intro to NLP - J. Eisner
6
7Common Regular Expression Operators (in XFST
notation)
- concatenation EF
- iteration E, E
- union E F
- intersection E F
- E F w w? E and w? F E?? F
-
- To pick a string in E F, pick a string from E
that is also in F. - To find out whether w ? E F, check whether w ?
E and w ? F.
600.465 - Intro to NLP - J. Eisner
7
8Common Regular Expression Operators (in XFST
notation)
- concatenation EF
- iteration E, E
- union E F
- intersection E F
- \ - complementation, minus E, \x, F-E
- E e e?? E ? - E
- E F e e?? E and e?? F E F
- \E ? - E (any single character not in E)
? is set of all letters so ? is set of all
strings
600.465 - Intro to NLP - J. Eisner
8
9Regular Expressions
- A language is a set of strings.
- It is a regular language if there exists an FSA
that accepts all the strings in the language, and
no other strings. - If E and F denote regular languages, than so do
EF, etc. - Regular expression EF(F G)
- Syntax
Semantics Denotes a regular language. As
usual, can build semantics compositionally
bottom-up. E, F, G must be regular languages.
As a base case, e denotes e (a language
containing a single string), so ef(fg) is
regular.
600.465 - Intro to NLP - J. Eisner
9
10Regular Expressionsfor Regular Relations
- A language is a set of strings.
- It is a regular language if there exists an FSA
that accepts all the strings in the language, and
no other strings. - If E and F denote regular languages, than so do
EF, etc. - A relation is a set of pairs here, pairs of
strings. - It is a regular relation if here exists an FST
that accepts all the pairs in the language, and
no other pairs. - If E and F denote regular relations, then so do
EF, etc. - EF (ef,ef) (e,e) ? E, (f,f) ? F
- Can you guess the definitions for E, E, E F,
E F when E and F are regular relations? - Surprise E F isnt necessarily regular in the
case of relations so not supported.
600.465 - Intro to NLP - J. Eisner
10
11Common Regular Expression Operators (in XFST
notation)
- concatenation EF
- iteration E, E
- union E F
- intersection E F
- \ - complementation, minus E, \x, F-E
- .x. crossproduct E .x. F
-
- E .x. F (e,f) e ? E, f ? F
- Combines two regular languages into a regular
relation.
600.465 - Intro to NLP - J. Eisner
11
12Common Regular Expression Operators (in XFST
notation)
- concatenation EF
- iteration E, E
- union E F
- intersection E F
- \ - complementation, minus E, \x, F-E
- .x. crossproduct E .x. F
- .o. composition E .o. F
- E .o. F (e,f) ?m. (e,m) ? E, (m,f) ? F
- Composes two regular relations into a regular
relation. - As weve seen, this generalizes ordinary function
composition.
600.465 - Intro to NLP - J. Eisner
12
13Common Regular Expression Operators (in XFST
notation)
- concatenation EF
- iteration E, E
- union E F
- intersection E F
- \ - complementation, minus E, \x, F-E
- .x. crossproduct E .x. F
- .o. composition E .o. F
- .u upper (input) language E.u domain
- E.u e ?m. (e,m) ? E
600.465 - Intro to NLP - J. Eisner
13
14Common Regular Expression Operators (in XFST
notation)
- concatenation EF
- iteration E, E
- union E F
- intersection E F
- \ - complementation, minus E, \x, F-E
- .x. crossproduct E .x. F
- .o. composition E .o. F
- .u upper (input) language E.u domain
- .l lower (output) language E.l range
600.465 - Intro to NLP - J. Eisner
14
15Function from strings to ...
false, true
strings
numbers
(string, num) pairs
16How to implement?
slide courtesy of L. Karttunen (modified)
- concatenation EF
- iteration E, E
- union E F
- \ - complementation, minus E, \x, E-F
- intersection E F
- .x. crossproduct E .x. F
- .o. composition E .o. F
- .u upper (input) language E.u domain
- .l lower (output) language E.l range
600.465 - Intro to NLP - J. Eisner
16
17Concatenation
example courtesy of M. Mohri
r
r
18Union
example courtesy of M. Mohri
r
19Closure (this example has outputs too)
example courtesy of M. Mohri
The loop creates (red machine) . Then we add a
state to get do ? (red machine) . Why do it
this way? Why not just make state 0 final?
20Upper language (domain)
example courtesy of M. Mohri
.u
similarly construct lower language .l also called
input output languages
21Reversal
example courtesy of M. Mohri
22Inversion
example courtesy of M. Mohri
23Complementation
- Given a machine M, represent all strings not
accepted by M - Just change final states to non-final and
vice-versa - Works only if machine has been determinized and
completed first (why?)
24Intersection
example adapted from M. Mohri
fat/0.5
pig/0.3
eats/0
1
0
sleeps/0.6
25Intersection
Paths 0012 and 0110 both accept fat pig eats So
must the new machine along path 0,0 0,1 1,1
2,0
26Intersection
fat/0.5
pig/0.3
eats/0
1
0
sleeps/0.6
pig/0.4
fat/0.2
sleeps/1.3
1
0
eats/0.6
0,0
Paths 00 and 01 both accept fat So must the
new machine along path 0,0 0,1
27Intersection
fat/0.5
pig/0.3
eats/0
1
0
sleeps/0.6
pig/0.4
fat/0.2
sleeps/1.3
1
0
eats/0.6
fat/0.7
0,0
0,1
Paths 00 and 11 both accept pig So must the
new machine along path 0,1 1,1
28Intersection
fat/0.5
pig/0.3
eats/0
1
0
sleeps/0.6
pig/0.4
sleeps/1.3
fat/0.2
1
0
eats/0.6
fat/0.7
pig/0.7
0,0
0,1
1,1
Paths 12 and 12 both accept fat So must the
new machine along path 1,1 2,2
29Intersection
fat/0.5
eats/0
pig/0.3
1
0
sleeps/0.6
pig/0.4
sleeps/1.3
fat/0.2
1
0
eats/0.6
fat/0.7
pig/0.7
0,0
0,1
1,1
sleeps/1.9
30What Composition Means
f
ab?d
abcd
31What Composition Means
ab?d
abgd
abed
Relation composition f ? g
ab?d
...
32Relation set of pairs
ab?d ? abcd ab?d ? abed ab?d ? abjd
abcd ? abgd abed ? abed abed ? ab?d
f
ab?d
abcd
33Relation set of pairs
ab?d ? abcd ab?d ? abed ab?d ? abjd
abcd ? abgd abed ? abed abed ? ab?d
ab?d ? abgd ab?d ? abed ab?d ? ab?d
4
ab?d
abgd
2
abed
8
ab?d
...
34Intersection vs. Composition
Intersection
pig/0.4
pig/0.3
pig/0.7
0,1
1
1
0
1,1
Composition
pigpink/0.4
Wilburpink/0.7
Wilburpig/0.3
.o.
0,1
1
0
1
1,1
35Intersection vs. Composition
Intersection mismatch
elephant/0.4
pig/0.3
pig/0.7
0,1
1
1
0
1,1
Composition mismatch
elephantgray/0.4
Wilburgray/0.7
Wilburpig/0.3
.o.
0,1
1
0
1
1,1
36Composition
example courtesy of M. Mohri
.o.
37Composition
.o.
ab .o. bb ab
38Composition
.o.
ab .o. ba aa
39Composition
.o.
ab .o. ba aa
40Composition
.o.
bb .o. ba ba
41Composition
.o.
ab .o. ba aa
42Composition
.o.
aa .o. ab ab
43Composition
.o.
bb .o. ab nothing (since intermediate symbol
doesnt match)
44Composition
.o.
bb .o. ba ba
45Composition
.o.
ab .o. ab ab
46Composition in Dyna
start pair( start1, start2 ). final(pair(Q1,Q
2)) - final1(Q1), final2(Q2). edge(U, L,
pair(Q1,Q2), pair(R1,R2)) min edge1(U,
Mid, Q1, R1) edge2(Mid, L, Q2, R2).
47Relation set of pairs
ab?d ? abcd ab?d ? abed ab?d ? abjd
abcd ? abgd abed ? abed abed ? ab?d
ab?d ? abgd ab?d ? abed ab?d ? ab?d
4
ab?d
abgd
2
abed
8
ab?d
...
483 Uses of Set Composition
- Feed string into Greek transducer
- abed?abed .o. Greek abed?abed, abed?ab?d
- abed .o. Greek abed?abed, abed?ab?d
- abed .o. Greek.l abed, ab?d
- Feed several strings in parallel
- abcd, abed .o. Greek abcd?abgd,
abed?abed, abed?ab?d - abcd,abed .o. Greek.l abgd, abed, ab?d
- Filter result via No? abgd, ab?d,
- abcd,abed .o. Greek .o. No? abcd?abgd,
abed?ab?d
49What are the basic transducers?
- The operations on the previous slides combine
transducers into bigger ones - But where do we start?
- ae for a ? S
- ex for x ? D
- Q Do we also need ax? How about ee ?
50Some Xerox Extensions
slide courtesy of L. Karttunen (modified)
- containment
- gt restriction
- -gt _at_-gt replacement
- Make it easier to describe complex languages and
relations without extending the formal power of
finite-state systems.
600.465 - Intro to NLP - J. Eisner
50
51Containment
Warning ? in regexps means any character at
all. But ? in machines means any character not
explicitly mentioned anywhere in the machine.
600.465 - Intro to NLP - J. Eisner
51
52Restriction
slide courtesy of L. Karttunen (modified)
600.465 - Intro to NLP - J. Eisner
52
53Replacement
slide courtesy of L. Karttunen (modified)
600.465 - Intro to NLP - J. Eisner
53
54Replacement is Nondeterministic
600.465 - Intro to NLP - J. Eisner
54
55Replacement is Nondeterministic
600.465 - Intro to NLP - J. Eisner
55
56Replacement is Nondeterministic
slide courtesy of L. Karttunen (modified)
a b b b a a b a -gt x applied to aba Four
overlapping substrings match we havent told it
which one to replace so it chooses
nondeterministically a b a a b a a b a
a b a a x a a x x a x
600.465 - Intro to NLP - J. Eisner
56
57More Replace Operators
slide courtesy of L. Karttunen
- Optional replacement a b (-gt) b a
- Directed replacement
- guarantees a unique result by constraining the
factorization of the input string by - Direction of the match (rightward or leftward)
- Length (longest or shortest)
600.465 - Intro to NLP - J. Eisner
57
58_at_-gt Left-to-right, Longest-match Replacement
slide courtesy of L. Karttunen
a b b b a a b a _at_-gt x applied to aba a
b a a b a a b a a b a a x a
a x x a x
_at_-gt left-to-right, longest match _at_gt
left-to-right, shortest match -gt_at_ right-to-left,
longest match gt_at_ right-to-left, shortest match
600.465 - Intro to NLP - J. Eisner
58
59Using for marking
slide courtesy of L. Karttunen (modified)
p o t a t o potato
Note actually have to write as -gt ...
or -gt ... since are parens
in the regexp language
600.465 - Intro to NLP - J. Eisner
59
60Using for marking
slide courtesy of L. Karttunen (modified)
p o t a t o potato
Which way does the FST transduce potatoe?
p o t a t o e potatoe
p o t a t o e potato e
vs.
How would you change it to get the other answer?
600.465 - Intro to NLP - J. Eisner
60
61Example Finnish Syllabification
slide courtesy of L. Karttunen
define C b c d f ... define V a e i
o u
C V C _at_-gt ... "-" _ C V Insert a
hyphen after the longest instance of the C V
C pattern in front of a C V pattern.
s t r u k t u r a l i s m i s t r u k - t
u - r a - l i s - m i
600.465 - Intro to NLP - J. Eisner
61
62Conditional Replacement
slide courtesy of L. Karttunen
600.465 - Intro to NLP - J. Eisner
62
63Hand-Coded Example Parsing Dates
slide courtesy of L. Karttunen
Today is Tuesday, July 25, 2000.
Need left-to-right, longest-match constraints.
600.465 - Intro to NLP - J. Eisner
63
64Source code Language of Dates
slide courtesy of L. Karttunen
- Day Monday Tuesday ... Sunday
- Month January February ... December
- Date 1 2 3 ... 3 1
- Year 0To9 (0To9 (0To9 (0To9))) - 0?
from 1 to 9999
- AllDates Day (Day , ) Month Date (,
Year))
600.465 - Intro to NLP - J. Eisner
64
65Object code All Dates from 1/1/1 to 12/31/9999
slide courtesy of L. Karttunen
,
,
Jan
Feb
Mar
Apr
May
Jun
3
Jul
Aug
Sep
Oct
Nov
,
Dec
13 states, 96 arcs 29 760 007 date expressions
,
May
Jan
Feb
Mar
Apr
Jun
Jul
Aug
Oct
Nov
Dec
Sep
600.465 - Intro to NLP - J. Eisner
65
66Parser for Dates
slide courtesy of L. Karttunen (modified)
Compiles into an unambiguous transducer (23
states, 332 arcs).
AllDates _at_-gt DT ...
Today is DT Tuesday, July 25, 2000 because
yesterday was DT Monday and it was DT July 24
so tomorrow must be DT Wednesday, July 26 and
not DT July 27 as it says on the program.
600.465 - Intro to NLP - J. Eisner
66
67Problem of Reference
slide courtesy of L. Karttunen
Valid dates Tuesday, July 25, 2000 Tuesday,
February 29, 2000 Monday, September 16,
1996 Invalid dates Wednesday, April 31,
1996 Thursday, February 29, 1900 Tuesday, July
26, 2000
600.465 - Intro to NLP - J. Eisner
67
68Refinement by Intersection
slide courtesy of L. Karttunen (modified)
Valid Dates
Q Why does this rule end with a comma? Q Can we
write the whole rule?
Q Why do these rules start with spaces?(And is
it enough?)
600.465 - Intro to NLP - J. Eisner
68
69Defining Valid Dates
slide courtesy of L. Karttunen
AllDates 13 states, 96 arcs 29 760 007 date
expressions
AllDates MaxDaysInMonth LeapYears WeekdayDat
es
ValidDates
ValidDates 805 states, 6472 arcs 7 307 053 date
expressions
600.465 - Intro to NLP - J. Eisner
69
70Parser for Valid and Invalid Dates
slide courtesy of L. Karttunen
AllDates - ValidDates _at_-gt ID ...
, ValidDates _at_-gt VD ...
2688 states, 20439 arcs
600.465 - Intro to NLP - J. Eisner
70