Title: LING 6932 Topics in Computational Linguistics
1LING 6932 Topics in Computational Linguistics
- Hana Filip
- Lecture 2 Regular Expressions, Finite State
Automata
2Regular expressions
- formulas for specifying text strings
- How can we search for any of these strings?
- woodchuck
- woodchucks
- Woodchuck
- Woodchucks
Figure from Dorr/Monz slides
3Regular Expressions
- Basic patterns of regular expressions
- Perl-based syntax (slightly different from other
notations for regular expressions as used in
UNIX, for example) - /Woodchuck/ matches any string containing the
substring Woodchuck, if your search application
returns entire lines, for example - / notation used by Perl, NOT part of the RE
- Google
- Woodchuck Draft Cider
- Producers of Woodchuck Draft Cider in
Spingfield, VT. - www.woodchuck.com/ - 17k - Cached - Similar
pages
Slide from Dorr/Monz
4Regular Expressions
- Regular expressions are CASE SENSITIVE
- The pattern /woodchuck/ will not match the string
Woodchuck - Disjunction /wWoodchuck/
Slide from Dorr/Monz
5Regular Expressions
Slide from Dorr/Monz
6Regular Expressions
- Negation /a/ caret
- match any single character except a
-
Slide from Dorr/Monz
7Regular Expressions
- Operators ? , and
- ? (0 or 1)
- /woodchucks?/ ? woodchuck or woodchucks
- /colou?r/ ? color or colour
- (0 or more)
- /ooh!/ ? oh! or ooh! or ooooh!
- (1 or more)
- /oh!/ ? oh! or ooh! or ooooh!
- related to the immediately preceding character or
regular expression -
- Wild card ./beg.n/ ? begin or began or begun
- any character between beg and n (except a
carriage return)
Slide from Dorr/Monz
8Regular Expressions
- Anchors and
- start of line
- /A-Z/ ? Ramallah, Palestine
- /A-Z/ ? verdad? really?
- end of line
- /\./ ? It is over.
- /./ ? ?
- Boundaries \b and \B
- /\bon\b/ ? on my way Monday (boundary)
- /\Bon\b/ ? automaton (non-boundary)
Slide from Dorr/Monz
9Disjunction, Grouping, Precedence
- Disjunction
- /yoursmine/ ? it is either yours or mine
- /gupp(yies)/ ? guppy or guppies
- Column 1 Column 2 Column 3 How do we
express this? - /Column0-9?/ ? space
- /(Column0-9?)/ NOT a RE character
-
- matches the word Column, followed by one number,
followed by zero or more spaces, the whole
pattern repeated any number of times (zero or
more times)
Slide from Dorr/Monz
10Disjunction, Grouping, Precedence
- Operator Precedence Hierarchy
- Parenthesis ()
- Counters ?
- Sequences and anchors the my end
- Disjunction
- REs are greedy!
- They always match the largest string they can
Slide from Dorr/Monz
11Example
- Find me all instances of the word the in a
text. - /the/
- Misses capitalized examples
- /tThe/
- Returns other or theology
- /\btThe\b/ matches the or The
- /a-zA-ZtThea-zA-Z/
- /(a-zA-Z)tThea-zA-Z/
- Matches the_ or the25
12Errors
- The process we just went through was based on two
fixing kinds of errors - Not matching things that we should have matched
(The) - False negatives
- Matching strings that we should not have matched
(there, then, other) - False positives
13Errors cont.
- Well be telling the same story for many tasks
- Reducing the error rate for an application often
involves two antagonistic efforts - Increasing accuracy (minimizing false positives)
- Increasing coverage (minimizing false negatives).
14More complex RE example
- Regular expressions for prices
- /0-9/
- Doesnt deal with fractions of dollars
- /0-9\.0-90-9/
- Doesnt allow 199, not at a word boundary
- /\b0-9(\.0-90-9)?\b)/
15Advanced operators
Regular expression operators for
counting RE Match n exactly n occurrences
of the previous character or expression n,m
from n to m occurrences of the previous
character or expression n, at least n
occurrences of the previous character or
expression /a\.24z/ a followed by 24 dots
followed by z
16Advanced operators
To refer to characters that are special
themselves precede them with a backslash RE
Match Example Strings Matched \ an
asterisk KAPLAN \. a period
. Dr.Livingston, I presume. \? A
question mark ? Would you light my
candle? \n a newline \t tab
17Advanced operators
Slide from Dorr/Monz
18Substitutions and Memory
- Substitution operator
- s/regexp1/regexp2/ (UNIX, Perl)
-
-
Substitute as many times as possible!
s/colour/color/ s/colour/color/g
s/colour/color/i
Case insensitive matching
Slide from Dorr/Monz
19Substitutions and Memory
- Substitutions
- the Xer they were, the Xer they will be
- constrain the two Xs to be the same string
Using numbered memories or registers 1, 2,
etc. used to refer back to matches An extended
feature of regular expressions
/the (.)er they were, the 1er they will
be/ /the (.)er they (.), the 1er they 2/
Slide from Dorr/Monz
20Eliza Weizenbaum, 1966
- User Men are all alike
- ELIZA IN WHAT WAY
- User Theyre always bugging us about something
or other - ELIZA CAN YOU THINK OF A SPECIFIC EXAMPLE?
- User Well, my boyfriend made me come here
- ELIZA YOUR BOYFRIEND MADE YOU COME HERE
- User He says Im depressed much of the time
- ELIZA I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
21Eliza-style regular expressions
Step 1 replace first person with second person
references
s/\bI(m am)\b /YOU ARE/g s/\bmy\b
/YOUR/g S/\bmine\b /YOURS/g
Step 2 use substitutions that look for relevant
patterns in the input and create an appropriate
output (reply)
- s/. YOU ARE (depressedsad) ./I AM SORRY TO
HEAR YOU ARE \1/ - s/. YOU ARE (depressedsad) ./WHY DO YOU THINK
YOU ARE \1/ - s/. all ./IN WHAT WAY/
- s/. always ./CAN YOU THINK OF A SPECIFIC
EXAMPLE/
Step 3 use scores to rank possible
transformations
Slide from Dorr/Monz
22Summary on REs so far
- Regular expressions are perhaps the single most
useful tool for text manipulation - Dumb but ubiquitous
- Eliza you can do a lot with simple
regular-expression substitutions
23Three Views
- Three equivalent formal ways to look at what
were up to (thanks to Martin Kay)
Regular Expressions
Finite State Automata
Regular Languages
24Finite State Automata
- Terminology Finite State Automata, Finite State
Machines, FSA, Finite Automata - Regular expressions are one way of specifying the
structure of finite-state automata. - FSAs and their close relatives are at the core of
most algorithms for speech and language
processing.
25Finite-state Automata (Machines)
Slide from Dorr/Monz
26Sheep FSA
- We can say the following things about this
machine - It has 5 states
- At least b, a, and ! are in its alphabet
- q0 is the start state
- q4 is the final ( accept) state
- It has 5 transitions
27More Formally Defining an FSA
- You can specify an FSA by enumerating the
following things. - a finite set of states Q
- a finite alphabet of symbols ?
- the start state q0
- The set of accepting/final states F such that
F?Q - A transition function ?(q,i) that maps Qx? to Q
- Given a state q?Q and an input symbol i??,
- ?(q,i) returns a new state q?Q.
28Yet Another View
29Recognition
- Recognition is the process of determining if a
string should be accepted by a machine - Or its the process of determining if a string
is in the language were defining with the
machine - Or its the process of determining if a regular
expression matches a string
30Recognition
- Traditionally, (Turings idea, 1936) this process
is depicted with a tape.
http//www.cs.princeton.edu/introcs/75turing/
31Recognition - Execution
- Start in the start state
- Examine the current input in the active cell
- Consult the table a finite table of instructions
- (a state transition diagram) that specifies
exactly what action the machine takes at each
step - Go to a new state and update the tape pointer.
- Until you run out of tape.
32Input Tape
ACCEPT
Slide from Dorr/Monz
33Input Tape
REJECT
Slide from Dorr/Monz
34Adding a failing state
a
b
a
a
!
q0
q1
q2
q3
q4
Slide from Dorr/Monz
35Tracing D-Recognize
36Key Points
- Deterministic means that at each point in
processing there is always one unique thing to do
(no choices). - D-recognize is a simple table-driven interpreter
- The algorithm is universal for all unambiguous
languages. - To change the machine, you change the table.
37Key Points
- Deterministic Pattern
- Example Consider a set of traffic lights the
sequence of lights is red - red/amber - green -
amber - red. The sequence can be pictured as a
state machine, where the different states of the
traffic lights follow each other. -
-
- Each state is dependent solely on the previous
state, so if the lights are green, an amber light
will always follow - that is, the system is
deterministic. Deterministic systems are
relatively easy to understand and analyse, once
the transitions are fully known.
38Key Points
- Crudely therefore matching strings with regular
expressions (a la Perl) is a matter of - translating the expression into a machine (table)
and - passing the table to an interpreter
39Recognition as Search
- You can view this algorithm as state-space
search. - States are pairings of tape positions and state
numbers. - Operators are compiled into the table
- Goal state is a pairing with the end of tape
position and a final accept state
40Generative Formalisms
- A formal Language is a model m which can both
generate and recognize all and only the strings
of a formal language - each string is composed of symbols from a finite
set of symbols (alphabet) - L(m) a formal language L characterized by
the model m - Finite-state automata define formal languages
(without having to enumerate all the strings in
the language) - The term Generative is based on the view that you
can run the machine as a generator to get strings
from the language.
41Generative Formalisms
- FSAs can be viewed from two perspectives
- Acceptors that can tell you if a string is in the
language (recognition) - Generators to produce all and only the strings in
the language (production)
42Summary
- Regular expressions are just a compact textual
representation of FSAs - Recognition is the process of determining if a
string/input is in the language defined by some
machine. - Recognition is straightforward with deterministic
machines.