Title: CPSC 503 Computational Linguistics
1CPSC 503Computational Linguistics
- RegExps and Finite State Automata Lecture 2
- Giuseppe Carenini
2Survey Results
By Student
By topic
3Knowledge-Formalisms Map(including probabilistic
formalisms)
State Machines (and prob. versions) (Finite State
Automata,Finite State Transducers, Markov Models)
Morphology
Syntax
Rule systems (and prob. versions) (e.g., (Prob.)
Context-Free Grammars)
Semantics
- Logical formalisms
- (First-Order Logics)
Pragmatics Discourse and Dialogue
AI planners
4Next Two Lectures
- State Machines (no prob.)
- Finite State Automata (and Regular Expressions)
- Finite State Transducers
(English) Morphology
5Today 1/16
- Regular Expressions
- Errors
- Finite-state automata
- Generation
- Recognition
- Non-determinism
6Regular Expressions
- Def. Notation to specify a set of strings
- disjunction of characters, negation
- /CPSC5034/, /CPSC500-9/,/CPSC5034/
- . Any character (to match a period \.)
- OR /(FfromSsubjectDdate)/
- Anchors (start of of line), (end of line),
\b (word boundary) - /(Ffrom\bSsubject\bDdate\b)/
7Regular Expressions (cont.)
- ( ) Grouping /happyier/ vs. /happ(yier)/
- Operators applied to preceding item (character or
exp.) - ? Optional /colou?r/,/July? (fourth4(th)?)/
- Repetitions
- one or more
- any number including none
- num num times
/0-9(\.0-9)3/
8Example of Usage Text Searching
- Find me all instances of the determiner the
in an English text. - To count them
- To substitute them with something else
- You try /the/
The other cop went to the bank but there were no
people there.
9Errors
- The process we just went through was based on
fixing two kinds of errors - Matching strings that we should not have matched
(there, other) - False positives
- Not matching things that we should have matched
(The) - False negatives
10Errors cont.
- Reducing the error rate for an application often
involves two antagonistic efforts - Increasing accuracy (minimizing false positives)
- Increasing coverage (minimizing false negatives).
11Finite State Automata
implement (generate and recognize)
Regular Expressions
describe
Many Linguistic Phenomena
model
12FSAs as Graphs
- Lets start with the sheep language from the
text /baa!/
13Verify
- It can generate the same set of strings (language)
- To generate a string
- follow a path leading to an accept state
- at each transition output corresponding symbol
14Sheep FSA
- We can say the following things about this
machine - It has 5 states
- b,a, and ! are in its alphabet
- q0 is the start state
- q4 is an accept state
- It has 5 transitions
15Sheep FSA
- We can say the following things about this
machine - It has 5 states
- At least b,a, and ! are in its alphabet
- q0 is the start state
- q4 is an accept state
- It has 5 transitions
16But note
- There are other machines that correspond to this
language - More on this one later
17More Formally
- You can specify an FSA by enumerating the
following things. - The set of states Q
- A finite alphabet S
- A start state
- A set of accept/final states
- A transition function that maps QxS to Q
18Represented as a Table
19About Alphabets
- Dont take that word to narrowly it just means
we need a finite set of symbols in the input. - These symbols can and will stand for bigger
objects that can have internal structure.
20Dollars and Cents
21Recognition
- Def. process of determining if a string is in the
language were defining with the machine - Or its the process of determining if the
equivalent regular expression matches a string
22Recognition Pseudocode (slide)
- Assume input on a tape
- Start in the start state pointing at the
beginning of the tape - Examine the current input symbol
- Consult the table
- (If a transition is allowed) Go to a new state
and update the tape pointer (Else Fail). - Repeat this process, until you run out of tape
- Now, if you are in an accept state accept the
string otherwise Fail
23D-Recognize
24Key Points
- D-recognize is a simple table-driven interpreter
- Matching strings with regular expressions (ala
Perl) is a matter of - translating the expression into a machine (table)
and - passing the table to an interpreter
25FSA Generative Formalisms
- FSAs can be viewed from two perspectives
- Acceptors that can tell you if a string is in the
language - Generators to produce all and only the strings in
the language
26Non-Determinism
27Non-Determinism cont.
- Yet another technique
- Epsilon transitions
- Key point these transitions do not examine or
advance the tape during recognition
e
28Non-Deterministic RecognitionKey ideas
- An input can lead to multiple paths
- The algorithm may need to explore all possible
paths - Whenever there is a choice (one possibility) is
to explore alternatives one at the time. - Save alternatives in an agenda
29Non-Deterministic Recognition
- Success occurs when a path is found through the
machine that ends in an accept state - Failure occurs when none of the possible paths
lead to an accept state
30Example (slide)
b
a
a
a
!
\
31Recognition as Search
32Equivalence between D and ND
- ND machines can always be converted to D ones
- That means that ND machines are not more powerful
than D ones - It also means that one way to do recognition with
a ND machine is to turn it into a D one.
33Why Bother?
- Non-determinism doesnt get us more formal
power and it causes headaches so why bother? - More natural solutions
- Machines based on construction are too big
34Next Time
- Read Chapter 1 (on-line) and Chapter 2 of
textbook - Try understand
- ND-recognize algorithm
- and why it is a state-space search algorithm