LING 6932 Topics in Computational Linguistics - PowerPoint PPT Presentation

About This Presentation
Title:

LING 6932 Topics in Computational Linguistics

Description:

Lecture 2: Regular Expressions, Finite State Automata. 2. LING ... gupp(y|ies)/ 'guppy' or 'guppies' Column 1 Column 2 Column 3 ... How do we express this? ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 43
Provided by: danjur
Learn more at: http://plaza.ufl.edu
Category:

less

Transcript and Presenter's Notes

Title: LING 6932 Topics in Computational Linguistics


1
LING 6932 Topics in Computational Linguistics
  • Hana Filip
  • Lecture 2 Regular Expressions, Finite State
    Automata

2
Regular expressions
  • formulas for specifying text strings
  • How can we search for any of these strings?
  • woodchuck
  • woodchucks
  • Woodchuck
  • Woodchucks

Figure from Dorr/Monz slides
3
Regular Expressions
  • Basic patterns of regular expressions
  • Perl-based syntax (slightly different from other
    notations for regular expressions as used in
    UNIX, for example)
  • /Woodchuck/ matches any string containing the
    substring Woodchuck, if your search application
    returns entire lines, for example
  • / notation used by Perl, NOT part of the RE
  • Google
  • Woodchuck Draft Cider
  • Producers of Woodchuck Draft Cider in
    Spingfield, VT.
  • www.woodchuck.com/ - 17k - Cached - Similar
    pages

Slide from Dorr/Monz
4
Regular Expressions
  • Regular expressions are CASE SENSITIVE
  • The pattern /woodchuck/ will not match the string
    Woodchuck
  • Disjunction /wWoodchuck/

Slide from Dorr/Monz
5
Regular Expressions
  • Ranges A-Z

Slide from Dorr/Monz
6
Regular Expressions
  • Negation /a/ caret
  • match any single character except a

Slide from Dorr/Monz
7
Regular Expressions
  • Operators ? , and
  • ? (0 or 1)
  • /woodchucks?/ ? woodchuck or woodchucks
  • /colou?r/ ? color or colour
  • (0 or more)
  • /ooh!/ ? oh! or ooh! or ooooh!
  • (1 or more)
  • /oh!/ ? oh! or ooh! or ooooh!
  • related to the immediately preceding character or
    regular expression
  • Wild card ./beg.n/ ? begin or began or begun
  • any character between beg and n (except a
    carriage return)

Slide from Dorr/Monz
8
Regular Expressions
  • Anchors and
  • start of line
  • /A-Z/ ? Ramallah, Palestine
  • /A-Z/ ? verdad? really?
  • end of line
  • /\./ ? It is over.
  • /./ ? ?
  • Boundaries \b and \B
  • /\bon\b/ ? on my way Monday (boundary)
  • /\Bon\b/ ? automaton (non-boundary)

Slide from Dorr/Monz
9
Disjunction, Grouping, Precedence
  • Disjunction
  • /yoursmine/ ? it is either yours or mine
  • /gupp(yies)/ ? guppy or guppies
  • Column 1 Column 2 Column 3 How do we
    express this?
  • /Column0-9?/ ? space
  • /(Column0-9?)/ NOT a RE character
  • matches the word Column, followed by one number,
    followed by zero or more spaces, the whole
    pattern repeated any number of times (zero or
    more times)

Slide from Dorr/Monz
10
Disjunction, Grouping, Precedence
  • Operator Precedence Hierarchy
  • Parenthesis ()
  • Counters ?
  • Sequences and anchors the my end
  • Disjunction
  • REs are greedy!
  • They always match the largest string they can

Slide from Dorr/Monz
11
Example
  • Find me all instances of the word the in a
    text.
  • /the/
  • Misses capitalized examples
  • /tThe/
  • Returns other or theology
  • /\btThe\b/ matches the or The
  • /a-zA-ZtThea-zA-Z/
  • /(a-zA-Z)tThea-zA-Z/
  • Matches the_ or the25

12
Errors
  • The process we just went through was based on two
    fixing kinds of errors
  • Not matching things that we should have matched
    (The)
  • False negatives
  • Matching strings that we should not have matched
    (there, then, other)
  • False positives

13
Errors cont.
  • Well be telling the same story for many tasks
  • Reducing the error rate for an application often
    involves two antagonistic efforts
  • Increasing accuracy (minimizing false positives)
  • Increasing coverage (minimizing false negatives).

14
More complex RE example
  • Regular expressions for prices
  • /0-9/
  • Doesnt deal with fractions of dollars
  • /0-9\.0-90-9/
  • Doesnt allow 199, not at a word boundary
  • /\b0-9(\.0-90-9)?\b)/

15
Advanced operators
Regular expression operators for
counting RE Match n exactly n occurrences
of the previous character or expression n,m
from n to m occurrences of the previous
character or expression n, at least n
occurrences of the previous character or
expression /a\.24z/ a followed by 24 dots
followed by z
16
Advanced operators
To refer to characters that are special
themselves precede them with a backslash RE
Match Example Strings Matched \ an
asterisk KAPLAN \. a period
. Dr.Livingston, I presume. \? A
question mark ? Would you light my
candle? \n a newline \t tab
17
Advanced operators
Slide from Dorr/Monz
18
Substitutions and Memory
  • Substitution operator
  • s/regexp1/regexp2/ (UNIX, Perl)

Substitute as many times as possible!
s/colour/color/ s/colour/color/g
s/colour/color/i
Case insensitive matching
Slide from Dorr/Monz
19
Substitutions and Memory
  • Substitutions
  • the Xer they were, the Xer they will be
  • constrain the two Xs to be the same string

Using numbered memories or registers 1, 2,
etc. used to refer back to matches An extended
feature of regular expressions
/the (.)er they were, the 1er they will
be/ /the (.)er they (.), the 1er they 2/
Slide from Dorr/Monz
20
Eliza Weizenbaum, 1966
  • User Men are all alike
  • ELIZA IN WHAT WAY
  • User Theyre always bugging us about something
    or other
  • ELIZA CAN YOU THINK OF A SPECIFIC EXAMPLE?
  • User Well, my boyfriend made me come here
  • ELIZA YOUR BOYFRIEND MADE YOU COME HERE
  • User He says Im depressed much of the time
  • ELIZA I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

21
Eliza-style regular expressions
Step 1 replace first person with second person
references
s/\bI(m am)\b /YOU ARE/g s/\bmy\b
/YOUR/g S/\bmine\b /YOURS/g
Step 2 use substitutions that look for relevant
patterns in the input and create an appropriate
output (reply)
  • s/. YOU ARE (depressedsad) ./I AM SORRY TO
    HEAR YOU ARE \1/
  • s/. YOU ARE (depressedsad) ./WHY DO YOU THINK
    YOU ARE \1/
  • s/. all ./IN WHAT WAY/
  • s/. always ./CAN YOU THINK OF A SPECIFIC
    EXAMPLE/

Step 3 use scores to rank possible
transformations
Slide from Dorr/Monz
22
Summary on REs so far
  • Regular expressions are perhaps the single most
    useful tool for text manipulation
  • Dumb but ubiquitous
  • Eliza you can do a lot with simple
    regular-expression substitutions

23
Three Views
  • Three equivalent formal ways to look at what
    were up to (thanks to Martin Kay)

Regular Expressions
Finite State Automata
Regular Languages
24
Finite State Automata
  • Terminology Finite State Automata, Finite State
    Machines, FSA, Finite Automata
  • Regular expressions are one way of specifying the
    structure of finite-state automata.
  • FSAs and their close relatives are at the core of
    most algorithms for speech and language
    processing.

25
Finite-state Automata (Machines)
Slide from Dorr/Monz
26
Sheep FSA
  • We can say the following things about this
    machine
  • It has 5 states
  • At least b, a, and ! are in its alphabet
  • q0 is the start state
  • q4 is the final ( accept) state
  • It has 5 transitions

27
More Formally Defining an FSA
  • You can specify an FSA by enumerating the
    following things.
  • a finite set of states Q
  • a finite alphabet of symbols ?
  • the start state q0
  • The set of accepting/final states F such that
    F?Q
  • A transition function ?(q,i) that maps Qx? to Q
  • Given a state q?Q and an input symbol i??,
  • ?(q,i) returns a new state q?Q.

28
Yet Another View
  • State-transition table

29
Recognition
  • Recognition is the process of determining if a
    string should be accepted by a machine
  • Or its the process of determining if a string
    is in the language were defining with the
    machine
  • Or its the process of determining if a regular
    expression matches a string

30
Recognition
  • Traditionally, (Turings idea, 1936) this process
    is depicted with a tape.

http//www.cs.princeton.edu/introcs/75turing/
31
Recognition - Execution
  • Start in the start state
  • Examine the current input in the active cell
  • Consult the table a finite table of instructions
  • (a state transition diagram) that specifies
    exactly what action the machine takes at each
    step
  • Go to a new state and update the tape pointer.
  • Until you run out of tape.

32
Input Tape
ACCEPT
Slide from Dorr/Monz
33
Input Tape
REJECT
Slide from Dorr/Monz
34
Adding a failing state
a
b
a
a
!
q0
q1
q2
q3
q4
Slide from Dorr/Monz
35
Tracing D-Recognize
36
Key Points
  • Deterministic means that at each point in
    processing there is always one unique thing to do
    (no choices).
  • D-recognize is a simple table-driven interpreter
  • The algorithm is universal for all unambiguous
    languages.
  • To change the machine, you change the table.

37
Key Points
  • Deterministic Pattern
  • Example Consider a set of traffic lights the
    sequence of lights is red - red/amber - green -
    amber - red. The sequence can be pictured as a
    state machine, where the different states of the
    traffic lights follow each other.
  • Each state is dependent solely on the previous
    state, so if the lights are green, an amber light
    will always follow - that is, the system is
    deterministic. Deterministic systems are
    relatively easy to understand and analyse, once
    the transitions are fully known.

38
Key Points
  • Crudely therefore matching strings with regular
    expressions (a la Perl) is a matter of
  • translating the expression into a machine (table)
    and
  • passing the table to an interpreter

39
Recognition as Search
  • You can view this algorithm as state-space
    search.
  • States are pairings of tape positions and state
    numbers.
  • Operators are compiled into the table
  • Goal state is a pairing with the end of tape
    position and a final accept state

40
Generative Formalisms
  • A formal Language is a model m which can both
    generate and recognize all and only the strings
    of a formal language
  • each string is composed of symbols from a finite
    set of symbols (alphabet)
  • L(m) a formal language L characterized by
    the model m
  • Finite-state automata define formal languages
    (without having to enumerate all the strings in
    the language)
  • The term Generative is based on the view that you
    can run the machine as a generator to get strings
    from the language.

41
Generative Formalisms
  • FSAs can be viewed from two perspectives
  • Acceptors that can tell you if a string is in the
    language (recognition)
  • Generators to produce all and only the strings in
    the language (production)

42
Summary
  • Regular expressions are just a compact textual
    representation of FSAs
  • Recognition is the process of determining if a
    string/input is in the language defined by some
    machine.
  • Recognition is straightforward with deterministic
    machines.
Write a Comment
User Comments (0)
About PowerShow.com