Regular Expression and Finite State Machine - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Regular Expression and Finite State Machine

Description:

Morphology is the study of the ways that words are ... is the messy stuff that no one ever taught you. Quasi-systematicity ... You work as a telephone ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 39
Provided by: barbara86
Category:

less

Transcript and Presenter's Notes

Title: Regular Expression and Finite State Machine


1
Regular Expression and Finite State Machine
  • Based on Slides by Jim Martin

2
Regular Expressions and Text Searching
  • Everybody does it
  • Emacs, vi, perl, grep, sed, awk, etc..
  • REs
  • Character sequence
  • Kleene star
  • Character set, complement set
  • Anchors
  • Disjunction
  • Grouping

3
Some Examples
Courtesy of Kathy McCoy
4
RE Description Uses?
/a/ Zero or more as Optional doubled modifiers (words)
/a/ One or more as Non-optional...
/a?/ Zero or one as Optional...
/catdog/ cat or dog Words modifying pets
/cat\./ A line that contains only cat. anchors beginning, anchors end of line. ??
/\bun\B/ Beginnings of longer strings Words prefixed by un
Courtesy of Kathy McCoy
5
E.G.
RE
Morphological variants of puppy
/pupp(yies)/
happier and happier, fuzzier and fuzzier
/ (.)ier and \1ier /
Courtesy of Kathy McCoy
6
Optionality and Repetition
  • /Wwoodchucks?/ matches woodchucks, Woodchucks,
    woodchuck, Woodchuck
  • /colou?r/ matches color or colour
  • /he3/ matches heee
  • /(he)3/ matches hehehe
  • /(he)3, matches a sequence of at least 3 hes

Courtesy of Kathy McCoy
7
Operator Precedence Hierarchy
  • 1. Parentheses ()
  • 2. Counters ?
  • 3. Sequence of Anchors the my end
  • 4. Disjunction
  • Examples
  • /moo/
  • /tryies/
  • /andor/

Courtesy of Kathy McCoy
8
A Simple Exercise
  • Write a regular expression to find all instances
    of the determiner the
  • The recent attempt by the police to retain their
    current rates of pay has not gathered much favor
    with the southern factions.

Courtesy of Kathy McCoy
9
A Simple Exercise
  • Write a regular expression to find all instances
    of the determiner the
  • /the/
  • The recent attempt by the police to retain their
    current rates of pay has not gathered much favor
    with the southern factions.

Courtesy of Kathy McCoy
10
A Simple Exercise
  • Write a regular expression to find all instances
    of the determiner the
  • /Tthe/
  • The recent attempt by the police to retain their
    current rates of pay has not gathered much favor
    with the southern factions.

Courtesy of Kathy McCoy
11
A Simple Exercise
  • Write a regular expression to find all instances
    of the determiner the
  • /\bTthe\b/
  • The recent attempt by the police to retain their
    current rates of pay has not gathered much favor
    with the southern factions.

Courtesy of Kathy McCoy
12
The Two Kinds of Errors
  • The process we just went through was based on
    fixing errors in the regular expression
  • Errors where some of the instances were missed
    (judged to not be instances when they should have
    been) False negatives
  • Errors where the instances were included (when
    they should not have been) False positives
  • This is pretty much going to be the story of the
    rest of the course!

Courtesy of Kathy McCoy
13
Finite State Automata as Graphs
  • Regular expressions can be viewed as a textual
    way of specifying the structure of finite-state
    automata.
  • Lets start with the sheep language from the text
  • /baa!/

14
Sheep FSA
  • We can say the following things about this
    machine
  • It has 5 states
  • At least b, a, and ! are in its alphabet
  • q0 is the start state
  • q4 is an accept state
  • It has 5 transitions

15
But note
  • There are other machines that correspond to this
    language
  • More on this one later

16
Morphology
  • Morphology is the study of the ways that words
    are built up from smaller meaningful units called
    morphemes
  • We can usefully divide morphemes into two classes
  • Stems The core meaning bearing units
  • Affixes Bits and pieces that adhere to stems to
    change their meanings and grammatical functions

17
Morphology
  • We can also divide morphology up into two broad
    classes
  • Inflectional
  • Derivational

18
Inflectional Morphology
  • Inflectional morphology concerns the combination
    of stems and affixes where the resulting word
  • Has the same word class as the original
  • Serves a grammatical/semantic purpose different
    from the original

19
Nouns and Verbs (English)
  • Nouns are simple (not really)
  • Markers for plural and possessive
  • Verbs are only slightly more complex
  • Markers appropriate to the tense of the verb

20
Regulars and Irregulars
  • Ok so it gets a little complicated by the fact
    that some words misbehave (refuse to follow the
    rules)
  • Mouse/mice, goose/geese, ox/oxen
  • Go/went, fly/flew
  • The terms regular and irregular will be used to
    refer to words that follow the rules and those
    that dont.

21
Regular and Irregular Verbs
  • Regulars
  • Walk, walks, walking, walked, walked
  • Irregulars
  • Eat, eats, eating, ate, eaten
  • Catch, catches, catching, caught, caught
  • Cut, cuts, cutting, cut, cut

22
Derivational Morphology
  • Derivational morphology is the messy stuff that
    no one ever taught you.
  • Quasi-systematicity
  • Irregular meaning change
  • Changes of word class

23
Derivational Examples
  • Verb/Adj to Noun

-ation computerize computerization
-ee appoint appointee
-er kill killer
-ness fuzzy fuzziness
24
Derivational Examples
  • Noun/Verb to Adj

-al Computation Computational
-able Embrace Embraceable
-less Clue Clueless
25
Compute
  • Many paths are possible
  • Start with compute
  • Computer -gt computerize -gt computerization
  • Computation -gt computational
  • Computer -gt computerize -gt computerizable
  • Compute -gt computee

26
Stemming vs Morphology
  • Sometimes you just need to know the stem of a
    word and you dont care about the structure.
  • In fact you may not even care if you get the
    right stem, as long as you get a consistent
    string.
  • This is stemming it most often shows up in IR
    applications

27
Stemming in IR
  • Run a stemmer on the documents to be indexed
  • Run a stemmer on users queries
  • Match
  • This is basically a form of hashing

28
Porter Stemmer
  • No lexicon needed
  • Basically a set of staged sets of rewrite rules
    that strip suffixes
  • Handles both inflectional and derivational
    suffixes
  • Doesnt guarantee that the resulting stem is
    really a stem (see first bullet)
  • Lack of guarantee doesnt matter for IR

29
Porter Stemmer Examples
  • wear wear
  • wearable wearabl
  • wearer wearer
  • wearied weari
  • wearier wearier
  • weariest weariest
  • wearily wearili
  • weariness weari
  • wearing wear
  • wearisome wearisom
  • wearisomely wearisom
  • wears wear
  • weather weather
  • weathercock weathercock
  • weathercocks weathercock
  • web web
  • Webb webb
  • Webber webber
  • webs web
  • Webster webster
  • Websterville webstervil
  • wedded wedd
  • wedding wedd
  • weddings wedd
  • wedge wedg
  • wedged wedg
  • wedges wedg
  • wedging wedg

30
  • static RuleList step1a_rules
  • 101, "sses", "ss", 3, 1,
    0, NULL,
  • 102, "ies", "i", 2, 0,
    0, NULL,
  • 103, "ss", "ss", 1, 1,
    0, NULL,
  • 104, "s", LAMBDA, 0, -1,
    0, NULL,
  • 000, NULL, NULL, 0, 0,
    0, NULL
  • static RuleList step1b_rules
  • 105, "eed", "ee", 2, 1,
    0, NULL,
  • 106, "ed", LAMBDA, 1, -1,
    -1, ContainsVowel,
  • 107, "ing", LAMBDA, 2, -1,
    -1, ContainsVowel,
  • 000, NULL, NULL, 0, 0,
    0, NULL

31
  • static RuleList step1b1_rules
  • 108, "at", "ate", 1, 2,
    0, NULL,
  • 109, "bl", "ble", 1, 2,
    0, NULL,
  • 110, "iz", "ize", 1, 2,
    0, NULL,
  • 111, "bb", "b", 1, 0,
    0, NULL,
  • 112, "dd", "d", 1, 0,
    0, NULL,
  • 113, "ff", "f", 1, 0,
    0, NULL,
  • 114, "gg", "g", 1, 0,
    0, NULL,
  • 115, "mm", "m", 1, 0,
    0, NULL,
  • 116, "nn", "n", 1, 0,
    0, NULL,
  • 117, "pp", "p", 1, 0,
    0, NULL,
  • 118, "rr", "r", 1, 0,
    0, NULL,
  • 119, "tt", "t", 1, 0,
    0, NULL,
  • 120, "ww", "w", 1, 0,
    0, NULL,
  • 121, "xx", "x", 1, 0,
    0, NULL,
  • 122, LAMBDA, "e", -1, 0,
    0, AddAnE,
  • 000, NULL, NULL, 0, 0,
    0, NULL

32
  • static RuleList step1c_rules
  • 123, "y", "i", 0, 0,
    -1, ContainsVowel,
  • 000, NULL, NULL, 0, 0,
    0, NULL
  • static RuleList step2_rules
  • 203, "ational", "ate", 6, 2,
    0, NULL,
  • 204, "tional", "tion", 5, 3,
    0, NULL,
  • 205, "enci", "ence", 3, 3,
    0, NULL,
  • 206, "anci", "ance", 3, 3,
    0, NULL,
  • 207, "izer", "ize", 3, 2,
    0, NULL,
  • 208, "abli", "able", 3, 3,
    0, NULL,
  • 209, "alli", "al", 3, 1,
    0, NULL,
  • 210, "entli", "ent", 4, 2,
    0, NULL,
  • 211, "eli", "e", 2, 0,
    0, NULL,
  • 213, "ousli", "ous", 4, 2,
    0, NULL,

33
  • static RuleList step3_rules
  • 301, "icate", "ic", 4, 1,
    0, NULL,
  • 302, "ative", LAMBDA, 4, -1,
    0, NULL,
  • 303, "alize", "al", 4, 1,
    0, NULL,
  • 304, "iciti", "ic", 4, 1,
    0, NULL,
  • 305, "ical", "ic", 3, 1,
    0, NULL,
  • 308, "ful", LAMBDA, 2, -1,
    0, NULL,
  • 309, "ness", LAMBDA, 3, -1,
    0, NULL,
  • 000, NULL, NULL, 0, 0,
    0, NULL

34
  • static RuleList step4_rules
  • 401, "al", LAMBDA, 1, -1,
    1, NULL,
  • 402, "ance", LAMBDA, 3, -1,
    1, NULL,
  • 403, "ence", LAMBDA, 3, -1,
    1, NULL,
  • 405, "er", LAMBDA, 1, -1,
    1, NULL,
  • 406, "ic", LAMBDA, 1, -1,
    1, NULL,
  • 407, "able", LAMBDA, 3, -1,
    1, NULL,
  • 408, "ible", LAMBDA, 3, -1,
    1, NULL,
  • 409, "ant", LAMBDA, 2, -1,
    1, NULL,
  • 410, "ement", LAMBDA, 4, -1,
    1, NULL,
  • 411, "ment", LAMBDA, 3, -1,
    1, NULL,

35
Problems with Stemming
36
Soundex
  • You work as a telephone information operator.
    Someone calls looking for our senior theory
    professor
  • What do you type as your query string?

37
Soundex
  1. Keep the first letter
  2. Drop non-initial occurrences of vowels, h, w and
    y
  3. Replace the remaining letters with numbers
    according to group (e.g.. b, f, p, and v -gt 1
  4. Replace strings of identical numbers with a
    single number (333 -gt 3)
  5. Drop any numbers beyond a third one

38
Soundex
  • Effect is to map (hash) all similar sounding
    transcriptions to the same code.
  • Structure your directory so that it can be
    accessed by code as well as by correct spelling
  • Used for census records, phone directories,
    author searches in libraries etc.
Write a Comment
User Comments (0)
About PowerShow.com