Title: Regular Expression and Finite State Machine
1Regular Expression and Finite State Machine
- Based on Slides by Jim Martin
2Regular Expressions and Text Searching
- Everybody does it
- Emacs, vi, perl, grep, sed, awk, etc..
- REs
- Character sequence
- Kleene star
- Character set, complement set
- Anchors
- Disjunction
- Grouping
3Some Examples
Courtesy of Kathy McCoy
4RE Description Uses?
/a/ Zero or more as Optional doubled modifiers (words)
/a/ One or more as Non-optional...
/a?/ Zero or one as Optional...
/catdog/ cat or dog Words modifying pets
/cat\./ A line that contains only cat. anchors beginning, anchors end of line. ??
/\bun\B/ Beginnings of longer strings Words prefixed by un
Courtesy of Kathy McCoy
5E.G.
RE
Morphological variants of puppy
/pupp(yies)/
happier and happier, fuzzier and fuzzier
/ (.)ier and \1ier /
Courtesy of Kathy McCoy
6Optionality and Repetition
- /Wwoodchucks?/ matches woodchucks, Woodchucks,
woodchuck, Woodchuck - /colou?r/ matches color or colour
- /he3/ matches heee
- /(he)3/ matches hehehe
- /(he)3, matches a sequence of at least 3 hes
Courtesy of Kathy McCoy
7Operator Precedence Hierarchy
- 1. Parentheses ()
- 2. Counters ?
- 3. Sequence of Anchors the my end
- 4. Disjunction
- Examples
- /moo/
- /tryies/
- /andor/
Courtesy of Kathy McCoy
8A Simple Exercise
- Write a regular expression to find all instances
of the determiner the - The recent attempt by the police to retain their
current rates of pay has not gathered much favor
with the southern factions.
Courtesy of Kathy McCoy
9A Simple Exercise
- Write a regular expression to find all instances
of the determiner the - /the/
- The recent attempt by the police to retain their
current rates of pay has not gathered much favor
with the southern factions.
Courtesy of Kathy McCoy
10A Simple Exercise
- Write a regular expression to find all instances
of the determiner the - /Tthe/
- The recent attempt by the police to retain their
current rates of pay has not gathered much favor
with the southern factions.
Courtesy of Kathy McCoy
11A Simple Exercise
- Write a regular expression to find all instances
of the determiner the - /\bTthe\b/
- The recent attempt by the police to retain their
current rates of pay has not gathered much favor
with the southern factions.
Courtesy of Kathy McCoy
12The Two Kinds of Errors
- The process we just went through was based on
fixing errors in the regular expression - Errors where some of the instances were missed
(judged to not be instances when they should have
been) False negatives - Errors where the instances were included (when
they should not have been) False positives - This is pretty much going to be the story of the
rest of the course!
Courtesy of Kathy McCoy
13Finite State Automata as Graphs
- Regular expressions can be viewed as a textual
way of specifying the structure of finite-state
automata. - Lets start with the sheep language from the text
- /baa!/
14Sheep FSA
- We can say the following things about this
machine - It has 5 states
- At least b, a, and ! are in its alphabet
- q0 is the start state
- q4 is an accept state
- It has 5 transitions
15But note
- There are other machines that correspond to this
language - More on this one later
16Morphology
- Morphology is the study of the ways that words
are built up from smaller meaningful units called
morphemes - We can usefully divide morphemes into two classes
- Stems The core meaning bearing units
- Affixes Bits and pieces that adhere to stems to
change their meanings and grammatical functions
17Morphology
- We can also divide morphology up into two broad
classes - Inflectional
- Derivational
18Inflectional Morphology
- Inflectional morphology concerns the combination
of stems and affixes where the resulting word - Has the same word class as the original
- Serves a grammatical/semantic purpose different
from the original
19Nouns and Verbs (English)
- Nouns are simple (not really)
- Markers for plural and possessive
- Verbs are only slightly more complex
- Markers appropriate to the tense of the verb
20Regulars and Irregulars
- Ok so it gets a little complicated by the fact
that some words misbehave (refuse to follow the
rules) - Mouse/mice, goose/geese, ox/oxen
- Go/went, fly/flew
- The terms regular and irregular will be used to
refer to words that follow the rules and those
that dont.
21Regular and Irregular Verbs
- Regulars
- Walk, walks, walking, walked, walked
- Irregulars
- Eat, eats, eating, ate, eaten
- Catch, catches, catching, caught, caught
- Cut, cuts, cutting, cut, cut
22Derivational Morphology
- Derivational morphology is the messy stuff that
no one ever taught you. - Quasi-systematicity
- Irregular meaning change
- Changes of word class
23Derivational Examples
-ation computerize computerization
-ee appoint appointee
-er kill killer
-ness fuzzy fuzziness
24Derivational Examples
-al Computation Computational
-able Embrace Embraceable
-less Clue Clueless
25Compute
- Many paths are possible
- Start with compute
- Computer -gt computerize -gt computerization
- Computation -gt computational
- Computer -gt computerize -gt computerizable
- Compute -gt computee
26Stemming vs Morphology
- Sometimes you just need to know the stem of a
word and you dont care about the structure. - In fact you may not even care if you get the
right stem, as long as you get a consistent
string. - This is stemming it most often shows up in IR
applications
27Stemming in IR
- Run a stemmer on the documents to be indexed
- Run a stemmer on users queries
- Match
- This is basically a form of hashing
28Porter Stemmer
- No lexicon needed
- Basically a set of staged sets of rewrite rules
that strip suffixes - Handles both inflectional and derivational
suffixes - Doesnt guarantee that the resulting stem is
really a stem (see first bullet) - Lack of guarantee doesnt matter for IR
29Porter Stemmer Examples
- wear wear
- wearable wearabl
- wearer wearer
- wearied weari
- wearier wearier
- weariest weariest
- wearily wearili
- weariness weari
- wearing wear
- wearisome wearisom
- wearisomely wearisom
- wears wear
- weather weather
- weathercock weathercock
- weathercocks weathercock
- web web
- Webb webb
- Webber webber
- webs web
- Webster webster
- Websterville webstervil
- wedded wedd
- wedding wedd
- weddings wedd
- wedge wedg
- wedged wedg
- wedges wedg
- wedging wedg
30- static RuleList step1a_rules
- 101, "sses", "ss", 3, 1,
0, NULL, - 102, "ies", "i", 2, 0,
0, NULL, - 103, "ss", "ss", 1, 1,
0, NULL, - 104, "s", LAMBDA, 0, -1,
0, NULL, - 000, NULL, NULL, 0, 0,
0, NULL -
- static RuleList step1b_rules
- 105, "eed", "ee", 2, 1,
0, NULL, - 106, "ed", LAMBDA, 1, -1,
-1, ContainsVowel, - 107, "ing", LAMBDA, 2, -1,
-1, ContainsVowel, - 000, NULL, NULL, 0, 0,
0, NULL -
31- static RuleList step1b1_rules
- 108, "at", "ate", 1, 2,
0, NULL, - 109, "bl", "ble", 1, 2,
0, NULL, - 110, "iz", "ize", 1, 2,
0, NULL, - 111, "bb", "b", 1, 0,
0, NULL, - 112, "dd", "d", 1, 0,
0, NULL, - 113, "ff", "f", 1, 0,
0, NULL, - 114, "gg", "g", 1, 0,
0, NULL, - 115, "mm", "m", 1, 0,
0, NULL, - 116, "nn", "n", 1, 0,
0, NULL, - 117, "pp", "p", 1, 0,
0, NULL, - 118, "rr", "r", 1, 0,
0, NULL, - 119, "tt", "t", 1, 0,
0, NULL, - 120, "ww", "w", 1, 0,
0, NULL, - 121, "xx", "x", 1, 0,
0, NULL, - 122, LAMBDA, "e", -1, 0,
0, AddAnE, - 000, NULL, NULL, 0, 0,
0, NULL -
32- static RuleList step1c_rules
- 123, "y", "i", 0, 0,
-1, ContainsVowel, - 000, NULL, NULL, 0, 0,
0, NULL -
- static RuleList step2_rules
- 203, "ational", "ate", 6, 2,
0, NULL, - 204, "tional", "tion", 5, 3,
0, NULL, - 205, "enci", "ence", 3, 3,
0, NULL, - 206, "anci", "ance", 3, 3,
0, NULL, - 207, "izer", "ize", 3, 2,
0, NULL, - 208, "abli", "able", 3, 3,
0, NULL, - 209, "alli", "al", 3, 1,
0, NULL, - 210, "entli", "ent", 4, 2,
0, NULL, - 211, "eli", "e", 2, 0,
0, NULL, - 213, "ousli", "ous", 4, 2,
0, NULL,
33- static RuleList step3_rules
- 301, "icate", "ic", 4, 1,
0, NULL, - 302, "ative", LAMBDA, 4, -1,
0, NULL, - 303, "alize", "al", 4, 1,
0, NULL, - 304, "iciti", "ic", 4, 1,
0, NULL, - 305, "ical", "ic", 3, 1,
0, NULL, - 308, "ful", LAMBDA, 2, -1,
0, NULL, - 309, "ness", LAMBDA, 3, -1,
0, NULL, - 000, NULL, NULL, 0, 0,
0, NULL -
34- static RuleList step4_rules
- 401, "al", LAMBDA, 1, -1,
1, NULL, - 402, "ance", LAMBDA, 3, -1,
1, NULL, - 403, "ence", LAMBDA, 3, -1,
1, NULL, - 405, "er", LAMBDA, 1, -1,
1, NULL, - 406, "ic", LAMBDA, 1, -1,
1, NULL, - 407, "able", LAMBDA, 3, -1,
1, NULL, - 408, "ible", LAMBDA, 3, -1,
1, NULL, - 409, "ant", LAMBDA, 2, -1,
1, NULL, - 410, "ement", LAMBDA, 4, -1,
1, NULL, - 411, "ment", LAMBDA, 3, -1,
1, NULL,
35Problems with Stemming
36Soundex
- You work as a telephone information operator.
Someone calls looking for our senior theory
professor - What do you type as your query string?
37Soundex
- Keep the first letter
- Drop non-initial occurrences of vowels, h, w and
y - Replace the remaining letters with numbers
according to group (e.g.. b, f, p, and v -gt 1 - Replace strings of identical numbers with a
single number (333 -gt 3) - Drop any numbers beyond a third one
38Soundex
- Effect is to map (hash) all similar sounding
transcriptions to the same code. - Structure your directory so that it can be
accessed by code as well as by correct spelling - Used for census records, phone directories,
author searches in libraries etc.