Title: TP2663 Asas Pemprosesan Bahasa Tabii
1TP2663 Asas Pemprosesan Bahasa Tabii
- Regular Expression (RE) Finite State Automata
(FSA)
2Regular Expression (RE)
3Regular Expressions
- RE ialah bahasa utk menentukan rentetan carian
teks (text search strings) - RE telah digunakan utk carian teks dlm UNIX (vi,
Perl, Emacs, grep) Ms Word(version 6 above) - Byk fitur RE digunakan dlm pelbagai enjin carian
- RE juga merupakan alatan teoretikal dlm bidang
sains komputer linguistik.
4Regular Expressions
- Satu RE ialah satu formula dlm bahasa tertentu yg
digunakan utk menentukan kelas string yang mudah. - String turutan symbols utk tujuan kebanyakan
teknik carian berasaskan teks - String turutan aksara alphanumeric (huruf,
nombor, spaces, tabs dan puntuation) - Dlm contoh seterusnya, carian akan memulangkan
satu baris ayat yg lengkap.
5Asas Regular Expressions Patterns
- Jenis RE yang paling mudah ialah turutan aksara
mudah. - Contoh, utk cari woodchuck, taip /woodchuck/
- Carian mungkin akan memulangkan ayat Im called a
little woodchuck - Contoh lain
-
6Asas Regular Expressions Patterns
- RE adlh case-sensitive.
- Cth /woodchucks/ tidak padan dgn Woodchucks
- Gunakan dan to specify disjunction of
characters. - Cth lain
7Asas Regular Expressions Patterns
- Cara utk letakkan sempadan/range utk hasil carian
adalah dgn menggunakan dash(-) - Brackets and dash (-)used to specify any one
character in a range. - Cth lain
8Asas Regular Expressions Patterns
- Penggunaan dan boleh digunakan utk memastikan
satu aksara yang tidak boleh berada dlm hasil
carian. - Penggunaan mesti hadir selepas untuk tujuan
di atas. - If the is the first symbol after the open
bracket , the resulting pattern is negated. - Cth pattern /a/ matches any single character
except a. - Cth lain
9Asas Regular Expressions Patterns
- Penggunaan /?/ digunakan utk mencari aksara yang
sebelumnya samada wujud ataupun tidak. - /?/ Means the preceding character or nothing
- Cth lain
10Asas Regular Expressions Patterns
- Penggunaan Kleene adalah utk mencari aksara
yang sebelumnya samada wujud ataupun tidak dan
mungkin wujud secara berjujukan. - Kleene star () means zero or more occurences
of the immediately previous character or regular
expression. - /a/ means any string of zero or more as
- Cth lain
11Asas Regular Expressions Patterns
- Penggunaan Kleene adalah utk mencari aksara
yang sebelumnya yang mesti wujud
sekurang-kurangnya sekali. - Kleene plus() means one or more of previous
character. - Is a shorter way to specify at least one
- Cth
12Asas Regular Expressions Patterns
- Penggunaan /./ adalah utk padankan dengan apa
juga aksara(kecuali carriage return) - wildcard - The wildcard is often used with the Kleene to
mean any string of characters - Contoh
13Asas Regular Expressions Patterns
- Penggunaan Anchors adalah utk menghubungkan RE
dgn lokasi tertentu dlm sesebuah rentetan
(string). - Cth anchor yang paling biasa digunakan dan
- digunakan utk padanan permulaan baris
- Cth /The/ padan utk The yg berada di
permulaan baris. - digunakan utk padanan akhir baris
- Contoh
14Asas Regular Expressions Patterns
- Penggunaan Anchors yg lain adalah \b dan \B
- \b digunakan utk padanan sempadan perkataan (a
word boundary) - \B digunakan utk padanan bukan-sempadan utk
perkataan (a non-boundary) - Contoh
15Finite State Automata (FSA)
16Three Views
- Three equivalent formal ways to look at what
were up to.
Regular Expressions
Finite State Automata
Regular Languages
17Finite State Automata
- Terminologi Finite State Automata, Finite State
Machines, FSA, Finite Automata - Regular expressions ialah satu cara utk
menerangkan finite-state automata(FSA). - REs boleh dilaksanakan dgn FSA.
- FSA blh diterangkan dgn RE
- RE FSA boleh digunakan utk menerangkan Regular
Languages.
18Intuition FSAs as Graphs
- Lets start with the sheep language from the text
- /baa!/
19Sheep FSA
- We can say the following things about this
machine - It has 5 states
- At least b,a, and ! are in its alphabet
- q0 is the start state
- q4 is an accept state/final state
- It has 5 transitions
20But note
- There are other machines that correspond to this
language - More on this one later
21More Formally Defining an FSA
- You can specify an FSA by enumerating the
following things. - The set of states Q
- A finite alphabet S
- A start state q0
- A set F of accepting/final states F?Q
- A transition function ?(q,i) that maps QxS to Q
22Yet Another View
To read the first row If were in state 0 and
we see the input b, we must go to the state 1. If
were in state 0 and we see the input a or !, we
fail.
23Latihan
- Hasilkan state-transition table utk finite-state
automaton yg berikut
24Recognition
- Recognition is the process of determining if a
string should be accepted by a machine - Or its the process of determining if a string
is in the language were defining with the
machine - Or its the process of determining if a regular
expression matches a string
25Recognition
- Traditionally, (Turings idea) this process is
depicted with a tape.
26Recognition
- Start in the start state
- Examine the current input
- Consult the table
- Go to a new state and update the tape pointer.
- Until you run out of tape.
If we are in the accepting state when we run out
of input, the machine has successfully recognized
the language
27D-Recognize a deterministic algorithm
Key points
28Tracing D-Recognize
To read the first row If were in state 0 and
we see the input b, we must go to the state 1. If
were in state 0 and we see the input a or !, we
fail.
29Key Points
- Deterministic means that at each point in
processing there is always one unique thing to do
(no choices). - D-recognize is a simple table-driven interpreter
- The algorithm is universal for all unambiguous
languages. - To change the machine, you change the table.
D-Recognize
30Key Points
- Crudely therefore matching strings with regular
expressions (ala Perl) is a matter of - translating the expression into a machine (table)
and - passing the table to an interpreter
31Generative Formalisms
- Formal Languages are sets of strings composed of
symbols from a finite set of symbols. - Finite-state automata define formal languages
(without having to enumerate all the strings in
the language) - The term Generative is based on the view that you
can run the machine as a generator to get strings
from the language.
32Generative Formalisms
- FSAs can be viewed from two perspectives
- Acceptors that can tell you if a string is in the
language - Generators to produce all and only the strings in
the language
33Dollars and Cents
34Summary
- Regular expressions are just a compact textual
representation of FSAs - Recognition is the process of determining if a
string/input is in the language defined by some
machine. - Recognition is straightforward with deterministic
machines.
35Three Views
- Three equivalent formal ways to look at what
were up to (thanks to Martin Kay)
Regular Expressions
Finite State Automata
Regular Languages
36Non-Determinism
DFSA
NFSA
37Non-Determinism cont.
- Yet another technique
- Epsilon transitions
- Key point these transitions do not examine or
advance the tape during recognition - Read If we are in state 3, we are allowed to
move to state 2 w/o looking at the input, or
advancing our input pointer.
e
38Using NFSA to accept strings
- Since there is more than one choice at some time,
we might take the wrong choice. - 3 standard solutions
- Backup put a marker when we come to a choice
point. If we took the wrong choice, go back up
and try another path - Look-ahead look ahead in the input to help us
decide which path to take - Parallelism look every alternative path in
parallel when we come to a choice point.
39Non-Deterministic Recognition
- In a NFSA there exists at least one path through
the machine for a string that is in the language
defined by the machine. - But not all paths directed through the machine
for an accept string lead to an accept state. - No paths through the machine lead to an accept
state for a string not in the language.
40Non-Deterministic Recognition
- So success in a non-deterministic recognition
occurs when a path is found through the machine
that ends in an accept state. - Failure occurs when none of the possible paths
lead to an accept state.
41Example
42Example
43Example
44Example
45Example
46Example
47Example
48Example
49Example
50Key Points
- States in the search space are pairings of tape
positions and states in the machine (search-state
vs machine-state). - By keeping track of as yet unexplored states, a
recognizer can systematically explore all the
paths through the machine given an input.
51ND-Recognize Code
52ND-Recognize Code cont.
- Agenda - keep track of all the currently
unexplored choices generated during the course of
processing - Current-search-state the branch choice being
currently explored - ACCEPT-STATE? return accept if the current
search-state contains accepting machine-state and
a pointer to the end of the tape. - GENERATE-NEW-STATES creates search-states for
any e-transitions and any normal input-symbol
transitions from the transition table.
53Recognition as Search
- ND_RECOGNIZE accomplishes the task of recognizing
strings in a regular languages by providing a way
to systematically explore all the possible paths
through a machine. - You can view this algorithm as state-space
search. - States are pairings of tape positions and state
numbers given by the machine. - Operators are compiled into the table
- Goal state is a pairing with the end of tape
position and a final accept state.
54Infinite Search
- If youre not careful such searches can go into
an infinite loop. - How?
55Why Bother?
- Non-determinism doesnt get us more formal power
and it causes headaches so why bother? - More natural solutions
- Machines based on construction are too big
56Equivalence
- Non-deterministic machines can be converted to
deterministic ones with a fairly simple
construction - That means that they have the same power
non-deterministic machines are not more powerful
than deterministic ones - It also means that one way to do recognition with
a non-deterministic machine is to turn it into a
deterministic one.
57Regular languages
- The class of languages characterizable by regular
expressions - Given alphabet ?, the reg. lgs. over ? is
- The empty set ? is a regular language
- ?a ? ? ? ?, a is a regular language
- If L1 and L2 are regular lgs, then so are
- L1 L2 xyx ? L1,y ? L2, concatenation of L1
L2 - L1 ? L2, the union of L1 and L2
- L1, the Kleene closure of L1
58Going from regexp to FSA
- Since all regular lgs meet above properties
- And reg lgs are the lgs characterizable by
regular expressions - All regular expression operators can be
implemented by combinations of union,
disjunction, closure - Counters (,) are repetition plus closure
- Anchors are individual symbols
- and () and . are kinds of disjunction
59Going from regexp to FSA
- So if we could just show how to turn
closure/union/concat from regexps to FSAs, this
would give an idea of how FSA compilation works. - The actual proof that reg lgs FSAs has 2 parts
- An FSA can be built for each regular lg
- A regular lg can be built for each automaton
- So Ill give the intuition of the first part
- Take any regular expression and build an
automaton - Intuition induction
- Base case build an automaton for single symbol
(say a) - Inductive step Show how to imitate the 3 regexp
operations in automata
60Union
- Accept a string in either of two languages
61Concatenation
- Accept a string consisting of a string from
language L1 followed by a string from language L2.
62FSAs and Computational Morphology
- An important use of FSAs is for morphology, the
study of word parts
63English Morphology
- Morphology is the study of the ways that words
are built up from smaller meaningful units called
morphemes - We can usefully divide morphemes into two classes
- Stems The core meaning bearing units
- Affixes Bits and pieces that adhere to stems to
change their meanings and grammatical functions
64Nouns and Verbs (English)
- Nouns are simple (not really)
- Markers for plural and possessive
- Verbs are only slightly more complex
- Markers appropriate to the tense of the verb
65Regulars and Irregulars
- Ok so it gets a little complicated by the fact
that some words misbehave (refuse to follow the
rules) - Mouse/mice, goose/geese, ox/oxen
- Go/went, fly/flew
- The terms regular and irregular will be used to
refer to words that follow the rules and those
that dont.
66Regular and Irregular Nouns and Verbs
- Regulars
- Walk, walks, walking, walked, walked
- Table, tables
- Irregulars
- Eat, eats, eating, ate, eaten
- Catch, catches, catching, caught, caught
- Cut, cuts, cutting, cut, cut
- Goose, geese
67Compute
- Many paths are possible
- Start with compute
- Computer -gt computerize -gt computerization
- Computation -gt computational
- Computer -gt computerize -gt computerizable
- Compute -gt computee
68Why care about morphology?
- Stemming in information retrieval
- Might want to search for aardvark and find
pages with both aardvark and aardvarks - Morphology in machine translation
- Need to know that the Spanish words quiero and
quieres are both related to querer want - Morphology in spell checking
- Need to know that miscellaneous and antidote are
not words despite being made up of word parts
69Cant just list all words
- Turkish
- Uygarlastiramadiklarimizdanmissinizcasina
- (behaving) as if you are among those whom we
could not civilize - Uygar civilized las become tir cause
ama not able dik past lar plural imiz
p1pl dan abl mis past siniz 2pl
casina as if
70What we want
- Something to automatically do the following kinds
of mappings - Cats cat N PL
- Cat cat N SG
- Cities city N PL
- Merging merge V Present-participle
- Caught catch V past-participle
71FSAs and the Lexicon
- This will actual require a kind of FSA we wont
be studying the Finite State Transducer (FST) - But well give a quick overview anyhow
- First well capture the morphotactics
- The rules governing the ordering of affixes in a
language. - Then well add in the actual words
72Simple Rules
73Adding the Words
74Derivational Rules
75Parsing/Generation vs. Recognition
- Recognition is usually not quite what we need.
- Usually if we find some string in the language we
need to find the structure in it (parsing) - Or we have some structure and we want to produce
a surface form (production/generation) - Example
- From cats to cat N PL
76Finite State Transducers
- The simple story
- Add another tape
- Add extra symbols to the transitions
- On one tape we read cats, on the other we write
cat N PL
77Transitions
- cc means read a c on one tape and write a c on
the other - Ne means read a N symbol on one tape and write
nothing on the other - PLs means read PL and write an s
78Lexical to Intermediate Level
79End of Topic 4