Title: Regular Expressions and Automata
1Chapter 2 Regular Expressions and Automata
Korea Maritime and Ocean University NLP Jung Tae
LEE inverse90_at_nate.com
201 Regular Expression
02 Finite-State Automata
03 Regular Languages and FSAs
04 Summary
3NLP
01 RE
1. Regular Expression
- Regular Expression?
- formula in a special language that specifies
simple classes of strings. - (a string is a sequence of symbols)
- algebraic notation for characterizing a set of
strings. - A language for sepcifying text search strings.
- so, RE is an important theoretical tool
throughout computer science and linguistics.
4NLP
01 RE
- RE search requires a pattern that we want to
search for and a corpus of texts to serach
through. - Simplest kind of regular expression is a sequence
of simple characters. - like cf) search for green, we type /green/.
- (recall that we are assuming a search
application that returns entire lines)
- Basic Regular Expression Pattern
RE Example Patterns Matched
/interested/ We are interested in NLP
/DOROTHY/ SURRENDER DOROTHY
/!/ Im in danger now!
/Claire says,/ Dagmar, my gift plz, Claire says,
- Regular expressions are case sensitive lower
case is distinct from upper case
5NLP
01 RE
- Basic Regular Expression Pattern
- Sensitive problem solve with the use of braces ,
RE Match Example Patterns Matched
/bBlue/ Blue or blue deep blue sea
/abc/ a, b, or c algebra
/1234567890/ Any digit plenty of 7 to 5
Use of the brackets to specify a disjunction
of characters.
- The brankets can be used with the dash(-) to
specify any one character - in a range
RE Match Example Patterns Matched
/A-Z/ An upper case letter we are INFINITY
/a-z/ A lower case letter not enough to love
/0-9/ A single digit chapter 2 RE
Use of the brankets plus the dash to
specify a range.
6NLP
01 RE
- Basic Regular Expression Pattern
- Braces can also be used to specify what a single
character cannot be, - by use of the caret .
RE Match Example Patterns Matched
A-Z Not an upper case latter Lee jung tae
ab The pattern ab look up ab now
e Either e or Kleene star
Use of the caret for negation or just to mean .
- Question mark ?, which means the preceding
character or nothing
RE Match Example Patterns Matched
means? mean or means mean
colou?r color or colour colour
The question mark ? Marks optionality of the
previous expression.
7NLP
01 RE
- Basic Regular Expression Pattern
- Sometimes we need regular expressions that allow
repetitions. - Ex) ba! baa! baaa! baaaa! ba..a!
- these are based on the asterisk or , commoly
called the Kleene - The Kleene star means zero or more
occurrences of the immediately previous character
or regular expression - Sometimes there is a shorter way to sepcify at
least one of some character. This is a Kleene,
which means one or more of the previous
character
RE Match Example Patterns Matched
/0-9/ String of digits or nothing 123.45
/0-9/ 0-90-9 .123
/beg.n/ Any char between beg and n begin, begn, begun
/The dog\./ The matches start of line and dog. matches end of line. The dog.
The use of the specify case about Kleene, period
or anchors.
8NLP
01 RE
- Disjunction, Grouping, and Precedence
- Still we cant distinct such as cat or dog. So,
we need new operator, the disjunction operator,
called the pipe symbol . - To make disjunction operator apply only to a
specific pattern, we need to use the parenthesis
operator ( and ). - ex) /guppy ies/ are match only string guppy
or ies. But we want guppy or guppies. So the
pattern /gupp(yies)/ would specify that.
operator Regular expression
Parenthesis ( )
Counters ?
Sequences and anchors The my end
Disjunction
?RE always match the largest string they
can. Patterns are greedy!
Operator precedence hierarchy
9NLP
01 RE
- There is more useful operator.
RE Expansion Match Examples
\d 0-9 Any digit Party of 5
\D 0-9 Any non-digit Blue moon
\w a-zA-Z0-9_ Any alphanumeric/underscore Daiyu
\W \w A non-alphanumeric !!!!!
\s \r\t\n\f Whitespace(space, tab)
\S \s Non-whitespace In Concord
Aliases for common sets of characters.
RE Match
n n occurrences of the previous char or expression
n,m From n to m occurrences of the previous char or expression
Regular expression operator for counting.
10NLP
01 RE
- Regular Expression Subtitution, Memory
- Ex) Perl substitution operator s/regexp1/pattern/
allows a string characterized by a regular
expression to be replaced by another string
Example RE Replaced string
35 boxes s/(0-9)/lt\1gt/ lt35gt boxes
The Xer is Ying /The (.)er is (.)ing/The \1er will \2/ The Xer will Y
- To do this, we put parentheses ( and ) around the
pattern. - Using memory called register.
11NLP
01 RE
- http//www.codejs.co.kr/ECA095EAB79CEC8B9
D-regular-expression/ - this page containing information about
meta-characters written in Korean
- http//gskinner.com/RegExr/
- there is useful regular expression.
12NLP
02 FSA
2. Finite-State Automata
- With a regular expressions used to describe
regular languages. - It is good theoretical foundation to deal of
computational work.
Except RE that use the memory feature
Three equivalent ways of describing regular
languages.
13NLP
02 FSA
- Use of an FSA to Recognize R.Language
- Automata for modeling about regular expression.
- Recognizes a set of strings
- Here how it(/baa!/) look
- State 0 is the start state(generally).
- Final state or accepting state represent by the
double circle like state 4.
14NLP
02 FSA
- Use of an FSA to Recognize R.Language
- The FSA can be used for recognizing (we also say
accepting) string in the following way.
15NLP
02 FSA
- Use of an FSA to Recognize R.Language
- It can represent an automata with a
state-transition table. - Formally, FA is defined by following five
parameters
A finite set of N states
? A finite input alphabet of symbols
The start state
F
d(q,i) The transition function or transition matrix between states. Given a state q?Q and an input symbol i??, d(q,i) returns a new state q?Q. d is thus a relation from Qx? to Q
Input Input Input
State b a !
0 1 0 0
1 0 2 0
2 0 3 0
3 0 3 4
4 0 0 0
16NLP
02 FSA
- Formal Language A model that can both generate
and recognize all and - only the strings of a formal language acts as
a definition of the formal L. - Set of strings
- Each string composed of symbols from a finite
symbol set called an alphabet
Previous language have the set ? a, b,
! Given a model m(such as particular FSA), we
can use L(m) to mean the formal language
characterized by m
a
b
a
a
!
L(m) baa!, baaa!, baaaa!, baaaaa!, baaaaaaa!,
17NLP
02 FSA
- Consider from the previous one to the next figure
a
b
a
a
!
Self-loop is on state2 instead of state 3.
- When we get to state 2, if we see an a we dont
know whether to remain in state 2 or go on to
state3. Automata with decision point like this, - we called non-deterministic FSAs (or NFSAs,
NFA).
e or ?
b
a
a
!
Arcs have no symbols on them(called
?-transitions). Also NFA
18NLP
02 FSA
- Use of an NFSA to Accept Strings
- We might follow the wrong arc and reject it when
we should have accepted - it. That is, since is more than one choice at
some point. - So, there are three standard solution to the
problem
- Backup whenever we come to a choice point, we
could put a marker to mark where we were in the
input and what state the automata was in. then if
it turns out that we took the wrong choice, we
could back up and try another path. - Look-ahead we could look ahead in the input to
help us decide which path to take. - Parallelism whenever we come to a choice point,
we could look at every alternative path in
parallel.
19NLP
02 FSA
- If yields a path ending in an accept state,
ND-RECOGNIZE accepts the string. - Otherwise, it rejects the string
- Searching for solutions, are known as state-space
search algorithms.
1.
b a a a !
b a a a !
6.
b a a a !
2.
b a a a !
7.
b a a a !
3.
b a a a !
8.
b a a a !
b a a a !
4.
5.
Depth-first search implemented by stack
20NLP
02 FSA
1.
b a a a !
Breadth-first search implemented by queue
b a a a !
2.
b a a a !
3.
b a a a !
b a a a !
4.
4.
b a a a !
5.
b a a a !
5.
b a a a !
5.
b a a a !
6.
21NLP
02 FSA
- Use of an NFSA to Accept Strings
- Like DFS, BFS has its pitfalls. As with
depth-first, if the state-space is infinite, the
search may never terminate. - And due to growth in the size of the agenda of
the state-space is even moderately large.
- For larger problems, more complex search
techniques such as dynamic programming or A must
be used. - gt we will discuss in other chapter.
Following va Santen and Sproat(1998), Kaplan
and Kay(1994), and Lewis and Papadimitriou(1988).
22NLP
02 FSA
- For any NFA, there is an exactly equivalent DFA.
23NLP
03 RL
3. Regular Languages and FSAs
- Class of languages that are definable by regular
expressions - And same as characterizable by finite-state
automata - The class of regular languages over ? is then
formally defined as follows
24NLP
03 RL
- Regular languages are closed under the following
operations(Such as a regular expression)
Intersection
Defference
Complementation
Revarsal
25NLP
03 RL
- RE are equivalent to FSA.
- For the inductive step, we show that each of the
primitive operations of a regular
expression(concatenation, union, closure) can be
imitated by an automata.
- Start with three base case,
a
(a) r ?
(b) r Ø
(c) r a
Automata for the base case (no operators) for the
induction showing that any regular expression can
be turned into an equivalent automata.
26NLP
03 RL
- RE are equivalent to FSA.
FSA2
FSA1
?
?
?
FSA1
?
27NLP
03 RL
- RE are equivalent to FSA.
?
?
FSA1
?
?
28NLP
04 Summary
4. Summary
- Introduced the most important fundamental concept
in language processing, the automata. - RE language is a powerful tool for
pattern-matching. - Basic operations in RE include concatenation of
symbols, disjuction of symbols, counters,
anchors, and precedence operators. - The behavior of a deterministic automata is fully
determined by the state it is in. - Any RE can be realized as a FSA.
- Memory is an advanced operation that is often
considered part of regular expressions but cannot
be realized as a finite automata. - Any NFA can be converted to a DFA.
- NFA search strategy.
29Thank You
Korea Maritime and Ocean University NLP Jung Tae
LEE inverse90_at_nate.com