Regular Expressions and Automata - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Regular Expressions and Automata

Description:

01 Regular Expression. 03 . Regular Languages and FSAs. 02 Finite-State Automata. 04 Summary ` – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 30
Provided by: tist263
Category:

less

Transcript and Presenter's Notes

Title: Regular Expressions and Automata


1
Chapter 2 Regular Expressions and Automata
Korea Maritime and Ocean University NLP Jung Tae
LEE inverse90_at_nate.com
2
01 Regular Expression
02 Finite-State Automata
03 Regular Languages and FSAs
04 Summary
3

NLP
01 RE
1. Regular Expression
  • Regular Expression?
  • formula in a special language that specifies
    simple classes of strings.
  • (a string is a sequence of symbols)
  • algebraic notation for characterizing a set of
    strings.
  • A language for sepcifying text search strings.
  • so, RE is an important theoretical tool
    throughout computer science and linguistics.

4

NLP
01 RE
  • RE search requires a pattern that we want to
    search for and a corpus of texts to serach
    through.
  • Simplest kind of regular expression is a sequence
    of simple characters.
  • like cf) search for green, we type /green/.
  • (recall that we are assuming a search
    application that returns entire lines)
  • Basic Regular Expression Pattern

RE Example Patterns Matched
/interested/ We are interested in NLP
/DOROTHY/ SURRENDER DOROTHY
/!/ Im in danger now!
/Claire says,/ Dagmar, my gift plz, Claire says,
  • Regular expressions are case sensitive lower
    case is distinct from upper case

5

NLP
01 RE
  • Basic Regular Expression Pattern
  • Sensitive problem solve with the use of braces ,

RE Match Example Patterns Matched
/bBlue/ Blue or blue deep blue sea
/abc/ a, b, or c algebra
/1234567890/ Any digit plenty of 7 to 5
Use of the brackets to specify a disjunction
of characters.
  • The brankets can be used with the dash(-) to
    specify any one character
  • in a range

RE Match Example Patterns Matched
/A-Z/ An upper case letter we are INFINITY
/a-z/ A lower case letter not enough to love
/0-9/ A single digit chapter 2 RE
Use of the brankets plus the dash to
specify a range.
6

NLP
01 RE
  • Basic Regular Expression Pattern
  • Braces can also be used to specify what a single
    character cannot be,
  • by use of the caret .

RE Match Example Patterns Matched
A-Z Not an upper case latter Lee jung tae
ab The pattern ab look up ab now
e Either e or Kleene star
Use of the caret for negation or just to mean .
  • Question mark ?, which means the preceding
    character or nothing

RE Match Example Patterns Matched
means? mean or means mean
colou?r color or colour colour
The question mark ? Marks optionality of the
previous expression.
7

NLP
01 RE
  • Basic Regular Expression Pattern
  • Sometimes we need regular expressions that allow
    repetitions.
  • Ex) ba! baa! baaa! baaaa! ba..a!
  • these are based on the asterisk or , commoly
    called the Kleene
  • The Kleene star means zero or more
    occurrences of the immediately previous character
    or regular expression
  • Sometimes there is a shorter way to sepcify at
    least one of some character. This is a Kleene,
    which means one or more of the previous
    character

RE Match Example Patterns Matched
/0-9/ String of digits or nothing 123.45
/0-9/ 0-90-9 .123
/beg.n/ Any char between beg and n begin, begn, begun
/The dog\./ The matches start of line and dog. matches end of line. The dog.
The use of the specify case about Kleene, period
or anchors.
8

NLP
01 RE
  • Disjunction, Grouping, and Precedence
  • Still we cant distinct such as cat or dog. So,
    we need new operator, the disjunction operator,
    called the pipe symbol .
  • To make disjunction operator apply only to a
    specific pattern, we need to use the parenthesis
    operator ( and ).
  • ex) /guppy ies/ are match only string guppy
    or ies. But we want guppy or guppies. So the
    pattern /gupp(yies)/ would specify that.

operator Regular expression
Parenthesis ( )
Counters ?
Sequences and anchors The my end
Disjunction
?RE always match the largest string they
can. Patterns are greedy!
Operator precedence hierarchy
9

NLP
01 RE
  • Advanced Operator
  • There is more useful operator.

RE Expansion Match Examples
\d 0-9 Any digit Party of 5
\D 0-9 Any non-digit Blue moon
\w a-zA-Z0-9_ Any alphanumeric/underscore Daiyu
\W \w A non-alphanumeric !!!!!
\s \r\t\n\f Whitespace(space, tab)
\S \s Non-whitespace In Concord
Aliases for common sets of characters.
RE Match
n n occurrences of the previous char or expression
n,m From n to m occurrences of the previous char or expression
Regular expression operator for counting.
10

NLP
01 RE
  • Regular Expression Subtitution, Memory
  • Ex) Perl substitution operator s/regexp1/pattern/
    allows a string characterized by a regular
    expression to be replaced by another string

Example RE Replaced string
35 boxes s/(0-9)/lt\1gt/ lt35gt boxes
The Xer is Ying /The (.)er is (.)ing/The \1er will \2/ The Xer will Y
  • To do this, we put parentheses ( and ) around the
    pattern.
  • Using memory called register.

11

NLP
01 RE
  • Reference
  • http//www.codejs.co.kr/ECA095EAB79CEC8B9
    D-regular-expression/
  • this page containing information about
    meta-characters written in Korean
  • http//gskinner.com/RegExr/
  • there is useful regular expression.

12

NLP
02 FSA
2. Finite-State Automata
  • FSA?
  • With a regular expressions used to describe
    regular languages.
  • It is good theoretical foundation to deal of
    computational work.

Except RE that use the memory feature
Three equivalent ways of describing regular
languages.
13

NLP
02 FSA
  • Use of an FSA to Recognize R.Language
  • Automata for modeling about regular expression.
  • Recognizes a set of strings
  • Here how it(/baa!/) look
  • State 0 is the start state(generally).
  • Final state or accepting state represent by the
    double circle like state 4.

14

NLP
02 FSA
  • Use of an FSA to Recognize R.Language
  • The FSA can be used for recognizing (we also say
    accepting) string in the following way.

 
15

NLP
02 FSA
  • Use of an FSA to Recognize R.Language
  • It can represent an automata with a
    state-transition table.
  • Formally, FA is defined by following five
    parameters

A finite set of N states
? A finite input alphabet of symbols
The start state
F
d(q,i) The transition function or transition matrix between states. Given a state q?Q and an input symbol i??, d(q,i) returns a new state q?Q. d is thus a relation from Qx? to Q
Input Input Input
State b a !
0 1 0 0
1 0 2 0
2 0 3 0
3 0 3 4
4 0 0 0
 
16

NLP
02 FSA
  • Formal Languages
  • Formal Language A model that can both generate
    and recognize all and
  • only the strings of a formal language acts as
    a definition of the formal L.
  • Set of strings
  • Each string composed of symbols from a finite
    symbol set called an alphabet

Previous language have the set ? a, b,
! Given a model m(such as particular FSA), we
can use L(m) to mean the formal language
characterized by m
a
b
a
a
!
 
 
 
 
 
 
L(m) baa!, baaa!, baaaa!, baaaaa!, baaaaaaa!,

17

NLP
02 FSA
  • Non-Deterministic FSAs
  • Consider from the previous one to the next figure

a
b
a
a
!
 
 
 
 
 
Self-loop is on state2 instead of state 3.
  • When we get to state 2, if we see an a we dont
    know whether to remain in state 2 or go on to
    state3. Automata with decision point like this,
  • we called non-deterministic FSAs (or NFSAs,
    NFA).

e or ?
b
a
a
!
 
 
 
 
 
Arcs have no symbols on them(called
?-transitions). Also NFA
18

NLP
02 FSA
  • Use of an NFSA to Accept Strings
  • We might follow the wrong arc and reject it when
    we should have accepted
  • it. That is, since is more than one choice at
    some point.
  • So, there are three standard solution to the
    problem
  • Backup whenever we come to a choice point, we
    could put a marker to mark where we were in the
    input and what state the automata was in. then if
    it turns out that we took the wrong choice, we
    could back up and try another path.
  • Look-ahead we could look ahead in the input to
    help us decide which path to take.
  • Parallelism whenever we come to a choice point,
    we could look at every alternative path in
    parallel.

19

NLP
02 FSA
  • Recognition as Search
  • If yields a path ending in an accept state,
    ND-RECOGNIZE accepts the string.
  • Otherwise, it rejects the string
  • Searching for solutions, are known as state-space
    search algorithms.

 
 
1.
b a a a !
b a a a !
6.
 
 
 
b a a a !
2.
b a a a !
7.
 
 
 
b a a a !
3.
b a a a !
8.
 
 
 
b a a a !
b a a a !
4.
5.
Depth-first search implemented by stack
20

NLP
02 FSA
  • Recognition as Search

 
1.
b a a a !
 
Breadth-first search implemented by queue
 
b a a a !
2.
 
 
b a a a !
3.
 
 
 
b a a a !
b a a a !
4.
4.
 
 
b a a a !
5.
b a a a !
5.
 
 
b a a a !
5.
b a a a !
6.
21

NLP
02 FSA
  • Use of an NFSA to Accept Strings
  • Like DFS, BFS has its pitfalls. As with
    depth-first, if the state-space is infinite, the
    search may never terminate.
  • And due to growth in the size of the agenda of
    the state-space is even moderately large.
  • For larger problems, more complex search
    techniques such as dynamic programming or A must
    be used.
  • gt we will discuss in other chapter.

Following va Santen and Sproat(1998), Kaplan
and Kay(1994), and Lewis and Papadimitriou(1988).
22

NLP
02 FSA
  • Relation of NFA and DFA
  • For any NFA, there is an exactly equivalent DFA.

 
23

NLP
03 RL
3. Regular Languages and FSAs
  • Regular Languages?
  • Class of languages that are definable by regular
    expressions
  • And same as characterizable by finite-state
    automata
  • The class of regular languages over ? is then
    formally defined as follows

 
24

NLP
03 RL
  • Operations
  • Regular languages are closed under the following
    operations(Such as a regular expression)

Intersection
Defference
Complementation
Revarsal
 
25

NLP
03 RL
  • RE are equivalent to FSA.
  • For the inductive step, we show that each of the
    primitive operations of a regular
    expression(concatenation, union, closure) can be
    imitated by an automata.
  • Start with three base case,

a
 
 
 
 
 
 
(a) r ?
(b) r Ø
(c) r a
Automata for the base case (no operators) for the
induction showing that any regular expression can
be turned into an equivalent automata.
26

NLP
03 RL
  • RE are equivalent to FSA.
  • Concatenation

 
 
 
 
FSA2
FSA1
?
  • Closure

?
?
 
 
 
 
FSA1
?
27

NLP
03 RL
  • RE are equivalent to FSA.
  • Union

 
?
 
?
FSA1
 
 
?
?
28

NLP
04 Summary
4. Summary
  • Introduced the most important fundamental concept
    in language processing, the automata.
  • RE language is a powerful tool for
    pattern-matching.
  • Basic operations in RE include concatenation of
    symbols, disjuction of symbols, counters,
    anchors, and precedence operators.
  • The behavior of a deterministic automata is fully
    determined by the state it is in.
  • Any RE can be realized as a FSA.
  • Memory is an advanced operation that is often
    considered part of regular expressions but cannot
    be realized as a finite automata.
  • Any NFA can be converted to a DFA.
  • NFA search strategy.

29
Thank You
Korea Maritime and Ocean University NLP Jung Tae
LEE inverse90_at_nate.com
Write a Comment
User Comments (0)
About PowerShow.com