Pattern Matching on Strings using Regular Expressions - PowerPoint PPT Presentation

About This Presentation
Title:

Pattern Matching on Strings using Regular Expressions

Description:

Regexp Capture Groups Capturing groups (Perl, PHP, Java regex, ...): Syntax: (i.e., in parentheses) Back-references: Syntax: (i.e., – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 48
Provided by: ituDkbra
Category:

less

Transcript and Presenter's Notes

Title: Pattern Matching on Strings using Regular Expressions


1
Pattern Matching on Stringsusing Regular
Expressions
Num 0 1-90-9 Email a-z "_at_"
a-z ("." a-z )
Claus Brabrand brabrand_at_itu.dk IT University
of Copenhagen
Jakob G. Thomsen gedefar_at_cs.au.dk Aarhus
University
2
Abstract
We show how to achieve typed and unambiguous
declarative pattern matching on strings using
regular expressions extended with a simple
recording operator. We give a characterization
of ambiguity of regular expressions that leads to
a sound and complete static analysis. The
analysis is capable of pinpointing all
ambiguities in terms of the structure of the
regular expression and report shortest ambiguous
strings. We also show how pattern matching can
be integrated into statically typed programming
languages for deconstructing strings and
reproducing typed and structured values. We
validate our approach by giving a full
implementation of the approach presented in this
paper. The resulting tool, reg-exp-rec, adds
typed and unambiguous pattern matching to Java in
a stand-alone and non-intrusive manner. We
evaluate the approach using several realistic
examples.
3
Outline
  • Pattern Matching (intro motiv)
  • The Chomsky Hierarchy (1956)
  • Regular Expressions
  • The Recording Construction
  • Ambiguity
  • Disambiguation
  • Type Inference
  • Usage and Examples
  • Evaluation and Conclusion

4
Introduction Motivation
  • Pattern matching an indispensable problem
  • Many applications need to "parse" dynamic input
  • 1) URLs
  • 2) Log Files
  • 3) DBLP

(list of key-value pairs)
http//first.dk/index.php?id141viewdetails
protocol
host
path
query-string
13/02/2010 66.249.65.107 get /support.html 20/02/2
010 42.116.32.64 post /search.html
ltarticlegt lttitlegtThree Models for
the...lt/titlegt ltauthorgtNoam Chomskylt/authorgt
ltyeargt1956lt/yeargt lt/articlegt
5
Outline
  • Pattern Matching (intro motiv)
  • The Chomsky Hierarchy (1956)
  • Regular Expressions
  • The Recording Construction
  • Ambiguity
  • Disambiguation
  • Type Inference
  • Usage and Examples
  • Evaluation and Conclusion

6
The Chomsky Hierarchy (1956)
  • Language classes (formalisms)
  • Type-3 regular expressions "enough" for
  • URLs, log files, DBLP, ...
  • "Trade" (excess) expressivity for
  • declarativity, simplicity, and static safety !

7
Type-0 java.net.URL
  • Turing-Complete programming (e.g., Java)
  • "unrestricted grammars" (e.g., rewriting
    systems)
  • Cyclomatic complexity (of official
    "java.net.URL")
  • 88 bug reports on Sun's Bug Repository !
  • Bug reports span more than a decade !

8
Type-1 Context-Sensitivity
  • Not widely used (or studied?) formalism
  • Presumeably because
  • Restricts expressivity w/o offering extra safety?

- ? -
9
Type-2 Context-Free Grammars
  • Conceptually harder than regexps
  • Essentially (Type-3) Regular Expressions
    recursion
  • The ultimate end-all scientific argument
  • We d

(conjecture!)
regexps 12 times more popular !
10
Type-? Regexp Capture Groups
  • Capturing groups (Perl, PHP, Java regex, ...)
  • Syntax (i.e., in parentheses)
  • Back-references
  • Syntax (i.e., "index of" capturing group)
  • Beyond regularity !
  • is non-regular
  • In fact, not even context-free !!!
  • is non-context-free

(R)
\7
(a)b\1
an b an n ? 0
? ? ? ???, ???
(.).\1
11
Type-? Regexp Capture Groups
  • Interpretation with back-tracking
  • NP-complete (exponential worst-case) -(

regexp " a?nan " vs. string " an "
1 minute
0.02 msecs
3.000.0001 on strings of length 29 !!!
12
Type-3 Regular Expressions
Declarative !
Safe !
Simple !
  • Closure properties
  • Union
  • Concatenation
  • Iteration
  • Restriction
  • Intersection
  • Complement
  • ...
  • Decidability properties
  • ...
  • ...
  • Containment L(R) ? L(R')
  • Ambiguity
  • ...
  • ...

13
Outline
  • Pattern Matching (intro motiv)
  • The Chomsky Hierarchy (1956)
  • Regular Expressions
  • The Recording Construction
  • Ambiguity
  • Disambiguation
  • Type Inference
  • Usage and Examples
  • Evaluation and Conclusion

14
Regular Expressions
  • Syntax
  • Semantics
  • where
  • L1 ? L2 is concatenation (i.e., ?1 ?2 ?1?L1,
    ?2?L2 )
  • L ?i?0 Li where L0 ? and
    Li L ? Li-1

15
Common Extensions (sugar)
  • Any character (aka, dot)
  • "." as c1c2...cn, ci??
  • Character ranges
  • "a-z" as ab...z
  • One-or-more regexps
  • "R" as R?R
  • Optional regexp
  • "R?" as ?R
  • Various repetitions e.g.
  • "R2,3" as R?R?R?

16
Outline
  • Pattern Matching (intro motiv)
  • The Chomsky Hierarchy (1956)
  • Regular Expressions
  • The Recording Construction
  • Ambiguity
  • Disambiguation
  • Type Inference
  • Usage and Examples
  • Evaluation and Conclusion

17
Recording
  • Syntax
  • "x " is a recording identifier
  • (it "remembers" the substring it matches)
  • Semantics
  • Example (simplified emails)
  • Matching against string
  • yields

NB cannot use DFAs / NFAs ! - only recognition
(yes / no) - not how (i.e., "the structure")
a-z "_at_" a-z ("."
a-z)
ltuser gt ltdomain
gt
"obama_at_whitehouse.gov"
user "obama"
domain "whitehouse.gov"

18
Recording (structured)
  • Another example (with nested recordings)
  • Matching against string
  • yields

ltdate ltday 0-92 gt "/" ltmonth
0-92 gt "/" ltyear 0-94 gt gt
"26/06/1992"
date 26/06/1992
date.day 26
date.month 06
date.year 1992
19
Recording (structured, lists)
  • Yet another example (yielding lists)
  • Matching against string
  • yields a list structure

ltname a-z gt " " ltname a-z gt
( ltname a-z gt "\n" )
ltname a-z gt (" " ltname a-z gt )
"obama bush"
name obama,bush
20
Outline
  • Pattern Matching (intro motiv)
  • The Chomsky Hierarchy (1956)
  • Regular Expressions
  • The Recording Construction
  • Ambiguity
  • Disambiguation
  • Type Inference
  • Usage and Examples
  • Evaluation and Conclusion

21
Abstract Syntax Trees (ASTs)
22
Ambiguity
  • Definition
  • R ambiguous iff
  • ?T,T'?ASTR T ? T' ? T T'
  • where ? AST ? ? (the flattening) is

23
Characterization of Ambiguity
  • Theorem
  • R unambiguous iff

NB sound complete !
R ? R?R
24
Examples
  • Ambiguous
  • aa
  • L(a) ? L(a) a ? Ø
  • a?a
  • L(a) L(a) an ? Ø
  • Unambiguous
  • aaa
  • L(a) ? L(aa) Ø
  • a?ba
  • L(a) L(ba) Ø

25
Ambiguity Examples
  • a?b(ab)
  • (aab)?(baa)
  • (aaaaa)

ambiguous choice a?b lt--gt (ab)
shortest ambiguous string "ab"
ambiguous concatenation (aab) lt--gt (baa)
shortest ambiguous string "aba"
ambiguous star (aaaaa)
shortest ambiguous string "aaaaa"
26
Ambiguity vs. Recordings
  • Ambiguities inside recordings
  • ltx a a gt
  • ltx a ? a gt
  • ...is not a problem!
  • Contextual composition (of recordings)
  • ltx a gt ? a
  • ltx a gt a
  • ...is a problem!
  • Note our tool tests only for these!

27
Outline
  • Pattern Matching (intro motiv)
  • The Chomsky Hierarchy (1956)
  • Regular Expressions
  • The Recording Construction
  • Ambiguity
  • Disambiguation
  • Type Inference
  • Usage and Examples
  • Evaluation and Conclusion

28
Disambiguation
  • 1) Manual rewriting
  • Always possible -)
  • Tedious -(
  • Error-prone -(
  • Not structure-preserving -(
  • 3) Disambiguators
  • From characterization
  • concat '?L', '?R'
  • choice 'L', 'R'
  • star 'L', 'R'
  • (partial-order on ASTs)
  • 2) Restriction
  • R1 - R2
  • And then encode...
  • RC as ? - R
  • R1 R2 as (R1CR2C)C
  • 4) Default disamb
  • concat, choice, and star are all left-biassed
    (by default) !
  • (Our tool does this)

29
Quizzz (Restriction vs. Recording)
  • Which can have recordings?
  • A) R1, R2, R3, R4, and R5 can have recordings
  • B) R1, R3, R4, and R5 can have recordings
  • C) R1, R4, and R5 can have recordings
  • D) R1 can have
    recordings
  • E) None of them can have recordings

R1 - R2 R3C as ? - R3 R4 R5 as
(R4CR5C)C
i.e., where do recordings make sense?
30
Outline
  • Pattern Matching (intro motiv)
  • The Chomsky Hierarchy (1956)
  • Regular Expressions
  • The Recording Construction
  • Ambiguity
  • Disambiguation
  • Type Inference
  • Usage and Examples
  • Evaluation and Conclusion

31
Type Inference
  • Type Inference
  • R (L,S)

32
Examples (Type Inference)
  • Regexp
  • Usage

Person ltname a-z gt " (" ltage 0-9 gt ")"
class Person // auto-generated String
name int age static Person match(String
s) ... public String toString() ...
compile (our tool)
String s "obama (48)" Person p
Person.match(s) print(p.name " is " p.age
"y old")
33
Examples (Type Inference)
  • Usage

Person ltname a-z gt " (" ltage 0-9 gt ")"
People ( Person "\n" )
class People // auto-generated String
name int age static Person
match(String s) ... public String
toString() ...
compile (our tool)
String s "obama (48) \n bush (63) \n "
People p People.match(s) println("Second
name is " p1.name)
34
Examples (Type Inference)
  • Usage

Person ltname a-z gt " (" ltage 0-9 gt ")"
People ( ltperson Person gt "\n" )
class People // auto-generated
Person person class Person // nested
class String name int age ...

compile (our tool)
String s "obama (48) \n bush (63) \n "
People people People.match(s) for (p
people.person) println(p.name)
35
Outline
  • Pattern Matching (intro motiv)
  • The Chomsky Hierarchy (1956)
  • Regular Expressions
  • The Recording Construction
  • Ambiguity
  • Disambiguation
  • Type Inference
  • Usage and Examples
  • Evaluation and Conclusion

36
URLs
  • URLs
  • Regexp
  • Query string further structured (list of
    key-value pairs)

(list of key-value pairs)
"http//www.google.com/search?qrecordhlen"
protocol
host
path
query-string (list of key-value pairs)
Host lthost a-z ("." a-z ) gt Path
ltpath a-z/. gt Query ltquery
a-z gt URL "http//" Host "/" Path
"?" Query
KeyVal ltkey a-z gt "" ltval a-z gt
Query KeyVal ("" KeyVal)
37
URLs (Usage Example)
  • Regexp
  • Usage (example)

Host lthost a-z ("." a-z ) gt Path
ltpath a-z/. gt KeyVal ltkey a-z
gt "" ltval a-z gt Query KeyVal (""
KeyVal) URL "http//" Host "/" Path
"?" Query
String s "http//www.google.com/search?qrecord"
URL url URL.match(s) print("Host is "
url.host) if (url.key.lengthgt0) print("1st key
" url.key0) for (String val url.val)
println("value " val)
38
Log Files
Format
13/02/2010 66.249.65.107 /support.html 20/02/2010
42.116.32.64 /search.html ...
Date ltdate ltday Day gt "/"
ltmonth Month gt "/"
ltyear 0-94 gt gt IP ltip 0-91,3
("." 0-91,3 )3 gt Entry ltentry Date
" " IP " " Path "\n" gt Log Entry
Regexp
Log log Log.match(log_file) for (Entry e
log.entry) if (e.date.month 02
e.date.day 29) print("Access on LEAP
YEAR from IP " e.ip)
Usage
39
Log Files (cont'd, ambiguity)
  • Assume we forgot "/" (between day month)
  • Ambiguity
  • i.e. "1/01" (January 1) vs. "10/1" (January
    10) -)

Regexp
Day 0?1-9 1-20-9 30 31 Month
0?1-9 10 11 12 Date ltdate
ltday Day gt // no slash !
ltmonth Month gt "/"
ltyear 0-94 gt gt
Error
ambiguous concatenation ltdaygt lt--gt ltmonthgt
shortest ambiguous string "101"
40
DBLP (Format)
  • DBLP (XML) Format

ltarticlegt ltauthorgtNoam Chomskylt/authorgt
lttitlegtThree Models for the Description of
Languagelt/titlegt ltyeargt1956lt/yeargt
ltjournalgtIRE Transactions on Information
Theorylt/journalgt lt/articlegt ltarticlegt
ltauthorgtClaus Brabrandlt/authorgt ltauthorgtJakob
G Thomsenlt/authorgt lttitlegtTyped and
Unambiguous Pattern Matching on
Strings using Regular Expressionslt/titlegt
ltyeargt2010lt/yeargt ltnotegtSubmittedlt/notegt lt/art
iclegt ...
41
DBLP (Regexp)
  • DBLP Regexp
  • Ambiguity !
  • EITHER 2 publications (. "")
  • OR 1 publication (. gray part) !!!

Author "ltauthorgt" ltauthor a-z gt
"lt/authorgt" Title "lttitlegt" lttitle a-z
gt "lt/titlegt" Article "ltarticlegt" Author
Title . "lt/articlegt" DBLP ltpub
Article gt
ambiguous star ltpubgt shortest ambiguous
string "ltarticlegtlttitlegtlt/titlegtlt/articlegt
ltarticlegtlttitlegtlt/titlegtlt/articlegt"
42
DBLP (Disambiguated)
  • DBLP Regexp
  • Disambiguated (using "(R1-R2)")
  • Unambiguous! -)

Author "ltauthorgt" ltauthor a-z gt
"lt/authorgt" Title "lttitlegt" lttitle a-z
gt "lt/titlegt" Article "ltarticlegt" Author
Title . "lt/articlegt" DBLP ltpub
Article gt
Article "ltarticlegt" Author
Title (. - (. "lt/articlegt" .))
"lt/articlegt"
43
DBLP (Usage Example)
  • DBLP Regexp
  • Usage (example)

Author "ltauthorgt" ltauthor a-z gt
"lt/authorgt" Title "lttitlegt" lttitle a-z
gt "lt/titlegt" Article "ltarticlegt" Author
Title . "lt/articlegt" DBLP ltarticle
Article gt
DBLP dblp DBLP.match(readXMLfile("DBLP.xml")) f
or (Article a dblp.article) print("Title "
a.title)
44
Outline
  • Pattern Matching (intro motiv)
  • The Chomsky Hierarchy (1956)
  • Regular Expressions
  • The Recording Construction
  • Ambiguity
  • Disambiguation
  • Type Inference
  • Usage and Examples
  • Evaluation and Conclusion

45
Evaluation
  • Evaluation summary
  • Also, (Type-3) regexps expressive "enough"
  • for URLs, Log files, DBLP, ...

MatMult
NP-Complete
FrischCardelli'04
46
Type-3 vs. Type-0 (URLs)
  • Regexps vs. Java

Regexps are 8 times more concise !
47
java.util.regex vs. Our approach
  • Efficiency(on DBLP)
  • java.util.regex
  • Exponential O(2?) 2,500 chars in 2
    mins !
  • In contrast ours
  • Linear (on DBLP) 1,200,000 chars in 6 secs !

2 mins
10 msecs
48
Related Work
  • Recording (with lists in general)
  • "x as R" in XDuce "xR" in CDuce and "x_at_R" in
    Scala and HaRP
  • Ambiguity
  • BookEvenGreibachOtt'71 and Hosoya'03 for
    XDuce but indirectly via NFAa, not directly
    (syntax-directed)
  • Disambiguation
  • Vansummeren'06 but with global, not local
    disambiguation
  • Type inference
  • Exact type inference in XDuce
    CDuce(soundnesscompleteness proof in
    Vansummeren'06)but not for stand-alone and
    non-intrusive usage (Java)

49
Conclusion
  • For string pattern matching, it is possible to
  • In conclusion
  • i.e., ambiguity checking and type inference !
  • stand-alone non-intrusive language
    integration (Java) !

"trade (excess) expressivity for
safetysimplicity"
We conclude that if regular expressions are
sufficiently expressive, they provide a simple,
declarative, and safe means for pattern matching
on strings, capable of extracting highly
structural information in a statically type-safe
and unambiguous manner.
50
lt/Talkgt
http//www.cs.au.dk/gedefar/reg-exp-rec/
  • Questions ? Complaints ?
Write a Comment
User Comments (0)
About PowerShow.com