Regular expressions (contd.) -- remembering subpattern matches - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Regular expressions (contd.) -- remembering subpattern matches

Description:

Sub-patterns whose matching substrings are to be remembered are ... myString is now Fred is a cowgirl. Tom is a boy. Look-behind checks (contd.) (? !regexp) ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 37
Provided by: jimb9
Category:

less

Transcript and Presenter's Notes

Title: Regular expressions (contd.) -- remembering subpattern matches


1
Regular expressions (contd.) -- remembering
subpattern matches
  • When a ltpatterngt is being matched with a target
    string, substrings that match sub-patterns can be
    remembered and re-used later in the same pattern
  • Sub-patterns whose matching substrings are to be
    remembered are enclosed in parentheses
  • The sub-patterns are implicitly numbered,
    starting from 1 and their matching substrings can
    then be re-used later in the pattern by using
    back-references like \1 or \2 or \3
  • However, to get the backslash, we need to escape
    it, so we must type \\1 or \\2 or \\3 in our
    regular expressions

2
Using back-references (contd.)
  • PHP code
  • lt?php
  • myString1 klmAklmAAklmABklmBklmBBklm"
  • echo "myString is myString ltbrgt"
  • myString1 preg_replace(/(A-Z)\\1/",_",mySt
    ring1)
  • echo "myString1 is now myString1 "
  • ?gt
  • Resultant output is
  • myString1 is klmAklmAAklmABklmBklmBBklm
  • myString1 is now klmAklm_klmABklmBklm_klm

3
Using back-references (contd.)
  • PHP code
  • lt?php
  • myString klmAklmAAklmABklmBklmBBklm"
  • echo "myString is myString ltbrgt"
  • myString preg_replace(/(A-Z)\\1/",_",myStr
    ing)
  • echo "myString is now myString "
  • ?gt
  • Resultant output is
  • myString1 is klmAklmAAklmABklmBklmBBklm
  • myString1 is now klmAklm_klmABklmBklm_klm

4
Regular expressions(contd.) -- using subpattern
matches in replacements
  • We saw that, within a regular expression,
    substrings that matched sub-patterns can be
    re-used later in the pattern by preceding the
    appropriate integer with a pair of backslashes,
    \\
  • Within a ltreplacementgt, substrings that matched
    sub-patterns in the regular expressioncan be used
    by preceding the appropriate integer with a
    dollar

5
Using sub-pattern matches in replacements
(contd.)
  • PHP code
  • lt?php
  • myString "ltpgtThis is paragraph 1.lt/pgtltpgtThis
    is paragraph 2.lt/pgt"
  • echo "myString is ".str_replace("lt","lt",myStri
    ng)." ltbrgt"
  • myString preg_replace("/lt(\w)gt(.?)lt\/\\1gt/","
    2",myString)
  • echo "myString is now ".str_replace("lt","lt",my
    String)
  • ?gt
  • Resultant output is
  • myString is ltpgtThis is paragraph 1.lt/pgtltpgtThis is
    paragraph 2.lt/pgt
  • myString is now This is paragraph 1.This is
    paragraph 2.

6
A reminder about greedy/frugal quantifiers
  • What would happen if we had used a greedy
    quantifier in previous slide?
  • PHP code
  • lt?php
  • myString "ltpgtThis is paragraph 1.lt/pgtltpgtThis
    is paragraph 2.lt/pgt"
  • echo "myString is ".str_replace("lt","lt",myStri
    ng)." ltbrgt"
  • myString preg_replace("/lt(\w)gt(.)lt\/\\1gt/","
    2",myString)
  • echo "myString is now ".str_replace("lt","lt",my
    String)
  • ?gt
  • Resultant output is
  • myString is ltpgtThis is paragraph 1.lt/pgtltpgtThis is
    paragraph 2.lt/pgt
  • myString is now This is paragraph 1.lt/pgtltpgtThis
    is paragraph 2.

7
Choice of regexp delimiters
  • Up to now, we have used the forward slash
    character to mark the start and end of regular
    expressions
  • In fact, we can use any non-alphanumeric
    character for the purpose
  • For example, instead of writing
  • /abc/
  • we could write
  • abc
  • This is useful when we wish to use the /
    character in our regular expression
  • See next slide

8
Using different regexp delimiters (contd.)
  • In the regular expression below,
  • we do not have to escape the / character at
    the start of the close tag, because
  • we are not using the / character as the
    regexp delimiter
  • lt?php
  • myString "ltpgtThis is paragraph 1.lt/pgtltpgtThis
    is paragraph 2.lt/pgt"
  • echo "myString is ".str_replace("lt","lt",myStri
    ng)." ltbrgt"
  • myString preg_replace(lt(\w)gt(.?)lt/\\1gt","
    2",myString)
  • echo "myString is now ".str_replace("lt","lt",my
    String)
  • ?gt
  • Resultant output is
  • myString is ltpgtThis is paragraph 1.lt/pgtltpgtThis is
    paragraph 2.lt/pgt
  • myString is now This is paragraph 1.This is
    paragraph 2.

9
Using regexps to process nested HTML
  • PHP code
  • lt?php
  • myString ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt"
  • echo "myString is ".str_replace("lt","lt",myStri
    ng)." ltbrgt"
  • myString preg_replace(lt(\w)gt(.)lt/\\1gt","2
    ",myString)
  • echo "myString is now ".str_replace("lt","lt",my
    String)
  • ?gt
  • Resultant output is
  • myString is ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt
  • myString is now ltligtfredlt/ligtltligttomlt/ligt
  • Suppose we wanted to remove all pairs of HTML
    tags. That is, suppose we wanted
  • myString is ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt
  • myString is now fredtom
  • How would we achieve that?

10
Using regexps to process nested HTML (contd.)
  • Would a frugal quantifier do the trick? PHP code
  • lt?php
  • myString ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt"
  • echo "myString is ".str_replace("lt","lt",myStri
    ng)." ltbrgt"
  • myString preg_replace(lt(\w)gt(.?)lt\\1gt","2
    ",myString)
  • echo "myString is now ".str_replace("lt","lt",my
    String)
  • ?gt
  • No. The resultant output is still
  • myString is ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt
  • myString is now ltligtfredlt/ligtltligttomlt/ligt
  • The reason is that, while preg_replace does
    replace all matching substrings in the target
    substring, it does not perform replacement
    operations on the replacement string
  • The value ltligtfredlt/ligtltligttomlt/ligt above is the
    result of a replacement operation, so it is not
    modified
  • However, suppose we wanted to remove all pairs,
    no matter how deep the nesting. How would we do
    that?

11
Using regexps to process nested HTML (contd.)
  • We must use repetition to attack the nested
    instances
  • lt?php
  • myString ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt"
  • echo "myString is ".str_replace("lt","lt",myStri
    ng)." ltbrgt"
  • newString preg_replace(lt(\w)gt(.?)lt\\1gt","
    2",myString)
  • while (newString ! mystring)
  • myString newString
  • echo "myString is now ".
  • str_replace("lt","lt",myString)
    .ltbrgt
  • newString
  • preg_replace(lt(\w)gt(.?)lt\\1gt",
    "2",myString)
  • ?gt
  • The resultant output is now
  • myString is ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt
  • myString is now ltligtfredlt/ligtltligttomlt/ligt
  • myString is now fredtom

12
Using regexps to process nested HTML (contd.)
  • Of course, we would not want to run words
    together like we did on the last slide, so we
    would use spaces in the replacement string
  • lt?php
  • myString ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt"
  • echo "myString is ".str_replace("lt","lt",myStri
    ng)." ltbrgt"
  • newString preg_replace(lt(\w)gt(.?)lt\\1gt","
    2",myString)
  • while (newString ! mystring)
  • myString newString
  • echo "myString is now ".
  • str_replace("lt","lt",myString)
    .ltbrgt
  • newString
  • preg_replace(lt(\w)gt(.?)lt\\1gt",
    " 2 ",myString)
  • ?gt
  • The resultant output is now
  • myString is ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt
  • myString is now ltligtfredlt/ligtltligttomlt/ligt
  • myString is now fred tom

13
More on regular expressions checking for context
  • All the preg_replace operations we have written
    so far have consumed all the characters that
    matched the regular expression
  • There was no notion of examining the context
    surrounding the consumed characters
  • any characters that were matched were consumed
  • We often need some way of matching characters
    without removing them from the target string
  • There four meta-expression for doing this, two
    for forward context and two for backward context

14
Look-ahead context checks
  • (?regexp)
  • This is a positive lookahead context check
  • It matches characters in the target string
    against the pattern specified by the embedded
    regular expression regexp without consuming them
    from the target string
  • Example
  • preg_replace(/\w(? cat)/,_,myString)
  • This replaces with an underscore any word
    that is followed by a space and the word cat,
    without removing the space or the word cat from
    the target string
  • An example application is on the next slide

15
Look-ahead checks (contd.)
  • Program fragment
  • myString "tabby is a big cat. fido is a fat
    dog."
  • echo "myString is myString ltbrgt"
  • myString preg_replace("/\w(?
    cat)/","_",myString)
  • echo "myString is now myString"
  • Output produced
  • myString is tabby is a big cat. fido is a fat
    dog.
  • myString is now tabby is a _ cat. fido is a fat
    dog.

16
Look-ahead checks (contd.)
  • (?!regexp)
  • This is a negative lookahead context check
  • It ensures that characters in the target string
    do not match the pattern specified by the
    embedded regular expression regexp
  • Example
  • preg_replace(/cow(?! boy)/,_,myString)
  • This replaces all sub-strings cow with
    _, provided these sub-strings are not followed
    by the sub-string boy

17
Look-ahead checks (contd.)
  • Program fragment
  • myString "Fred is a cowboy. Dolly is a cow."
  • echo "myString is myString ltbrgt"
  • myString preg_replace("/cow(?!boy)/","_",myStr
    ing)
  • echo "myString is now myString"
  • Output produced
  • myString is Fred is a cowboy. Dolly is a cow.
  • myString is now Fred is a cowboy. Dolly is a _.

18
Look-behind context checks
  • (?ltregexp)
  • This is a positive look-behind context check
  • It ensures that preceding characters in the
    target string match the pattern specified by the
    embedded regular expression regexp
  • Example
  • preg_replace(/(?lt cow)boy/,girl,myString)
  • This replaces all sub-strings boy with
    girl, provided these sub-strings are preceded
    by the sub-string cow, but the sub-string cow
    is not consumed.

19
Look-ahead checks (contd.)
  • Program fragment
  • myString Fred is a cowboy. Tom is a boy."
  • echo "myString is myString ltbrgt"
  • myString preg_replace("/(?ltcow)boy/",girl",m
    yString)
  • echo "myString is now myString"
  • Output produced
  • myString is Fred is a cowboy. Tom is a boy.
  • myString is now Fred is a cowgirl. Tom is a boy.

20
Look-behind checks (contd.)
  • (?lt!regexp)
  • This is a negative look-behind context check
  • It ensures that preceding characters in the
    target string do not match the pattern specified
    by the embedded regular expression regexp
  • Example
  • preg_replace(/(?lt!cow)boy/,girl,myStrin
    g)
  • This replaces all sub-strings boy with
    girl, provided these sub-strings are not
    preceded by the sub-string cow

21
Look-ahead checks (contd.)
  • Program fragment
  • myString Fred is a cowboy. Tom is a boy."
  • echo "myString is myString ltbrgt"
  • myString preg_replace("/(?lt!cow)boy/",girl",m
    yString)
  • echo "myString is now myString"
  • Output produced
  • myString is Fred is a cowboy. Tom is a boy.
  • myString is now Fred is a cowboy. Tom is a girl.

22
Regexp pattern modifiers
  • We have seen that a regexp is of the form /../
    where the slash characters are delimiters (and
    could be replaced by other non-alphanumeric
    printable characters
  • The terminating character can be followed by a
    sequence of modifiers which affect the meaning of
    the regexp between the delimiting characters

23
Example pattern modifier the caseless match
modifier
  • Program fragment
  • myString "Fred is a boy. Tom is a BOY."
  • echo "myString is myString ltbrgt"
  • newstring1 preg_replace("/boy/","_",myString)
  • echo "newstring1 is newstring1 ltbrgt"
  • newstring2 preg_replace("/boy/i","_",myString)
  • echo "newstring2 is newstring2"
  • Output produced
  • myString is Fred is a boy. Tom is a BOY.
  • newstring1 is Fred is a _. Tom is a BOY.
  • newstring1 is Fred is a _. Tom is a _.

24
Contrast these 2 examples
  • Program fragment 1
  • lt?php
  • oldstring1 "ltpgtFred is a boy.lt/pgt"
  • echo "oldstring1 is ".str_replace("lt","lt",olds
    tring1)." ltbrgt"
  • newstring1 preg_replace("ltpgt.lt/pgt","_",olds
    tring1)
  • echo "newstring1 is ".str_replace("lt","lt",news
    tring1)
  • ?gt
  • Output produced
  • oldstring1 is ltpgtFred is a boy.lt/pgt
  • newstring1 is _

25
  • Program fragment 2
  • lt?php
  • oldstring1 "ltpgtFred
  • is a boy.lt/pgt"
  • echo "oldstring1 is ".str_replace("lt","lt",olds
    tring1)." ltbrgt"
  • newstring1 preg_replace("ltpgt.lt/pgt","_",olds
    tring1)
  • echo "newstring1 is ".str_replace("lt","lt",news
    tring1)
  • ?gt
  • Output produced
  • oldstring1 is ltpgtFred is a boy.lt/pgt
  • newstring1 is ltpgtFred is a boy.lt/pgt
  • Why no replacment? Why no match?
  • Answer
  • the target string is a multi-line string
  • the . meta-character does not match newline
    characters

26
The dot-all modifier
  • Program fragment
  • lt?php
  • oldstring1 "ltpgtFred
  • is a boy.lt/pgt"
  • echo "oldstring1 is ".str_replace("lt","lt",olds
    tring1)." ltbrgt"
  • newstring1 preg_replace("ltpgt.lt/pgts","_",old
    string1)
  • echo "newstring1 is ".str_replace("lt","lt",news
    tring1)
  • ?gt
  • Output produced
  • oldstring1 is ltpgtFred is a boy.lt/pgt
  • newstring1 is _
  • The dot-all modifier says that the dot
    meta-character should match all characters,
    including newlines

27
The dot-all modifier again
  • Program fragment 2
  • lt?php
  • oldstring1 "ltpgtFred \n is a boy.lt/pgt"
  • echo "oldstring1 is ".str_replace("lt","lt",olds
    tring1)." ltbrgt"
  • newstring1 preg_replace("ltpgt.lt/pgts","_",old
    string1)
  • echo "newstring1 is ".str_replace("lt","lt",news
    tring1)
  • ?gt
  • Output produced
  • oldstring1 is ltpgtFred is a boy.lt/pgt
  • newstring1 is _
  • \n is a newline but the dot-all modifier says
    that the dot meta-character should match all
    characters, including newlines

28
The Ungreedy modifier
  • This is very similar to the use of the ?
    character to stop quantifiers being greedy

29
Example usage of the ungreedy modifier
  • Program fragment
  • myString "ltpgtFred is a boy.lt/pgtltpgtAnn is a
    girl.lt/pgt"
  • echo "myString is ".str_replace("lt","lt",myStri
    ng)." ltbrgt"
  • myString preg_replace("ltpgt(.)lt/pgtU","1",my
    String)
  • echo echo "myString is now ".str_replace("lt","lt
    ",myString)
  • Output produced
  • myString is ltpgtFred is a boy.lt/pgtltpgtAnn is a
    girl.lt/pgt
  • myString is now Fred is a boy.Ann is a girl.
  • The meta-character is normally greedy but the U
    modifier has made it ungreedy

30
2nd Example usage of U modifier, part 1
  • Program fragment 1
  • myString
  • "ltpgtFred is a boy.lt/pgtltpgtAnn is a
    girl.lt/pgtlthrgt"
  • echo "myString is myString ltbrgt"
  • myString preg_replace("ltpgt(.?)lt/pgt(.)","1x
    2",myString)
  • echo "myString is now myString"
  • Output produced
  • myString is ltpgtFred is a boy.lt/pgtltpgtAnn is a
    girl.lt/pgtlthrgt
  • myString is now Fred is a boy.xltpgtAnn is a
    girl.lt/pgtlthrgt
  • Here, the first is frugal but the second is
    greedy

31
2nd Example usage of U modifier, part 2
  • Program fragment 2
  • myString
  • "ltpgtFred is a boy.lt/pgtltpgtAnn is a
    girl.lt/pgtlthrgt"
  • echo "myString is myString ltbrgt"
  • myString
  • preg_replace("ltpgt(.?)lt/pgt(.)U","1x2",mySt
    ring)
  • echo "myString is now myString"
  • Output produced
  • myString is ltpgtFred is a boy.lt/pgtltpgtAnn is a
    girl.lt/pgtlthrgt
  • myString is now Fred is a boy.lt/pgtltpgtAnn is a
    girl.xlthrgt
  • Here, the first is greedy but the second is
    frugal
  • That is, the U modifier reverses the meaning of
    the presence or absence of the ? Character after
    a quantifier

32
More on multi-line strings
  • By default, the subject string is regarded as
    consisting of a single "line" of characters (even
    if it actually contains several newlines).
  • The "start of line" metacharacter () matches
    only at the start of the string
  • The "end of line" metacharacter () matches only
    at the end of the string
  • Suppose, however, that we want these
    metacharacters to also match at newlines inside
    the string

33
Multi-line strings(contd.)
  • Program fragment
  • myString ltpgtFred is a boy.lt/pgt\nltpgtAnn is a
    girl.lt/pgt"
  • echo "myString is myString ltbrgt"
  • myString preg_replace("ltpgt(.)lt/pgt","x1y",
    myString)
  • echo "myString is now myString"
  • Output produced
  • myString is ltpgtFred is a boy.lt/pgtltpgtAnn is a
    girl.lt/pgt
  • myString is now ltpgtFred is a boy.lt/pgtltpgtAnn is a
    girl.lt/pgt
  • One the one hand, the and characters did not
    match the newline in the middle of the string
  • On the other hand, the dot did not match the
    newline

34
The multi-line modifier
  • Program fragment
  • myString ltpgtFred is a boy.lt/pgt\nltpgtAnn is a
    girl.lt/pgt"
  • echo "myString is myString ltbrgt"
  • myString preg_replace("ltpgt(.)lt/pgtm","x1y"
    ,myString)
  • echo "myString is now myString"
  • Output produced
  • myString is ltpgtFred is a boy.lt/pgtltpgtAnn is a
    girl.lt/pgt
  • myString is now xFred is a boy.yxAnn is a girl.y
  • The multiline modifier m has made the and
    characters match the newline in the middle of the
    string

35
CS4408 got here on 21 oct 2005
36
More PHP functions for using regular expressions
  • To be continued in next slide set
Write a Comment
User Comments (0)
About PowerShow.com