Title: Regular expressions (contd.) -- remembering subpattern matches
1Regular expressions (contd.) -- remembering
subpattern matches
- When a ltpatterngt is being matched with a target
string, substrings that match sub-patterns can be
remembered and re-used later in the same pattern - Sub-patterns whose matching substrings are to be
remembered are enclosed in parentheses - The sub-patterns are implicitly numbered,
starting from 1 and their matching substrings can
then be re-used later in the pattern by using
back-references like \1 or \2 or \3 - However, to get the backslash, we need to escape
it, so we must type \\1 or \\2 or \\3 in our
regular expressions
2Using back-references (contd.)
- PHP code
- lt?php
- myString1 klmAklmAAklmABklmBklmBBklm"
- echo "myString is myString ltbrgt"
- myString1 preg_replace(/(A-Z)\\1/",_",mySt
ring1) - echo "myString1 is now myString1 "
- ?gt
- Resultant output is
- myString1 is klmAklmAAklmABklmBklmBBklm
- myString1 is now klmAklm_klmABklmBklm_klm
3Using back-references (contd.)
- PHP code
- lt?php
- myString klmAklmAAklmABklmBklmBBklm"
- echo "myString is myString ltbrgt"
- myString preg_replace(/(A-Z)\\1/",_",myStr
ing) - echo "myString is now myString "
- ?gt
- Resultant output is
- myString1 is klmAklmAAklmABklmBklmBBklm
- myString1 is now klmAklm_klmABklmBklm_klm
4Regular expressions(contd.) -- using subpattern
matches in replacements
- We saw that, within a regular expression,
substrings that matched sub-patterns can be
re-used later in the pattern by preceding the
appropriate integer with a pair of backslashes,
\\ - Within a ltreplacementgt, substrings that matched
sub-patterns in the regular expressioncan be used
by preceding the appropriate integer with a
dollar
5Using sub-pattern matches in replacements
(contd.)
- PHP code
- lt?php
- myString "ltpgtThis is paragraph 1.lt/pgtltpgtThis
is paragraph 2.lt/pgt" - echo "myString is ".str_replace("lt","lt",myStri
ng)." ltbrgt" - myString preg_replace("/lt(\w)gt(.?)lt\/\\1gt/","
2",myString) - echo "myString is now ".str_replace("lt","lt",my
String) - ?gt
- Resultant output is
- myString is ltpgtThis is paragraph 1.lt/pgtltpgtThis is
paragraph 2.lt/pgt - myString is now This is paragraph 1.This is
paragraph 2.
6A reminder about greedy/frugal quantifiers
- What would happen if we had used a greedy
quantifier in previous slide? - PHP code
- lt?php
- myString "ltpgtThis is paragraph 1.lt/pgtltpgtThis
is paragraph 2.lt/pgt" - echo "myString is ".str_replace("lt","lt",myStri
ng)." ltbrgt" - myString preg_replace("/lt(\w)gt(.)lt\/\\1gt/","
2",myString) - echo "myString is now ".str_replace("lt","lt",my
String) - ?gt
- Resultant output is
- myString is ltpgtThis is paragraph 1.lt/pgtltpgtThis is
paragraph 2.lt/pgt - myString is now This is paragraph 1.lt/pgtltpgtThis
is paragraph 2.
7Choice of regexp delimiters
- Up to now, we have used the forward slash
character to mark the start and end of regular
expressions - In fact, we can use any non-alphanumeric
character for the purpose - For example, instead of writing
- /abc/
- we could write
- abc
- This is useful when we wish to use the /
character in our regular expression - See next slide
8Using different regexp delimiters (contd.)
- In the regular expression below,
- we do not have to escape the / character at
the start of the close tag, because - we are not using the / character as the
regexp delimiter - lt?php
- myString "ltpgtThis is paragraph 1.lt/pgtltpgtThis
is paragraph 2.lt/pgt" - echo "myString is ".str_replace("lt","lt",myStri
ng)." ltbrgt" - myString preg_replace(lt(\w)gt(.?)lt/\\1gt","
2",myString) - echo "myString is now ".str_replace("lt","lt",my
String) - ?gt
- Resultant output is
- myString is ltpgtThis is paragraph 1.lt/pgtltpgtThis is
paragraph 2.lt/pgt - myString is now This is paragraph 1.This is
paragraph 2.
9Using regexps to process nested HTML
- PHP code
- lt?php
- myString ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt"
- echo "myString is ".str_replace("lt","lt",myStri
ng)." ltbrgt" - myString preg_replace(lt(\w)gt(.)lt/\\1gt","2
",myString) - echo "myString is now ".str_replace("lt","lt",my
String) - ?gt
- Resultant output is
- myString is ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt
- myString is now ltligtfredlt/ligtltligttomlt/ligt
- Suppose we wanted to remove all pairs of HTML
tags. That is, suppose we wanted - myString is ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt
- myString is now fredtom
- How would we achieve that?
10Using regexps to process nested HTML (contd.)
- Would a frugal quantifier do the trick? PHP code
- lt?php
- myString ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt"
- echo "myString is ".str_replace("lt","lt",myStri
ng)." ltbrgt" - myString preg_replace(lt(\w)gt(.?)lt\\1gt","2
",myString) - echo "myString is now ".str_replace("lt","lt",my
String) - ?gt
- No. The resultant output is still
- myString is ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt
- myString is now ltligtfredlt/ligtltligttomlt/ligt
- The reason is that, while preg_replace does
replace all matching substrings in the target
substring, it does not perform replacement
operations on the replacement string - The value ltligtfredlt/ligtltligttomlt/ligt above is the
result of a replacement operation, so it is not
modified - However, suppose we wanted to remove all pairs,
no matter how deep the nesting. How would we do
that?
11Using regexps to process nested HTML (contd.)
- We must use repetition to attack the nested
instances - lt?php
- myString ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt"
- echo "myString is ".str_replace("lt","lt",myStri
ng)." ltbrgt" - newString preg_replace(lt(\w)gt(.?)lt\\1gt","
2",myString) - while (newString ! mystring)
- myString newString
- echo "myString is now ".
- str_replace("lt","lt",myString)
.ltbrgt - newString
- preg_replace(lt(\w)gt(.?)lt\\1gt",
"2",myString) -
- ?gt
- The resultant output is now
- myString is ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt
- myString is now ltligtfredlt/ligtltligttomlt/ligt
- myString is now fredtom
12Using regexps to process nested HTML (contd.)
- Of course, we would not want to run words
together like we did on the last slide, so we
would use spaces in the replacement string - lt?php
- myString ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt"
- echo "myString is ".str_replace("lt","lt",myStri
ng)." ltbrgt" - newString preg_replace(lt(\w)gt(.?)lt\\1gt","
2",myString) - while (newString ! mystring)
- myString newString
- echo "myString is now ".
- str_replace("lt","lt",myString)
.ltbrgt - newString
- preg_replace(lt(\w)gt(.?)lt\\1gt",
" 2 ",myString) -
- ?gt
- The resultant output is now
- myString is ltolgtltligtfredlt/ligtltligttomlt/ligtlt/olgt
- myString is now ltligtfredlt/ligtltligttomlt/ligt
- myString is now fred tom
13More on regular expressions checking for context
- All the preg_replace operations we have written
so far have consumed all the characters that
matched the regular expression - There was no notion of examining the context
surrounding the consumed characters - any characters that were matched were consumed
- We often need some way of matching characters
without removing them from the target string - There four meta-expression for doing this, two
for forward context and two for backward context
14Look-ahead context checks
- (?regexp)
- This is a positive lookahead context check
- It matches characters in the target string
against the pattern specified by the embedded
regular expression regexp without consuming them
from the target string - Example
- preg_replace(/\w(? cat)/,_,myString)
- This replaces with an underscore any word
that is followed by a space and the word cat,
without removing the space or the word cat from
the target string - An example application is on the next slide
15Look-ahead checks (contd.)
- Program fragment
- myString "tabby is a big cat. fido is a fat
dog." - echo "myString is myString ltbrgt"
- myString preg_replace("/\w(?
cat)/","_",myString) - echo "myString is now myString"
- Output produced
- myString is tabby is a big cat. fido is a fat
dog. - myString is now tabby is a _ cat. fido is a fat
dog.
16Look-ahead checks (contd.)
- (?!regexp)
- This is a negative lookahead context check
- It ensures that characters in the target string
do not match the pattern specified by the
embedded regular expression regexp - Example
- preg_replace(/cow(?! boy)/,_,myString)
- This replaces all sub-strings cow with
_, provided these sub-strings are not followed
by the sub-string boy
17Look-ahead checks (contd.)
- Program fragment
- myString "Fred is a cowboy. Dolly is a cow."
- echo "myString is myString ltbrgt"
- myString preg_replace("/cow(?!boy)/","_",myStr
ing) - echo "myString is now myString"
- Output produced
- myString is Fred is a cowboy. Dolly is a cow.
- myString is now Fred is a cowboy. Dolly is a _.
18Look-behind context checks
- (?ltregexp)
- This is a positive look-behind context check
- It ensures that preceding characters in the
target string match the pattern specified by the
embedded regular expression regexp - Example
- preg_replace(/(?lt cow)boy/,girl,myString)
- This replaces all sub-strings boy with
girl, provided these sub-strings are preceded
by the sub-string cow, but the sub-string cow
is not consumed.
19Look-ahead checks (contd.)
- Program fragment
- myString Fred is a cowboy. Tom is a boy."
- echo "myString is myString ltbrgt"
- myString preg_replace("/(?ltcow)boy/",girl",m
yString) - echo "myString is now myString"
- Output produced
- myString is Fred is a cowboy. Tom is a boy.
- myString is now Fred is a cowgirl. Tom is a boy.
20Look-behind checks (contd.)
- (?lt!regexp)
- This is a negative look-behind context check
- It ensures that preceding characters in the
target string do not match the pattern specified
by the embedded regular expression regexp - Example
- preg_replace(/(?lt!cow)boy/,girl,myStrin
g) - This replaces all sub-strings boy with
girl, provided these sub-strings are not
preceded by the sub-string cow
21Look-ahead checks (contd.)
- Program fragment
- myString Fred is a cowboy. Tom is a boy."
- echo "myString is myString ltbrgt"
- myString preg_replace("/(?lt!cow)boy/",girl",m
yString) - echo "myString is now myString"
- Output produced
- myString is Fred is a cowboy. Tom is a boy.
- myString is now Fred is a cowboy. Tom is a girl.
22Regexp pattern modifiers
- We have seen that a regexp is of the form /../
where the slash characters are delimiters (and
could be replaced by other non-alphanumeric
printable characters - The terminating character can be followed by a
sequence of modifiers which affect the meaning of
the regexp between the delimiting characters
23Example pattern modifier the caseless match
modifier
- Program fragment
- myString "Fred is a boy. Tom is a BOY."
- echo "myString is myString ltbrgt"
- newstring1 preg_replace("/boy/","_",myString)
- echo "newstring1 is newstring1 ltbrgt"
- newstring2 preg_replace("/boy/i","_",myString)
- echo "newstring2 is newstring2"
- Output produced
- myString is Fred is a boy. Tom is a BOY.
- newstring1 is Fred is a _. Tom is a BOY.
- newstring1 is Fred is a _. Tom is a _.
24Contrast these 2 examples
- Program fragment 1
- lt?php
- oldstring1 "ltpgtFred is a boy.lt/pgt"
- echo "oldstring1 is ".str_replace("lt","lt",olds
tring1)." ltbrgt" - newstring1 preg_replace("ltpgt.lt/pgt","_",olds
tring1) - echo "newstring1 is ".str_replace("lt","lt",news
tring1) - ?gt
- Output produced
- oldstring1 is ltpgtFred is a boy.lt/pgt
- newstring1 is _
25- Program fragment 2
- lt?php
- oldstring1 "ltpgtFred
- is a boy.lt/pgt"
- echo "oldstring1 is ".str_replace("lt","lt",olds
tring1)." ltbrgt" - newstring1 preg_replace("ltpgt.lt/pgt","_",olds
tring1) - echo "newstring1 is ".str_replace("lt","lt",news
tring1) - ?gt
- Output produced
- oldstring1 is ltpgtFred is a boy.lt/pgt
- newstring1 is ltpgtFred is a boy.lt/pgt
- Why no replacment? Why no match?
- Answer
- the target string is a multi-line string
- the . meta-character does not match newline
characters
26The dot-all modifier
- Program fragment
- lt?php
- oldstring1 "ltpgtFred
- is a boy.lt/pgt"
- echo "oldstring1 is ".str_replace("lt","lt",olds
tring1)." ltbrgt" - newstring1 preg_replace("ltpgt.lt/pgts","_",old
string1) - echo "newstring1 is ".str_replace("lt","lt",news
tring1) - ?gt
- Output produced
- oldstring1 is ltpgtFred is a boy.lt/pgt
- newstring1 is _
- The dot-all modifier says that the dot
meta-character should match all characters,
including newlines
27The dot-all modifier again
- Program fragment 2
- lt?php
- oldstring1 "ltpgtFred \n is a boy.lt/pgt"
- echo "oldstring1 is ".str_replace("lt","lt",olds
tring1)." ltbrgt" - newstring1 preg_replace("ltpgt.lt/pgts","_",old
string1) - echo "newstring1 is ".str_replace("lt","lt",news
tring1) - ?gt
- Output produced
- oldstring1 is ltpgtFred is a boy.lt/pgt
- newstring1 is _
- \n is a newline but the dot-all modifier says
that the dot meta-character should match all
characters, including newlines
28The Ungreedy modifier
- This is very similar to the use of the ?
character to stop quantifiers being greedy
29Example usage of the ungreedy modifier
- Program fragment
- myString "ltpgtFred is a boy.lt/pgtltpgtAnn is a
girl.lt/pgt" - echo "myString is ".str_replace("lt","lt",myStri
ng)." ltbrgt" - myString preg_replace("ltpgt(.)lt/pgtU","1",my
String) - echo echo "myString is now ".str_replace("lt","lt
",myString) - Output produced
- myString is ltpgtFred is a boy.lt/pgtltpgtAnn is a
girl.lt/pgt - myString is now Fred is a boy.Ann is a girl.
- The meta-character is normally greedy but the U
modifier has made it ungreedy
302nd Example usage of U modifier, part 1
- Program fragment 1
- myString
- "ltpgtFred is a boy.lt/pgtltpgtAnn is a
girl.lt/pgtlthrgt" - echo "myString is myString ltbrgt"
- myString preg_replace("ltpgt(.?)lt/pgt(.)","1x
2",myString) - echo "myString is now myString"
- Output produced
- myString is ltpgtFred is a boy.lt/pgtltpgtAnn is a
girl.lt/pgtlthrgt - myString is now Fred is a boy.xltpgtAnn is a
girl.lt/pgtlthrgt - Here, the first is frugal but the second is
greedy
312nd Example usage of U modifier, part 2
- Program fragment 2
- myString
- "ltpgtFred is a boy.lt/pgtltpgtAnn is a
girl.lt/pgtlthrgt" - echo "myString is myString ltbrgt"
- myString
- preg_replace("ltpgt(.?)lt/pgt(.)U","1x2",mySt
ring) - echo "myString is now myString"
- Output produced
- myString is ltpgtFred is a boy.lt/pgtltpgtAnn is a
girl.lt/pgtlthrgt - myString is now Fred is a boy.lt/pgtltpgtAnn is a
girl.xlthrgt - Here, the first is greedy but the second is
frugal - That is, the U modifier reverses the meaning of
the presence or absence of the ? Character after
a quantifier
32More on multi-line strings
- By default, the subject string is regarded as
consisting of a single "line" of characters (even
if it actually contains several newlines). - The "start of line" metacharacter () matches
only at the start of the string - The "end of line" metacharacter () matches only
at the end of the string - Suppose, however, that we want these
metacharacters to also match at newlines inside
the string
33Multi-line strings(contd.)
- Program fragment
- myString ltpgtFred is a boy.lt/pgt\nltpgtAnn is a
girl.lt/pgt" - echo "myString is myString ltbrgt"
- myString preg_replace("ltpgt(.)lt/pgt","x1y",
myString) - echo "myString is now myString"
- Output produced
- myString is ltpgtFred is a boy.lt/pgtltpgtAnn is a
girl.lt/pgt - myString is now ltpgtFred is a boy.lt/pgtltpgtAnn is a
girl.lt/pgt - One the one hand, the and characters did not
match the newline in the middle of the string - On the other hand, the dot did not match the
newline
34The multi-line modifier
- Program fragment
- myString ltpgtFred is a boy.lt/pgt\nltpgtAnn is a
girl.lt/pgt" - echo "myString is myString ltbrgt"
- myString preg_replace("ltpgt(.)lt/pgtm","x1y"
,myString) - echo "myString is now myString"
- Output produced
- myString is ltpgtFred is a boy.lt/pgtltpgtAnn is a
girl.lt/pgt - myString is now xFred is a boy.yxAnn is a girl.y
- The multiline modifier m has made the and
characters match the newline in the middle of the
string
35CS4408 got here on 21 oct 2005
36More PHP functions for using regular expressions
- To be continued in next slide set