Title: Chapter 18 Regular Expressions
1Chapter 18Regular Expressions
2Outlines
- The Structure of a Regular Expression
- Representing Characters
- Especial concepts in XQuery
3Regular Expressions
- What are they?
- patterns that describe strings
- How can we use them?
- to determine whether a string value matches a
particular pattern (matches) - to replace parts of string that match a pattern
(replace) - to tokenize strings based on a delimiter pattern
(tokenize)
4The Structure of a Regular Expression
- is based on that of XML Schema
- can be composed of a number of different parts
- atoms
- quantifiers
- branches
- a single character, such as d,
-
- 2. an escape sequence that represents one or more
characters, like \s or \pLu. - 3. a character class expression that represents a
range or choice of several characters, such as
a-z
5Quantifiers
- The number of times a matching string may appear
is indicated - It appears directly after an atom
6Quantifiers
7Parenthesized Sub-Expressions and Branches
- can be used as an atom in a larger regular
expression - fo matches fooo, but not fofo
- for specifying a choice between several different
patterns
(fo)
8Representing Characters
- Representing Individual Characters
- Representing Any Character
- Representing Groups of Characters
9Representing Individual Characters
- A single character can have a quantifier
associated with it - def matches the strings def, ddef, dddef
- Certain characters, in order to be taken
literally, must be escaped - the asterisk () will be treated like a
quantifier unless it is escaped
10Representing Individual Characters
- These characters are escaped by preceding them
with a backslash.
11Representing Individual Characters
- Use the standard XML syntax for character
references and predefined entity references in
regular expressions - a space x20
- a less-than symbol (lt) lt
12Representing Any Character
- represents one matching character
- the period (.)
- represent multiple characters
- a quantifier (such as )
13Representing Groups of Characters
- Three different kinds of escapes
- multi-character escapes
- category escapes
- block escapes
- They all start with a backslash
14Multi-Character Escapes
- each escape represents only one character in a
matching string - use a quantifier such as to represent several
replacement characters
15Category Escapes
- The Unicode standard defines categories of
characters based on their purpose. - Category escapes take the form \pXX
- \p \P
- it is better to use A-Z than \pLu
16Block Escapes
- Unicode defines a numeric code point for each
character. - Each range of characters is represented by a
block name, also defined by Unicode - \pIsLatin-1Supplement matches any one character
in the range x0080 to x00FF -
- More details at the blocks file of the Unicode
standard
17Examples for Representing Groups of Characters
18Character Class Expressions
- Single Characters and Ranges
- Subtraction from a Range
- Negative Character Class Expressions
- Escaping Rules
19Single Characters and Ranges
- List several characters inside square brackets.
- def defdef, eddfefd
- \PLI\d either a lowercase letter or a
digit - It is also possible to specify a range of
characters a-z but not \d - More than one range in the same character class
expression a-zA-Z0-9 - Ranges and single characters can be combined in
any order abc0-9 a0-9bc
20Subtraction from a Range
- Subtraction allows to express the matched a range
of characters but leave a few out - a-z-jkl any character from a to z
- except j, k, or l
- A hyphen can be used a-z-j-l
- A multicharacter escape also can be substracted
\pLu-ABC
21Negative Character Class Expressions
- A string should not match any of the characters
specified - a-b any character that is not a and b
- Any character class expression can be negated
- specify single characters
- Ranges a-z0-9
22Examples for Character class expression
23Escaping Rules for Character Class Expressions
- The characters , , \, and - must be escaped
when included as single characters. - The character \ must be escaped if it is the
lower bound of the range. - The characters and \ must be escaped if one of
them is the upper bound of the range. - The character must be escaped only if it
appears first in the character class expression,
directly after the opening bracket ().
24Especial Concepts in XQuery
- Reluctant Quantifiers
- Anchors
- Back-References
- Using Flags
25Reluctant Quantifiers
- allow part of a regular expression to match the
shortest possible string - a (?) to the end of any of the kinds of
quantifiers identified in Table 18-1
r.t
r.?t
reluct or reluctant
26More related reluctant quantifiers
- The replace
- function that
- use reluctant
- quantifiers
- Reluctant quantifiers are not supported in XML
Schema - r.?tly reluctantly
27Anchors
- XQuery adds the concept of anchors to XML Schema
regular expressions - str does not match 5str5 in XML Schema
- The expression should match the beginning or end
of the string (or both) - and
28Anchors and Multi-Line Mode
- In Some XQuery functions, the processor should
operate in multi-line mode - In multi-line mode, anchors match the beginning
and end of any line within the string
29Back-References
- Back-references allow you to ensure that certain
characters in a string match each other - a string is a product number delimited by either
single or double quote ('")\d3-A-Z2('
") -
- '\d3-A-Z2'"\d3-A-Z2
-
- ('")\d3-A-Z2\1
From left to right Starting with 1 (not
0) Reference any of them by number
30Using Flags
- functions accepts a flags argument that allows
for additional options - Options are indicated by single letters
- the flags argument is a string
- duplicates are allowed
- Four options
- s indicates dot-all mode
- m indicates multi-line mode
- i indicates case-insensitive mode
- x indicates that whitespace characters should be
ignored
31Example for the flags argument
- address 123 Main Street
- Traverse City, MI 49684
32Using Sub-Expressions with Replacement Variables
- The replace function allows parenthesized
sub-expressions
333 functions
- fnmatches
- fnmatches("abracadabra", "a.a") returns true
- fnreplace
- replace("abracadabra", "a.?a", "") returns
"cbra" - fntokenize
- fntokenize("The cat sat on the mat", "\s")
returns ("The", "cat", "sat", "on", "the", "mat") - fntokenize("1, 15, 24, 50", ",\s") returns
("1", "15", "24", "50") - fntokenize("1,15,,24,50,", ",") returns ("1",
"15", "", "24", "50", "")
34Conclusion
- The Structure of a Regular Expression
- Representing Individual Characters
- Representing Any Character
- Representing Groups of Characters
- Character Class Expressions
- Reluctant Quantifiers
- Anchors
- Back-References
- Using Flags
- Using Sub-Expressions with Replacement Variables
35The End