Title: Matching in list context (Chapter 11 continued)
1- Matching in list context (Chapter 11
continued) - _at_array (str /pattern/)
- This stores the list of the special (1, 2,)
capturing variables into the _at_array only if there
are grouped expressions in the pattern to capture
matches. Otherwise, if there are no grouped
expressions, either (1) or () is returned into
the _at_array depending upon whether there are
successful matches or not. - The following results in
- ("cat chow" , "cat" , "chow")
- being assigned to the _at_array.
- _at_array ("Purina cat chow"
- /((catdogferret) (foodchow))/)
2- The g command modifier causes matching to be
done globally -- it doesn't quit after finding
the first match. - _at_array (str /pattern/g)
- Use global matching only when there are no
grouped expressions in the pattern. - The following results in the list ("an ", "amp")
being assigned to the _at_array. - _at_array ("an example" /a../g)
- In contrast the following would result in the
one-element list ("an ") being assigned to the
_at_array. - _at_array ("an example" /(a..)/)
3- The following statement parses out all of the
HTML tags and stores this list ("lth1gt", "lt/h1gt")
in the _at_tags array. - _at_tags ("lth1gtTitlelt/h1gt" /lt.?gt/g)
- Suppose document is a (perhaps long) string
that contains some text document, and suppose we
want to pull out all the social security numbers
from the document. If we assume social security
numbers look like 123-45-6789, then a solution is - _at_soc_numbers(document /\d3-\d2-\d4/g)
- But what if the social security numbers are
inconsistent in that some are missing the dashes?
Then a solution is - _at_soc_numbers(document/\d3-?\d2-?\d4/g)
4- Two very useful functions that take patterns and
return lists.
split(pattern, string) Returns a list consisting of the fields (the substrings not used in any matches) between successful matches of the pattern against the string. Trailing empty fields are omitted.
split(pattern,string,limit) Returns a list with at most limit number of fields.
grep(pattern, list) Returns a list consisting of those elements in the given list which successfully matched the pattern. (grep -- get regular expression pattern)
5- We have used split often, even in the decoding
routine where we split about a one-character
string. - _at_nameValuePairssplit(//,datastring)
- A string with more complicated delimiting
patterns can also be split. In the following
case, a delimiter is one or more colons. - str "23224559885"
- _at_numbers split( // , str)
6- grep (get regular expression pattern) is
different from split in that you send it an array
rather than a string. It "filters" the array
based upon the regular expression. That is only
those array elements which match the pattern are
returned. - Suppose _at_domains contains some large number of
named Web addresses. One simple call to grep can
filter out only those addresses in the ".edu"
domain, for example -
- _at_edu_sites grep (/\.edu/, _at_domains)
- Note The period had to be escaped since it is a
metacharacter.
7Example Analyzing log files. A typical HTTP
access log. See accesslog.txt.
8- The 10 different fields are actually standard.
- Results when we split out the first line (around
delimiting spaces). - _at_fields split (/\s/, line)
FIELD First Line Meaning
field0 136.201.141.108 Address (either IP or name) of client
field1 - Not used anymore
field2 - Not used anymore
field3 09/Nov/2001103401 Date and time
field4 -0600 Time zone
field5 "GET Request method
field6 / Relative part of URL (here the site root)
field7 HTTP/1.1" HTTP version
field8 200 Status code (success code or error code)
field9 16058 Bytes transferred
9- Log file analysis can get very elaborate and
there are many commercial and free software
packages available for that. - For a simple example, we count the total number
of hits (lines in the access log) and the total
number of unique hits (different IP addresses). - Notice that requesting one page can result in
numerous lines in the access log since all of the
image transfers are separate HTTP transactions.
(Some hit counters you find actually report the
number of lines in the file!) - Counting lines is easy. To count the number of
unique IP addresses, we add IP addresses to a
hash as the keys. Thus a new hash entry only can
originate from a new IP address. We then count
the number of keys in the hash. - See source file hitcount.pl
10- The substitution operator
- scalar_variable s/pattern/replacement_string/c
ommand_modifiers - The binding operator "binds" the substitution
onto the string. -
- The substitution operator s/// takes two
arguments (in contrast to the match operator m//
). - It attempts to find a match for the pattern in
the scalar_variable, and if successful, replaces
the match with the replacement_string. - Thus, the scalar variable is altered if a
successful match is found. In contrast, match
operator does not alter the string onto which it
is bound. -
11- The following attempts to replace the with my.
- str "the cat in the hat"
- str s/the/my/
- This causes str to contain "my cat in the hat".
- By default, only the left-most occurrence is
replaced. - The g (global) command modifier causes
substitutions to be made globally. - str "the cat in the hat"
- str s/the/my/g
- This causes str to contain "my cat in my hat".
12- The following results in str having the value
"puppy ferret category". (non-global
substitution) - str "puppy dog category"
- str s/(catdog)/ferret/
- A similar global substitution results in str
containing "puppy ferret ferretegory". - str "puppy dog category"
- str s/(catdog)/ferret/g
- The following replaces all whitespace characters
with the empty string, resulting in str
containing "hello". - str "h e l l o"
- str s/\s//g
13- Captured matches can actually be included into
the replacement string. - str "puppy dog category"
- str s/(\w)/1s/g
- This results in str having the value "puppys
dogs categorys". - There is only one set of grouping parentheses
used in this example, so we only need to use 1. - As each match is found, 1 is assigned that new
match. Thus, 1 may be reused several times
during a global substitution.
14- The transliteration operator
- scalar tr/search_characters/replacement_charac
ters/ - This replaces the search characters with the
corresponding replacement characters. - It's usually used with single characters.
- str "the cat in the hat"
- str tr/a/u/
- The result is "the cut in the hut"
- Transliteration can be done using substitutions,
but tr automatically does global substitutions
and only uses characters which means you don't
have to escape metacharacters.
15Example Inspired by news sites which which
display parts of stories and provide links
pointing to the full stories.
See partialcontent.cgi
16- Each story is a text file (.news)
- Paragraphs must separated by at least a blank
line /n/n - The program reads the directory and prints the
first two paragraphs of only the .news files.
17- Acquiring only the .news stories from the
directory is straight forward, especially with
the power of grep. - opendir(D, "storyDataDir")
- _at_storyFiles readdir(D)
- closedir(D)
- _at_storyFiles grep (/.news/ , _at_storyFiles)
- We then loop over the .news files and process
each one. - foreach file (_at_storyFiles)
- if(open(STORY, "storyDataDirfile"))
- my _at_wholeStory ltSTORYgt
- close(STORY)
-
- join whole story into one string
- my story join("", _at_wholeStory)
-
-
18- We can then extract all of the paragraph with
one global match!! - _at_paragraphs (story /((.\n)?\n\s\n)/g)
- It's then trivial to print the first two
paragraphs. - But the pattern certainly needs clarification.
- First we need to identify the space between
paragraphs. - \n\s\n matches one or more consecutive blank
lines - That is, two newline characters
with zero or more whitespace characters
in between. - Since quantifiers are greedy, the pattern will
not stop after finding the first in a sequence of
blank lines.
19- Now we match paragraph content.
- (.\n) one or more of any character
- (wildcard doesn't match /n characters)
- Now the whole pattern which matches a paragraph.
- /(.\n)?\n\s\n/ one or more of anything,
then a - then a blank line(s)
- Notes
- One would have been tempted to identify
paragraphs as one or more wildcard characters
(.). But that would miss parts of paragraphs
containing an inadvertent hard return (\n)
between sentences. - The extra metacharacter (?) specifies
non-greedy matching. Otherwise, the pattern
would not stop after the first paragraph.
20- There are still two subtle pitfalls regarding
the structure of the news files. - A sequence of two blank lines (\n\n\n) or more
at the beginning of the file will cause the first
\n to be matched as the first paragraph. (That
is not a problem for multiple blank lines between
paragraphs since \n\s\n is greedy.) - If there are no blank lines after the last
paragraph in the file, the last paragraph will
not be matched (hence not captured). That
doesn't affect this application as long as there
are three or more paragraphs in a file. - How would you fix those problems?