Title: Python
1Python Pattern Matchingwith Regular
Expressions (REs)
FilePythonREs.ppt
2Foresight
- Pattern matching
- Literal
- With metacharacters
- Regular expressions (REs)
- Using REs in Python
3Consider dir by Itself
D\athomepc\day\idtgtdir Volume in drive D has
no label Volume Serial Number is 3E4B-1609
Directory of D\athomepc\day\idt .
ltDIRgt 01-01-02 816a . ..
ltDIRgt 01-01-02 816a .. SPRING1 PDF
180,072 01-01-02 817a spring02idtfront.pdf SPR
ING2 PDF 241,542 01-01-02 819a
spring02idtpartI.pdf SPRING3 PDF 1,246,514
01-01-02 820a spring02idtpartII.pdf SPRING4
PDF 2,517,343 01-01-02 822a
spring02idtpartIII.pdf SPRING5 PDF 3,469,138
01-01-02 824a spring02idtpartIV.pdf CASE1-1
DOC 35,328 01-01-02 842a
case1-python.doc LECTUR1 PPT 78,336
01-01-02 945a lecture01fall01.ppt PYTHON1 PPT
34,816 01-01-02 946a Python_Intro.ppt PYT
HON2 PPT 37,376 01-01-02 946a
Python_Structures.ppt LECTUR2 PPT 154,112
01-01-02 1151a lecture01spring02.ppt PYTHON3
PPT 34,816 01-01-02 1152a PythonREs.ppt
11 file(s) 8,029,393 bytes 2
dir(s) 1,209.06 MB free D\athomepc\day\id
tgt
4Now dir with a Literal Search
D\athomepc\day\idtgtdir case1-python.doc Volume
in drive D has no label Volume Serial Number is
3E4B-1609 Directory of D\athomepc\day\idt CASE1
-1 DOC 35,328 01-01-02
842a case1-python.doc 1 file(s)
35,328 bytes 0 dir(s) 1,209.06
MB free D\athomepc\day\idtgt
5Now dir with
D\athomepc\day\idtgtdir .doc Volume in drive D
has no label Volume Serial Number is 3E4B-1609
Directory of D\athomepc\day\idt CASE1-1 DOC
35,328 01-01-02 842a case1-python.doc
1 file(s) 35,328 bytes 0
dir(s) 1,209.06 MB free D\athomepc\day\id
tgt
6Literal vs. Pattern Searches
- dir myfile.doc
- Searches literally, for an exact match with
myfile.doc - dir my.doc
- Does a pattern search. Matches to any file
beginning with my, followed by 0 or more
characters of any kind, followed by .doc
7MetaCharacters
- dir treats as a metacharacter, a character
not taken literally, but as instruction to match
a certain kind of pattern (here anything) - The dir metacharacter scheme is very useful
8On Beyond
- ...and also very primitive and limited
- A step up grep in Unix Linux support for RE
searches in some text editors, e.g., TextPad
(www.textpad.com) - Regular expressions (REs) use a richer language
and larger set of metacharacters, giving us a
very powerful capability to extract information
(patterns) from text
9Pythons RE Metacharacters
- Heres the complete list
- . ? \ ( )
- No use memorizing. Well learn by examples.
- A natural question But what if I want to search
for a pattern that contains what Pythons RE
counts as metacharacters? - Be just a little patient
10Load Pythons re Module
gtgtgt import re gtgtgt teststring "Television is
public anomie number 1. gtgtgt teststring 'Televisio
n is public anomie number 1. gtgtgt
len(teststring) 37 gtgtgt match re.search('anomie',
teststring) gtgtgt match None 0 gtgtgt
match.span() (21, 27) gtgtgt teststring2127 'anomi
e gtgtgt
11Now a Nonliteral Match
gtgtgt match re.search('Television',teststring) gtgtgt
match None 0 gtgtgt match re.search('television
',teststring) gtgtgt match None 1 gtgtgt match
re.search('tTelevision',teststring) gtgtgt
match.span() (0, 10) gtgtgt teststring 'Television
is public anomie number 1. gtgtgt
12Square Bracket Notation ...
- tT means any one of the characters t or
T. - ... is called a character class
- Examples
- abc, a-z, A-Z
- tT not t and not T
13Not Example
gtgtgt teststring 'Television is public anomie
number 1. gtgtgt match re.search('tTa-z',te
ststring) gtgtgt match.span() (1, 10) gtgtgt
teststring110 'elevision gtgtgt
Note means one or more of the previous
means zero or more ? means zero or one
14'\s\w\.' and '\s(\w)\.'
gtgtgt teststring 'Television is public anomie
number 1. gtgtgt match re.search('\s\w\.',teststr
ing) gtgtgt match.span() (34, 37) gtgtgt
teststring3437 ' 1. gtgtgt match
re.search('\s(\w)\.',teststring) gtgtgt
match.span(0) (34, 37) gtgtgt match.span(1) (35,
36) gtgtgt teststring3536 '1 gtgtgt
15. \.
- Inside ... most metacharacters are taken
literally - So, . \.
- Note (again) ... is called a character class
gtgtgt match re.search('\s(\w).',teststring) gtgtgt
match.span() (34, 37) gtgtgt
16Avoiding Greed ?
gtgtgt newstring 'ltdiv align"center"gt gtgtgt
newstring newstring'lti class"smaller"gt gtgtgt
newstring newstring'(As of 1055 AM on
12/20/01) gtgtgt newstring newstring'lt/igtlt/divgtltb
rgt gtgtgt newstring 'ltdiv align"center"gtlti
class"smaller"gt(As of 1055 AM on
12/20/01)lt/igtlt/divgtltbrgt gtgtgt match
re.search('lt.gt',newstring) gtgtgt match.span() (0,
81) gtgtgt match re.search('lt.?gt',newstring) gtgtgt
match.group() ltdiv align"center"gt gtgtgt
17More on Not Being Greedy
gtgtgt match re.search(r'lt(\w).?gt(.)lt/(\1)',newst
ring) gtgtgt match.groups() ('d', 'lti
class"smaller"gt(As of 1055 AM on
12/20/01)lt/igt', 'd') gtgtgt match
re.search(r'lt(\w).?gt(lt)lt/(\1)',newstring) gtgtgt
match.groups() ('i', '(As of 1055 AM on
12/20/01)', 'i') gtgtgt
\1 is called a backreference. It refers to group 1
18Concluding
- REs are a very powerful tool, very often very
useful - The language notation is compact and a bit hard
to read - Practice, study the examples, dont worry about
memorization.
19Advice on Scripting
- Scripting, and programming in general, is a
process - Successful scripts dont spring into existence
whole - Scripts built in small increments
- Attend to
- Decomposition
- Stories
- Testing
20Advice on Scripting
- Decomposition
- Solve big problems by decomposing them into small
problems and solving them - Stories
- Scripting/programming as a form of literature
- Use comments with code to tell a clear story
about what the code is or should be doing - Testing
- Everything, whole and part, often, varying inputs
21Readings
- IDT book, chapter 8, Text and Pattern
Processing - Further information (but beyond the scope of 101)
- The Python online documentation on the re module
- Regular Expression HOWTO by A.M. Kuchling at
http//py-howto.sourceforge.net/ and also at
http//py-howto.sourceforge.net/regex/regex.html