Title: An Introduction to Perl
1An Introduction to Perl
- Sources and inspirations
- http//www.cs.utk.edu/plank/plank/classes/cs494/4
94/notes/Perl/lecture.html - Randal L. Schwartz and Tom Christiansen,Learning
Perl 2nd ed., OReilly - Randal L. Schwartz and Tom Phoenix,Learning
Perl 3rd ed., OReilly - Dr. Nathalie Japkowicz, Dr. Alan Williams
Go O'Reilly!
CSI 3125, Perl, page 1
2Perl overview (1)
- Perl Practical extraction and report language
- Perl Pathologically eclectic rubbish lister ?
- It is a powerful general-purpose language, which
is particularly useful for writing quick and
dirty programs. - Invented by Larry Wall, with no apologies for its
lack of elegance (!). - If you know C and a fair bit of Unix (or Linux),
you can learn Perl in days (well, some of it...).
3Perl overview (2)
- In the hierarchy of programming language, Perl is
located half-way between high-level languages
such as Pascal, C and C, and shell scripts
(languages that add control structure to the Unix
command line instructions) such as sh, sed and
awk. - By the way
- awk Aho, Weinberger, Kernighan
- sed Stream Editor.
4Advantages of Perl (1)
- Perl combines the best (according to its admirers
?) features of - Unix/Linux shell programming,
- The commands sed, grep, awk and tr,
- C,
- Cobol.
- Shell scripts are usually written in many small
files that refer to each other. Perl achieves the
functionality of such scripts in a single program
file.
5Advantages of Perl (2)
- Perl offers extremely strong regular expression
capabilities, which allow fast, flexible and
reliable string handling operations, especially
pattern matching. - As a result, Perl works particularly well in
text processing applications. - As a matter of fact, it is Perl that allowed a
lot of text documents to be quickly moved to the
HTML format in the early 1990s, allowing the Web
to expand so rapidly.
6Disadvantages of Perl
- Perl is a jumble! It contains many, many features
from many languages and tools. - It contains different constructs for the same
functionality (for example, there are at least 5
ways to perform a one-line if statement). - ?It is not a very readable language.
- You cannot distribute a Perl program as an opaque
binary. That is, you cannot really commercialize
products you develop in Perl.
7Perl resources and versions
- http//www.perl.org tells you everything that you
want to know about Perl. - What you will see here is Perl 5.
- Perl 5.8.0 has been released in July 2002.
- Perl 6 (http//dev.perl.org/perl6/) is the next
version, still under development, but moving
along nicely. The first book on Perl 6 is in
stores (http//www.oreilly.com/catalog/perl6es).
8Scalar data strings and numbers
- Scalars need not to be defined or their types
declaredPerl understands from context. - cat hellos.pl
- !/usr/bin/perl -w
- print "Hello" . " " . "world\n"
- print "hi there " . 2 . " worlds!" ."\n"
- print (("5" 6) . " eggs\n" . " in " . " 3 2
" . ("3" "2") . " baskets\n" )
invoke Perl
hellos.pl Hello world hi there 2 worlds! 11
eggs in 3 2 5 baskets
9Scalar variables
- Scalar variable names start with a dollar sign.
They do not have to be declared.
cat scalar.pl !/usr/bin/perl -w i 1 j
"2" print "i and j \n" k i j print
"k\n" print i . j . "\n" print 'k\n' . "\n"
scalar.pl 1 and 2 3 12 k\n
10Quotes and substitution
- Suppose x 3
- Single-quotes ' ' allow no substitution except
for the escape sequences \\ and \'. - print('x\n') gives x\n and no new line.
- Double-quotes " " allow substitution of variables
like x and control codes like \n (newline). - print("x\n") gives 3 (and a new line).
- Back-quotes also allow substitution, then try
to execute the result as a system command,
returning as the final value whatever the system
command outputs. - y date print(y) results in
- Sun Aug 10 070417 EDT 2003
11Control statements if, else, elsif
cat names.pl !/usr/bin/perl -w name
ltSTDINgt chomp(name) if (name gt 'fred')
print "'name' follows 'fred'\n" elsif (name
eq 'fred') print "both names are
'fred'\n" else print "'name' precedes
'fred'\n"
standard input
cut newline
names.pl Stan 'Stan' precedes 'fred'
my input
Perl's output
- names.pl
- stan
- 'stan' follows 'fred'
12Control statements loops (1)
cat oddsum_while.pl !/usr/bin/perl -w Add up
some odd numbers max ltSTDINgt n 1 while
(n lt max) sum n n 2 On to
the next odd number print "The total is sum.\n"
- oddsum_while.pl
- 10
- Use of uninitialized value at oddnums.pl line 6,
ltSTDINgt chunk 1. - The total is 25.
my input
a warning
Perl's output
13Control statements loops (2)
- End-line comments begin with
- It is okay, though not nice, to use a variable
without initialization (like sum). Such a
variable is initialized to 0 if it is first used
as a number or to the empty string "" if it is
first used as a string. In fact, it is always
undef, variously converted. - Perl can, if asked, issue a warning (use the -w
flag). - Of course, while is only one of many looping
constructs in Perl. Read on...
14Control statements loops (3)
- cat oddsum_until.pl
- !/usr/bin/perl -w
- Add up some odd numbers
- max ltSTDINgt
- n 1
- sum 0
- until (n gt max)
- sum n
- n 2 On to the next odd number
- print "The total is sum.\n"
- oddsum_until.pl
- 10
- The total is 25.
15Control statements loops (4)
- cat oddsum_for.pl
- !/usr/bin/perl -w
- Add up some odd numbers
- max ltSTDINgt
- sum 0
- for (n 1 n lt max n 2)
- sum n
- print "The total is sum.\n"
- oddsum_for.pl
- 10
- The total is 25.
- We also have do-while and do-until, and we have
foreach. Read on.
16Control statements loops (5)
- cat oddsum_foreach.pl
- !/usr/bin/perl -w
- Add up some odd numbers
- max ltSTDINgt
- sum 0
- foreach n ( (1 .. max) )
- if ( n 2 ! 0 ) sum n
- print "The total is sum.\n"
- oddsum_foreach.pl
- 10
- The total is 25.
17Control constructs compared
18Lists and arrays
- A list is an ordered collection of scalars. An
array is a variable that contains a list. - Each element is an independent scalar value. A
list can hold numbers, strings, undef valuesany
mixture of kinds of scalar values. - To use an array element, prefix the array name
with a place a subscript in square brackets. - To access the whole array, prefix its name with a
_at_. - You can copy an array into another. You can use
the operators sort, reverse, push, pop, split.
19Command-line arguments
- Suppose that a Perl program stored in the file
cleanUp is invoked in Unix/Linux with the
command - cleanUp -o result.htm data.htm
- The built-in list named _at_ARGV then contains three
elements - ('-o', 'result.htm', 'data.htm')
- These three element can be accessed
as ARGV0 ARGV1 ARGV2
20Array examples (1)
- cat arraysort.pl
- !/usr/bin/perl -w
- i 0
- while (k ltSTDINgt)
- ai k
- print " sorted \n"
- print sort(_at_a)
- arraysort.pl
- Nathalie
- Frank
- hello
- John
- Zebra
- notary
- nil
sorted Frank John Nathalie Zebra hello
nil notary
control-D here
21Array examples (2A)
Reversing a text file (whole lines). cat
whole_rev.pl !/usr/bin/perl -w while (k
ltSTDINgt) push(_at_a, k) print " reversed
\n" while (oldval pop(_at_a)) print
oldval
- whole_rev.pl
- a b c d
- e f
- g h i
- reversed
- g h i
- e f
- a b c d
control-D here
22Array examples (2B)
Reversing each line in a text file
cat each_rev.pl !/usr/bin/perl -w while(k
ltSTDINgt) _at_a split(/\s/, k) s "" for
(i _at_a i gt 0 i--) s
"sai-1 " chop(s) print "s\n"
- each_rev.pl
- a bc d efg
- efg d bc a
- hi j
- j hi
- klm nopq st
- st nopq klm
control-D
split cuts the line on white space (we will see
regular expressions soon)
output
23Array examples (3)
- Reversing a text file (whole lines)
- print reverse(ltSTDINgt)
- Reversing each line in a text file
- while(k ltSTDINgt)
- s ""
- foreach i
- (reverse(split(/\s/, k)))
- s "si "
- chop(s)
- print "s\n"
24A digressionPerl's favourite default variable
by default,Perl reads into _
- while(ltSTDINgt)
- s ""
- foreach i
- (reverse(split(/\s/, _)))
- s "si "
- chop(s) print "s\n"
by default,Perl splits _ too!
while(ltSTDINgt) s "" foreach i
(reverse(split(/\s/ ))) s "si "
chop(s) print "s\n"
25Hashes
- A hash is similar to an array, but instead of
subscripts, we can have anything as a key, and we
use curly brackets rather than square brackets. - The official name is associative array (known to
be implemented by hashing ?). - Keys and values can be any scalars keys are
always converted to strings. - To refer to a hash as a whole, prefix its name
with a . - If you assign a hash to an array, it becomes a
simple list.
26Hash examples I (1)
- cat hash_array.pl
- !/usr/bin/perl -w
- some_hash
- ("foo", 35, "bar", 12.4, 2.5, "hello",
- "wilma", 1.72e30, "betty", "bye\n")
- _at_an_array some_hash
- print "_at_an_array\n\n"
- foreach key (keys some_hash)
- print "key "
- print delete some_hashkey
- print "\n"
27Hash examples I (2)
- hash_array.pl
- betty bye
- wilma 1.72e30 foo 35 2.5 hello bar 12.4
-
- betty bye
- wilma 1.72e30
- foo 35
- 2.5 hello
- bar 12.4
some_hash ("foo", 35, "bar", 12.4, 2.5,
"hello", "wilma", 1.72e30, "betty",
"bye\n") _at_an_array some_hash print
"_at_an_array\n\n" foreach key (keys
some_hash) print "key " print delete
some_hashkey print "\n"
28Hash examples II
cat hash_arrows.pl !/usr/bin/perl -w my hash
( "a" gt 1, "b" gt 2, "c" gt 3) foreach key
(sort keys hash) value hashkey
print "key gt value\n"
- hash_arrows.pl
- a gt 1
- b gt 2
- c gt 3
29A brief interludethe diamond operator
cat concat !/usr/bin/perl -w while ( ltgt )
print _
- cat a
- one-a
- two-a
- cat b
- three-b
- four-b
- five-b
- concat a b
- one-a
- two-a
- three-b
- four-b
- five-b
ltgt loops over the files listed as command-line
arguments _ is the current input line
concat a b gtc cat c one-a two-a three-b four-b
five-b
30Hash examples IIIcharacter frequency count
- cat frequency.pl
- !/usr/bin/perl -w
- while (ltgt)
- split _ into single characters, loop
- foreach c (split //)
- Increment count of c
- countc
-
- end of input, print count
- for c (sort keys count)
- print "c\tcountc\n"
31Character frequency count (2)
- frequency.pl
- Nathalie
- Fran
- hello
- John
- rather
- Notary
- F 1
- J 1
\n
8 2 1 2 F 2 J
2 N 2 a 5 e 3 h 4 i
1 l 3 n 2 o 3 r 4 t
3 y 1
space
D
32Subroutines
- A subroutine is a user-defined function. The
syntax is very simple so is the semantics.
!/usr/bin/perl sub max if ( x gt y ) x
else y x 10 y 11 print max .
"\n"
- There are no arguments the script accesses two
global variables. The subroutine call is marked
with . The value returned is that of the last
expression evaluated.
33Subroutines (2)
- A few housekeeping rules.
- You can place your definitions anywhere in the
file, though it is recommended to have them at
the beginning. - Perl always uses the latest definition in the
fileany preceding one is ignored. - Certain elements of the syntax are optional.
- The might sometimes be omitted (but it is not a
good idea). - The return operator may precede a value to be
returned (this can be useful) - if ( x gt y ) return x
- else return y
34Subroutines (3)
- Clearly, the use of global variables is much too
limited. Subroutines take arguments, and work on
them via a predefined list variable _at__ or its
elements _0, _1 and so on.
!/usr/bin/perl sub max if ( _0 gt _1 )
_0 else _1 print max ( 12, 13
) . "\n"
35Subroutines (4)
- _0, _1 are not fun to work with. We can
rename them locally, using the my operatorit
creates a sub's private variables. Here, we
declare two such variables and right away
initialize them.
!/usr/bin/perl sub max my ( a, b ) _at__
if ( a gt b ) a else b print max (
15, 14 ) . "\n"
36Subroutines (5)
- But this is not a safe max calculation.
!/usr/bin/perl sub max my ( a, b ) _at__
if ( a gt b ) a else b print max (
16, 19, 23 ) . "\n" print max ( 26 ) . "\n"
- This produces 19 (23 gets ignored) and 26 (the
second value is undef, that is, 0).
37Subroutines (6)
- We could stop the subroutine if the number of
arguments is wrong. The (generally very useful!)
operator die does that for us.
!/usr/bin/perl sub max if ( _at__ ! 2 )
die "max needs two arguments _at__\n" my (
a, b ) _at__ if ( a gt b ) a else b
print max ( 16, 19, 23 ) . "\n"
The script is stopped after printing this max
needs two arguments 16 19 23
38Subroutines (7)
- We can have just a warning, if we use the
operator warn instead.
!/usr/bin/perl sub max if ( _at__ ! 2 )
warn "max needs two arguments _at__\n" my (
a, b ) _at__ if ( a gt b ) a else b
print max ( 16, 19, 23 ) . "\n"
The script prints this max needs two arguments
16 19 23 19
39Subroutines (8)
- It is, by the way, not a bad idea to generalize
max by allowing it to take any number of
arguments.
!/usr/bin/perl sub max my ( curr_max )
shift _at__ foreach ( _at__ ) if ( _ gt
curr_max ) curr_max _
curr_max print max ( 15, 14 ) . "\n" print
max ( 16, 19, 23 ) . "\n" print max ( 26 ) .
"\n"
40Subroutines (9)
- This even works for empty lists.
!/usr/bin/perl sub max my ( curr_max )
shift _at__ foreach ( _at__ ) if ( _ gt
curr_max ) curr_max _
curr_max z max ( ) if ( defined z )
print z . "\n" else print "undefined\n"
41Regular expressions (1)
- A regular expression (also called a pattern) is a
template that describes a class of strings. A
string can either match or not match the pattern. - The simplest pattern is one character.
- A character classthe pattern matches any of
these charactersis written in square brackets - 01234567 an octal digit
- 0-7 an octal digit
- 0-9A-F a hex digit
- A-Za-z not a letter ( "negates")
- 0-9- a decimal digit or a minus
42Regular expressions (2)
- Metacharacters
- . (dot) any character except \n
- Anchors
- the beginning of a string
- the end of a string
- Multipliers
- repeat the preceding item 0 or more times
- repeat the preceding item 1 or more times
- ? make the preceding item optional
- n repeat n times
- n, m repeat n to m times (n lt m)
- n, repeat n or more times
43Regular expressions (3)
- The Boolean operator tries to match a string
with a regular expression written inside slashes.
- x "01239876AGH"
- if ( x /01-94,/ )
- print "yes1\n"
- if ( x /A-Z3/ )
- print "yes2\n"
- if ( x /.A-Z4/ )
- print "yes3\n"
44Regular expressions (4)
- Patterns can be grouped by parentheses (the whole
pattern becomes one item).Alternative is
denoted by the bar .
- x "01239876AGH"
- if ( x /(0-94A-Z3)2,/ )
- print "yes4\n"
- if ( x /(0?4)(51abc1,)/ )
- print "yes5\n"
45Regular expressions (5)
- The precedence of pattern elements
- parentheses ( )
- multipliers ? n n,m n,
- sequence, anchors
- alternation
- Some character classes are predefined
- class not class
- digit \d \D
- word char a-zA-Z0-9_ \w \W
- whitespace \s \S
- Some additional anchors
- word boundary \b \B
46Regular expression examples (1)
- i "Jim"
- match
- i /Jim/ yes
- i /J/ yes
- i /j/ no
- i /j/i yes
- i /\w/ yes
- i /\W/ no
- Case is ignored in matching if the postfix i is
used.
47Regular expression examples (2)
- j "JjJjJjJj"
- j /j/ yes matches anything
- j /j/ yes matches the first j
- j /j?/ yes matches the first j
- j /j2/ no
- j /j2/i yes ignores case
- j /(Jj)3/ yes
48Regular expression examples (3)
- k "Boom Boom, out go the lights!"
- k /JimBoom/ yes matches Boom
- k /(Boom)2/ no a space between Booms
- k /(Boom )2/ no fails on the comma
- k /(Boom\W)2/ yes \W is space, comma
- k /\bBoom\b/ yes
- k /\bBoom.the\b/ yes
- k /\Bgo\B/ no "go" is a complete word
- k /\Bgh\B/ yes the "gh" inside "lights"
49Regular expression substitution (1)
- We can modify a string variable by applying a
substitution. - The operator is and the substitution is
written as - s/pattern1/pattern2/
- v "a string to play with"
- v s/\w/just a single/
- print "v\n"
- just a single string to play with
50Regular expression substitution (2)
- Matched patterns are remembered in built-in
variables1, 2, 3 etc. These variables keep
their values till the next matching operation. - Each set of paretheses in a pattern corresponds
to a "memory" variable. - v "just a single string to play with"
- v s/(\b\w\b)(.)/'1'2/
- print "v\n"
- print "2, 1 1\n"
- 'just' a single string to play with
- a single string to play with, just just
51Regular expression substitution (3)
- A substitution can be applied to all occurrences
of the pattern, that is, globally - s/pattern1/pattern2/g
- v "'just' a single string to play with"
- v s/\b\w\b/word/g
- print "v\n"
- 'word' word word word word word word
- v s/\b\w\b/last/
- print "v\n"
- 'word' word word word word word last
52Regular expression substitution (4)
Parentheses as memory can help construct powerful
patterns with "instant repetition". We can use
\1, \2 etc. for matched substrings.
- v "This is a double double word."
- v s/(\b\w\b) \1/\1/
- print "v\n"
- This is a double word.
- v "This is a triple triple triple word."
- v s/(\b\w\b) \1 \1/\1/
- print "v\n"
- This is a triple word.
53Regular expression substitution (5)
Here is a more realistic example (last year's
homework). You rather need explanations in
class, please.
- Day '01-9120-93011-9'
- Month '01-910121-9'
- Year number up to 31 must have a leading zero
or two. - Year '0-940-9332-94-90-9'
- while(ltgt)
- Find all dates, selecting and reinserting the
context. - 1 and 6 match the context. Superfluous
digits, - as 43 and 55 in 432001-01-2255, belong in the
context. - "Dates" such as April 31 or February 30 are
allowed. - There are no provisions for leap years.
- s/(\D)((Year)-(Month)-(Day))(\D.)/1ltdate
gt2lt\/dategt6/g - s/(\D)((Day)-(Month)-(Year))(\D.)/1ltdate
gt2lt\/dategt6/g - print _
54Regular expression substitution (6)
One example run, to show how it works.
- DATA
- Both 12-09-2000 and 25-8-324 are good dates,
- but 30-14-1955 and 10-10-10 are not. OTOH,
10-10-010 is. - RESULTS
- Both ltdategt12-09-2000lt/dategt and
ltdategt25-8-324lt/dategt are good dates, - but 30-14-1955 and 10-10-10 are not. OTOH,
ltdategt10-10-010lt/dategt is.
55In another course ?
- Predefined variables (lots!)
- More on lists, arrays and hashes
- More on regular expressions
- File management
- Directory management
- Process management
- Perl database facilities
- CGI programming
- ... and more, and much more
56Mistakes that novices make (1)
Thanks to Alan Williams for this list
- Adapted from Programming Perl, page 361.
- Testing "all-at-once" instead of incrementally,
either bottom-up or top-down. - Optimistically skipping print scaffolding to dump
values and show progress. - Not running with the perl -w switch to catch
obvious typographical errors. - Leaving off or _at_ or from the front of a
variable. - Forgetting the trailing semicolon.
- Forgetting curly braces around a block.
57Mistakes that novices make (2)
- Unbalanced (), , , "", '', , and sometimes
ltgt. - Confusing '' and "", or / and \.
- Using instead of eq, ! instead of ne,
instead of , and so on. - ('White' 'Black') and (x 5) evaluate as (0
0) and (5) and thus are true! - Using "else if" instead of "elsif".
- Putting a comma after the file handle in a print
statement.
58Mistakes that novices make (3)
- Not chopping the output of backquotes date or
not chopping input - print "Enter y to proceed "
- ans ltSTDINgt
- chop ans
- if (ans eq 'y') print "You said y\n"
- else print "You did not say 'y'\n"
- Forgetting that Perl array subscripts and string
indexes normally start at 0, not 1. - Using _, 1, or other side-effect variables,
then modifying the code in a way that unknowingly
affects or is affected by these. - Forgetting that regular expressions are greedy,
seeking the longest match not the shortest match.