Perl - PowerPoint PPT Presentation

1 / 13

About This Presentation

Title:

Perl

Description:

Number of Views:46

Avg rating:3.0/5.0

Slides: 14

Provided by: tri5274

Category:

Tags: gibberish | grep | perl | slashes

Transcript and Presenter's Notes

Title: Perl

1
Perl 3 Regular Expressions

2
Patterns in Biology

3
Computers are good at finding Patterns

Find command in your word processor, Find
File in your computers operating system
Based on an underlying concept called a Regular
Expression (regexp)
a regexp is a text string, such as aatcg
can also have variable characters aatacg
or a wildcard aaxcg
or a variable spacer aax(1-20)cg

4
grep

grep is a handy Unix tool to get regular
expresssions
it is powerful and moderately complex tool (has
one of the longest man pages in the online Unix
help system)
does not require its own OReilly book, but is a
solid chapter in Intro and intermediate
Unix/Linux books

5
Perl Regular Expressions

Perl Regular Expressions are more complex and
more powerful than grep
Can find and substitute bits of text in a single
command
Various options for fuzzy matches
Perl regular expressions can get extremely
complex - goes way beyond the scope of this
course
gt man perlrequick

6
The Match Operator / /

Perl uses a special type of operator to do text
matching with regular expressions
/ /
The symbol is a pattern match comparison
operator
- it can be translated as contains
The forward slashes contain the pattern to be
matched, like this
print EcoRI site found! if dna /GAATTC/

7
Alternative Characters

Square brackets within the match expression allow
for alternative characters
if dna /GGGGATCCCC/
This will match an DNA string that starts with
GGG has G,A,T, or C in the 4th position,
followed by CCC
A vertical line within the /expression/ allows
you to look for either of two completely
different patterns
if dna /GAATTCAAGCTT/

8
Wildcards

Perl has a set of wildcard characters for Reg.
Exps. that are completely different than the ones
used by Unix
the dot (.) matches any character
\d matches any digit (a number from 0-9)
\w matches any text character (a letter
or number, not punctuation or space)
\s matches white space (any amount)
matches the beginning of a line
matches the end of a line
(Yes, this is very confusing!)

9
Repeat for a count

Use curly brackets to show that a character
repeats a specific number (or range) of times
find an EcoRI fragment of 100-500 bp length (two
EcoRI sites with any other sequence between)
if ecofrag /GAATTCGATC100,500GAATTC/
The sign is used to indicate an unlimited
number of repeats (occurs 1 or more times)

10
It gets worse

What if you need to match text that contains a
special character?
(the dot shows up all the time in GenBank IDs,
filenames, etc.)
Now you have to use a backslash (\) to escape
the wildcard meaning of that character
if seqname /\w \ . \d/
-This would match any sequence ID that has some
text characters, a dot, followed by a single
digit M65783.2

11
Grabbing parts of a string

Regular expressions can do more than just ask
if questions
They can be used to extract parts of a line of
text into variables Check this out
/gt(\w)\s(. )/
Complete gibberish, right?
It means
-look for the gt sign at the beginning of a FASTA
formatted sequence file
-dump the first word (\w) into variable 1 (the
sequence ID)
-after a space, dump the rest of the line (.),
until you reach the end of line , into variable
2 (the description)

12
You can also do Substitution