Title: Computer Programming for Biologists
1Computer Programming for Biologists
Class 4 Feb 10th, 2012 Karsten
Hokamp http//bioinf.gen.tcd.ie/GE3027
2Computer Programming for Biologists
Overview
- Revision
- Project
- Loop control
- Regular Expressions
3Computer Programming for Biologists
Revision program components
- expressions
- 42
- base eq T
- num 1 0
- statements
- seq atgaacgt
- print hello world!\n
- operators
- , -, , /, , ,
- built-in functions
- print, shift, length,
- key words
- foreach, while, if,
4Computer Programming for Biologists
Revision Scalars
- most basic variable type
- indicated by dollar sign
- assign value (right to left)
- sequence atgaacctctac
- repeat 5
- in ltgt
- default _
5Computer Programming for Biologists
Revision Arrays
- ordered list of elements
- indicated by at symbol (_at_)
- index starts at 0
- _at_letters (a..z) ? (a, b, c, d, e, , x,
y, z) - Use and when working on individual elements
- letters0 A
- first letters0
- last letters-1
- default _at_ARGV
Index 0 1 2 3 4 23 24 25
6Computer Programming for Biologists
Revision built-in functions
upper uc(in) upper_first
uc_first(in) backwards reverse(in) len
length(sequence) num scalar _at_ARGV file
shift _at_ARGV push _at_prot, aa
_at_bases split //, seq out join _,
_at_letters _at_found keys found _at_order sort
_at_letters _at_out sort a ltgt b _at_nums codon
substr seq, 0, 3, print Hello world!\n
7Computer Programming for Biologists
Revision structures
- branching
- if (base eq t)
- base u
-
- ? also unless ()
- if (base eq t)
- base a
- elsif (base eq a)
- base t
- else
-
- loops
- while (in ltgt)
- out . uc(in)
-
- ? also until ()
- foreach element (_at_list)
- i
- print i) element
-
8Computer Programming for Biologists
Revision conditions
- Representative values
- Examples
- while (1) ... endless loop
- if (text) ... true if text not nor
0 - if (_at_rows) ... true if array is not empty
- until (i gt 100) ... comparison or
expression - while (out substr seq, 0, 3, ) ...
9Computer Programming for Biologists
True and False
- false ? 0 or empty string ()
- true ? value different from 0 (1 by default)
- Comparisons are last in order of execution!
10Computer Programming for Biologists
Revision data input
- word(s) from list of command line parameters
- in shift
- equivalent to the following
- in shift _at_ARGV
(command line arguments)
11Data Input/Output
- Reading from STDIN, default input stream
- in ltgt
- equivalent to
- in ltSTDINgt
- Shell tries to stream from file(s) if command
line argument(s) present - perl prog.pl input.txt
- STDIN
12Computer Programming for Biologists
Programming Strategy
- start with pseudo code (comments)
- code small bits and run
- watch for warnings and errors
- dare to try things out
- check Perl documentation
13Computer Programming for Biologists
Project
- Implement the following in a program
- Print a welcome message
- Read input from a file
- Separate header from sequence
- Report length of sequence
- Make sequence all upper case
- Reverse-complement the sequence
- Reformat sequence into 60 bp width
- Provide position numbers at each line
- Go to http//bioinf.gen.tcd.ie/GE3027/class3
14Computer Programming for Biologists
Project
Exercise http//bioinf.gen.tcd.ie/GE3027/class4
15Computer Programming for Biologists
Structures Breaks
- Ways of breaking the loops
- next continues with next loop
- last continues after loop
- exit exits program
- example
- print Type y to continue, q to quit
- while (ltgt)
- if (_ eq y)
- last
- elsif (_ eq q)
- exit
- else
- next
-
16Computer Programming for Biologists
Regular Expressions
- constructs that describe patterns
- powerful methods for text processing
- search for patterns in a string
- search and extract patterns
- search and replace patterns
17Computer Programming for Biologists
Regular Expressions
- Examples
- Look for a motif in a dna/protein sequence
- Find low complexity repeats and mask with xs
- Find start of sequence string in GenBank record
- Extract e-mail addresses from a web-page
- Replace strings, e.g. _at_tcd.ie with
_at_gmail.com
18Computer Programming for Biologists
Regular Expressions
Find a pattern in a string (stored in a
variable) sequence ataggctagctaga if (
sequence /ctag/ ) print Found!
string in which to search
delimiters
pattern
binding operator
without binding // to a variable, regular
expression works on _
19Computer Programming for Biologists
Regular Expressions
Search modifier i make search
case-insensitive sequence ataggctagctaga i
f ( sequence /TAG/i ) print Found!
20Computer Programming for Biologists
Regular Expressions
Metacharacters match at the beginning of a
line match at the end of the line . match
any character (except newline) \ escape the
next metacharacter sequence
gtsequence1\natgacctggaataggat if ( sequence
/gt/ ) line starts with gt print Found
Fasta header!
/\./ matches dot at end of line
21Computer Programming for Biologists
Regular Expressions
Matching repetition a? match 'a' 1 or 0
times a match 'a' 0 or more times, i.e., any
number of times a match 'a' 1 or more times,
i.e., at least once an,m match at least n"
times, but not more than "m" times. an, match
at least "n" or more times an match exactly
"n" times sequence /a5,/ finds repeats
of 5 or more as
22Computer Programming for Biologists
Regular Expressions
Search for classes of characters \d match a
digit character \w match a word character
(alphanumeric and _) \D match a non-digit
character \W match a non-word character \s
whitespace \S match a non-whitespace
character date 30 Jan 2009 if ( date
/\d1,2 \w \d2,4/ ) print Correct date
format!
also matches 1 February 09
23Computer Programming for Biologists
Regular Expressions
Match special characters \t matches a tabulator
(tab) \b matches a word boundary \r matches
return \n matches UNIX newline \cM matches
Control-M (line-ending in Windows) while (my
line ltgt) if (line /\cM/) warn
Windows line-ending detected!
24Computer Programming for Biologists
Regular Expressions
Search for range of characters match at
least one of the characters specified within
these brackets - specifies a range, e.g. a-z,
or 0-9 match any character not in the list,
e.g. A-Z sequence ataggctapgctaga if (
sequence /acgt/ ) print Sequence
contains non-DNA character
is a special variable containing the last
pattern match and contain strings before
and after match
25Computer Programming for Biologists
Regular Expressions
Search and replace (substitute) s/pattern1/patter
n2/ sequence ataggctagctaga rna
sequence rna s/t/u/ -gt auaggctagctaga
Only the first match will be replaced!
26Computer Programming for Biologists
Regular Expressions
Modifiers for substitution i case
in-sensitive g global s match includes
newline sequence ataggctagctaga rna
sequence rna s/t/u/g -gt auaggcuagcuaga
replaces all t in the line with u
27Computer Programming for Biologists
Regular Expressions
Example Clean up a sequence string sequence
1 ataggctagctagat 16 ttagagctagta sequence
s/actg//g -gt ataggctagctagatttagagctagta
Deletes everything that is not a, c, t, or g.
28Computer Programming for Biologists
Regular Expressions
- Extract matched patterns
- put patterns in parentheses
- \1, \2, \3, refers back to ()s within pattern
match - 1, 2, 3, refers back to ()s after pattern
match - sequence gttest\natgtagagctagta
- if (sequence /gt(.)/) id 1
- or
- email s/(.)\_at_(.)\.(.)/\1 at \2 dot \3/
- print Changed address to 1 at 2 dot 3\n
changes kahokamp_at_tcd.ie to kahokamp at tcd dot
ie