Title: http://creativecommons.org/licenses/by-sa/2.0/
1http//creativecommons.org/licenses/by-sa/2.0/
2An Introduction to Perl for Bioinformatics Part
2
- Will Hsiao
- Simon Fraser University
- Department of Molecular Biology and Biochemistry
- wwhsiao_at_sfu.ca
- www.pathogenomics.sfu.ca/brinkman
3(No Transcript)
4Outline
- Session 1
- Review of the previous day
- Perl historical perspective
- Expand on Regular Expression
- General Use of Perl
- Expand on Perl Functions and introduce Modules
- Interactive demo on Modules
- Break
- Session 2
- Use of Perl in Bioinformatics
- Object Oriented Perl
- Bioperl Overview
- Interactive demo on Bioperl
- Introduction to the Perl assignment
5Perl in Bioinformatics
- Case to point 1 Human Genome data exchange
- How Perl saved the Human Genome Project
- Lincoln Stein (1996) www.perl.org
- Different sequencing centres all have different
data format - Perl allowed various genome centres to exchange
and communicate data with each other - Introduces a project to produce modules to
process all known forms of biological data
(Bioperl)
6Perl in Bioinformatics
- Case to point 2 Ensembl
- Much of Ensembl is written in Perl
- Ensembl has an extensive Perl API - allow you to
access Ensembl database directly from your perl
code - Case to point 3 GMOD Generic Model Organism
Database - www.gmod.org
- a joint effort by model organism system databases
(worm, fly, corn, rat, yeast, E. coli,
arabidopsis, rice) to develop reusable components
suitable to be adapted for other biological
databases - Written mostly in Java and Perl
7Bioinformatics Spectrum
JAVA
Perl
Math
Biology
Computer Science
Software/ data analysis
C/C
8Perl for bioinformatics in your lab
- Scripting
- automation of repetitive analyses
- parse results obtained from other programs
- Wrapping
- accessing others programs (e.g. BLAST) through
Perl - Web CGIing
- Develop an interactive web page to your lab
- Create web forms
9Bioperl Overview
- The Bioperl project www.bioperl.org
- Comprehensive, well documented set of Perl
modules - Last stable release 1.4.0 (developer 1.5.1)
- A bioinformatics toolkit for
- Format conversion
- Report processing
- Data manipulation
- Sequence analyses
- and more!
- Written in object-oriented Perl
10What are objects?
- Examples of objects in real life
- Cars, dogs, dishwashers
- Objects have ATTRIBUTES and ACTIONS
- Some attributes of a dog
- Color of fur
- Height
- Owners Name
- Weight
- Tail position
- Some actions of a dog
- Bark
- Walk
- Run
- Eat
- Wag tail
11What are programming objects?
- Borrows from the concept of real life objects
A Program Dog Object
Attributes are stored as variables Actions are
implemented as functions
sub dye_fur
fur_color weight tail_position
sub eat
sub wag_tail
12Object Exercise
- Pair up with your neighbour (2-3 people)
- In the next 2-3 minutes, come up with as many
attributes and actions (aka methods) of a DNA
sequence object - E.g. attributes of a DNA sequence object
- length300, percent_GC50
- E.g. methods of a DNA sequence object
- Translate_to_protein, remove_polyA_tail
- Share with the class
13Objects belong to Classes
- If we take all your suggestions and design a
generic template. We can then use this template
to create objects. - This template is called a Class
- An instance of a class is called an object
DNA sequence object 1 DNA sequence object 2 DNA
sequence object 3 DNA sequence object 4
DNA Sequence Class
14How do we interact with an object?
- We have to refer to an object by its name
WOOF
POLO
Polo is the name of my dog
15Interact with a program object
WOOF
A Program Dog Object
Polo
sub dye_fur
fur_color weight tail_position
sub eat
sub wag_tail
Polo is the name of a program dog object
16A name is a reference
- Objects have unique names (labels)
- You refer to an object by its unique name
- This unique name that you give to an object is
called a reference
17Reference in Perl
- A reference is a scalar (simple) variable that
refers to a chunk of memory - Stored in that memory can be another variable or
an object
Memory
My Program
array_ref
18Reference to an object
Memory
My Program
my_protein is called a reference to an object
(in this case a protein object) To access the
attributes and methods of the protein object, you
have to go through its reference (i.e.
my_protein) Objects have inherent functions
that are useful These inherent functions also
have specific names
my_protein
A protein object
varSwissProt_ID varname varlength varso
uce var_at_journal_articles vardomain_location
sub new sub return_ID sub get_domain
19Object Oriented Programming
- What is O-O Programming?
- Simple answer a way to organize code so it
interacts in certain ways and follows certain
rules - Long answer to be found in books on O-O
- Why O-O Programming?
- Provides well defined framework
- Promotes certain good practice such as code
reuse, abstraction, cleaner design, etc. - Does have certain trade-offs (e.g. O-O Perl is
usually slower than declarative Perl) - Designing good object classes requires
forethoughts and skills
20To use an object
- Find out which class you need and learn about the
class by reading its documentation - Make the class available to your program
- Create a new object of the class
- Start using the object by modifying its
attributes and calling its methods -
21Example of using objects
- Task
- I have a sequence file in Genbank format that I
want to convert to EMBL format - How many objects do you think we need to
accomplish the task above?
221. Find the Objects you need
- Objects that we need
- an object that read in sequences from a file
- an object that represents a sequence record
- an object that write sequences to a file
Memory
EMBL
Genbank
Sequence File Input Object
Sequence Object
Sequence File Output Object
23Example of using objects
- Solution
- I remember that Bioperl provides this
functionality. So first Ill take a look at the
Bioperl documentation - Website http//www.bioperl.org
24Bioperl Documentation demo
- Go to the webpage and navigate to SeqIO doc
- Pay attention to
- 1) the name of the module
- 2) Synopsis (code examples)
- 3) Description
- 4) list of methods
25(No Transcript)
26Click
27List of Modules by Class
Complete List of Modules by Name
28(No Transcript)
292. Make the object class available
- In perl, classes are implemented as
object-oriented modules - To include a class, simply use the module
- E.g. use BioSeqIO
- Note the name of the module is case sensitive
- By using BioSeqIO, my program automatically
gain access to any modules included in BioSeqIO
303. Create an object
- Make up a name for my object reference (e.g.
seq_input) - Create the object by calling the object classs
new method - every class has a constructor method to create
an object of that class - constructor method is often called new
- use single arrow operator to call methods
- Assign the object to the object reference
- You can give the object you are about to create
some initial attributes (e.g. the file name of my
sequence record, the format of the record)
my seq_in
BioSeqIO-gtnew
( -file gt myGBrecord, -format gt
genbank)
314. Call objects methods?
- Weve seen the -gt (single arrow) operator for
calling a class method (e.g. new) - The same operator is used for calling an object
method - E.g. to ask seq_in object to get a sequence
record from your Genbank sequence file - my seq_record seq_in-gtnext_seq()
32Putting it all together
- !/usr/bin/perl w
- use strict
- use BioSeqIO
- my seq_in BioSeqIO-gtnew(
- -file gt myGBrecord,
- -format gt genbank)
- my seq_out BioSeqIO-gtnew(
- -file gt gtmyEMBLrec,
- -format gt EMBL)
- my seq_record seq_in-gtnext_seq()
- seq_out-gtwrite_seq(seq_record)
Create a new BioSeqIO object and initialize
some attributes
33More Bioperl modules
- BioSeqIO Sequence Input/Output
- Retrieve sequence records and write to files
- Converting sequence records from one format to
another - BioSeq Manipulating sequences
- Get subsequences (seq-gtsubseq(start, end))
- Find the length of the object (seq-gtlength)
- Reverse complement a DNA sequence
- Translate a DNA sequence .etc.
- BioAnnotation Annotate a sequence
- Assign journal references to a sequence, etc.
- BioAnnotation is associated with an entire
sequence record and not just part of a sequence
(see also BioSeqFeature)
34Some more Bioperl modules
- BioSeqFeature Associate feature annotation to
a sequence - features describe specific locations in the
sequence - E.g. 5 UTR, 3 UTR, CDS, SNP, etc
- Using this object, you can add feature
annotations to your sequences - When you parse a genbank file using Bioperl, the
features of a record are stored as SeqFeature
objects - BioDBGenBank, GenPept, EMBL and Swissprot
Remote Database Access - You can retrieve a sequence from remote databases
(through the Internet) using these objects
35Even more Bioperl modules
- BioSearchIO Parse sequence database search
reports - Parse BLAST reports (make custom report)
- Parse HMMer, FASTA, SIM4, WABA, etc.
- Custom reports can be output to various formats
(HTML, Table, etc) - BioToolsRunStandAloneBLAST Run Standalone
BLAST through perl - By combining this and SearchIO, you can automate
and customize BLAST search - BioGraphics Draw biological entities (e.g. a
gene, an exon, BLAST alignments, etc)
36Bioperl Summary
- For Online documentation
- For this workshop http//doc.bioperl.org/releases
/bioperl-1.4/ - Tutorial http//www.bioperl.org/wiki/HOWTOBeginn
ers - HOWTOs http//www.bioperl.org/wiki/HOWTOs
- Modules http//www.bioperl.org/wiki/CategoryCore
_Modules - Literature
- Stajich et al., The Bioperl toolkit Perl modules
for the life sciences. Genome Res. 2002
Oct12(10)1611-8.PMID 12368254 - Bioperl mailing list bioperl-l_at_bioperl.org
- Best way to get help using Bioperl
- Very active list (upwards of 10 messages a day)
- Use with caution things change fast and without
warning (unless you are on the mailing list)
37Interactive demo on Bioperl
- Open your laptop!
- Open a terminal window
- Type cd /perl_two
- Type gedit ./bioperl_demo.pl
- Lets go over the example together
38Summary for Session 2
- Perl is a popular language in bioinformatics
because - it handles text well
- It has great user base and support (e.g. Bioperl)
- Bioperl is a large collection of object oriented
perl modules for many biological data analyses - an object is a collection of attributes and
methods - You have to access an object through its
reference - a reference is a name
39Perl Documents
- In-line documentation
- POD plain old documents
- Read POD by typing perldoc ltmodule namegt
- E.g. perldoc perl, perldoc BioSeqIO
- On-line documentation
- http//www.cpan.org
- http//www.perl.com
- http/www.bioperl.org
- Books
- Learning Perl (the best way to learn Perl if you
know a bit about programming already) - Beginning Perl for Bioinformatics (example based
way to learn Perl for Bioinformatics) - Programming Perl (THE Perl reference book not
for the faint of heart)
40Additional Book References
- Perl Cookbook 2nd edition (quick solutions to 80
of what you want to do) - Learning Perl Objects, References Modules (for
people who want to learn objects, references and
modules in Perl) - Perl in a Nutshell (an okay quick reference)
- Perl CD Bookshelf, Version 4.0 (electronic
version of the above books best value,
searchable, and kill fewer trees) - Mastering Perl for Bioinformatics (more example
based learning) - CGI Programming with Perl (rather outdated
treatment on the subject... Not really
recommended) - Perl Graphics Programming (if you want to
generate graphics using Perl side note Perl is
probably not the best tool for generating
graphics)
41Introduction to the Assignment Part A
- Goals
- To convert passive knowledge to active skills
- To write some simple perl programs by yourself
- Consists of 2 modules
- Write a program to convert the temperature from F
to C - Write a program to count the frequencies of bases
in a sequence (sequence MAN1.fasta can be
downloaded from Day6 wiki)
42Introduction to the Assignment Part B
- Goals
- To see the power of Perl in bioinformatics
- To see how some common bioinformatics tasks are
done using Perl - Consists of 3 modules
- Download E. coli O157H7 proteins in FASTA format
- Use Regular Expression to find a protein motif
- Run BLAST on all proteins in the proteome (gt5000
BLAST runs)
43Introduction to the Assignment Part B
- Most of the code is given to you, you just have
to modify them (in total, no more than 15 lines
of new code!!) - You are not expected to know everything in the
scripts. It takes time to learn a new language - TAs and your CS team mates will help you, dont
wait until last minute to ask for help - Remember, you still have to hand in your own
version of the assignment! No copying!
44Acknowledgements
- Thanks to Sohrab Shah and Sanja Rojic (CS, UBC)
for a wonderful collaborative work on the
lecture/lab material - Some ideas of this lecture is borrowed from
Lincoln Steins workshop (http//stein.cshl.org/ge
nome_informatics/)