Title: Outline for Today
1Outline for Today
- 900-930 Genome Canada Help Desk
- 930-1000 Introduction to Programming
- 1000-1030 Perl for Bioinformatics
- 1030-1100 Break
- 1100 1200 Perl for Bioinformatics contd
(Structured Lab) - 1200-100 Lunch
2Outline for Today
- 100-230 Perl for Bioinformatics structured
lab (contd) - 230-300 Break
- 300-500 Perl and the Web structured lab and
lectures - 500-600 Practice
- 700-900 Open Lab
3Introduction to Programming
- David Wishart
- david.wishart_at_ualberta.ca
4A Computer Program
!/user/bin/perl myprogram.pl this is a comment
line use warnings use strict print "Please
enter a name and press Enter\n" my
sequenceName ltSTDINgt chomp(sequenceName) pri
nt "Please enter a DNA sequence press
Enter\n" my sequence ltSTDINgt chomp(sequence
) print "The sequence name is sequenceName\n" p
rint "The sequence is sequence\n"
5Computer Programs
- A set of instructions that tell a computer what
to do - It is a like a recipe or set of instructions for
cooking, playing musical notes etc. - To recognize those instructions, they need to be
written in a language the computer understands - There are many computer languages
6Computer Languages
- Binary or Machine code (1s and 0s that a
computer truly understands) - Assembly language (primitive language)
- Fortran (compiled language)
- Basic (an interpreted language)
- C (compiled language)
- Java (a pseudo-interpreted language)
- C (an object oriented language)
7Computer Languages
- Different languages have different syntax or
grammar - Easier-to-learn languages like BASIC are similar
to standard English - Hard-to-learn languages like Assembler or C have
a structure that is more mathematical - All computer languages have to be convertible to
1s and 0s using either an interpreter or a
compiler
8Compilers vs. Interpreters
- A compiler is a special computer program that
translates a Fortran or C program
(human-readable) to a form that a computer can
execute (binary) - Files on your computer that have the .exe suffix
are the outputs of compilers - Interpreters are programs that convert high level
languages (Perl or Basic) into machine readable
forms on the fly
9Computer Languages Have
- Words or names to store data (variables)
- Words or commands to read input
- Words or commands to send output
- Words or commands to perform mathematical
operations - Words or commands to perform character or string
manipulation - Words or commands to perform logical operations
(AND, OR, NOT, IF, ELSE)
10More Simply Stated
- Expressions
- NOT, OR, AND, EQ, NEQ etc.
- Assignment
- myname david
- Loops
- While ltsun is downgt party
- Decisions
- If ltsun is upgt sleep
- Input and Output
- Print party on!
11Writing Programs
- Requires that you decompose any complex operation
into a set of really simple and logically correct
operations - Example Michael, open the door
- Simple statement has many implied operations
- Humans can operate with this set of instructions,
computers cant
12Michael, Open the Door
- Turn Michael on
- Push chair back 30 cm
- Stand up
- Rotate counter clockwise 90o
- Take one step forward
- If encountering an obstacle rotate cw 90o and
take one step forward
- Rotate clockwise 90o
- Take 3 steps forward
- Raise right hand 36 cm
- Open hand
- Move hand forward 9 cm
- Close hand until pressure gt10 psi
- Rotate knob counter clockwise by 90o
- While holding knob, step back 1 step
13Programming
- Some educators have likened programming to the
equivalent of Simon Says but with bits of paper
(punch cards) or magnetic tape as the source of
instructions - Its also like trying to explain how to cook to a
5 year old, while the kid is in the kitchen and
youre sitting in the living room
14Learning to Program
- Many beginning programmers learn by example or by
charting the plans on paper - Part of the process of learning to program
involves learning to decompose what you want to
do into either PSEUDOCODE or into FLOWCHARTS - Pseudocode and flowcharts help sort out the
structure and logic of the program
15Pseudocode
- A generic way of describing an algorithm using
specific programming language-like notations or
syntax - Pseudocode is very human-readable
- Pseudocode is not readable by a computer, but it
resembles a computer language - An algorithm is a procedure or formula for
solving a problem
16Pseudocode Example
Initialize arrays Print request to screen Open
and read file entered by user Change default
reading mode to read from gt to gt While
sequences still exist in file. Read sequence
titles and place titles into array Determine
length of each sequence and place length into an
array Determine GC content of each sequence and
place into array Close file Print results to
screen Sort arrays containing GC Print more
results to screen
17Another Pseudocode Example
Declare ltvariable-namegt as lttypegt Declare
ltarray-namegtltlowboundgt to ltupboundgt of
lttypegt ltvariablegt ltexpressiongt If
ltconditiongt do stuff Else do other stuff While
ltconditiongt do some more stuff For ltvariablegt
from ltfirst valuegt to ltlast valuegt do stuff with
variables End
18The Flow Chart
- A graphic representation of an algorithm, often
used in the design phase of programming to work
out the logical flow of a program - A useful illustrative tool (remember a picture is
worth a thousand words) - Some programming languages (like LOGO) are built
to look like flow chart icons
19Flow Chart Symbols
Oval Denotes beginning or end Flow line
Denotes direction of logic Parallelogram
Denotes either an input (READ) or output (PRINT)
operation Rectangle Denotes a process (such
as addition) to be carried out Diamond
Denotes a decision or branch point (e.g.
IF/THEN/ELSE)
20A Flow Chart
An Example of a Flow Chart for a Generic Computer
Program
21The Software Life Cycle
- Problem Definition and Specification
- Prototyping
- Requirements
- Design
- Development
- Testing
- Deployment
- Maintenance
22Pitfalls of Programming
- Most beginning programmers spend a very large
amount of time in the development and testing
phase of their program (debugging) - Most common bugs are typographical errors (type
slowly and carefully) - Next most common bugs are errors in logic
(forgetting to read a number or going into an
infinite loop)
23Pitfalls of Programming
- The primary source of typographic errors comes
from the difficult-to-comprehend and
difficult-to-type syntax of many programming
languages - BASIC is probably the easiest programming
language to learn and one with the fewest sources
for typos - Perl is one of the toughest to type
24Perl For Bioinformatics
- David Wishart
- david.wishart_at_ualberta.ca
25What is Perl?
- Practical Extraction and Report Language
- An interpreted computer programming language
optimized for text extraction and manipulation - Combines features of other pattern languages
(sed, awk, sh, csh, grep) - Developed by Larry Wall (a linguist working at
the NSA) in 1987
26Why Use Perl?
- Everyone else is using it
- Great for Scripting
- automation of repetitive tasks
- Great for Wrapping
- running C or Fortran programs through Perl
- Great for WWW CGI
- building interactive web pages
- Great for FTPing
- automated downloading of data
27More About Perl
- History
- http//history.perl.org/
- Useful Web Sites
- http//www.perl.org (How perl saved the human
genome project) - http//www.bioperl.org/wiki/Main_Page (Resource
for many perl programs in bioinformatics) - http//www.cpan.org (You can find all kinds of
perl modules at this site) - http//stein.cshl.org/lstein/text.html (Lincoln
Stein, MD - perl activist and bioinformatician
extraordinaire)
28More About Perl
29More About Perl
- Perl Tutorials in Bioinformatics
- http//bip.weizmann.ac.il/course/prog/index.html
- Part of the Perl Programming Course for
Bioinformatics and Internet given by Jamie
Prilusky of the Weizmann Institute - http//www.sanbi.ac.za/tdrcourse/coursematerial.ht
ml - An excellent tutorial developed by Peter van
Heusden for a course given in South Africa on
bioinformatics in tropical diseases
30Perl s and -s
- High level language
- Interpreted
- Modular
- Object-oriented
- Cross-platform
- Very flexible
- Well supported
- CGI, web, FTP, DBI interfaces and links
- Slow
- High memory usage
- Complex syntax
- Not for number crunching
- Not the best for sophisticated data structures
31Downloading/Installing Perl
- Perl interpreters do not normally come with MacOS
or Windows machines - Some Linux/Unix OS come with Perl
- To find out if your computer has Perl installed
or if Perl is in your path, type - which perl
- It should return something like
- /usr/bin/perl or /usr/local/bin/perl
32Installing Perl
- If you dont get a response, then you will have
to download and install Perl - To download the Perl interpreter go to
http//www.perl.org/get.html - Identify your operating system and press the
hyperlink thats appropriate - Have either you or your system administrator
install the Perl binary in the appropriate
directories.
33Installing Perl
34Basic Perl Program Structure
- Tell the computer its a Perl program
- Tell the computer where the Perl interpreter is
located (/usr/local/perl or /usr/local/bin/perl)) - Declare or define some variables
- Read some text input
- Manipulate the text input
- Output (write out) the manipulated text
35Sample Perl Program
!/usr/bin/perl myprogram.pl this is a comment
line use warnings use strict print "Please
enter a name and press Enter\n" my
sequenceName ltSTDINgt chomp(sequenceName) pri
nt "Please enter a DNA sequence press
Enter\n" my sequence ltSTDINgt chomp(sequence
) print "The sequence name is sequenceName\n" p
rint "The sequence is sequence\n"
36Perl Syntax
- All comments begin with a
- All executable Perl statements end with a
semicolon - REMEMBER THE SEMICOLON
- Perl programs should be named with a suffix
.pl attached - Name Perl programs after their expected function
37Sample Perl Program
!/usr/bin/perl myprogram.pl this is a comment
line use warnings use strict print "Please
enter a name and press Enter\n" my
sequenceName ltSTDINgt chomp(sequenceName) pri
nt "Please enter a DNA sequence press
Enter\n" my sequence ltSTDINgt chomp(sequence
) print "The sequence name is sequenceName\n" p
rint "The sequence is sequence\n"
38Use Warnings/Use Strict
- A nice way to help debug programs and to deal
with the usual typos you make when you are
learning to program - Warnings tell you when Perl thinks you typed
things you didnt even mean - Strict makes Perl complain when variables are not
declared, especially if theyve been mistyped
39Sample Perl Program
!/usr/bin/perl myprogram.pl this is a comment
line use warnings use strict print "Please
enter a name and press Enter\n" my
sequenceName ltSTDINgt chomp(sequenceName) pri
nt "Please enter a DNA sequence press
Enter\n" my sequence ltSTDINgt chomp(sequence
) print "The sequence name is sequenceName\n" p
rint "The sequence is sequence\n"
40The Rest of the Program
- Contains statements for input and output (print
and ltSTDINgt) - Contains statements for processing characters
(chomp) - Contains assignment statements where variables
are assigned () - Contains variable names (sequence)
- Assignment, variables, input, output
41Running A Perl Program
- Call the Perl interpreter explicitly at the Unix
prompt - gtperl myprogram.pl
- Put a header (shebang) with the path to the
Perl interpreter on line 1 followed by the
program name on line 2 - !/usr/bin/perl
- myprogram.pl
OR
42Running A Perl Program
- In both cases you must make the program
executable using the chmod command - gtchmod arx myprogram.pl OR
- gtchmod x myprogram.pl OR
- gtchmod 755 myprogram.pl
- Then just type the program name
- gtmyprogram.pl or ./myprogram.pl
43chmod Explained
- Chmod change mode or change access privileges
- Access (4 types) user (u), group (g),
others (o) and all (a go) Privileges (3
types) read (r), write (w) and execute
(x) - Changing access
- rw (add read write access)
- -wx (remove write and execute access)
- Usage chmod arx prog.pl
44chmod Shortcut
- Usage chmod arw prog.pl
- u g o
- 0 0 0 (3 digit binary)
- Assign Execute 1, Write 2, Read 4
- Add and subtract s to change access
- chmod arwx pro.pl -gt chmod 777 pro.pl
- chmod arx pro.pl -gt chmod 755 pro.pl means give
yourself read/write/execute and all others
execute/read privileges
45Exercise 1
- Reading and Writing a Sequence Entered by the
Keyboard (Terminal)
46Exercise 1
- Objective - to prompt a user to enter a name and
a DNA sequence and then to print the information
on the screen - Key Concepts
- The print statement
- The new line character (\n)
- Chomp (removing \n or new line)
- Standard input (STDIN)
- Variables and variable names
47Algorithm for Exercise 1
- Print request to screen
- Read sequence name entered by user
- Print another request to screen
- Read sequence entered by user
- Print statement of fact to screen
48The Print Command
- The way to print data to the terminal
- Tells the Perl interpreter to print to the screen
or terminal - The best way the user can communicate with the
computer - Note the quotes, \n and semicolons
print What is your Name? \n" print Have a nice
day! \n"
49New line Character
- New lines or carriage returns are designated by a
special character \n - It tells the computer to put in a carriage return
after a string of text has been entered or read - Computers see \n, we dont
- This stupid character shows up EVERYWHERE
50Chomp
- A command that Perl uses to gobble up \n
characters at the ends of lines
..\n
my Name PacMan\n" chomp Name Name now
equals PacMan
51ltSTDINgt
- Stands for STanDard INput
- Used to capture any keyboard input (i.e. typing)
from the user - A special kind of filehandle or name that stores
binary data
print What is your Name? \n" my Name
ltSTDINgt chomp Name print Hello Name
52Variables (Scalars)
- Variables are holding places for calculations,
constants, input from keyboards, files, etc. - Variables are designated with a followed by
letters and or numbers - Variables have names, the my in front is
optional - just good practice
my Number ltSTDINgt my Five 5 sum Five
Number
53The Assignment Operator
- Variables are assigned values or quantities using
the sign - Sometimes you can think of it as
- x ? y rather than a statement of numeric
equality - Dont get confused between and the
(comparison operator)
54The Assignment OperatorSome Simple Tricks
- a b means a b
- a b means a a b
- a - b means a a - b
- a b means a a b
- a / b means a a / b
- a b means a a b
55Lets Give it a Try
- Go to Exercise 1
- Open a text editor
- Type in the program called typeseq.pl
- Dont type the comments, just read them
- Save the program as typeseq.pl
- Get back to the Unix prompt and type chmod x
typeseq.pl - Type ./typeseq.pl or typeseq and see what happens
56Text Editor Options
- Option 1 launch nedit by right-clicking on
mouse and slide down to text editing option,
choose nedit - Option 2 go to the bottom of your screen and
select the icon with the paper and pen, click on
this icon to launch text edit
57NOTE!!!
- In the 1st line of exercise 1, change
!/usr/bin/perl to !/usr/local/bin/perl - Which perl or locate perl
- (this command tells you the path you should use)
58Getting It To Run
- Type ./typeseq.pl
- Type perl typeseq.pl
- Both should work, but for some of you there may
be differences in your account set-up - Use the same protocol for all subsequent programs
you will write today
59Exercise 2
- Reading and Writing a FASTA Sequence File From
Disk
60FASTA Format
gtprompt anything goes here THISISTHESEQUENCETHISIS
THESEQUENCE THISISTHESEQUENCETHISISGETTINGREPE TIT
IVE gt THISISTHESEQUENCETHISISTHESEQUENCE THISIST
HESEQUENCETHISISGETTINGREPE TITIVE
OR
61Exercise 2
- Objective - to read a DNA sequence in FASTA
format and then to print the sequence on the
screen - Key Concepts
- The Open statement
- The Die statement
- Filehandles
- The Diamond operator
- The While statement
- The Close statement
62Algorithm for Exercise 2
- Print request to screen
- Open file provided by user
- Read file provided by user
- Read and print contents of file to screen
- Close file
63The Open Statement
- The way to open a file (containing data or text)
that exists somewhere on the computers disk - The Open" statement requires two arguments the
name of the filehandle and the filename
open (DNAFILE, fileToRead)
File handle Filename (dna.txt)
64Filehandles
- Filehandles are the equivalent to ltSTDINgt -- just
a holding place - Can be associated with data on a disk or input
typed from a keyboard - Normally written in ALL_CAPS
open (DNAFILE, fileToRead)
File handle Filename (dna.txt)
65The Die Statement
- The way Perl handles file opening or file finding
errors (halts the program) - The Die statement appears after Open
- If the file to be opened does not exist Perl
needs to know what to do next - The Die statement tells the program to store the
system error message in ! - Contains one argument (a message)
open (F1, file) or die (oops !)
66The Diamond (ltgt)
- Diamond operator tells the Perl program to read a
file or keyboard input until it encounters a new
line (\n) character - Diamonds brackets enclose Filehandles
- Diamonds direct data to a scalar variable
From keyboard
my Name ltSTDINgt my firstline ltDNAFILEgt
From disk
67The While Statement
- Tells the computer to repeat an operation while a
certain logical expression or condition is true - The expression or condition to be tested is
enclosed in ( ) brackets - The operations to be repeated are enclosed in
brackets
While (test_is_true) dosomething
doanotherthing
68The Close Statement
- Closes a previously opened file
- All files should be closed after use
- The close" statement requires just one argument,
the name of the Filehandle - Always include a Die statement in case the file
has already been closed
Close (FILE) or die (oops !)
69Lets Give it a Try
- Go to Exercise 2
- Open a text editor
- Type in the program seqfromfile.pl
- Dont type the comments, just read them
- Save the program
- Get back to the Unix prompt and type chmod x
seqfromfile.pl - Type seqfromfile.pl see what happens
70File Input
- To run this program you will need to have the
sequence file called SARS.txt - This file can be retrieved from
http//gchelpdesk.ualberta.ca/sequences - Click on the link SARS.txt, copy the file and
paste it into a text editor (your choice). - Save the file on your disk as SARS.txt
71Exercise 3
- Reading and Manipulating Multiple FASTA Sequence
Files From Disk
72Multi-FASTA Format
gtsequence1 THISISTHESEQUENCETHISISTHESEQUENCE THIS
ISTHESEQUENCETHISISGETTINGREPE TITIVE gtsequence2 S
EQUENCESEQUENCESEQUENCESESEQUENCE THISISTHESEQUENC
ETHISISGETTINGREPE TITIVE gtsequence3 ANOTHERSEQUEN
CANOTHERSEQUENCEANOTH
73Exercise 3
- Objective - to read multiple FASTA sequences from
a file and then to print useful data about the
sequences - Key Concepts
- Arrays Push operator
- If and If/Else statement
- The For loop
- The Matching operator
- The Substitution operator
- Comparison operators
- Sort, reverse, length, and scalar functions
74Algorithm for Exercise 3
- Initialize arrays
- Print request to screen
- Open and read file entered by user
- Change default reading mode to read from gt to
gt instead of line by line - While sequences still exist in file.
- Read sequence titles and place titles into array
- Determine length of each sequence and place
length into an array - Determine GC content of each sequence and place
each value into array - Close file
- Print results to screen
- Sort arrays containing GC
- Print more results to screen
75Arrays
_at_
_at_
- Arrays are lists of numbers, letters or words
(lists of scalars) - Arrays are ordered sequentially (start at 0)
- Great way of holding, tracking and manipulating
textual or numeric data - Array names always start with _at_
my _at_bases ("A",C","G","T")
0 1 2 3
76Accessing Arrays
_at_
_at_
- Array elements can be retrieved or altered by
keeping track of the array index - Array indices are enclosed in brackets
my _at_bases ("A",C","G","T") set the variable
3base to G 3base bases2 change the T to
an X bases3 "X"
77Pushing Arrays
_at_
_at_
- Push operator pushes or adds elements to the
ends of an array - A B C D ---gt A B C D E
- Requires two arguments, the array name and the
element (scalar) to be added
E
push
my _at_bases ("A",C","G","T") add another
character to bases push (_at_bases, N) now bases
is ACGTN
78The For Loop
- Another method for repeating operations in a Perl
program using indices - i0 initialization of the counter index i
- i lt 10 as long as the index is less than 10
continue the operation - i increment the counter/index by 1
for (i0 ilt10 i) dosomething
79If/Else Statements
- Tells the computer to perform one operation if a
statement or test is true and to perform another
if the statement is false (a conditional process) - The expression or condition to be tested is
enclosed in ( ) brackets - The operations to be performed if the condition
or test is true are enclosed in brackets - The Else is optional if more than two
conditions to be tested use Elsif
80If/Else Statements
If (test_is_true) dosomething
doanotherthing doyetanotherthing else
doadifferent_thing doanother_thing
81The Matching Operator
- Part of the regular expression operators in
Perls text manipulation arsenal - Uses m (for match) followed by a character
string enclosed by slashes / - Determines whether a character string in the
match argument matches a character string in the
variable or array - To apply the substitution globally use the g
modifier at the end - means not
82The Matching Operator
seq ACGXTTACTACGTA check for bad
characters if (seq m/CATGN/g) die
(oops !) pattern AC.AGACGT chec
k for this pattern in seq seq
m/pattern/g found this pattern ACGXTTACTACGTA
83The Substitution Operator
- Part of the regular expression operators in
Perls text manipulation arsenal - Uses s (for substitution) followed by two
arguments enclosed by slashes / - Replaces the part of a string that matches in the
first argument with another string in the second
argument - To apply the substitution globally use the g
modifier at the end
84The Substitution Operator
simple 123ABCABC replace first ABC with
456 simple s/ABC/456/ now simple is
123456ABC seq ACGXTTACTXXGTA replace all
Xs with Ns seq s/X/N/g now seq is
ACGNTTACTNNGTA
85Comparison Operators
String Number Meaning eq Equal ne
! Not Equal lt lt Less than gt gt
Greater than le lt Less or equal ge
gt Greater or equal
86Perl Functions
- Sort Function -- sort (_at_array)
- sort(6 5 8 3 1) 1 3 5 6 8
- Reverse Function -- reverse (_at_array)
- reverse(1 2 3 4 5) 5 4 3 2 1
- Length Function -- length (scalar)
- length(abcdefgh) 8
- Scalar Function -- scalar (_at_array)
- scalar(a b c d e) 5
87Perl Miscellany
- \n means new line or carriage return
- \n means more than 1 not new line
- / \n means read to next line (deflt)
- / gt means read until next gt found
- / undef means read until EOF
- , , gt, eq watch meanings!
88Lets Give it a Try
- Go to Exercise 3
- Open a text editor
- Type in the program readmany.pl
- Dont type the comments, just read them
- Save the program
- Get back to the Unix prompt and type chmod x
readmany.pl - Type readmany.pl and see what happens
89File Input
- To run this program you will need to have the
sequence file called shortseqs.txt - This file can be retrieved from
http//gchelpdesk.ualberta.ca/sequences - Click on the link shortseqs.txt, copy the file
and paste it into a text editor (your choice). - Save the file as shortseqs.txt
90Exercise 4 - translate.pl
91The Fundamental Paradigm
DNA RNA Protein
92RNA Polymerase
5 3
Forward ATGCTATCTGTACTATATGATCTA
Complement TACGATAGACATGATATACTAGAT
A
U
G
C
U
A
U
Forward CTGTACTATATGATCTA
Complement TACGATAGACATGATATACTAGAT
93The Genetic Code
94The Genetic Code
95Translating DNA/RNA(single frame)
Frame1 M R I A
M R I
ATGCGTATAGCGATGCGCATT TACGCATATCGCTACGCGTAA
Frame-1 H T Y R H
A N
96Exercise 4
- Objective - to read multiple FASTA DNA sequences
from a file and translate them in just one
reading frame - Key Concepts
- Hash tables
- Exists function
- Substring function
97Algorithm for Exercise 4
- Initialize arrays
- Create hash table of codon translations
- Print request to screen
- Open and read file entered by user
- Change default reading mode to read from gt to
gt instead of line by line - While sequences still exist in file.
- Read sequence titles and place titles into array
- Remove any non DNA characters and convert to
lower case - Read DNA sequence in chunks of 3
- Use hash tables to convert each codon to AA and
place into array - Convert array of AAs into a string
- Place each AA translations into an array of
translations - Close file
- Print results to screen
98Hashes
- Another way of preparing arrays
- Indexing of Hashes is not numeric but by keys
or character strings (no ordering) - Hash elements are assigned keys by the assignment
symbol gt - The keys must be unique
- Hash elements must be accessed using keys placed
in brackets - Hash names must begin with
99Hash Example
my genetic_code (UUU gt "F",
UUC gt "F", UUA gt "L",
UCU gt "S",) print
genetic_codeUUA\n prints the letter
L genetic_codeUUA Leu changes hash value
from L to Leu
100Exists Substr
- Exists checks if a value or string exists in a
hash table is true/false - If (exists(hashvalue))
- Substr extracts substrings from longer
character strings - 3 argument function
- String to work on, start position, length
- codonsubstr(seq, i, 3)
101File Input
- To run this program you will need to have the
sequence file called shortseqs.txt - This file can be retrieved from
http//gchelpdesk.ualberta.ca/sequences - Click on the link shortseqs.txt, copy the file
and paste it into a text editor (your choice). - Save the file as shortseqs.txt
102Exercise 5 - revcomp.pl
5 3
(Sense)
Forward ATGCTATCTGTACTATATGATCTA
Complement TACGATAGACATGATATACTAGAT
_
(Antisense)
5 3
Reverse TAGATCATATAGTACAGAGATCAT
Complement
103DNA Structure
104DNA - base pairing
- Hydrogen Bonds
- Base Stacking
- Hydrophobic Effect
105Base-pairing (Details)
3 H-bonds
2 H-bonds
106DNA Sequences
5 3
Single ATGCTATCTGTACTATATGATCTA 5
3 Paired ATGCTATCTGTACTATATGATCTA
TACGATAGACATGATATACTAGAT
Read this way-----gt 5 3 ATGATCGATAGACTGATCGA
TCGATCGATTAGATCC TACTAGCTATCTGACTAGCTAGCTAGCTAATC
TAGG 3 5 lt---Read this way
107DNA Sequence Nomenclature
5 3
(Sense)
Forward ATGCTATCTGTACTATATGATCTA
Complement TACGATAGACATGATATACTAGAT
_
(Antisense)
5 3
Reverse TAGATCATATAGTACAGAGATCAT
Complement
108Exercise 5
- Objective - to read a DNA sequence file and
determine its reverse complement and its content
of A,T,G, and C - Key Concepts
- Translation operator
- Split function
- Join function
- Reverse function
109Algorithm for Exercise 5
- Print request to screen
- Open and read file provided by user
- Change default reading mode to entire file
instead of line by line - Read sequence title
- Read sequence and convert to lower case, remove
any non-DNA - Convert sequence to complement and save
separately - Convert character string to array, reverse array,
then convert to character string - Print the results
- Count the number of A,T,G,C in each strand
- Print the results
110Tr, Split and Join
- Tr translation operator, converts one argument
to another - value tr/GATC/CTAG/
- Split split function, splits a character string
into an array of characters - _at_array split(/\B/, sequence)
- Join join function, joins an array of
characters into a single char. String - string join(,_at_array)
111File Input
- To run this program you will need to have the
sequence file called SARS.txt - This file can be retrieved from
http//gchelpdesk.ualberta.ca/sequences - Click on the link SARS.txt, copy the file and
paste it into a text editor (your choice). - Save the file on your disk as SARS.txt
112Exercise 6 - orfs.pl
ORF (open reading frame)
TATA box
Stop codon
Start codon
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
113Prokaryotes
- Simple gene structure
- Small genomes (0.5 to 10 million bp)
- No introns (uninterrupted)
- Genes are called Open Reading Frames of ORFs
(include start stop codon) - High coding density (gt90)
- Some genes overlap (nested)
- Some genes are quite short (lt60 bp)
114Prokaryotic Gene Structure
ORF (open reading frame)
TATA box
Stop codon
Start codon
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
115Gene Finding In Prokaryotes
- Scan forward strand until a start codon is found
- Staying in same frame scan in groups of three
until a stop codon is found - If of codons between start and end is less than
17, identify as gene and go to last start codon
and proceed with step 1 - If codons between start and end is less than
18, go back to last start codon and go to step 1 - At end of chromosome, repeat process for reverse
complement
116Exercise 7 - transorfs.pl
Frame3 A Y S D
A H Frame2 C V
R C A Frame1 M
R I A M R I
ATGCGTATAGCGATGCGCATT TACGCATATCGCTACGCGTAA
Frame-1 H T Y R H
A N Frame-2 R I
A I R M Frame-3 A
Y L S A C