Outline for Today - PowerPoint PPT Presentation

1 / 116
About This Presentation
Title:

Outline for Today

Description:

10:00-10:30 Perl for Bioinformatics. 10:30-11:00 Break ... easiest programming language to learn and one with the fewest sources for typos ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 117
Provided by: umani
Category:

less

Transcript and Presenter's Notes

Title: Outline for Today


1
Outline for Today
  • 900-930 Genome Canada Help Desk
  • 930-1000 Introduction to Programming
  • 1000-1030 Perl for Bioinformatics
  • 1030-1100 Break
  • 1100 1200 Perl for Bioinformatics contd
    (Structured Lab)
  • 1200-100 Lunch

2
Outline for Today
  • 100-230 Perl for Bioinformatics structured
    lab (contd)
  • 230-300 Break
  • 300-500 Perl and the Web structured lab and
    lectures
  • 500-600 Practice
  • 700-900 Open Lab

3
Introduction to Programming
  • David Wishart
  • david.wishart_at_ualberta.ca

4
A Computer Program
!/user/bin/perl myprogram.pl this is a comment
line use warnings use strict print "Please
enter a name and press Enter\n" my
sequenceName ltSTDINgt chomp(sequenceName) pri
nt "Please enter a DNA sequence press
Enter\n" my sequence ltSTDINgt chomp(sequence
) print "The sequence name is sequenceName\n" p
rint "The sequence is sequence\n"
5
Computer Programs
  • A set of instructions that tell a computer what
    to do
  • It is a like a recipe or set of instructions for
    cooking, playing musical notes etc.
  • To recognize those instructions, they need to be
    written in a language the computer understands
  • There are many computer languages

6
Computer Languages
  • Binary or Machine code (1s and 0s that a
    computer truly understands)
  • Assembly language (primitive language)
  • Fortran (compiled language)
  • Basic (an interpreted language)
  • C (compiled language)
  • Java (a pseudo-interpreted language)
  • C (an object oriented language)

7
Computer Languages
  • Different languages have different syntax or
    grammar
  • Easier-to-learn languages like BASIC are similar
    to standard English
  • Hard-to-learn languages like Assembler or C have
    a structure that is more mathematical
  • All computer languages have to be convertible to
    1s and 0s using either an interpreter or a
    compiler

8
Compilers vs. Interpreters
  • A compiler is a special computer program that
    translates a Fortran or C program
    (human-readable) to a form that a computer can
    execute (binary)
  • Files on your computer that have the .exe suffix
    are the outputs of compilers
  • Interpreters are programs that convert high level
    languages (Perl or Basic) into machine readable
    forms on the fly

9
Computer Languages Have
  • Words or names to store data (variables)
  • Words or commands to read input
  • Words or commands to send output
  • Words or commands to perform mathematical
    operations
  • Words or commands to perform character or string
    manipulation
  • Words or commands to perform logical operations
    (AND, OR, NOT, IF, ELSE)

10
More Simply Stated
  • Expressions
  • NOT, OR, AND, EQ, NEQ etc.
  • Assignment
  • myname david
  • Loops
  • While ltsun is downgt party
  • Decisions
  • If ltsun is upgt sleep
  • Input and Output
  • Print party on!

11
Writing Programs
  • Requires that you decompose any complex operation
    into a set of really simple and logically correct
    operations
  • Example Michael, open the door
  • Simple statement has many implied operations
  • Humans can operate with this set of instructions,
    computers cant

12
Michael, Open the Door
  • Turn Michael on
  • Push chair back 30 cm
  • Stand up
  • Rotate counter clockwise 90o
  • Take one step forward
  • If encountering an obstacle rotate cw 90o and
    take one step forward
  • Rotate clockwise 90o
  • Take 3 steps forward
  • Raise right hand 36 cm
  • Open hand
  • Move hand forward 9 cm
  • Close hand until pressure gt10 psi
  • Rotate knob counter clockwise by 90o
  • While holding knob, step back 1 step

13
Programming
  • Some educators have likened programming to the
    equivalent of Simon Says but with bits of paper
    (punch cards) or magnetic tape as the source of
    instructions
  • Its also like trying to explain how to cook to a
    5 year old, while the kid is in the kitchen and
    youre sitting in the living room

14
Learning to Program
  • Many beginning programmers learn by example or by
    charting the plans on paper
  • Part of the process of learning to program
    involves learning to decompose what you want to
    do into either PSEUDOCODE or into FLOWCHARTS
  • Pseudocode and flowcharts help sort out the
    structure and logic of the program

15
Pseudocode
  • A generic way of describing an algorithm using
    specific programming language-like notations or
    syntax
  • Pseudocode is very human-readable
  • Pseudocode is not readable by a computer, but it
    resembles a computer language
  • An algorithm is a procedure or formula for
    solving a problem

16
Pseudocode Example
Initialize arrays Print request to screen Open
and read file entered by user Change default
reading mode to read from gt to gt While
sequences still exist in file. Read sequence
titles and place titles into array Determine
length of each sequence and place length into an
array Determine GC content of each sequence and
place into array Close file Print results to
screen Sort arrays containing GC Print more
results to screen
17
Another Pseudocode Example
Declare ltvariable-namegt as lttypegt Declare
ltarray-namegtltlowboundgt to ltupboundgt of
lttypegt ltvariablegt ltexpressiongt If
ltconditiongt do stuff Else do other stuff While
ltconditiongt do some more stuff For ltvariablegt
from ltfirst valuegt to ltlast valuegt do stuff with
variables End
18
The Flow Chart
  • A graphic representation of an algorithm, often
    used in the design phase of programming to work
    out the logical flow of a program
  • A useful illustrative tool (remember a picture is
    worth a thousand words)
  • Some programming languages (like LOGO) are built
    to look like flow chart icons

19
Flow Chart Symbols
Oval Denotes beginning or end Flow line
Denotes direction of logic Parallelogram
Denotes either an input (READ) or output (PRINT)
operation Rectangle Denotes a process (such
as addition) to be carried out Diamond
Denotes a decision or branch point (e.g.
IF/THEN/ELSE)
20
A Flow Chart
An Example of a Flow Chart for a Generic Computer
Program
21
The Software Life Cycle
  • Problem Definition and Specification
  • Prototyping
  • Requirements
  • Design
  • Development
  • Testing
  • Deployment
  • Maintenance

22
Pitfalls of Programming
  • Most beginning programmers spend a very large
    amount of time in the development and testing
    phase of their program (debugging)
  • Most common bugs are typographical errors (type
    slowly and carefully)
  • Next most common bugs are errors in logic
    (forgetting to read a number or going into an
    infinite loop)

23
Pitfalls of Programming
  • The primary source of typographic errors comes
    from the difficult-to-comprehend and
    difficult-to-type syntax of many programming
    languages
  • BASIC is probably the easiest programming
    language to learn and one with the fewest sources
    for typos
  • Perl is one of the toughest to type

24
Perl For Bioinformatics
  • David Wishart
  • david.wishart_at_ualberta.ca

25
What is Perl?
  • Practical Extraction and Report Language
  • An interpreted computer programming language
    optimized for text extraction and manipulation
  • Combines features of other pattern languages
    (sed, awk, sh, csh, grep)
  • Developed by Larry Wall (a linguist working at
    the NSA) in 1987

26
Why Use Perl?
  • Everyone else is using it
  • Great for Scripting
  • automation of repetitive tasks
  • Great for Wrapping
  • running C or Fortran programs through Perl
  • Great for WWW CGI
  • building interactive web pages
  • Great for FTPing
  • automated downloading of data

27
More About Perl
  • History
  • http//history.perl.org/
  • Useful Web Sites
  • http//www.perl.org (How perl saved the human
    genome project)
  • http//www.bioperl.org/wiki/Main_Page (Resource
    for many perl programs in bioinformatics)
  • http//www.cpan.org (You can find all kinds of
    perl modules at this site)
  • http//stein.cshl.org/lstein/text.html (Lincoln
    Stein, MD - perl activist and bioinformatician
    extraordinaire)

28
More About Perl
29
More About Perl
  • Perl Tutorials in Bioinformatics
  • http//bip.weizmann.ac.il/course/prog/index.html
  • Part of the Perl Programming Course for
    Bioinformatics and Internet given by Jamie
    Prilusky of the Weizmann Institute
  • http//www.sanbi.ac.za/tdrcourse/coursematerial.ht
    ml
  • An excellent tutorial developed by Peter van
    Heusden for a course given in South Africa on
    bioinformatics in tropical diseases

30
Perl s and -s
  • High level language
  • Interpreted
  • Modular
  • Object-oriented
  • Cross-platform
  • Very flexible
  • Well supported
  • CGI, web, FTP, DBI interfaces and links
  • Slow
  • High memory usage
  • Complex syntax
  • Not for number crunching
  • Not the best for sophisticated data structures

31
Downloading/Installing Perl
  • Perl interpreters do not normally come with MacOS
    or Windows machines
  • Some Linux/Unix OS come with Perl
  • To find out if your computer has Perl installed
    or if Perl is in your path, type
  • which perl
  • It should return something like
  • /usr/bin/perl or /usr/local/bin/perl

32
Installing Perl
  • If you dont get a response, then you will have
    to download and install Perl
  • To download the Perl interpreter go to
    http//www.perl.org/get.html
  • Identify your operating system and press the
    hyperlink thats appropriate
  • Have either you or your system administrator
    install the Perl binary in the appropriate
    directories.

33
Installing Perl
34
Basic Perl Program Structure
  • Tell the computer its a Perl program
  • Tell the computer where the Perl interpreter is
    located (/usr/local/perl or /usr/local/bin/perl))
  • Declare or define some variables
  • Read some text input
  • Manipulate the text input
  • Output (write out) the manipulated text

35
Sample Perl Program
!/usr/bin/perl myprogram.pl this is a comment
line use warnings use strict print "Please
enter a name and press Enter\n" my
sequenceName ltSTDINgt chomp(sequenceName) pri
nt "Please enter a DNA sequence press
Enter\n" my sequence ltSTDINgt chomp(sequence
) print "The sequence name is sequenceName\n" p
rint "The sequence is sequence\n"
36
Perl Syntax
  • All comments begin with a
  • All executable Perl statements end with a
    semicolon
  • REMEMBER THE SEMICOLON
  • Perl programs should be named with a suffix
    .pl attached
  • Name Perl programs after their expected function

37
Sample Perl Program
!/usr/bin/perl myprogram.pl this is a comment
line use warnings use strict print "Please
enter a name and press Enter\n" my
sequenceName ltSTDINgt chomp(sequenceName) pri
nt "Please enter a DNA sequence press
Enter\n" my sequence ltSTDINgt chomp(sequence
) print "The sequence name is sequenceName\n" p
rint "The sequence is sequence\n"
38
Use Warnings/Use Strict
  • A nice way to help debug programs and to deal
    with the usual typos you make when you are
    learning to program
  • Warnings tell you when Perl thinks you typed
    things you didnt even mean
  • Strict makes Perl complain when variables are not
    declared, especially if theyve been mistyped

39
Sample Perl Program
!/usr/bin/perl myprogram.pl this is a comment
line use warnings use strict print "Please
enter a name and press Enter\n" my
sequenceName ltSTDINgt chomp(sequenceName) pri
nt "Please enter a DNA sequence press
Enter\n" my sequence ltSTDINgt chomp(sequence
) print "The sequence name is sequenceName\n" p
rint "The sequence is sequence\n"
40
The Rest of the Program
  • Contains statements for input and output (print
    and ltSTDINgt)
  • Contains statements for processing characters
    (chomp)
  • Contains assignment statements where variables
    are assigned ()
  • Contains variable names (sequence)
  • Assignment, variables, input, output

41
Running A Perl Program
  • Call the Perl interpreter explicitly at the Unix
    prompt
  • gtperl myprogram.pl
  • Put a header (shebang) with the path to the
    Perl interpreter on line 1 followed by the
    program name on line 2
  • !/usr/bin/perl
  • myprogram.pl

OR
42
Running A Perl Program
  • In both cases you must make the program
    executable using the chmod command
  • gtchmod arx myprogram.pl OR
  • gtchmod x myprogram.pl OR
  • gtchmod 755 myprogram.pl
  • Then just type the program name
  • gtmyprogram.pl or ./myprogram.pl

43
chmod Explained
  • Chmod change mode or change access privileges
  • Access (4 types) user (u), group (g),
    others (o) and all (a go) Privileges (3
    types) read (r), write (w) and execute
    (x)
  • Changing access
  • rw (add read write access)
  • -wx (remove write and execute access)
  • Usage chmod arx prog.pl

44
chmod Shortcut
  • Usage chmod arw prog.pl
  • u g o
  • 0 0 0 (3 digit binary)
  • Assign Execute 1, Write 2, Read 4
  • Add and subtract s to change access
  • chmod arwx pro.pl -gt chmod 777 pro.pl
  • chmod arx pro.pl -gt chmod 755 pro.pl means give
    yourself read/write/execute and all others
    execute/read privileges

45
Exercise 1
  • Reading and Writing a Sequence Entered by the
    Keyboard (Terminal)

46
Exercise 1
  • Objective - to prompt a user to enter a name and
    a DNA sequence and then to print the information
    on the screen
  • Key Concepts
  • The print statement
  • The new line character (\n)
  • Chomp (removing \n or new line)
  • Standard input (STDIN)
  • Variables and variable names

47
Algorithm for Exercise 1
  • Print request to screen
  • Read sequence name entered by user
  • Print another request to screen
  • Read sequence entered by user
  • Print statement of fact to screen

48
The Print Command
  • The way to print data to the terminal
  • Tells the Perl interpreter to print to the screen
    or terminal
  • The best way the user can communicate with the
    computer
  • Note the quotes, \n and semicolons

print What is your Name? \n" print Have a nice
day! \n"
49
New line Character
  • New lines or carriage returns are designated by a
    special character \n
  • It tells the computer to put in a carriage return
    after a string of text has been entered or read
  • Computers see \n, we dont
  • This stupid character shows up EVERYWHERE

50
Chomp
  • A command that Perl uses to gobble up \n
    characters at the ends of lines

..\n
my Name PacMan\n" chomp Name Name now
equals PacMan
51
ltSTDINgt
  • Stands for STanDard INput
  • Used to capture any keyboard input (i.e. typing)
    from the user
  • A special kind of filehandle or name that stores
    binary data

print What is your Name? \n" my Name
ltSTDINgt chomp Name print Hello Name
52
Variables (Scalars)
  • Variables are holding places for calculations,
    constants, input from keyboards, files, etc.
  • Variables are designated with a followed by
    letters and or numbers
  • Variables have names, the my in front is
    optional - just good practice

my Number ltSTDINgt my Five 5 sum Five
Number
53
The Assignment Operator
  • Variables are assigned values or quantities using
    the sign
  • Sometimes you can think of it as
  • x ? y rather than a statement of numeric
    equality
  • Dont get confused between and the
    (comparison operator)

54
The Assignment OperatorSome Simple Tricks
  • a b means a b
  • a b means a a b
  • a - b means a a - b
  • a b means a a b
  • a / b means a a / b
  • a b means a a b

55
Lets Give it a Try
  • Go to Exercise 1
  • Open a text editor
  • Type in the program called typeseq.pl
  • Dont type the comments, just read them
  • Save the program as typeseq.pl
  • Get back to the Unix prompt and type chmod x
    typeseq.pl
  • Type ./typeseq.pl or typeseq and see what happens

56
Text Editor Options
  • Option 1 launch nedit by right-clicking on
    mouse and slide down to text editing option,
    choose nedit
  • Option 2 go to the bottom of your screen and
    select the icon with the paper and pen, click on
    this icon to launch text edit

57
NOTE!!!
  • In the 1st line of exercise 1, change
    !/usr/bin/perl to !/usr/local/bin/perl
  • Which perl or locate perl
  • (this command tells you the path you should use)

58
Getting It To Run
  • Type ./typeseq.pl
  • Type perl typeseq.pl
  • Both should work, but for some of you there may
    be differences in your account set-up
  • Use the same protocol for all subsequent programs
    you will write today

59
Exercise 2
  • Reading and Writing a FASTA Sequence File From
    Disk

60
FASTA Format
gtprompt anything goes here THISISTHESEQUENCETHISIS
THESEQUENCE THISISTHESEQUENCETHISISGETTINGREPE TIT
IVE gt THISISTHESEQUENCETHISISTHESEQUENCE THISIST
HESEQUENCETHISISGETTINGREPE TITIVE
OR
61
Exercise 2
  • Objective - to read a DNA sequence in FASTA
    format and then to print the sequence on the
    screen
  • Key Concepts
  • The Open statement
  • The Die statement
  • Filehandles
  • The Diamond operator
  • The While statement
  • The Close statement

62
Algorithm for Exercise 2
  • Print request to screen
  • Open file provided by user
  • Read file provided by user
  • Read and print contents of file to screen
  • Close file

63
The Open Statement
  • The way to open a file (containing data or text)
    that exists somewhere on the computers disk
  • The Open" statement requires two arguments the
    name of the filehandle and the filename

open (DNAFILE, fileToRead)
File handle Filename (dna.txt)
64
Filehandles
  • Filehandles are the equivalent to ltSTDINgt -- just
    a holding place
  • Can be associated with data on a disk or input
    typed from a keyboard
  • Normally written in ALL_CAPS

open (DNAFILE, fileToRead)
File handle Filename (dna.txt)
65
The Die Statement
  • The way Perl handles file opening or file finding
    errors (halts the program)
  • The Die statement appears after Open
  • If the file to be opened does not exist Perl
    needs to know what to do next
  • The Die statement tells the program to store the
    system error message in !
  • Contains one argument (a message)

open (F1, file) or die (oops !)
66
The Diamond (ltgt)
  • Diamond operator tells the Perl program to read a
    file or keyboard input until it encounters a new
    line (\n) character
  • Diamonds brackets enclose Filehandles
  • Diamonds direct data to a scalar variable

From keyboard
my Name ltSTDINgt my firstline ltDNAFILEgt
From disk
67
The While Statement
  • Tells the computer to repeat an operation while a
    certain logical expression or condition is true
  • The expression or condition to be tested is
    enclosed in ( ) brackets
  • The operations to be repeated are enclosed in
    brackets

While (test_is_true) dosomething
doanotherthing
68
The Close Statement
  • Closes a previously opened file
  • All files should be closed after use
  • The close" statement requires just one argument,
    the name of the Filehandle
  • Always include a Die statement in case the file
    has already been closed

Close (FILE) or die (oops !)
69
Lets Give it a Try
  • Go to Exercise 2
  • Open a text editor
  • Type in the program seqfromfile.pl
  • Dont type the comments, just read them
  • Save the program
  • Get back to the Unix prompt and type chmod x
    seqfromfile.pl
  • Type seqfromfile.pl see what happens

70
File Input
  • To run this program you will need to have the
    sequence file called SARS.txt
  • This file can be retrieved from
    http//gchelpdesk.ualberta.ca/sequences
  • Click on the link SARS.txt, copy the file and
    paste it into a text editor (your choice).
  • Save the file on your disk as SARS.txt

71
Exercise 3
  • Reading and Manipulating Multiple FASTA Sequence
    Files From Disk

72
Multi-FASTA Format
gtsequence1 THISISTHESEQUENCETHISISTHESEQUENCE THIS
ISTHESEQUENCETHISISGETTINGREPE TITIVE gtsequence2 S
EQUENCESEQUENCESEQUENCESESEQUENCE THISISTHESEQUENC
ETHISISGETTINGREPE TITIVE gtsequence3 ANOTHERSEQUEN
CANOTHERSEQUENCEANOTH
73
Exercise 3
  • Objective - to read multiple FASTA sequences from
    a file and then to print useful data about the
    sequences
  • Key Concepts
  • Arrays Push operator
  • If and If/Else statement
  • The For loop
  • The Matching operator
  • The Substitution operator
  • Comparison operators
  • Sort, reverse, length, and scalar functions

74
Algorithm for Exercise 3
  • Initialize arrays
  • Print request to screen
  • Open and read file entered by user
  • Change default reading mode to read from gt to
    gt instead of line by line
  • While sequences still exist in file.
  • Read sequence titles and place titles into array
  • Determine length of each sequence and place
    length into an array
  • Determine GC content of each sequence and place
    each value into array
  • Close file
  • Print results to screen
  • Sort arrays containing GC
  • Print more results to screen

75
Arrays
_at_
_at_
  • Arrays are lists of numbers, letters or words
    (lists of scalars)
  • Arrays are ordered sequentially (start at 0)
  • Great way of holding, tracking and manipulating
    textual or numeric data
  • Array names always start with _at_

my _at_bases ("A",C","G","T")
0 1 2 3
76
Accessing Arrays
_at_
_at_
  • Array elements can be retrieved or altered by
    keeping track of the array index
  • Array indices are enclosed in brackets

my _at_bases ("A",C","G","T") set the variable
3base to G 3base bases2 change the T to
an X bases3 "X"
77
Pushing Arrays
_at_
_at_
  • Push operator pushes or adds elements to the
    ends of an array
  • A B C D ---gt A B C D E
  • Requires two arguments, the array name and the
    element (scalar) to be added

E
push
my _at_bases ("A",C","G","T") add another
character to bases push (_at_bases, N) now bases
is ACGTN
78
The For Loop
  • Another method for repeating operations in a Perl
    program using indices
  • i0 initialization of the counter index i
  • i lt 10 as long as the index is less than 10
    continue the operation
  • i increment the counter/index by 1

for (i0 ilt10 i) dosomething
79
If/Else Statements
  • Tells the computer to perform one operation if a
    statement or test is true and to perform another
    if the statement is false (a conditional process)
  • The expression or condition to be tested is
    enclosed in ( ) brackets
  • The operations to be performed if the condition
    or test is true are enclosed in brackets
  • The Else is optional if more than two
    conditions to be tested use Elsif

80
If/Else Statements
If (test_is_true) dosomething
doanotherthing doyetanotherthing else
doadifferent_thing doanother_thing
81
The Matching Operator
  • Part of the regular expression operators in
    Perls text manipulation arsenal
  • Uses m (for match) followed by a character
    string enclosed by slashes /
  • Determines whether a character string in the
    match argument matches a character string in the
    variable or array
  • To apply the substitution globally use the g
    modifier at the end
  • means not

82
The Matching Operator
seq ACGXTTACTACGTA check for bad
characters if (seq m/CATGN/g) die
(oops !) pattern AC.AGACGT chec
k for this pattern in seq seq
m/pattern/g found this pattern ACGXTTACTACGTA
83
The Substitution Operator
  • Part of the regular expression operators in
    Perls text manipulation arsenal
  • Uses s (for substitution) followed by two
    arguments enclosed by slashes /
  • Replaces the part of a string that matches in the
    first argument with another string in the second
    argument
  • To apply the substitution globally use the g
    modifier at the end

84
The Substitution Operator
simple 123ABCABC replace first ABC with
456 simple s/ABC/456/ now simple is
123456ABC seq ACGXTTACTXXGTA replace all
Xs with Ns seq s/X/N/g now seq is
ACGNTTACTNNGTA
85
Comparison Operators
String Number Meaning eq Equal ne
! Not Equal lt lt Less than gt gt
Greater than le lt Less or equal ge
gt Greater or equal
86
Perl Functions
  • Sort Function -- sort (_at_array)
  • sort(6 5 8 3 1) 1 3 5 6 8
  • Reverse Function -- reverse (_at_array)
  • reverse(1 2 3 4 5) 5 4 3 2 1
  • Length Function -- length (scalar)
  • length(abcdefgh) 8
  • Scalar Function -- scalar (_at_array)
  • scalar(a b c d e) 5

87
Perl Miscellany
  • \n means new line or carriage return
  • \n means more than 1 not new line
  • / \n means read to next line (deflt)
  • / gt means read until next gt found
  • / undef means read until EOF
  • , , gt, eq watch meanings!

88
Lets Give it a Try
  • Go to Exercise 3
  • Open a text editor
  • Type in the program readmany.pl
  • Dont type the comments, just read them
  • Save the program
  • Get back to the Unix prompt and type chmod x
    readmany.pl
  • Type readmany.pl and see what happens

89
File Input
  • To run this program you will need to have the
    sequence file called shortseqs.txt
  • This file can be retrieved from
    http//gchelpdesk.ualberta.ca/sequences
  • Click on the link shortseqs.txt, copy the file
    and paste it into a text editor (your choice).
  • Save the file as shortseqs.txt

90
Exercise 4 - translate.pl
91
The Fundamental Paradigm
DNA RNA Protein
92
RNA Polymerase
5 3
Forward ATGCTATCTGTACTATATGATCTA
Complement TACGATAGACATGATATACTAGAT
A
U
G

C
U
A
U
Forward CTGTACTATATGATCTA
Complement TACGATAGACATGATATACTAGAT
93
The Genetic Code
94
The Genetic Code
95
Translating DNA/RNA(single frame)
Frame1 M R I A
M R I
ATGCGTATAGCGATGCGCATT TACGCATATCGCTACGCGTAA
Frame-1 H T Y R H
A N
96
Exercise 4
  • Objective - to read multiple FASTA DNA sequences
    from a file and translate them in just one
    reading frame
  • Key Concepts
  • Hash tables
  • Exists function
  • Substring function

97
Algorithm for Exercise 4
  • Initialize arrays
  • Create hash table of codon translations
  • Print request to screen
  • Open and read file entered by user
  • Change default reading mode to read from gt to
    gt instead of line by line
  • While sequences still exist in file.
  • Read sequence titles and place titles into array
  • Remove any non DNA characters and convert to
    lower case
  • Read DNA sequence in chunks of 3
  • Use hash tables to convert each codon to AA and
    place into array
  • Convert array of AAs into a string
  • Place each AA translations into an array of
    translations
  • Close file
  • Print results to screen

98
Hashes

  • Another way of preparing arrays
  • Indexing of Hashes is not numeric but by keys
    or character strings (no ordering)
  • Hash elements are assigned keys by the assignment
    symbol gt
  • The keys must be unique
  • Hash elements must be accessed using keys placed
    in brackets
  • Hash names must begin with

99
Hash Example
my genetic_code (UUU gt "F",
UUC gt "F", UUA gt "L",
UCU gt "S",) print
genetic_codeUUA\n prints the letter
L genetic_codeUUA Leu changes hash value
from L to Leu
100
Exists Substr
  • Exists checks if a value or string exists in a
    hash table is true/false
  • If (exists(hashvalue))
  • Substr extracts substrings from longer
    character strings
  • 3 argument function
  • String to work on, start position, length
  • codonsubstr(seq, i, 3)

101
File Input
  • To run this program you will need to have the
    sequence file called shortseqs.txt
  • This file can be retrieved from
    http//gchelpdesk.ualberta.ca/sequences
  • Click on the link shortseqs.txt, copy the file
    and paste it into a text editor (your choice).
  • Save the file as shortseqs.txt

102
Exercise 5 - revcomp.pl
5 3
(Sense)

Forward ATGCTATCTGTACTATATGATCTA
Complement TACGATAGACATGATATACTAGAT
_
(Antisense)
5 3
Reverse TAGATCATATAGTACAGAGATCAT
Complement
103
DNA Structure
104
DNA - base pairing
  • Hydrogen Bonds
  • Base Stacking
  • Hydrophobic Effect


105
Base-pairing (Details)
3 H-bonds
2 H-bonds
106
DNA Sequences
5 3
Single ATGCTATCTGTACTATATGATCTA 5
3 Paired ATGCTATCTGTACTATATGATCTA
TACGATAGACATGATATACTAGAT
Read this way-----gt 5 3 ATGATCGATAGACTGATCGA
TCGATCGATTAGATCC TACTAGCTATCTGACTAGCTAGCTAGCTAATC
TAGG 3 5 lt---Read this way
107
DNA Sequence Nomenclature
5 3
(Sense)

Forward ATGCTATCTGTACTATATGATCTA
Complement TACGATAGACATGATATACTAGAT
_
(Antisense)
5 3
Reverse TAGATCATATAGTACAGAGATCAT
Complement
108
Exercise 5
  • Objective - to read a DNA sequence file and
    determine its reverse complement and its content
    of A,T,G, and C
  • Key Concepts
  • Translation operator
  • Split function
  • Join function
  • Reverse function

109
Algorithm for Exercise 5
  • Print request to screen
  • Open and read file provided by user
  • Change default reading mode to entire file
    instead of line by line
  • Read sequence title
  • Read sequence and convert to lower case, remove
    any non-DNA
  • Convert sequence to complement and save
    separately
  • Convert character string to array, reverse array,
    then convert to character string
  • Print the results
  • Count the number of A,T,G,C in each strand
  • Print the results

110
Tr, Split and Join
  • Tr translation operator, converts one argument
    to another
  • value tr/GATC/CTAG/
  • Split split function, splits a character string
    into an array of characters
  • _at_array split(/\B/, sequence)
  • Join join function, joins an array of
    characters into a single char. String
  • string join(,_at_array)

111
File Input
  • To run this program you will need to have the
    sequence file called SARS.txt
  • This file can be retrieved from
    http//gchelpdesk.ualberta.ca/sequences
  • Click on the link SARS.txt, copy the file and
    paste it into a text editor (your choice).
  • Save the file on your disk as SARS.txt

112
Exercise 6 - orfs.pl
ORF (open reading frame)
TATA box
Stop codon
Start codon
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
113
Prokaryotes
  • Simple gene structure
  • Small genomes (0.5 to 10 million bp)
  • No introns (uninterrupted)
  • Genes are called Open Reading Frames of ORFs
    (include start stop codon)
  • High coding density (gt90)
  • Some genes overlap (nested)
  • Some genes are quite short (lt60 bp)

114
Prokaryotic Gene Structure
ORF (open reading frame)
TATA box
Stop codon
Start codon
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
115
Gene Finding In Prokaryotes
  • Scan forward strand until a start codon is found
  • Staying in same frame scan in groups of three
    until a stop codon is found
  • If of codons between start and end is less than
    17, identify as gene and go to last start codon
    and proceed with step 1
  • If codons between start and end is less than
    18, go back to last start codon and go to step 1
  • At end of chromosome, repeat process for reverse
    complement

116
Exercise 7 - transorfs.pl
Frame3 A Y S D
A H Frame2 C V
R C A Frame1 M
R I A M R I
ATGCGTATAGCGATGCGCATT TACGCATATCGCTACGCGTAA
Frame-1 H T Y R H
A N Frame-2 R I
A I R M Frame-3 A
Y L S A C
Write a Comment
User Comments (0)
About PowerShow.com