Title: Sequence Analysis
1Sequence Analysis Determining how similar 2 (or
more) gene/protein sequences are (too each other)
is a staple function in bioinformatics. This
information is utilized for 1) Gene/Protein
Identification 2) Infer Gene/Protein
Function 3) Measure Genetic Distance This
ENTIRE exercise relies on the comparison between
2 (or more) sequences, and is independent of any
functional content within the sequence(s).
2In Pair Wise analysis and Multiple Sequence
Alignments, two (or more) sequences are compared
to each other and a similarity measurement is
derived. This process is completely computational
and there is no need for a database query. From
this process we can 1) Identify common regions
of sequence identity (infer function). 2) Rank
order multiple sequences to identify the
sequences that are most similar (measure
genetic distance).
3In Sequence Identification, we compare our
sequence(s) of interest to an entire database of
(known) sequences, and identify those sequences
that are most similar to our sequence of interest.
Theoretical Basis of Pairwise Sequence
Analysis Needleman-Wunsch Algorithm Global
Alignment (entire sequence contributes to
alignment) Fundamental Principle calculate the
alignment score across two sequences. All
possible pairs are represented by a
two-dimensional array, and all possible
comparisons are represented by pathways through
the array. Represents Dynamic Programming
Solving a series of subsets of a computational
problem to solve the entire problem. Divide and
Conquer.
4DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS
'Dynamic programming' is an efficient programming
technique for solving certain combinatorial
problems. It is particularly important in
bioinformatics as it is the basis of sequence
alignment algorithms for comparing protein and
DNA sequences. In the bioinformatics
application Dynamic Programming gives a
spectacular efficiency gain over a purely
recursive algorithm.
Don't expect much enlightenment from the
etymology of the term 'dynamic programming,'
though. Dynamic programming was formalized in the
early 1950s by mathematician Richard Bellman, who
was working at RAND Corporation on optimal
decision processes. He wanted to concoct an
impressive name that would shield his work from
US Secretary of Defense Charles Wilson, a man
known to be hostile to mathematics research. His
work involved time series and planningthus
'dynamic' and 'programming' (note, nothing
particularly to do with computer programming).
Bellman especially liked 'dynamic' because "it's
impossible to use the word dynamic in a
derogatory sense" he figured dynamic programming
was "something not even a Congressman could
object to.
5DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS
Alignment of 2 Sequences (words for demo
purposes)
OFFICEUNIVERSITY COFFEEICEVARSITY
Ungapped Alignment
OFFICEUNIVERSITY COFFEEICEVARSITY
-OFFICEUNIVERSITY COFFEEICEVARSITY
6DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS
Alignment of 2 Sequences (words for demo
purposes)
OFFICEUNIVERSITY COFFEEICEVARSITY
Gapped Alignment
-OFF--ICEUNIVERSITY
COFFEEICE---VARSITY
If gaps at any position (and any length) are
allowed, the process becomes computationally
expensive, and in many cases the alignment does
not provide meaningful information. Hence gaps
must be limited to a useful and manageable number.
7DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS
Dynamic Programming (Initialization Step)
 O F F I C E U N I V E R S I T Y
C Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
O Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
F Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
F Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
E Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
E Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
I Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
C Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
E Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
V Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
A Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
R Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
S Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
I Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
T Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Y Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
8DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS
 O F F I C E U N I V E R S I T Y
C     Ö           Â
O Ö               Â
F  Ö              Â
F   Ö             Â
E      Ö     Ö     Â
E      Ö     Ö     Â
I    Ö     Ö     Ö  Â
C     Ö           Â
E      Ö     Ö     Â
V          Ö      Â
A Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
R            Ö    Â
S             Ö   Â
I    Ö     Ö     Ö  Â
T               Ö Â
Y                Ö
9DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS
 O F F I C E U N I V E R S I T Y
C     Ö           Â
O 1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
F Â 1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â
F Â Â 1 Â Â Â Â Â Â Â Â Â Â Â Â Â
E   -0.3   Ö     Ö     Â
E   -3   Ö     Ö     Â
I    1     Ö     Ö  Â
C Â Â Â Â 1 Â Â Â Â Â Â Â Â Â Â Â
E      1 -0.3 -0.3 -3  Ö     Â
V Â Â Â Â Â Â Â Â Â 1 Â Â Â Â Â Â
A Â Â Â Â Â Â Â Â Â Â -3 Â Â Â Â Â
R Â Â Â Â Â Â Â Â Â Â Â 1 Â Â Â Â
S Â Â Â Â Â Â Â Â Â Â Â Â 1 Â Â Â
I    Ö     Ö     1  Â
T Â Â Â Â Â Â Â Â Â Â Â Â Â Â 1 Â
Y Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 1
Gap Penalties 1) Reduce number of gaps in the
alignment 2) Ensure a more meaningful
alignment 3) Opening a gap is costly 4)
Extending a gap is cheap
Gap opening penalty should be 2 3 times larger
than the most negative value in the substitution
matrix that is being used. Gap extension
penalty should be 0.1 to 0.3 times the value of
the gap opening penalty.
10DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS
 O F F I C E U N I V E R S I T Y
C     Ö           Â
O 2.1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
F Â 1.1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â
F Â Â 0.1 Â Â Â Â Â Â Â Â Â Â Â Â Â
E   -0.9   Ö     Ö     Â
E   -0.6   Ö     Ö     Â
I    2.4     Ö     Ö  Â
C Â Â Â Â 1.4 Â Â Â Â Â Â Â Â Â Â Â
E      0.4 -0.6 -0.3 0  Ö     Â
V Â Â Â Â Â Â Â Â Â 3 Â Â Â Â Â Â
A Â Â Â Â Â Â Â Â Â Â 2 Â Â Â Â Â
R Â Â Â Â Â Â Â Â Â Â Â 5 Â Â Â Â
S Â Â Â Â Â Â Â Â Â Â Â Â 4 Â Â Â
I    Ö     Ö     3  Â
T Â Â Â Â Â Â Â Â Â Â Â Â Â Â 2 Â
Y Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 1
11DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS
 O F F I C E U N I V E R S I T Y
C     Ö           Â
O 1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
F Â 2 Â Â Â Â Â Â Â Â Â Â Â Â Â Â
F Â Â 3 Â Â Â Â Â Â Â Â Â Â Â Â Â
E   0   Ö     Ö     Â
E   -0.3   Ö     Ö     Â
I    0.7     Ö     Ö  Â
C Â Â Â Â 1.7 Â Â Â Â Â Â Â Â Â Â Â
E      2.7 -0.3 -0.6 -0.9  Ö     Â
V Â Â Â Â Â Â Â Â Â 0.1 Â Â Â Â Â Â
A Â Â Â Â Â Â Â Â Â Â -2.9 Â Â Â Â Â
R Â Â Â Â Â Â Â Â Â Â Â -1.9 Â Â Â Â
S Â Â Â Â Â Â Â Â Â Â Â Â -0.9 Â Â Â
I    Ö     Ö     0.1  Â
T Â Â Â Â Â Â Â Â Â Â Â Â Â Â 1.1 Â
Y Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 2.1
12DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS
-OFF--ICEUNIVERSITY
COFFEEICE---VARSITY
-OFF--ICE COFFEEICE
VERSITY VARSITY
13Theoretical Basis of Pairwise Sequence
Analysis Smith-Waterman Algorithm Local
Alignment Fundamental Principle based on
Needleman-Wunsch, but compares segments of all
possible lengths and chooses whichever optimize
the similarity measure. Allows user to search for
conserved/functional domains within sequences.
Functionally, global alignments start aligning at
the far end of the alignment matrix and trace
back, where local alignments only show the
regions of alignment.
14Pair Wise Alignment
Multiple Alignments
Sequence Searching
Process Objective Application
Compares 2 sequences
Compares 3 or more sequences
Compares 1 sequence against thousands
Find common sequence motifs
Find common sequence motifs, rank based on
alignment scores.
Sequence Identification, Comparative genomics
http//www.ncbi.nlm.nih.gov/BLAST/
http//www.ebi.ac.uk/clustalw/
http//www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.c
gi
15BLAST (Basic Local Alignment Search Tool) Why
is BLAST so fast? By preindexing all the
possible 11-letter words into the database
records. EXAMPLE AGTGTCGATCG Steps 1) Find
all the 11-letter words in your query sequence,
plus a few variations. 2) Look these up in the
11-letter-word index. 3) Retrieve all sequences
containing those words. 4) Use a rigorous
algorithm (e.g. Smith-Waterman) to extend the
match in both directions