Title: Algorithms in Bioinformatics
1Algorithms in Bioinformatics
- Lawrence DAntonio
- Ramapo College of New Jersey
2Topics
- Algorithm basics
- Types of algorithms in bioinformatics
- Sequence alignment
- Database Searches
3Algorithm basics
- What is an algorithm?
- Algorithm complexity
- P vs. NP
- NP completeness
4What is an algorithm?
- An algorithm is a step-by-step procedure to solve
a problem - The word algorithm comes from the 9th century
Islamic mathematician al-Khwarizmi
5Algorithm Complexity
- If the algorithm works with n pieces of data and
the number of steps is proportional to n, then we
say that the running time is O(n). - If the number of steps is proportional to log
n, then the running time is O(log n).
6Example
- Problem find the largest element in a sequence
of n elements. - Solution idea Iteratively compare size of
elements in sequence.
7- Algorithm
- Initialize first element as largest.
- For each remaining element.
- If current element larger than largest, make
that element largest.
Running time O(n)
8Polynomial Time
- An algorithm is said to run in polynomial time if
its running time can be written in the form O(nk)
for some power k. - The underlying problem is said to be of class P.
9Polynomial Time Examples
- Searching
- Binary Search O(log n)
- Sorting
- Quick Sort O(n log n)
10NP Algorithms
- An algorithm is nondeterministic if it begins
with guessing a solution to the problem and then
verifies the guess. - A problem is of category NP if there is a
nondeterministic algorithm for that problem which
runs in polynomial time.
11NP Complete
- A problem is NP-complete if it has an NP
algorithm, and solutions to this problem can be
used to solve all other NP problems. - A problem is NP-hard if it is at least as hard as
the NP-complete problems
12NP Complete Examples
- Traveling salesman
- Knapsack problem
- Partition problem
- Graph coloring
13P NP ?
- P ? NP
- If P ? NP then NP-complete problems have
exponential running time.
14Polynomial vs. Exponential
15Algorithms in Bioinformatics
- Algorithms to compare DNA, RNA, or protein
sequences - Database searches to find homologous sequences
- Sequence assembly
- Construction of evolutionary trees
- Structure prediction
16Edit operations on sequences
Substitution
Insertion
Deletion
AATAAGC
AAT-AAGC
AATAAGC
ATTAAGC
AATTAAGC
AA-AAGC
17What is sequence alignment?
- Compare two sequences using matches,
substitutions and indels. - G A A - - T C A T
- G - T G G - C A -
- 3 matches, 1 substitution, 5 indels
18Complexity of DNA Problems
- 3 billion base pairs in human genome
- Many NP complete problems
- 10600 possible alignments for two 1000 character
sequences
19Types of sequence alignment
- Determine the alignment of two sequences that
maximizes similarity (global alignment) - Determine substrings of two sequences with
maximum similarity (local alignment) - Determine the alignment for several sequences
that maximizes the sum of pairs similarity
(multiple alignment)
20Significance of Alignment
- Functional similarity
- Structural similarity
- Homology
21Scoring System
- Assign a score for each possible match,
substitution and indel - Distance functions Find alignment to minimize
distance between sequences - Similarity functions Find alignment to maximize
similarity between sequences
22Edit Distance
- G A A - - T C A T
- G - T G G - C A -
- Similarity function 1 for match, -1 for
substitution, -2 for indel - Score -8
23Dynamic Programming
- Used on optimization problems
- Bottom-up approach
- Recursively builds up solution from subproblem
optimal solutions
24Dynamic Programming Alignment Algorithm
(Needleman-Wunsch)
- Given sequences a1,a2,,an and b1,b2,,bm to be
aligned - Initialize alignment matrix (aligning with
spaces) - Entry i,j gives optimal alignment score for
sequences a1,a2,,ai and b1,b2,,bj (where 1 ? i
? n, 1 ? j ? m)
25Computing Alignment Matrix
If a1,a2,,ai and b1,b2,,bj have been
aligned, there are three possible next moves
- Match ai1 with bj1
- Match ai1 with a space
- Match bj1 with a space
Choose the move that maximizes the similarity of
the two sequences
26Global Alignment Matrix
27Optimal Global Alignment
28Alignment Running Time
- Assuming two sequences n characters each
- Running time is O(n2) (each entry of matrix must
be calculated)
29Variations of Alignment Algorithm
- Gap penalty
- Local alignment
- Multiple alignment
30Gap Penalty
- A gap is a number k of consecutive spaces
- k consecutive spaces are more probable than k
isolated spaces - Typical gap penalty function a bk (affine
gap penalty) - Here the first space in a gap is penalized ab,
further spaces are penalized b each.
31Gap Penalty Example
- Use penalty, 1 k
- A - A - C - A
- A C T A T C A
- Score -6
- A A C - - - A
- A C T A T C A
- Score -4
32Local Alignment
- Find conserved regions in otherwise dissimilar
sequences (e.g., viral and host DNA) - Smith-Waterman algorithm
- Includes a fourth possibility at each step (dont
align)
33Local Alignment Example
- Align the following
- G C T C T G C G A A T A
- C G T T G A G A T A C T
34Optimal Local Alignment
- G C T C T G C G A A T A
- C G T T G A G A T A C T
- (G C T C) T G C G A A T A
- (C G T) T G A G - A T A (C T)
35Multiple Alignment
- Find the alignment among a set of sequences that
maximizes the sum of scores for all pairs of
sequences - Dynamic programming run-time for k sequences of
length n O(k2 2k nk) - Multiple alignment is NP-complete
36Other Features
- Usually used for protein alignment
- Can be used for global or local alignment
37Multiple Alignment Example
38Multiple vs. Pairwise Alignment
- Optimal multiple alignment does not imply optimal
pairwise alignment - AT A -
- A - - T
- - T
39Substitution Matrices
- In homologous sequences certain amino acid
substitutions are more likely to occur than
others - Types of substitution matrices
- PAM
- BLOSUM
40PAM Matrices
- Defines units of evolutionary distance
- 1 PAM unit represents an average of one mutation
per 100 amino acids - Start with a set of highly similar sequences and
compute - pa probability of occurrence of amino acid a
- Mab probability of a mutating to b
41PAM Matrix Formula
- Entries in a k-PAM matrix
42PAM250 Matrix
43BLOSUM Matrices (Omit)
- Uses log-odds ratio similar to PAM
- Uses short highly conserved sequences
- BLOSUM x matrices created after removing
sequences that are more than x percent identical - Better at local alignments
44BLOSUM Matrices
- A motif is a conserved amino acid pattern found
in a group of proteins with similar biological
meaning (PROSITE) - A block is a conserved amino acid pattern in a
group of proteins (no spaces allowed in the
pattern) (BLOCKS)
45Motif Example
- Motif obtained from a group of 34 tubulin
proteins - MFYW . . FVLIH . FYW . . EGM
46Defining BLOSUM (I)
- BLOSUMn uses blocks that are n identical
(BLOSUM62 is most common) - Consider all pairs of amino acids appearing in
the same column in the blocks
47Defining BLOSUM (II)
- Define n(i,j) to be the frequency that amino
acids i,j appear in a column pair - Define e(i,j) to be the frequency that amino
acids i,j appear in any pair - Define BLOSUM entry
48PAM vs. BLOSUM
- PAM derived from highly similar sequences
(evolutionary model) - BLOSUM derived from protein families sharing a
common ancestor (conserved domain model)
49Database Searches
50FASTA
- Looks for sequences in a database similar to a
query sequence - Heuristic, exclusion method
- Compares query sequence to each database sequence
(called the text)
51FASTA Algorithm (I)
- Look for small substrings in query and text that
exactly match (hot spots) - Find ten best diagonal runs of hot spots
52Hot Spot Example
- E K L A S R K L
- H
- A
- S
- H
- K
- L
53FASTA Algorithm (II)
- Find best local alignment for each run
- Combine these into larger alignment
- Do multiple alignment on query and texts having
highest score in last step
54BLAST
- Basic Local Alignment Search Tool
- Heuristic, exclusion method
- Computes statistical significance of alignment
scores
55BLAST Algorithm
- Find all w-length substrings in text that align
to some w-length substring in query with score
above a given threshold (called hits) - Extend these hits as far as possible (segment
pairs) - Report the highest scoring segment pairs
56Other Bioinformatics Algorithms
- Palindromes
- Tandem Repeats
- Longest Common Subsequence
- Double Digest (NP complete)
- Shortest Common Superstring (NP complete)
57References
- Clote and Backofen, Computational Molecular
Biology, Wiley - Gusfield, Algorithms on Strings, Trees, and
Sequences, Cambridge University Press - Mount, Bioinformatics, Cold Spring Harbor Press
- Setubal and Meidanis, Introduction to
Computational Molecular Biology, PWS - Waterman, Introduction to Computational Biology,
CRC Press