Algorithms in Bioinformatics

About This Presentation

Title:

Algorithms in Bioinformatics

Description:

If the algorithm works with n pieces of data and the number of steps is ... Smith-Waterman algorithm. Includes a fourth possibility at each step (don't align) ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 58

Provided by: lda7

Learn more at: https://phobos.ramapo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms in Bioinformatics

1
Algorithms in Bioinformatics

Lawrence DAntonio
Ramapo College of New Jersey

2
Topics

Algorithm basics
Types of algorithms in bioinformatics
Sequence alignment
Database Searches

3
Algorithm basics

What is an algorithm?
Algorithm complexity
P vs. NP
NP completeness

4
What is an algorithm?

An algorithm is a step-by-step procedure to solve
a problem
The word algorithm comes from the 9th century
Islamic mathematician al-Khwarizmi

5
Algorithm Complexity

If the algorithm works with n pieces of data and
the number of steps is proportional to n, then we
say that the running time is O(n).
If the number of steps is proportional to log
n, then the running time is O(log n).

6
Example

Problem find the largest element in a sequence
of n elements.
Solution idea Iteratively compare size of
elements in sequence.

Algorithm
Initialize first element as largest.
For each remaining element.
If current element larger than largest, make
that element largest.

Running time O(n)
8
Polynomial Time

An algorithm is said to run in polynomial time if
its running time can be written in the form O(nk)
for some power k.
The underlying problem is said to be of class P.

9
Polynomial Time Examples

Searching
Binary Search O(log n)
Sorting
Quick Sort O(n log n)

10
NP Algorithms

An algorithm is nondeterministic if it begins
with guessing a solution to the problem and then
verifies the guess.
A problem is of category NP if there is a
nondeterministic algorithm for that problem which
runs in polynomial time.

11
NP Complete

A problem is NP-complete if it has an NP
algorithm, and solutions to this problem can be
used to solve all other NP problems.
A problem is NP-hard if it is at least as hard as
the NP-complete problems

12
NP Complete Examples

Traveling salesman
Knapsack problem
Partition problem
Graph coloring

13
P NP ?

P ? NP
If P ? NP then NP-complete problems have
exponential running time.

14
Polynomial vs. Exponential
15
Algorithms in Bioinformatics

Algorithms to compare DNA, RNA, or protein
sequences
Database searches to find homologous sequences
Sequence assembly
Construction of evolutionary trees
Structure prediction

16
Edit operations on sequences

Substitution
Insertion
Deletion
AATAAGC
AAT-AAGC
AATAAGC
ATTAAGC
AATTAAGC
AA-AAGC
17
What is sequence alignment?

Compare two sequences using matches,
substitutions and indels.
G A A - - T C A T
G - T G G - C A -
3 matches, 1 substitution, 5 indels

18
Complexity of DNA Problems

3 billion base pairs in human genome
Many NP complete problems
10600 possible alignments for two 1000 character
sequences

19
Types of sequence alignment

Determine the alignment of two sequences that
maximizes similarity (global alignment)
Determine substrings of two sequences with
maximum similarity (local alignment)
Determine the alignment for several sequences
that maximizes the sum of pairs similarity
(multiple alignment)

20
Significance of Alignment

Functional similarity
Structural similarity
Homology

21
Scoring System

Assign a score for each possible match,
substitution and indel
Distance functions Find alignment to minimize
distance between sequences
Similarity functions Find alignment to maximize
similarity between sequences

22
Edit Distance

G A A - - T C A T
G - T G G - C A -
Similarity function 1 for match, -1 for
substitution, -2 for indel
Score -8

23
Dynamic Programming

Used on optimization problems
Bottom-up approach
Recursively builds up solution from subproblem
optimal solutions

24
Dynamic Programming Alignment Algorithm
(Needleman-Wunsch)

Given sequences a1,a2,,an and b1,b2,,bm to be
aligned
Initialize alignment matrix (aligning with
spaces)
Entry i,j gives optimal alignment score for
sequences a1,a2,,ai and b1,b2,,bj (where 1 ? i
? n, 1 ? j ? m)

25
Computing Alignment Matrix
If a1,a2,,ai and b1,b2,,bj have been
aligned, there are three possible next moves

Match ai1 with bj1
Match ai1 with a space
Match bj1 with a space

Choose the move that maximizes the similarity of
the two sequences
26
Global Alignment Matrix
27
Optimal Global Alignment
28
Alignment Running Time

Assuming two sequences n characters each
Running time is O(n2) (each entry of matrix must
be calculated)

29
Variations of Alignment Algorithm

Gap penalty
Local alignment
Multiple alignment

30
Gap Penalty

A gap is a number k of consecutive spaces
k consecutive spaces are more probable than k
isolated spaces
Typical gap penalty function a bk (affine
gap penalty)
Here the first space in a gap is penalized ab,
further spaces are penalized b each.

31
Gap Penalty Example

Use penalty, 1 k
A - A - C - A
A C T A T C A
Score -6
A A C - - - A
A C T A T C A
Score -4

32
Local Alignment

Find conserved regions in otherwise dissimilar
sequences (e.g., viral and host DNA)
Smith-Waterman algorithm
Includes a fourth possibility at each step (dont
align)

33
Local Alignment Example

Align the following
G C T C T G C G A A T A
C G T T G A G A T A C T

34
Optimal Local Alignment

G C T C T G C G A A T A
C G T T G A G A T A C T
(G C T C) T G C G A A T A
(C G T) T G A G - A T A (C T)

35
Multiple Alignment

Find the alignment among a set of sequences that
maximizes the sum of scores for all pairs of
sequences
Dynamic programming run-time for k sequences of
length n O(k2 2k nk)
Multiple alignment is NP-complete

36
Other Features

Usually used for protein alignment
Can be used for global or local alignment

37
Multiple Alignment Example
38
Multiple vs. Pairwise Alignment

Optimal multiple alignment does not imply optimal
pairwise alignment
AT A -
A - - T
- T

39
Substitution Matrices

In homologous sequences certain amino acid
substitutions are more likely to occur than
others
Types of substitution matrices
PAM
BLOSUM

40
PAM Matrices

Defines units of evolutionary distance
1 PAM unit represents an average of one mutation
per 100 amino acids
Start with a set of highly similar sequences and
compute
pa probability of occurrence of amino acid a
Mab probability of a mutating to b

41
PAM Matrix Formula

Entries in a k-PAM matrix

42
PAM250 Matrix
43
BLOSUM Matrices (Omit)

Uses log-odds ratio similar to PAM
Uses short highly conserved sequences
BLOSUM x matrices created after removing
sequences that are more than x percent identical
Better at local alignments

44
BLOSUM Matrices

A motif is a conserved amino acid pattern found
in a group of proteins with similar biological
meaning (PROSITE)
A block is a conserved amino acid pattern in a
group of proteins (no spaces allowed in the
pattern) (BLOCKS)

45
Motif Example

Motif obtained from a group of 34 tubulin
proteins
MFYW . . FVLIH . FYW . . EGM

46
Defining BLOSUM (I)

BLOSUMn uses blocks that are n identical
(BLOSUM62 is most common)
Consider all pairs of amino acids appearing in
the same column in the blocks

47
Defining BLOSUM (II)

Define n(i,j) to be the frequency that amino
acids i,j appear in a column pair
Define e(i,j) to be the frequency that amino
acids i,j appear in any pair
Define BLOSUM entry

48
PAM vs. BLOSUM

PAM derived from highly similar sequences
(evolutionary model)
BLOSUM derived from protein families sharing a
common ancestor (conserved domain model)

49
Database Searches

FASTA
BLAST

50
FASTA

Looks for sequences in a database similar to a
query sequence
Heuristic, exclusion method
Compares query sequence to each database sequence
(called the text)

51
FASTA Algorithm (I)

Look for small substrings in query and text that
exactly match (hot spots)
Find ten best diagonal runs of hot spots

52
Hot Spot Example

E K L A S R K L
H
A
S
H
K
L

53
FASTA Algorithm (II)

Find best local alignment for each run
Combine these into larger alignment
Do multiple alignment on query and texts having
highest score in last step

54
BLAST

Basic Local Alignment Search Tool
Heuristic, exclusion method
Computes statistical significance of alignment
scores

55
BLAST Algorithm

Find all w-length substrings in text that align
to some w-length substring in query with score
above a given threshold (called hits)
Extend these hits as far as possible (segment
pairs)
Report the highest scoring segment pairs

56
Other Bioinformatics Algorithms

Palindromes
Tandem Repeats
Longest Common Subsequence
Double Digest (NP complete)
Shortest Common Superstring (NP complete)

57
References

Clote and Backofen, Computational Molecular
Biology, Wiley
Gusfield, Algorithms on Strings, Trees, and
Sequences, Cambridge University Press
Mount, Bioinformatics, Cold Spring Harbor Press
Setubal and Meidanis, Introduction to
Computational Molecular Biology, PWS
Waterman, Introduction to Computational Biology,
CRC Press

Write a Comment

User Comments (0)

About PowerShow.com

Algorithms in Bioinformatics - PowerPoint PPT Presentation

Algorithms in Bioinformatics

If the algorithm works with n pieces of data and the number of steps is ... Smith-Waterman algorithm. Includes a fourth possibility at each step (don't align) ... – PowerPoint PPT presentation