Algorithms in Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Algorithms in Bioinformatics

Description:

If the algorithm works with n pieces of data and the number of steps is ... Smith-Waterman algorithm. Includes a fourth possibility at each step (don't align) ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 58
Provided by: lda7
Category:

less

Transcript and Presenter's Notes

Title: Algorithms in Bioinformatics


1
Algorithms in Bioinformatics
  • Lawrence DAntonio
  • Ramapo College of New Jersey

2
Topics
  • Algorithm basics
  • Types of algorithms in bioinformatics
  • Sequence alignment
  • Database Searches

3
Algorithm basics
  • What is an algorithm?
  • Algorithm complexity
  • P vs. NP
  • NP completeness

4
What is an algorithm?
  • An algorithm is a step-by-step procedure to solve
    a problem
  • The word algorithm comes from the 9th century
    Islamic mathematician al-Khwarizmi

5
Algorithm Complexity
  • If the algorithm works with n pieces of data and
    the number of steps is proportional to n, then we
    say that the running time is O(n).
  • If the number of steps is proportional to log
    n, then the running time is O(log n).

6
Example
  • Problem find the largest element in a sequence
    of n elements.
  • Solution idea Iteratively compare size of
    elements in sequence.

7
  • Algorithm
  • Initialize first element as largest.
  • For each remaining element.
  • If current element larger than largest, make
    that element largest.

Running time O(n)
8
Polynomial Time
  • An algorithm is said to run in polynomial time if
    its running time can be written in the form O(nk)
    for some power k.
  • The underlying problem is said to be of class P.

9
Polynomial Time Examples
  • Searching
  • Binary Search O(log n)
  • Sorting
  • Quick Sort O(n log n)

10
NP Algorithms
  • An algorithm is nondeterministic if it begins
    with guessing a solution to the problem and then
    verifies the guess.
  • A problem is of category NP if there is a
    nondeterministic algorithm for that problem which
    runs in polynomial time.

11
NP Complete
  • A problem is NP-complete if it has an NP
    algorithm, and solutions to this problem can be
    used to solve all other NP problems.
  • A problem is NP-hard if it is at least as hard as
    the NP-complete problems

12
NP Complete Examples
  • Traveling salesman
  • Knapsack problem
  • Partition problem
  • Graph coloring

13
P NP ?
  • P ? NP
  • If P ? NP then NP-complete problems have
    exponential running time.

14
Polynomial vs. Exponential
15
Algorithms in Bioinformatics
  • Algorithms to compare DNA, RNA, or protein
    sequences
  • Database searches to find homologous sequences
  • Sequence assembly
  • Construction of evolutionary trees
  • Structure prediction

16
Edit operations on sequences

Substitution
Insertion
Deletion
AATAAGC
AAT-AAGC
AATAAGC
ATTAAGC
AATTAAGC
AA-AAGC
17
What is sequence alignment?
  • Compare two sequences using matches,
    substitutions and indels.
  • G A A - - T C A T
  • G - T G G - C A -
  • 3 matches, 1 substitution, 5 indels

18
Complexity of DNA Problems
  • 3 billion base pairs in human genome
  • Many NP complete problems
  • 10600 possible alignments for two 1000 character
    sequences

19
Types of sequence alignment
  • Determine the alignment of two sequences that
    maximizes similarity (global alignment)
  • Determine substrings of two sequences with
    maximum similarity (local alignment)
  • Determine the alignment for several sequences
    that maximizes the sum of pairs similarity
    (multiple alignment)

20
Significance of Alignment
  • Functional similarity
  • Structural similarity
  • Homology

21
Scoring System
  • Assign a score for each possible match,
    substitution and indel
  • Distance functions Find alignment to minimize
    distance between sequences
  • Similarity functions Find alignment to maximize
    similarity between sequences

22
Edit Distance
  • G A A - - T C A T
  • G - T G G - C A -
  • Similarity function 1 for match, -1 for
    substitution, -2 for indel
  • Score -8

23
Dynamic Programming
  • Used on optimization problems
  • Bottom-up approach
  • Recursively builds up solution from subproblem
    optimal solutions

24
Dynamic Programming Alignment Algorithm
(Needleman-Wunsch)
  • Given sequences a1,a2,,an and b1,b2,,bm to be
    aligned
  • Initialize alignment matrix (aligning with
    spaces)
  • Entry i,j gives optimal alignment score for
    sequences a1,a2,,ai and b1,b2,,bj (where 1 ? i
    ? n, 1 ? j ? m)

25
Computing Alignment Matrix
If a1,a2,,ai and b1,b2,,bj have been
aligned, there are three possible next moves
  • Match ai1 with bj1
  • Match ai1 with a space
  • Match bj1 with a space

Choose the move that maximizes the similarity of
the two sequences
26
Global Alignment Matrix
27
Optimal Global Alignment
28
Alignment Running Time
  • Assuming two sequences n characters each
  • Running time is O(n2) (each entry of matrix must
    be calculated)

29
Variations of Alignment Algorithm
  • Gap penalty
  • Local alignment
  • Multiple alignment

30
Gap Penalty
  • A gap is a number k of consecutive spaces
  • k consecutive spaces are more probable than k
    isolated spaces
  • Typical gap penalty function a bk (affine
    gap penalty)
  • Here the first space in a gap is penalized ab,
    further spaces are penalized b each.

31
Gap Penalty Example
  • Use penalty, 1 k
  • A - A - C - A
  • A C T A T C A
  • Score -6
  • A A C - - - A
  • A C T A T C A
  • Score -4

32
Local Alignment
  • Find conserved regions in otherwise dissimilar
    sequences (e.g., viral and host DNA)
  • Smith-Waterman algorithm
  • Includes a fourth possibility at each step (dont
    align)

33
Local Alignment Example
  • Align the following
  • G C T C T G C G A A T A
  • C G T T G A G A T A C T

34
Optimal Local Alignment
  • G C T C T G C G A A T A
  • C G T T G A G A T A C T
  • (G C T C) T G C G A A T A
  • (C G T) T G A G - A T A (C T)

35
Multiple Alignment
  • Find the alignment among a set of sequences that
    maximizes the sum of scores for all pairs of
    sequences
  • Dynamic programming run-time for k sequences of
    length n O(k2 2k nk)
  • Multiple alignment is NP-complete

36
Other Features
  • Usually used for protein alignment
  • Can be used for global or local alignment

37
Multiple Alignment Example
38
Multiple vs. Pairwise Alignment
  • Optimal multiple alignment does not imply optimal
    pairwise alignment
  • AT A -
  • A - - T
  • - T

39
Substitution Matrices
  • In homologous sequences certain amino acid
    substitutions are more likely to occur than
    others
  • Types of substitution matrices
  • PAM
  • BLOSUM

40
PAM Matrices
  • Defines units of evolutionary distance
  • 1 PAM unit represents an average of one mutation
    per 100 amino acids
  • Start with a set of highly similar sequences and
    compute
  • pa probability of occurrence of amino acid a
  • Mab probability of a mutating to b

41
PAM Matrix Formula
  • Entries in a k-PAM matrix

42
PAM250 Matrix
43
BLOSUM Matrices (Omit)
  • Uses log-odds ratio similar to PAM
  • Uses short highly conserved sequences
  • BLOSUM x matrices created after removing
    sequences that are more than x percent identical
  • Better at local alignments

44
BLOSUM Matrices
  • A motif is a conserved amino acid pattern found
    in a group of proteins with similar biological
    meaning (PROSITE)
  • A block is a conserved amino acid pattern in a
    group of proteins (no spaces allowed in the
    pattern) (BLOCKS)

45
Motif Example
  • Motif obtained from a group of 34 tubulin
    proteins
  • MFYW . . FVLIH . FYW . . EGM

46
Defining BLOSUM (I)
  • BLOSUMn uses blocks that are n identical
    (BLOSUM62 is most common)
  • Consider all pairs of amino acids appearing in
    the same column in the blocks

47
Defining BLOSUM (II)
  • Define n(i,j) to be the frequency that amino
    acids i,j appear in a column pair
  • Define e(i,j) to be the frequency that amino
    acids i,j appear in any pair
  • Define BLOSUM entry

48
PAM vs. BLOSUM
  • PAM derived from highly similar sequences
    (evolutionary model)
  • BLOSUM derived from protein families sharing a
    common ancestor (conserved domain model)

49
Database Searches
  • FASTA
  • BLAST

50
FASTA
  • Looks for sequences in a database similar to a
    query sequence
  • Heuristic, exclusion method
  • Compares query sequence to each database sequence
    (called the text)

51
FASTA Algorithm (I)
  • Look for small substrings in query and text that
    exactly match (hot spots)
  • Find ten best diagonal runs of hot spots

52
Hot Spot Example
  • E K L A S R K L
  • H
  • A
  • S
  • H
  • K
  • L

53
FASTA Algorithm (II)
  • Find best local alignment for each run
  • Combine these into larger alignment
  • Do multiple alignment on query and texts having
    highest score in last step

54
BLAST
  • Basic Local Alignment Search Tool
  • Heuristic, exclusion method
  • Computes statistical significance of alignment
    scores

55
BLAST Algorithm
  • Find all w-length substrings in text that align
    to some w-length substring in query with score
    above a given threshold (called hits)
  • Extend these hits as far as possible (segment
    pairs)
  • Report the highest scoring segment pairs

56
Other Bioinformatics Algorithms
  • Palindromes
  • Tandem Repeats
  • Longest Common Subsequence
  • Double Digest (NP complete)
  • Shortest Common Superstring (NP complete)

57
References
  • Clote and Backofen, Computational Molecular
    Biology, Wiley
  • Gusfield, Algorithms on Strings, Trees, and
    Sequences, Cambridge University Press
  • Mount, Bioinformatics, Cold Spring Harbor Press
  • Setubal and Meidanis, Introduction to
    Computational Molecular Biology, PWS
  • Waterman, Introduction to Computational Biology,
    CRC Press
Write a Comment
User Comments (0)
About PowerShow.com