An Introduction to Sequence Alignment - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

An Introduction to Sequence Alignment

Description:

DNA/RNA sequences: strings composed of an alphabet of 4 letters ... Find clues to gene function (ortholog?) Find other ... Bioinformatics 14, 25-39. BLAST ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 32
Provided by: sin63
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Sequence Alignment


1
An Introduction to Sequence Alignment
  • July 16, 2003
  • Sining Chen
  • Expressionist Meeting

2
What?
  • DNA/RNA sequences strings composed of an
    alphabet of 4 letters
  • Protein sequences alphabet of 20 letters

3
Why?
  • Identify a gene
  • Find clues to gene function (ortholog?)
  • Find other organisms with this gene (homology)
  • Gather info for an evolutionary model

4
An Example
  • GCGCATGGATTGAGCGA
  • TGCGCCATTGATGACCA
  • A possible alignment
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A

5

Alignments
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A
  • Three elements
  • Perfect matches
  • Mismatches
  • Insertions deletions (indel)

6
Significant Similarity in an Alignment
  • Ho the current alignment is a result of random
    line-up (the 2 sequences are unrelated)
  • Ha the sequences diverge from a common ancestor
    (related)
  • Test statistic Ymax length of the longest
    running perfect match subsequence

7
Exact Matching Subsequences
  • In DNA alignment, the matching probability
  • Under Ho lengths of exact match subseq Y should
    follow a geometric distribution

8
Well-matching Subsequences
  • Evolution may cause small differences to even
    sequences with a reasonably recent common
    ancestor.
  • We consider Ymax to be the longest subseq with up
    to k mismatches.
  • Y follow hyper-geometric distribution
  • P-value exact/simulated/approximate
    (independence among Y does not hold any more)

9
Choosing Alignments
  • There are many possible alignments
  • For example, compare
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A
  • to
  • ------GCGCATGGATTGAGCGA
  • TGCGCC----ATTGATGACCA--
  • Which one is better?

10
Scoring Rule
  • Example Score
  • ( matches) ( mismatches) ( indels) x 2

11
Example
  • Example
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A
  • Score (1x13) (-1x2) (-2x4) 3
  • ------GCGCATGGATTGAGCGA
  • TGCGCC----ATTGATGACCA--
  • Score (1x5) (-1x6) (-2x11) -23

12
Edit Distance
  • The edit distance between two sequences is the
    cost of the cheapest set of edit operations
    needed to transform one sequence into the other
  • Computing edit distance between two sequences
    almost equivalent to finding the alignment that
    minimizes the distance

13
Computing Edit Distance
  • How can we compute the edit distance??
  • If s n and t m, there are more than
    alignments
  • 2 sequences each of length 1000 gt 10600
  • The additive form of the score allows to perform
    dynamic programming to compute edit distance
    efficiently

14
Recursive Argument
  • Suppose we have two sequencess1..n1 and
    t1..m1
  • The best alignment must be in one of three cases
  • 1. Last position is (sn1,tm 1 )
  • 2. Last position is (sn 1,-)
  • 3. Last position is (-, tm 1 )

15
Recursive Argument
  • Suppose we have two sequencess1..n1 and
    t1..m1
  • The best alignment must be in one of three cases
  • 1. Last position is (sn1,tm 1 )
  • 2. Last position is (sn 1,-)
  • 3. Last position is (-, tm 1 )

16
Recursive Argument
  • Suppose we have two sequencess1..n1 and
    t1..m1
  • The best alignment must be in one of three cases
  • 1. Last position is (sn1,tm 1 )
  • 2. Last position is (sn 1,-)
  • 3. Last position is (-, tm 1 )

17
Recursive Argument
  • Define the notation
  • Using the recursive argument, we get the
    following recurrence for V

18
Recursive Argument
  • Of course, we also need to handle the base cases
    in the recursion

19
Dynamic Programming Algorithm
We fill the matrix using the recurrence rule
20
Dynamic Programming Algorithm
21
Reconstructing the Best Alignment
  • We now trace back the path the corresponds to the
    best alignment

AAAC AG-C
22
Reconstructing the Best Alignment
  • Sometimes, more than one alignment has the best
    score

AAAC A-GC
23
Complexity
  • Space O(mn)
  • Time O(mn)
  • Filling the matrix O(mn)
  • Backtrace O(mn)

24
Other scoring schemes
  • Needleman and Wunsch 1 for identical amino acid,
    0 otherwise
  • Dayhoff PAM scoring matrix variations include
    BLOSUM matrices(Henikoff and Henikoff 1992, Proc.
    Nat. Acad. Sci. 89, 10915-10919).
  • Different Gap Cost Function

25
Scoring matrix for protein sequences
26
  • Discussed above are global alignment algorithms
  • For local alignment algorithm use score 0 as
    threshold as in
  • Smith, Waterman (1981), J. Mol. Biol., 147,
    195-197

27
  • There exist other algorithms faster than dynamic
    programming under specific assumptions, e.g.
    BLAST, MegaBLAST
  • Zhang Z, Schwartz S, Wagner L, Miller W. 2000. A
    greedy algorithm for aligning DNA sequences. J
    Comp Biol. 7203-214.
  • Assume only sequencing error to search for
    very similar sequences. MegaBLAST behind NCBI
    HomoloGene.

28
Multiple Global Alignment
  • CLUSTAL W (Thompson, Higgins, Gibson 1994,
    Nucleic Acids Res, 22, 4673-4680)
  • Ungapped local alignments Lawrence et.al. 1993,
    Science 262, 208-214
  • Gapped local alignments Zhu et.al., 1998,
    Bayesian adaptive sequence alignment algorithms.
    Bioinformatics 14, 25-39

29
BLAST
  • Search a sequence database for fragments similar
    to the query sequence
  • Compile a list of high-scoring short words shared
    by the query sequence and the database
  • Scan the database for hits
  • Expand the hits to MSP (maximum segment pair
    a pair of equal-length/no-gap segments with the
    highest alignment score)

30
BLAST
  • Altschul, et. al. (1990) "Basic local alignment
    search tool." J. Mol. Biol. 215403-410.
  • Variations of BLAST designed for specific
    purposes
  • http//www.ncbi.nlm.nih.gov/BLAST/

31
Advertisement
Write a Comment
User Comments (0)
About PowerShow.com