Aligning Alignments Exactly - PowerPoint PPT Presentation

About This Presentation
Title:

Aligning Alignments Exactly

Description:

Aligning Alignments Exactly. By John Kececioglu, Dean Starrett. CS Dept. Univ. of Arizona ... Given two (DNA or Protein) sequences, an alignment puts them ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 24
Provided by: jme63
Category:

less

Transcript and Presenter's Notes

Title: Aligning Alignments Exactly


1
Aligning Alignments Exactly
  • By John Kececioglu, Dean StarrettCS Dept. Univ.
    of ArizonaAppeared in 8th ACM RECOME 2004,
  • Presented by Jie Meng

2
  • Background
  • Definition
  • Hardness
  • An Exponential time algorithm

3
Alignments
  • Given two (DNA or Protein) sequences, an
    alignment puts them against each other such that
    the similar parts are aligned as close as
    possible, for example
  • A T C T C G C T- T G - A T G A T

There are four kinds of alignments Match Insertion
Deletion Mismatch
4
Scoring Alignments
  • There are four types of aligned columns
  • Match Score ?match 0.
  • Mismatch Score ?mismatch ? 0.
  • Insertion Score ?insertion ? 0.
  • Deletion Score ?deletion ? 0.
  • The score of an alignment is defined to be the
    sum of the score of the aligned columns.
  • The goal is to minimize the score

5
Gap-cost
  • We can extend the score ?indel by ?open and
    ?extension, then for a gap of size x, we have
    ?open x ?extension instead of x ?indel .
  • AT----CGCTTCAT
    -TGCATAT-----
  • ?open 4 ?extension

6
Multiple Alignments
  • In general we also need compare multiple
    sequences and find the similarities.
  • Multiple alignment generalizes the alignment idea
    to handle many sequences.
  • AT-C-TCGAT -TGCAT--AT
    ATCCA-CGCT

7
Sum-of-Pairs (SP) Score
  • Given a multiple alignment, the sum-of-pairs (SP)
    score is given by the sum of the induced pairwise
    alignment scores of each pair in the alignment.
  • AT-C-TCGAT
    -TGCAT--AT ATCCA-CGCT
  • ?
  • AT-C-TCGAT -TGCAT--AT AT-C-TCGAT
    -TGCAT--AT ATCCA-CGCT ATCCA-CGCT



8
BAD NEWS
  • Multiple alignment is NP-hard
  • One methods is to approximate the optimal value
  • Progressive alignments
  • A problem arised natually Aligning Alignments

9
Aligning Alignments
  • Let S be a collection of strings s1, s2, s3sk,
    over alphabet
  • An alignment of S is a matrix A with k rows such
    thati) Each entry is either a letter or a
    spaceii) No column is all spaceiii) Reading
    across row i and remove space, we get
    string si
  • Like before, we have three types of aligning
    scorematch, mismatch and substitution

10
Aligning Alignments
  • Given two alignments A with k sequences of length
    N, B with l sequences of length M, we want to
    align the columns of A and B

AT-C-TCGAT-TGCAT--ATATCCA-CGAT CT-ATTGGAT-TTAT
-G--TCTTA-GGGAT
11
Aligning Alignments
  • In other word, We treat the columns of A and B
    as single letters, just like aligning two
    sequences.
  • CT GT -T
  • AT -T GT

C-TG-T--T -AT--T-GT
12
Aligning Alignments
  • The score function is still sum-of-pair, namely
  • We note that the alignment of Ai and Bj may
    contain space in both sequences, so we just
    remove the space here
  • Ai a----aa-a
  • Bj aaa-a-a-a

13
Aligning Alignments
  • Without gap cost, aligning alignments is
    polynomial time solvable. We can apply dynamic
    programming like we did in aligning sequences
    the only difference here is that we align columns.

14
Aligning Alignments
  • With gap cost, this problem is NP-complete
  • We can use a reduction from MAX-CUT problem
  • MAX-CUT Given a graph G(V, E), and a integer c,
    ask whether there is a partition of V V L R
    and , such that the size of
    the cut is no less than c
  • By cut, it means the set of edges which have one
    end vertex in L and another is in R

15
NP-hardness
  • Given an instance of MAX-CUT G(V,E), Vv1,
    v2, vn and Ee1, e2, em,and a integer c
  • we construct two multiple alignments A and B
    over alphabet 0,1 both A and B has m edge rows
    and k dummy rows, each edge rows corresponding an
    edge A has 2n columns, every two continuous
    columns correspond a vertex B has 3n columns,
    every three continuous columns correspond a
    vertex

16
NP-hardness
  • The dummy rows in A are (0-)n, dummy rows in B
    are (0--)n
  • As to the edge rows in A suppose the row for e,
    and e(vi, vj), then in columns i and j, there
    are substring, -1, and space elsewhere
  • As to the edge rows in B suppose the row for e,
    and e(vi, vj), (iltj), then in columns i, there
    is a substring 010, in columns j, there is a
    substring -10

17
NP-hardness
  • Simply we let score for match is 0, score
    for mismatch is 1,
  • and gap open cost is 2, gap extension cost is
    1
  • ask whether there is an alignment such that the
    score is less then d-c
  • So we have an instance of Aligning Alignments.

18
HOMEWORK4
  • Given a set of multiple alignments A1, A2,
    An, each Ai is a multiple alignment with ki
    sequences, without gap cost, is the problem of
    multiple alignment on those alignments A1, A2,
    An hard or easy, use the method in this paper to
    align multiple alignments, i.e. align columns. If
    hard, prove it otherwise, give an efficient
    algorithm and prove complexity and correctness.

19
Exact Algorithm
  • The basic idea is still dynamic programming
  • We have to remember extra information by a set,
    so-called shape, S for each row in a multiple
    alignment, we record the columns of the
    right-most letters.

20
Exact Algorithm
  • S(i, j)

21
Exact Algorithm
  • C(i,j,t)min
  • Where g(Ai, Bj, s) means the total number of
    gaps initiated by appending column Ai and Bj
    onto an alignment that ends in shape s

22
Exact Algorithm
  • The optimum value is
  • The problem here is the number of shapes maybe
    too many, so in the worst case the time and space
    complexity is

23
Any Questions?
423B jmeng_at_cs.tamu.edu
Write a Comment
User Comments (0)
About PowerShow.com