Title: Computational Molecular Biology
1Computational Molecular Biology
- Multiple Sequence Alignment
2Sequence Alignment
- Problem Definition
- Given 2 DNA or protein sequences
- Find Best match between them
- What is an Alignment
- Given 2 Strings S and S
- Goal The lengths of S and S are the same by
inserting spaces (-- sometimes denote as ?) into
these strings
A -- T C -- A
-- C T C A A
3Matches, Mismatches and Indels
- Match two aligned, identical characters in an
alignment - Mismatch two aligned, unequal characters
- Indel A character aligned with a space
A A C T A C T -- C C T A A C A C T -- -- -- -- C
T C C T A C C T -- -- T A C T T T
10 matches, 2 mismatches, 7 indels
4Basic Algorithmic Problem
- Find the alignment of the two strings that
- max m where m ( matches mismatches indels)
- Or min m where m is the SP-score of an alignment
- m defines the similarity of the two strings, also
called Optimal Global Alignment - Biologically a mismatch represents a mutation,
whereas an indel represents a historical
insertion or deletion of a single character
5Multiple Sequence Alignment
- Problem Definition
- Similar to the sequence alignment problem but the
input has more than 2 strings - Challenges
- NP-hard, MAX-SNP
- Guarantee factor 2 2/k where k is the number
of the input sequences. - More work to reduce the time and space complexity
6Sum of Pairs Score (SP-Score)
- Given a finite alphabet and where ? denotes
a space - Consider k sequences over that we want to
align. After an alignment, each sequence has
length l - A score d is assigned to each pair of letters
7SP-Score
- The SP-Score of an alignment A is defined as
- Consider a matrix of l columns and k rows where
the rows represents the sequences and columns
represent the letters - SP-Score is the sum of the scores of all columns
- Score of each column is the sum of the scores of
all distinct unordered pairs of letters in the
column - Or we can view as sum of pairwise sequence
alignment values. - Find an (optimal) alignment to minimize the
SP-Score value
8- Proving MSA with SP-Score that is a Metric
is NP-hard
9Some Notations
10Some Basic Properties
- Lemma 1 Let s1, s2 be two sequences over S such
that l1s1, l2s2, l2l1 and there are m
symbols of s1 that are not in s2. Then every
alignment of the set s1,s2 has at least ml2-l1
mismatches
11(No Transcript)
12The construction
- Reduce the vertex cover (or node cover) to MSA.
- Vertex cover
- Instance A graph G(V,E) and an integer kV
- Question Is there a vertex cover V1 of G of size
k or less? - MSA
- Instance A set Ss1, , sn of finite sequences
over a fixed alphabet S, an SP-score and an
integer C - Question Is there a multiple alignment of the
sequences in S that is of value C or less?
13SP-Score (alphabet of size 6)
14The Reduction
So, we have , T is a set of C2 sequences t
and X contains C1 sequences x(k), where C1 and C2
will be determined later
15An Example
16Intuition
- By the above construction, an optimal alignment A
of S is obtained when A satisfies certain
properties (called standard alignment) - The value of standard alignment is bounded by a
given threshold C only where G has a vertex cover
of size k - How to obtain
- Force ds of the test sequences to be aligned
with bs of the edge sequences - Only one b of each edge sequence can be aligned
to a d - The number of such alignment determines the value
of the alignment
17Standard Alignemnt
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22- Let US and US,X denote the upper bounds of D(AS)
and D(AS,X) respectively - By Corollary 8 and Lemma 9, we have the standard
alignment has value not greater than DSD US
US,X - where DSD D(AX) D(AT) D(AX,T) D(AS,T)
over a standard alignment A - Now, let C1 gt US and C2 gt US US,X, we can prove
that an optimal alignment must be a standard one
23(No Transcript)
24(No Transcript)
25- Show the NP-hardness of any scoring matrix in
a broad class M - Show that there is a scoring matrix M0 such that
MSA for M0 is MAX-SNP hard
26Interesting Observation
- Via the brute force, optimal MSA contains very
few gaps - Suggesting the study of gap limitations
- Have an upper bound of the number of gaps one can
insert during the alignment - Special case
- Gap-0 No gap allows, but we can shift the
strings for an alignment (insert gaps at the
beginning or at the end of a string) - Gap-0-1 a gap-0 alignment such that the gaps at
the beginning or at the end of each string is
exactly one space
27Problem Definition
- Given a finite alphabet
- Scoring matrix
- For i, j gt 0, si,j represents the penalty for
aligning ai with aj - For i gt 0, s0,i and si,0 are called indel
penalites - Gap opening penalties (in addition to the indel
penalties) for aligning ai with the first or last
? in the string of ?s
28Generic Scoring Matrix
Where SA,T, x, y, x are fixed nonnegative
numbers and u gt max0, vA, vT holds
- Let M2 be the class of all scoring matrices that
contain a generic submatrix M - Let M1 be the class of all scoring matrices that
contain a sub-matrix isomorphic - to a generic matrix M with z gt vT.
- Let M be the class of all scoring matrices that
contain a submatrix isomorphic - to a generic matrix M with y gt u and z gt vT.
- Theorem 1
- The gap-0-1 multiple alignment problem is NP-hard
for every scoring matrix M - in M2.
- (b) The gap-0 multiple alignment problem is
NP-hard for every M in M1 - (c) The multiple alignment problem is NP-hard for
every M in M - Note that M is quite broad and covers most
scoring schemes used in - biological applications.
29Reduction
- Reduce the MAX-CUT-B
- Given G(V,E) where kV and each vertex has a
degree at most B - Find a partition of V into two disjoint sets such
that to maximize the number of edges crossing
these two sets - Given a graph G(V,E) with k vertices v0, , vk-1
and l edges e0, , el-1. We will construct a set
of k2 sequences t0, , tk2-1 as follows
30Reduction
- For each vertex vi, construct a sequence ti such
that - for each edge emvh, vi incident at vi, h lt i,
n lt k5, set - where ti,j represents the character at the jth
position in ti. - For other j, let ti,j T
- For i k, set ti T T T T with length k12l
31An Example
32Proof of Theorem 1(a)
- We will show that a gap-0-1 alignment will
partition V into two disjoint subsets V0 and V1 - V0 all vertices vi such that ti remains in place
(a space appends at the end) - V1 all vertices vi such that ti shifts to the
right - Thus, based on the alignment, we can find the
cut. And vice versa, based on the cut, we can
find the alignment - The left part is prove that if k is sufficiently
large, the optimal gap-0-1 alignment yields a
partion of V with maximum edge cut.
33Proof of Theorem 1(a)
- Let c denote the cut based on the alignment A
- Consider all the sequences ti after that
alignment A - The total indel penalties is of order O(k4)
(appears at the first and last column in the SP
score matrix) - The total number of mismatches before the
alignment is 3k5l(k2-1) - To maximally reduce this number
- 1 A-A match reduces 2 A-T mismatches
- For each edge (vh, vi), if there are in different
subsets (of the partition), then a total of k5
A-A matches between sequences th and ti are
created - No other A-T mismatches can be elimiated
- Thus the SP-score
- k12lvTk2(k2-1)23k5l(u-vT)(k2-1)-ck5(2u-vA-vT)O(k
4)
34Theorem 2
- Consider the following scoring matrix M0 for the
- alphabet ?0 A,T,C.
-
- The gap-0-1 MSA problem is MAX-SNP-hard
- The gap-0 MSA problem in MAX-SNP-hard
- The MSA problem in MAX-SNP-hard
35MAX-SNP-hard Proof
- To prove problem A is MAX-SNP-hard, we need to
L-reduce problem A, which is MAX-SNP-hard to A - L-reduce
- There are two polynomial-time algorithms f, g and
constants a, b gt 0 such that for each instance I
of A - f produces an instance I f(I) of A such that
OPT(I) aOPT(I) - Given any solution of I with cost c, g produces
a solution of I with cost c such that c-OPT(I)
bc-OPT(I)
36Proof of Theorem 2
- To prove MSA (with M0 and the scoring matrix
mentioned before) MAX-SNP-hard - L-reduce the MAX-CUT-B to another optimization
problem, called A, which is L-reduce to a scaled
version of MSA - Problem A
- Given a graph G(V,E) with bounded degree B. For
every partition PV0, V1, let cp be the size of
cut determined by P. - Find the partition P of V that minimizes dp
3E-2cp
37Show A is MAX-SNP-hard
- Let f and g be an identity function
- Set a 3B and b 2, we can easily prove the two
properties of the L-reduction since - cp E/B and dp 3E - 2 cp 3 E
- Any increase of cp by 1 decrease dp by 2
38Show A L-reduce to scaled MSA
Similar to the above construction, we have
39- Similar to the proof of Theorem 1, we have the
optimal SP-score where - If the SP-score is scaled by a factor of k-5/2
for a MSA of k sequences, then A L-reduce to MSA.
40GENETIC ALGORITHMS
41How do GAs work?
- Create a population of random solutions
- Use natural selection
- crossover and mutation to improve the solutions
- Stop the operation if satisfying some certain
criteria such as - No improvement on fitness function
- The improvement is less than some certain
threshold - The number of iteration is more than some certain
threhold
42Terms and Definitions
- Chromosomes
- Potential solutions
- Population
- Collection of chromosomes
- Generations
- Successive populations
43Terms and Definitions
- Crossover
- Exchange of genes between two chromosomes
- Mutation
- Random change of one or more genes in a
chromosome - Elitism
- Copy the best solutions without doing crossover
or mutation.
44Terms and Definitions
- Offspring
- New chromosome created by crossover between two
parent chromosomes - Fitness function
- Measures how good a chromosome is.
- Encoding scheme
- How do we represent every chromosome/gene?
- Binary, combination, syntax trees.
45Why are GAs attractive?
- No need for a particular algorithm to solve the
given problem. Only the fitness function is
required to evaluate the quality of the
solutions. - Implicitly a parallel technique and can be
implement efficiently on powerful parallel
computers for demanding large scale problems.
46Basic Outline of a GA
- Initial population composed of random
chromosomes, called first generation - Evaluate the fitness of each chromosome in the
population - Create a new population
- Select two parent chromosomes from a population
according to their fitness - Crossover (with some probability) to form a new
offspring - Mutation (with some probability) to mutate new
offspring - Place new offspring in a new population
- Process is repeated until a satisfactory solution
evolves
47(No Transcript)
48Operations
- Mutation Operation
- Modify a single parent
- Try to avoid local minima
49Let's see some running examples
- Minimum of a function
- http//cs.felk.cvut.cz/xobitko/ga/example_f.html
- Elitism
- http//cs.felk.cvut.cz/xobitko/ga/params.html
- The travelling salesman problem
- http//cs.felk.cvut.cz/xobitko/ga/tspexample.html
50Multiple Sequence Alignment
- Fitness function is used to compare the different
alignments - Based on the number of matching symbols and the
number and size of gaps - Also called the cost function
- Different weights for different types of matches
- Gap costs
- can be simple and count the total matching
symbols - can be complicated and consider the type of
matching symbols, location in the sequence,
neighboring symbols etc.
51 52Scoring method
- Score zero for a match or for two opposing spaces
- Score one for a mismatch or for a character
opposite a space
53Assumptions
- Assume that two opposing spaces have a zero value
- Assume other values satisfies triangle inequality
- s(x,z) s(x,y) s(y,z)
- s(x,z) cost of transforming character x into
character z
54Objective Functions
- Two objective functions
- SP
- The sum of the values of pairwise alignments
induced by an alignment A - TA
- Using the topology of the tree, map the strings
to the nodes of the tree - The sum of the selected pairwise alignments is
called tree alignment
55Center Star Method
- For a set of k strings X
- Choose a center string Xc of X which minimizes
Sj?cD(Xc,Xj) - Let M min Sj?cD(Xc,Xj)
- Center star is a star tree of k nodes with the
center node labeled Xc and each of the k-1
remaining nodes labeled by a distinct string in X
\ Xc - If Xi and Xj are strings labeling adjacent nodes
of tree T, then alignment of Xi and Xj induced by
A(T) has value D(Xi,Xj)
56Center Star Method Alg Ac
- Do an optimal alignment for each pair (Xc, Xj)
for all j ? c - s0 max number of spaces placed before the first
char of Xc - sf max number of spaces placed after the last
char of Xc - si max number of spaces placed between Xc(i)
and Xc(i1)
57Center Star Method Alg Ac
- For Xc, insert s0, si, and sf spaces at the
beginning, between, and the end of Xc
respectively. Call Xc - Then for each Xj, do the optimal alignment
without modifying Xc
58Analysis
- d(Xi,Xj) D(Xi,Xj)
- V(Ac) Siltjd(Xi,Xj)
- V(Ac) is at most twice the value of the optimal
multiple alignment of X
59Analysis
- Lemma 3.1 For any 2 strings Xi,Xj, we have
- d(Xi,Xj) d(Xi,Xc) d(Xc,Xj)
- D(Xi,Xc) D(Xc,Xj)
- triangle inequality
60Analysis
- A be the optimal multiple alignment of k strings
X - Define V(A) Siltjd(Xi,Xj)
61Analysis
- Theorem 3.1
- V(Ac) / V(A) 2(k-1)/ k lt 2
- Proof
62Disadvantages
- Requires all pairwise alignments
- Computationally expensive
- Faster, Randomized alignments
- Randomly select string Xi
- Build multiple alignment with star centered at Xi
- Select best multiple alignment A from p such
stars - At most (k-1)p pairwise alignments need to be
computed
63Randomized Alignments
- Theorem 3.2For any r gt1, let e(r) be the
expected number of stars needed to be chosen at
random before the value of best resulting
alignment is within a factor of 21/(r-1) of the
optimal alignment. Then e(r) r. - e(r) is independent of k and the length of the
strings.
64Proof of Theorem 3.2
- For r 2, for each string Xi
- define M(i) SjD(Xi,Xj) then M(c) M
- From Theorem 3.1,
- S(i,j)D(Xi,Xj) SjM(i) 2(k-1)M so the Avg
value of M(i) lt 2 M - Since min M(i) M, then Median M(i) lt 3M
- Number of centers selected before a selected
M(i) is less than the median 2
65Proof
- Suppose median is ?M for 1 ? 3
- Then S(i,j)D(Xi,Xj) kM/2 k ? M/2
- Value of the alignment obtained from any below
median star 2(k-1) ? M - Therefore, error ratio for this star 2 ? /
(1/2 ? /2) - When ? 3, error ratio 3.
- So we have e(2) 2
66Proof
- Now generalize this proof for r gt 2
- At least k/r stars have M(i) less than or equal
to (2r-1)M/(r-1) - Minimum M(i) is M
- Mean lt 2M
- expected number of stars to pick with M(i) lt ? M
is r for 1 ? (2r-1)/(r-1) - error ratio 2 ? /1/r (r-1) ? /r
- (2r-1)/(r-1)2 1/(r-1)
67Theorem 3.3
- Picking p stars at random, the best resulting
alignment will have value within a factor of 2
1/(r-1) of the optimal with probability at least - 1 (r-1)/rp
68Center Star Method
- Proof
- From theorem 3.2, if Median value was actually 3M
- For half the stars M(i) M and M(i) 3M for the
other half - S(i,j)D(Xi,Xj)2kM
- optimal SP alignment can be obtained from any
center string Xiwith M(i) M - Probability of selecting such a string is one-half
69Tree Alignment Method
- Typical approach
- first find multiple alignment and then build a
tree showing the evolutionary derivations - Another approach (called tree alignment)
- first choose the typology of the tree and then
map the strings to the nodes of the tree - Alignment is the pairwise alignments of the
strings at the ends of the edges of the tree
70Formal Definitions
- Let K be an input set of k strings
- K K be a set of strings containing K
- Evolutionary tree TK for K is a tree
- with at least k nodes
- each string in K labels exactly one node each
node gets exactly one label in K - The value of TK V(TK) SD(X,Y)
- the problem is to find a set of strings K and
T(K) for K which minimizes V (TK)
71- The alignment value D(X,Y ) is interpreted as
the minimum cost" to transform string X to
string Y - The sum of the alignment values of the edges
gives the evolutionary cost implied by the tree.
72Method
- Let G be a graph with k nodes labeled with a
distinct string in K - Each edge (X,Y) has a weight D(X,Y)
- Find the MST of G. This MST is an evolutionary
tree for K
73Analysis
- T denote the optimal evolutionary tree for K.
- Prove V(MST)/V(T) lt 2OPT
- Let C be a traversal of edges of T which
traverses everyy edge exactly once in each
direction - Let C1, , Ck be the order that C encounters
- Let V(C) D(Ck,C1) SiltkD(Ci,Ci1)
74Analysis
75Analysis
- Corollary 4.1 V(C) 2V(T),
- Let D(Ci,Ci1) be the largest distance of any
adjacent strings in C traversal - Lemma(4.2)
- V(MST) V(C) D(Ci,Ci1) V(C) V(C)/K
76Analysis
- Theorem 4.1
- For any set K of k strings, we have
- V(MST)/ V(Tk) 2(k-1)/k lt 2
- Theorem 4.2
- V(MST) / V(Tk) (k-1)/k V(C)/V(Tk) 2 (k-1)/k
- Corollary 4.2
- V(Tk) gt kV(MST)/2(k-1)
77Constrained MSA
78Motivation
- General SP MSA problem
- NP-completeness has already been established
- Appromixation algorithms have been developed
- Heuristics are also avaliable
- Constrained MSA
- Biologists often have additional knowledge of
data (e.g. active site residues) - Additional knowledge can specify matches at
certain locations - Models allow users to provide additional
constraints
79Definition of CMSA Problem
- Suppose that P p1p2 . . . pa is a common
subsequence of S1, S2, . . . , SK - The constrained multiple sequence alignment of S
with respect to P is - an MSA A with the constraints that there are a
columns in A, c1, c2, . . . , ca with c1 lt c2 lt
lt ca, such that the characters of column ci, 1
i a, are all equal to pi.
80Optimal CPSA
81Dynamic Algorithm
82Time and Space Complexities
83CMSA
- The improvement of CPSA in turn improves the time
space complexity of - Progressive CMSA from O(akn4) and O(an4) to
O(ak2n2) and O(an2). - Optimal CMSA
- This Optimal CMSA algorithm involves the creation
of a matrix with k1 dimensions. - (Assume d(x,y) is the distance function and
satisfies the triangle inequality.) - Let D(i1, . . . , ik ?) be the optimal CMSA
score matrix for - S11..i1, . . . , Sk1..ik where P1..? is
aligned in ? columns. - Then optimal alignment score is D(n1, . . . , nk
a), where ni Si. - Computing D
- D(0k 0) 0
- Let ej 0 or 1 with ejSjij where j 0
represents a space, and - d(x1, . . . , xk) S1iltjkd(xi, xj).
- D(i1, i2, . . . , ik ?) is the minimum of
- if S1i1 . . . Skik P?,
- D(i1 - 1, . . . , ik - 1 ? - 1) d(S1i1, . .
. , Skik) - mine?0,1k (D(i1 - e1, . . . , ik - ek ?)
d(e1S1i1, . . . , ekSkik)).
84CMSA (Center Star)
- The Center-Star method proposed for the general
- MSA problem can be modified to apply to the CMSA
- problem.
- Consider each sequence as the center, Sc.
Consider each list position that Sc is aligned
with P. - Find the minimum star-sum score Sc.
- Create a constrained alignment matrix by merging
the - constrained pairwise sequence alignments between
Sc Sj.
85CMSA (Center Star)
- The recurrence of Thm. 3.1 is only slightly
modified
86Example