Title: A Quick Tour:
1A Quick Tour Algorithms, Complexity, Data
Structure
Laxmi Parida Computational Biology Center IBM T
J Watson Research Center
2Problem, Algorithm?
- Problem
- parameters (values are unspecified)
- solution (what is it)
- Algorithm
- step-by-step procedure to solve the problem
3Example Problem
Given n positive integers, reorder the numbers
in ascending order
An instance of the problem with n5 is 3, 6, 4,
2, 1
4Algorithm 1
Obtain all possible orderings of the n
numbers Check each ordering to see if it
satisfies the ascending property
5Algorithm 2
1. Let i be 1 2. If i is equal to n, then exit 3.
Go through the numbers from i to n and pick the
minimum mi 4. Move mi to the start of the
list 5. Increment i by 1 and go to Step 2
Critical Question How efficient is the
algorithm?
6Algorithm 3
Sort (n) 1. Split the elements into 2 sets of
n/2 elements 2. Solve the problem for the two
sets as Sort(n/2) 3. Merge the two solutions
in linear time
Merge Sort
7Time Complexity (worst-case measure)
Described in terms of the size of the input n
Algorithm 1
n!
f(n) O(g(n)), f(n) lt cg(n), for all n
O(n2)
Algorithm 2
Algorithm 3
O(nlog n)
T(1) O(1), T(n) lt 2T(n/2) O(n)
8Tractable Problems
Problem that has a polynomial time solution
Intractable if there exists no polynomial time
solution
9Intractable Problems
Causes of intractability
- so difficult that exponential time needed to
discover the solution - solution itself is exponential in size
10How do we prove Intractability?
(NP-Complete Problems)
- Cooke's theorem (1971) SATISFIABILITY is
NP-complete - If given problem P can be solved in polynomial
time, implies that SAT can be solved in
polynomial time, then P is intractable - there exist a set of known NP-complete problems
Ni - if one of Ni can be transformed to P in
polynomial time, then P is intractable
11How do we deal with Intractable Problems?
- Intrinsic hardness
- Approximation Algorithms
- (provable approximation bounds, hardness of
approximation...) - Heuristics
- Exponential Size of Output
- Output Sensitive Algorithms
- Intelligent means of reducing output size
12Solving a Problem/ Designing an Algorithm
Consult an algorithmicist!
- Abstract the essentials of the problem
- Is it tractable?
- No. Is it approximable?
- Yes. Design an efficient algorithm
- No. Can the problem be reformulated without
significant loss? - Yes. Repeat the process.
- No. Explore heuristics or other ad-hoc methods
- Yes. Design an efficient algorithm
13The Mismatch Distance Problem I
Given strings s1, s2, the mismatch distance is
the number of positions that match a blank in
the other. The objective is to minimize this
distance in alignments.
Let s1 GTTCAGT s2 TTCGTT
GTTCAGT TTC GTT
mismatch distance 3
14Mismatch Distance Problem (Recursive
Formulation)
0, if ij0 D(i,j)
D(i-1,j-1), if s1i s2j
minD(i,j-1), D(i-1,j) 1, otherwise
Required value is D(n1,n2)
T(1) O(1) n is
n1n2 T(n) lt 2T(n-1) O(1)
15Dynamic Programming (polynomial time)
- optimal substructure
- overlapping subproblems
- recursive formulation
- subproblem solutions in a "table" (programming)
16Dynamic Programming
-
G
T
T
C
A
G
T
-
0
1
2
3
4
5
6
7
T
1
2
1
2
3
4
5
6
T
2
3
2
1
2
3
4
5
C
3
4
3
2
1
2
3
4
G
4
3
4
3
2
3
2
3
T
5
4
3
4
3
4
3
2
T
6
5
4
3
4
5
4
3
GTTCAGT TTC GTT
17The Dynamic Table (memoization)
- obtain optimal value
- use table to track the optimal path
18The Mismatch Distance Problem II
Given strings s1, s2, the mismatch distance is
the number of positions that don't match with
each other (match a blank or mismatch). The
objective is to minimize this distance in
alignments.
s1 GATTCG s2
TACG TCGTTCG T
CGTTCG TACG
TACG mismatch distance 4
mismatch distance 5
19Mismatch Distance Problem II (Recursive
Formulation)
0, if ij0 D(i,j)
D(i-1,j-1), if s1i s2j
minD(i-1,j-1),D(i,j-1), D(i-1,j) 1, otherwise
Required value is D(n1,n2)
20Dynamic Programming
-
T
C
G
T
T
C
G
-
0
1
2
3
4
5
6
7
T
1
0
1
2
3
4
5
6
A
2
1
1
2
3
4
5
6
C
3
2
1
2
2
3
4
5
G
4
3
2
1
2
3
4
4
TCGTTCG TACG
21Mismatch Distance Problems
TCGTTCG TACG T CGTTCG TACG
Problem II (distance 4)
Problem I (distance 5)
Longest Common Subsequence Problem
22Mismatch Distance Problems
III Distance of gaps - of mismatches IV
Obtain the k best solutions (k 2)
23Data Structure
- Organization of data
- to answer specific queries efficiently
- to retrieve efficiently, leading to efficient
algorithms
24A Quick Tour Continues Data Structure
25Data Structure
- Organization of data
- to answer specific queries efficiently
- to retrieve efficiently, leading to efficient
algorithms
26Graphs, Trees
- Entities with binary relationships
- Acyclic graphs
- rooted, unrooted
27Example Property of (compact) Trees
- root, internal, leaf nodes
- of internal nodes lt of leaf nodes
28Suffix Tree (compact)
- rooted, directed tree
- leaves numbered bijectively from 1 to n root to
leaf represents si..n - each internal node has at least 2 outgoing edges
- no two outgoing edges start with the same symbol
Given a string s, represents all the suffixes of
s
29Suffix Tree
example GTTCGATT
CGATT
G
T
ATT
4
6
TTCGATT
CGATT
T
ATT
8
5
1
3
CGATT
7
2
30 Patterns
Given a string GATCGATCGA what are the patterns?
maximal patterns GATCGA (2)
GA (3)
31At most how many maximal non-unique patterns
exist in a string of length n?
32Problem
Consider a genomic sequence s s1, s2, ... sm are
fragment compomers (mass spectrometry) Can we
compute the sequence of s?
33One man's algorithm is another man's data
structure
Jon Bentley
34Problem Formulation
Given a set X and subsets of X as S1, S2, S3, ...
Sm, is there a permutation o of the elements of X
such that the elements of each Si is consecutive
in o?
35PQ Trees
A tree with different kinds of nodes - P The
children are in any order - Q The children are
in the fixed l-to-r or r-to-l order - leafnodes
elements of a set
36PQ Tree (example)
Consistent seqs A B C D E B A C D E A B E D C B
A E D C C D E A B C D E B A E D C A B E D C B
A D A C B E ?
E
C
D
B
A
37Few Definitions
Least Common Ancestor (LCA) of nodes 1, 2, ......
k is node p that is a common ancestor of all the
nodes and there does not exist common ancestor
p' s.t p is the ancestor of p' - strips of Q
- whole P Reachable Set (R(p)) is the
collection of the leafnodes that are reachable
from node p
38Formulation as a PQ Tree Problem
Given a set X and subsets of X as S1, S2, S3, ...
Sm, is there a permutation o of the elements of X
such that the elements of each Si is consecutive
in o?
Does there exist a PQ Tree such that S1, S2, S3,
..., Sm are consistent and for each Si,
R(LCA(Si)) Si
39Example
e,b,d,a h,f,a c,h,f,g b,d,a,f
What permutation of a, b, c, d, e, f, g, h gives
the four sets as neighbors?
40Step 1
e,b,d,a
a
e
d
b
h, f, a
41Step 2
h,f,a
a
e
f
h
d
b
c,h,f,g
42Step 3
c,h,f,g
a
e
f
g
c
h
d
b
b,d,a,f
43e, b, d, a h, f, a c, h, f, g b, d, a, f
Step 4
b,d,a,f
f
a
h
e
g
c
d
b
e b d a f h c g
44Observations (algorithm)
- only a constant number of levels affected
- linear in the size of the input
45Properties of PQ Tree
- minimal PQ Tree (unique)
- Reduce a tree using the following
- if only one child, merge with parent
- if k children of P node are P, merge the k nodes
with parent - if k consecutive children of Q are Q, merge with
parent - This reduction gives a unique PQ Tree
46Reverse PQ Tree Problem
Given o1, o2, ... om permutations of X, find the
minimal PQ Tree that is consistent with o1, o2,
... om