Title: Assembling%20Algorithms%20and%20Techniques
1Assembling Algorithms and Techniques
- Upmanyu Misra
- Computational Issues in Molecular Biology
- CSE 397-497
2Motivation
- Direct sequencing of full stretches of DNA base
pairs is not possible (as of now) - Hence, we sample DNA stretches by breaking them
in fragments - Sequencing of fragments is done by previously
discussed algorithms - Fragments are to assembled back to make sense
of the full DNA stretches
3Approach
- Primitive (Root) Techniques
- Models used
- Algorithms Greedy/Multicontig
- Heuristics
- Greedy Path Merge Heuristics
- Basics/Examples/Definitions wherever required
4Refreshing Basics
- Consider 4 strings (fragments) to be assembled
- CTAGG
- AGGTA
- GAACCT
- TAAC
- We know that final sequence size should be around
10 base pairs
5How Assembling works
- _ _ C T A G G _ _
- _ _ _ _ A G G T A
- G A C T A _ _ _ _
- _ _ _ T A G G_ _
G A C T A G G T A
CONSENSUS
6Trivial, eh?
- Consensus sequence easily achieved
- Very close to our goal of finding a 10 base
pair chain - A combination was found such that each column was
coherent - The consensus contained each fragment as an exact
substring
7Size does matter
- Errors creep in when we consider the real life
scenario - Size of sequence increases
- Size of fragments increases
- Noise
- Loss of data
- Larger consensus goals
8Pandoras Box
- Errors might be of the following type (and many
more) - Base Call Insertion Deletion - Substitution
- Chimeras
- Unknown Orientations
- Repeated Regions
- Lack of coverage
9Primitive Solutions
- Assembling for Shotgun fragments
- Directed Sequencing or Walking
- Dual End Sequencing
- Non-Shotgun (based on experimental data)
- Sequencing by Hybridization
- Good news Provide headway into the assembly
problem - Bad news None of them, by itself, is good
enough, not even close
10Usable Models
Real World Problems Contamination Chimras Experiment Errors Orientation Errors Repeated Sections Lack of Coverage
Models Contamination Chimras Experiment Errors Orientation Errors Repeated Sections Lack of Coverage
Shortest Common Superstring
Reconstr -uction
Multicontig
NP HARD
11Algorithms
- Again, these algorithms are of little value by
themselves. Good supporting heuristics help to
make them usable - Two different approaches
- GREEDY
- MULTICONTIG (acyclic overlap graphs)
12Getting Started
- Linking overlaps to graphsWhy?
- Superstrings can be matched on to Graphs
- We (the Computer Science community!) come across
Graphs frequently - Comfortable (logically, psychologically,
cognitively)
13Definitions
- def. Overlap Multigraph OM(F) of a collection F
is the directed, weighted. The set V of nodes is
the set F. A directed edge from a ? F to a
different fragment b ? F with weights t 0
exists if a suffix of a with t characters is a
prefix of b.
suffix(a,t) prefix (b,t)
k is the Killer Agent such that kATCG TCG, k
-1
14Hence...
TACGA
Consider F is a,b,c,d a TACGA b ACCC c
CTAAAG d GACA
t 2
GACA
1
a
b
2
1
1
c
d
1
15Properties of the Graph
- Node must not have a path to itself
- We keep the zero weight edges too
- (hence n(n-1) edges)
- We denote path P1 dbc such that
-
- G A C A _ _ _ _ _ _ _
- _ _ _ A C C C _ _ _ _
- _ _ _ _ _ _ C T A A G
a TACGA b ACCC c CTAAAG d GACA
16Properties of the Graph
- If A is the set of fragments involved in P, we
will have exactly edges in P - Common superstring will be called S(P)
- Relationship between total length of A, the
paths weight and the superstring length is, - A w(P) S(P)
A 1
17Picture is worth .
G A C A _ _ _ _ _ _ _ _ _
_ _ A C C C _ _ _ _ _ _ _ _ _ _ _
C T A A A G
2 w(P)
12
14 A
S(P) G A C A C C C T A A A G
18Mathematical Representation
computing length
A - w(P)
19Hamiltonian Path
- Each path we traverse gives a common superstring
of the fragments involved - If there is a path that passes through each
vertex, we have a common superstring containing
all fragments of F - Hence S(P) F - w(P)
- We have been talking about
- Observation Since F is constant, minimizing
S(P) is equivalent to maximizing w(P)
20Hamiltonian Graph
http//mathworld.wolfram.com/HamiltonianGraph.html
21Backtracking
- Every path corresponds to a superstring
- Is the converse true?
- Consider shortest superstrings
- YES, they will always correspond to a path
NO
22Important results
- def. A collection of fragments is
substring-free if there are no two distinct
strings in the collection such that any one of
them is a substring for the other - For example,
- F GAA, CTGA, AGTCACGCAA is SFree
- but,
- F ACTGAA, TGCA, CTGA, ACGA is not
23Important results
- THEOREM 1 Let F be a substring free collection.
Then for every common superstring S of F there is
a Hamiltonian path P in OM(F) such that S(P) is a
subsequence of S - Example F tag, cat, gac
- S(P) tagacat / tagcatgac / catagac /
- S any superstring (with/without faulty
characters)
24Important results
TAG
GAC
1
1
1
1
0
0
CAT
S(P) tagacat / tagcatgac / catagac /
25Important results
- def. F dominates another collection G if every
elements of G is a substring of some element of
F. - For eg.,
- F ATGA, CATAGAA, GTACTAA
- G TAG, CAT, TACT
26Important results
- def. F and G are equivalent if F dominates G
and G dominates F - F TGA, CTTGAA, ACTTGAA
- G TTGA, ACTTGAA
27Important results
- LEMMA Two equivalent substring-free collections
are identical - F TGA, CTTGAA, ACTTGAA
- G TTGA, ACTTGAA
- F ACTTGAA
- G ACTTGAA
28Important results
- THEOREM 2 F is a collection of strings. There is
a unique substring-free collection G equivalent
to F - This result implies that if we are looking for
common superstrings, then we just have to look
for substring-free collections, since every
collection will have one equivalent to it. - Hence, just removing all strings from F that are
substrings of other elements in F solves our
purpose
Hamiltonian Path SCS SCS
Substring free
Hamiltonian Path
29GREEDY ALGORITHM
- Looking for shortest common superstrings is same
as looking for Hamiltonian paths of maximum
weights in a directed multigraph - Since only heaviest edges are required, we may
prune the weaker ones, saving time and space - This is a greedy attempt
30GREEDY ALGORITHM
- Lets look at an example
- Considering previous graph
31Hence...
TACGA
Consider F is a,b,c,d a TACGA b ACCC c
CTAAAG d GACA
t 2
GACA
1
a
b
2
1
TACGA
1
GACA
c
d
ACCC
1
CTAAAG
32IMPLEMENTATION
- Graphs are easier to understand, but not
necessary - We can implement the greedy algorithm by
recursive implementation of following procedure,
until only one fragment remains.
33IMPLEMENTATION
- Take pair (f, g) of fragments with largest
overlap, say T. - Remove both fragments from F and add fkTg to F
- Assumption, F is substring-free
34Greed seldom pays
Consider F is a,b,c a ATGC b TGCAT c GCC
2
3
a
b
2
0
c
Greedy Algorithm follows the path a ? b ? c
giving a total weight of 3 including a zero
weight path
The Best path would have been b ? a ? c giving a
weight of 4
35HEURISTICS
- Motivation
- - Algorithms like Greedy do not
guarantee optimal solutions - - Solving Shortest Common Superstrings
through Hamiltonian paths is NP-Complete - - Closeness of problem to Multiple alignment
problem
36HEURISTICS
- Following heuristics mostly aim towards solving
two major problems - Fragments can participate with either direct or
the reverse-complemented sequence - Fragments themselves are usually much shorter
than the alignment -
37HEURISTICS
- calls for discrimination between treatment of
different types of gaps (internal, external), to
bind the fragment characters - urges us to consider other criteria, besides the
score, to assess the quality of an alignment
38Accessing Criteria
- SCORING
- Measure of participation of each aligned column
in a multiple alignment - Entropy may be used for that
-
- Entropy measures the uniformity of alignment
39Accessing Criteria
- COVERAGE
- A fragment covers a column i if it participates
in this column either with a character or with an
internal space - Minimum, maximum and mean coverage can be
calculated for a layout - Lesser the coverage, weaker the connection and
more the independence - Coverage bolsters/weakens the notion of consensus
sequence
40Accessing Criteria
- Linkage
- Measure of the way individual fragments are
linked to each other in the layout - Fragments should have overlapping ends to show
some linkage
41Summing up
- For practical implementation we may divide the
process in three phases - Finding overlaps
- Building a layout
- Computing the consensus
42Greedy Path Merge Algorithm
Clone-by-Clone Approach (by Human Genome project)
Whole Genome Shotgun (Celera Genomics)
Preliminary Data
Heuristic Greedy Path Merge
43Greedy Path Merge Algorithm
- HGPs clone-by-clone approach
- Constructs a tiling of the genome by overlapping
pieces ( 150k bp) - Concentrates on determining the sequence of each
such piece - Pieces are called BAC clones or simply BACs
(Bacterial Artificial Chromosomes) - BAC is randomly broken into many smaller
fragments that are cloned and sequenced - The fragments are assembled resulting in Contigs
44Greedy Path Merge Algorithm
- WGS strategy
- Whole Genome is randomly broken into smaller
pieces that are sequenced - Due to size of Genome a pure overlap-based
approach is not feasible - Additionally, Celera uses mate-pairs. These are
fragments with a known relative distance and a
standard deviation (sort of experimental data
from sampling and sequencing large chunks of DNA
45Greedy Path Merge Algorithm
- def. Bactigs are pieces of DNA sequences that
are obtained from a common source region of
150k bp contiguous DNA of human genome obtained
by shotgun sequencing and assembly -
- BACs start as phase-0 BAC, which means several
bactids of small size. They evolve upto phase-3
BAC, which is one bactig fully representing its
source sequence
46Greedy Path Merge Algorithm
- This approach takes up the BACs and Celeras
fragments and primarily aims at increasing the
level of assembly of BACs using the information
given by fragments and mate links - Hence evolving phase-1/2 BACs to phase-3 BACs
47Greedy Path Merge Algorithm
- INITIAL BACTIG GRAPH
- BAC BB1, B2, .., Bn. Fragment f hits (or is
embedded in) a bactig Bi if it (or its reverse
complement) aligns with the bactig with a high
density - Bactig graph is a weighted, undirected
multi-graph without self-loops
Mate edge
48Greedy Path Merge Algorithm
- Functions performed on Initial Bactig Graph
- Edge Bundling For more than one mate pairs
- Transitive Reduction Transitively resucing long
mate edges - Hence Final Bactig Graph is the initial one
with Edge Bundling and Transitive Reduction
performed on it
49Greedy Path Merge Algorithm
- BACTIG ORDERING PROBLEM
- Given a Bactig graph G. To find an ordering of G
that maximizes the weights of valid (happy) mate
edges - The problem is NP complete
- From here on we proceed almost exactly as in
normal Greedy Algorithm
50Greedy Path Merge Algorithm
- Throughout the implementation it is maintained
that every node is adjacent to at most two
selected edges. These edges form the selected
path. - The ordering of Bactigs induced by such a path is
called Scaffolding of Bactigs - The deviation from the actual algorithm is that
it can introduce new edges to the bactig graph,
which are called inferred edges
51(No Transcript)
52Greedy Path Merge Algorithm
- Given a bactig graph G. The output of the
algorithm is a node-disjoint covering of G by
selected paths, each one defining an ordering of
the bactigs whose edges it covers - The algorithm runs in O(mnm2) time, where m is
he number of mate pair edges and n is the number
of bactig edges
53THANK YOU