Assembling%20Algorithms%20and%20Techniques - PowerPoint PPT Presentation

About This Presentation
Title:

Assembling%20Algorithms%20and%20Techniques

Description:

Direct sequencing of full stretches of DNA base pairs is not possible (as of now) ... G = {TAG, CAT, TACT} Important results [def. ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 54
Provided by: Upma9
Category:

less

Transcript and Presenter's Notes

Title: Assembling%20Algorithms%20and%20Techniques


1
Assembling Algorithms and Techniques
  • Upmanyu Misra
  • Computational Issues in Molecular Biology
  • CSE 397-497

2
Motivation
  • Direct sequencing of full stretches of DNA base
    pairs is not possible (as of now)
  • Hence, we sample DNA stretches by breaking them
    in fragments
  • Sequencing of fragments is done by previously
    discussed algorithms
  • Fragments are to assembled back to make sense
    of the full DNA stretches

3
Approach
  • Primitive (Root) Techniques
  • Models used
  • Algorithms Greedy/Multicontig
  • Heuristics
  • Greedy Path Merge Heuristics
  • Basics/Examples/Definitions wherever required

4
Refreshing Basics
  • Consider 4 strings (fragments) to be assembled
  • CTAGG
  • AGGTA
  • GAACCT
  • TAAC
  • We know that final sequence size should be around
    10 base pairs

5
How Assembling works
  • _ _ C T A G G _ _
  • _ _ _ _ A G G T A
  • G A C T A _ _ _ _
  • _ _ _ T A G G_ _

G A C T A G G T A
CONSENSUS
6
Trivial, eh?
  • Consensus sequence easily achieved
  • Very close to our goal of finding a 10 base
    pair chain
  • A combination was found such that each column was
    coherent
  • The consensus contained each fragment as an exact
    substring

7
Size does matter
  • Errors creep in when we consider the real life
    scenario
  • Size of sequence increases
  • Size of fragments increases
  • Noise
  • Loss of data
  • Larger consensus goals

8
Pandoras Box
  • Errors might be of the following type (and many
    more)
  • Base Call Insertion Deletion - Substitution
  • Chimeras
  • Unknown Orientations
  • Repeated Regions
  • Lack of coverage

9
Primitive Solutions
  • Assembling for Shotgun fragments
  • Directed Sequencing or Walking
  • Dual End Sequencing
  • Non-Shotgun (based on experimental data)
  • Sequencing by Hybridization
  • Good news Provide headway into the assembly
    problem
  • Bad news None of them, by itself, is good
    enough, not even close

10
Usable Models
Real World Problems Contamination Chimras Experiment Errors Orientation Errors Repeated Sections Lack of Coverage
Models Contamination Chimras Experiment Errors Orientation Errors Repeated Sections Lack of Coverage
Shortest Common Superstring
Reconstr -uction
Multicontig
NP HARD
11
Algorithms
  • Again, these algorithms are of little value by
    themselves. Good supporting heuristics help to
    make them usable
  • Two different approaches
  • GREEDY
  • MULTICONTIG (acyclic overlap graphs)

12
Getting Started
  • Linking overlaps to graphsWhy?
  • Superstrings can be matched on to Graphs
  • We (the Computer Science community!) come across
    Graphs frequently
  • Comfortable (logically, psychologically,
    cognitively)

13
Definitions
  • def. Overlap Multigraph OM(F) of a collection F
    is the directed, weighted. The set V of nodes is
    the set F. A directed edge from a ? F to a
    different fragment b ? F with weights t 0
    exists if a suffix of a with t characters is a
    prefix of b.

suffix(a,t) prefix (b,t)
k is the Killer Agent such that kATCG TCG, k
-1
14
Hence...
TACGA
Consider F is a,b,c,d a TACGA b ACCC c
CTAAAG d GACA
t 2
GACA
1
a
b
2
1
1
c
d
1
15
Properties of the Graph
  • Node must not have a path to itself
  • We keep the zero weight edges too
  • (hence n(n-1) edges)
  • We denote path P1 dbc such that
  • G A C A _ _ _ _ _ _ _
  • _ _ _ A C C C _ _ _ _
  • _ _ _ _ _ _ C T A A G

a TACGA b ACCC c CTAAAG d GACA
16
Properties of the Graph
  • If A is the set of fragments involved in P, we
    will have exactly edges in P
  • Common superstring will be called S(P)
  • Relationship between total length of A, the
    paths weight and the superstring length is,
  • A w(P) S(P)

A 1
17
Picture is worth .
G A C A _ _ _ _ _ _ _ _ _
_ _ A C C C _ _ _ _ _ _ _ _ _ _ _
C T A A A G
2 w(P)

12

14 A
S(P) G A C A C C C T A A A G
18
Mathematical Representation
  • OR, we may say,
  • If

computing length
A - w(P)
19
Hamiltonian Path
  • Each path we traverse gives a common superstring
    of the fragments involved
  • If there is a path that passes through each
    vertex, we have a common superstring containing
    all fragments of F
  • Hence S(P) F - w(P)
  • We have been talking about
  • Observation Since F is constant, minimizing
    S(P) is equivalent to maximizing w(P)

20
Hamiltonian Graph

http//mathworld.wolfram.com/HamiltonianGraph.html
21
Backtracking
  • Every path corresponds to a superstring
  • Is the converse true?
  • Consider shortest superstrings
  • YES, they will always correspond to a path

NO
22
Important results
  • def. A collection of fragments is
    substring-free if there are no two distinct
    strings in the collection such that any one of
    them is a substring for the other
  • For example,
  • F GAA, CTGA, AGTCACGCAA is SFree
  • but,
  • F ACTGAA, TGCA, CTGA, ACGA is not

23
Important results
  • THEOREM 1 Let F be a substring free collection.
    Then for every common superstring S of F there is
    a Hamiltonian path P in OM(F) such that S(P) is a
    subsequence of S
  • Example F tag, cat, gac
  • S(P) tagacat / tagcatgac / catagac /
  • S any superstring (with/without faulty
    characters)

24
Important results
TAG
GAC
1
1
1
1
0
0
CAT
S(P) tagacat / tagcatgac / catagac /
25
Important results
  • def. F dominates another collection G if every
    elements of G is a substring of some element of
    F.
  • For eg.,
  • F ATGA, CATAGAA, GTACTAA
  • G TAG, CAT, TACT

26
Important results
  • def. F and G are equivalent if F dominates G
    and G dominates F
  • F TGA, CTTGAA, ACTTGAA
  • G TTGA, ACTTGAA

27
Important results
  • LEMMA Two equivalent substring-free collections
    are identical
  • F TGA, CTTGAA, ACTTGAA
  • G TTGA, ACTTGAA
  • F ACTTGAA
  • G ACTTGAA

28
Important results
  • THEOREM 2 F is a collection of strings. There is
    a unique substring-free collection G equivalent
    to F
  • This result implies that if we are looking for
    common superstrings, then we just have to look
    for substring-free collections, since every
    collection will have one equivalent to it.
  • Hence, just removing all strings from F that are
    substrings of other elements in F solves our
    purpose

Hamiltonian Path SCS SCS
Substring free
Hamiltonian Path
29
GREEDY ALGORITHM
  • Looking for shortest common superstrings is same
    as looking for Hamiltonian paths of maximum
    weights in a directed multigraph
  • Since only heaviest edges are required, we may
    prune the weaker ones, saving time and space
  • This is a greedy attempt

30
GREEDY ALGORITHM
  • Lets look at an example
  • Considering previous graph

31
Hence...
TACGA
Consider F is a,b,c,d a TACGA b ACCC c
CTAAAG d GACA
t 2
GACA
1
a
b
2
1
TACGA
1
GACA
c
d
ACCC
1
CTAAAG
32
IMPLEMENTATION
  • Graphs are easier to understand, but not
    necessary
  • We can implement the greedy algorithm by
    recursive implementation of following procedure,
    until only one fragment remains.

33
IMPLEMENTATION
  • Take pair (f, g) of fragments with largest
    overlap, say T.
  • Remove both fragments from F and add fkTg to F
  • Assumption, F is substring-free

34
Greed seldom pays
Consider F is a,b,c a ATGC b TGCAT c GCC
2
3
a
b
2
0
c
Greedy Algorithm follows the path a ? b ? c
giving a total weight of 3 including a zero
weight path
The Best path would have been b ? a ? c giving a
weight of 4
35
HEURISTICS
  • Motivation
  • - Algorithms like Greedy do not
    guarantee optimal solutions
  • - Solving Shortest Common Superstrings
    through Hamiltonian paths is NP-Complete
  • - Closeness of problem to Multiple alignment
    problem

36
HEURISTICS
  • Following heuristics mostly aim towards solving
    two major problems
  • Fragments can participate with either direct or
    the reverse-complemented sequence
  • Fragments themselves are usually much shorter
    than the alignment

37
HEURISTICS
  • calls for discrimination between treatment of
    different types of gaps (internal, external), to
    bind the fragment characters
  • urges us to consider other criteria, besides the
    score, to assess the quality of an alignment

38
Accessing Criteria
  • SCORING
  • Measure of participation of each aligned column
    in a multiple alignment
  • Entropy may be used for that
  • Entropy measures the uniformity of alignment

39
Accessing Criteria
  • COVERAGE
  • A fragment covers a column i if it participates
    in this column either with a character or with an
    internal space
  • Minimum, maximum and mean coverage can be
    calculated for a layout
  • Lesser the coverage, weaker the connection and
    more the independence
  • Coverage bolsters/weakens the notion of consensus
    sequence

40
Accessing Criteria
  • Linkage
  • Measure of the way individual fragments are
    linked to each other in the layout
  • Fragments should have overlapping ends to show
    some linkage

41
Summing up
  • For practical implementation we may divide the
    process in three phases
  • Finding overlaps
  • Building a layout
  • Computing the consensus

42
Greedy Path Merge Algorithm
Clone-by-Clone Approach (by Human Genome project)
Whole Genome Shotgun (Celera Genomics)
Preliminary Data
Heuristic Greedy Path Merge
43
Greedy Path Merge Algorithm
  • HGPs clone-by-clone approach
  • Constructs a tiling of the genome by overlapping
    pieces ( 150k bp)
  • Concentrates on determining the sequence of each
    such piece
  • Pieces are called BAC clones or simply BACs
    (Bacterial Artificial Chromosomes)
  • BAC is randomly broken into many smaller
    fragments that are cloned and sequenced
  • The fragments are assembled resulting in Contigs

44
Greedy Path Merge Algorithm
  • WGS strategy
  • Whole Genome is randomly broken into smaller
    pieces that are sequenced
  • Due to size of Genome a pure overlap-based
    approach is not feasible
  • Additionally, Celera uses mate-pairs. These are
    fragments with a known relative distance and a
    standard deviation (sort of experimental data
    from sampling and sequencing large chunks of DNA

45
Greedy Path Merge Algorithm
  • def. Bactigs are pieces of DNA sequences that
    are obtained from a common source region of
    150k bp contiguous DNA of human genome obtained
    by shotgun sequencing and assembly
  • BACs start as phase-0 BAC, which means several
    bactids of small size. They evolve upto phase-3
    BAC, which is one bactig fully representing its
    source sequence

46
Greedy Path Merge Algorithm
  • This approach takes up the BACs and Celeras
    fragments and primarily aims at increasing the
    level of assembly of BACs using the information
    given by fragments and mate links
  • Hence evolving phase-1/2 BACs to phase-3 BACs

47
Greedy Path Merge Algorithm
  • INITIAL BACTIG GRAPH
  • BAC BB1, B2, .., Bn. Fragment f hits (or is
    embedded in) a bactig Bi if it (or its reverse
    complement) aligns with the bactig with a high
    density
  • Bactig graph is a weighted, undirected
    multi-graph without self-loops

Mate edge
48
Greedy Path Merge Algorithm
  • Functions performed on Initial Bactig Graph
  • Edge Bundling For more than one mate pairs
  • Transitive Reduction Transitively resucing long
    mate edges
  • Hence Final Bactig Graph is the initial one
    with Edge Bundling and Transitive Reduction
    performed on it

49
Greedy Path Merge Algorithm
  • BACTIG ORDERING PROBLEM
  • Given a Bactig graph G. To find an ordering of G
    that maximizes the weights of valid (happy) mate
    edges
  • The problem is NP complete
  • From here on we proceed almost exactly as in
    normal Greedy Algorithm

50
Greedy Path Merge Algorithm
  • Throughout the implementation it is maintained
    that every node is adjacent to at most two
    selected edges. These edges form the selected
    path.
  • The ordering of Bactigs induced by such a path is
    called Scaffolding of Bactigs
  • The deviation from the actual algorithm is that
    it can introduce new edges to the bactig graph,
    which are called inferred edges

51
(No Transcript)
52
Greedy Path Merge Algorithm
  • Given a bactig graph G. The output of the
    algorithm is a node-disjoint covering of G by
    selected paths, each one defining an ordering of
    the bactigs whose edges it covers
  • The algorithm runs in O(mnm2) time, where m is
    he number of mate pair edges and n is the number
    of bactig edges

53
THANK YOU
Write a Comment
User Comments (0)
About PowerShow.com