Chapter 5: DNA Sequence Assembly - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Chapter 5: DNA Sequence Assembly

Description:

Anneal primers (5'- 3', and 3'- 5'). Make two double chains. Repeat: 1,2,4,8,16, ... 3' ... Insert a large piece of DNA into a cloning vector (virus, bacteria, ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 16
Provided by: min87
Category:

less

Transcript and Presenter's Notes

Title: Chapter 5: DNA Sequence Assembly


1
Chapter 5 DNA Sequence Assembly
2
  • Natura enim simplex est, et rerum causis
    superfluis non luxuriat.

  • Principia
  • I.
    Newton

3
DNA Sequencing
  • Two ways of copying DNA
  • 1. Polymerase Chain Reaction (1986) make many
    copies of a fragment (500). Needs primers (end
    segments). Cleave into two. Anneal primers (5-
    3, and 3- 5). Make two double chains. Repeat
    1,2,4,8,16,

3

5
5
3
5
3

5
3
4
  • 2. Cloning. Insert a large piece of DNA into a
    cloning vector (virus, bacteria, yeast -YAC).
    Then the vector is inserted into a host cell to
    duplicate naturally.
  • DNA (shotgun) Sequencing
  • Make many copies (single strand)
  • Cut them into fragments of lengths 500.
  • For each fragment of length L, use some process
    like PCR, generating all lengths 1 L with some
    fluorescence dye. In old scheme, you generate all
    fragments end with A, then with C, then G, then
    T, run them in 4 lanes of electrophoresis gel. In
    the new scheme, you have 4 colors (of the dye)
    all fragments in 1 lane.
  • Then assemble all fragments into the shortest
    common superstring by GREEDY repeatedly merge
    the pair with max overlap until finish.

5
Mouse Whole Genome Assembly
  • Multiple copies of the genome are broken into
    pieces
  • Both ends of every piece are read.
  • Length (and orientation) of each piece form
    constraints.
  • Reads 500-1000 bp
  • Quality array for each position.
  • Reconstruct genome from reads and constraints.
  • Issues both ends of a read usually low quality,
    chimeric reads, repetitive regions.

6
Mouse Genome Data Set
  • Dec. 2001 release, 33 million reads. After
    removing low quality reads, 30.5 million reads.
  • 12.9 million constraints.
  • After removing repeats, if two reads overlap
    large enough, merge (with max score)
  • A contig is an ordered and oriented list of
    overlapping reads.
  • A scaffold is an ordered and oriented list of
    contigs.
  • Covered about 2.5 GB.

7
Shortest Common Superstring
  • In FOCS 1990, we formulated and analyzed the
    following learning problem Infer original DNA
    sequence from fragments. Or given n strings,
    find the shortest common superstring.
  • The problem has been proved to be NPC, 1980.
  • Open for 10 years does GREEDY give linear
    approximation ratio --- greedy/opt lt constant.
  • We solved this, proving approximation ratio 3,
    STOC91 (Blum, Jiang, Li, Tromp, Yannakakis).
  • Improvements by many 2.89, 2.81, 2.79, 2.75,
    2.66,
  • Formal Statement Given Ss1, sn, find a
    shortest s such that for all i, si is a substring
    of s.
  • E.g. alf ate half lethal alpha alfalfa ?
    lethalphalfalfate

8
Many Systems
  • PHREP
  • CAP
  • Euler
  • TIGR Assembler
  • Arachne
  • LSA
  • We will touch on some fundamental facets of the
    sequence assembly problem.

9
Why Shortest superstring?
  • Theorem. Given a DNA sequence S. Random sample
    substring (of length L) from it according to
    distribution D for 6nlogn/Le times. Then given a
    query is s a substring of D, with probability gt
    1-e, we can answer approximately correctly (with
    probD gt 1-e, under distribution D).
  • Remarks
  • I.e., polynomial time pac learnable.
  • L 500 700
  • Weak theorem, but it does show nontrivially why
    we need the shortest superstring the shorter
    the more likely it is correct.

10
Theorem. GREEDY achieves 4.
  • Proof. Given Ss1, ,sm, construct G
  • Nodes are s1, ,sm
  • Edges if then
    add edge
  • where pref is the pref length. I.e.
    siprefoverlap length with sj
  • SCS(S) length shortest Hamilton cycle in G
  • (Modified) Greedy restated find all cycles with
    minimum weights in G, then open cycle,
    concatenate to obtain the final superstring.

sj
pref
si
pref
si
sj
11
This minimum cycle exists
  • Assuming initial Hamilton cycle has w(C) n
  • Then merging si with sj is equivalent to breaking
    into two cycles. We have
  • w(C1) w(C2) n
  • Proof We merged (si, sj) because they have max
    overlap. Picture shows
  • d(si,sj)d(s,s)ltd(si,s)d(s,sj)
  • Continue this process,end with self-cycles C1,
    C2, C3, C4,
  • Sw(Ci) n.

C
si
sj
s
s

S
sj
C1
si
S
si
sj
s
s
C2
12
  • Then we open cycles concat
  • Let Wiw(Ci)
  • Li longest string in Ci
  • open Ci Wi Li
  • n SWi
  • Lemma. S1 and S2 overlap w1w2
  • S(Li-2Wi) n, by lemma, since Lis must be in
    final SCS.
  • Greedy(S)ltS(LiWi)
  • S(Li-2Wi)S3Wi
  • n 3n
  • 4n.
  • QED

w1
w1
w1
w1
s1
w2
w2
w2
s2
13
Dealing with Repeats
  • When there are repeats, no system can handle
    them.
  • Are they unsolvable?
  • Observations repeated sequences are often not
    exact repeats. We can use such small differences
    to separate them.
  • Formalizing this Given a collection of repeats
    (approximately similar strings). Separate them
    into k piles each with a center, minimizing max
    Hamming distance. This problem is NP-hard. An
    interesting research problem. For kO(1), similar
    techniques (as in Motif finding PTAS lecture) can
    be used to give PTAS. However, for larger k, this
    becomes exponential in k.

14
Projects
  • Comparative study CAP, PHREP, Arachne, Euler.

15
Open Problems
  • Prove the 2 approximation ratio for Greedy.
  • Solve the repeat resolution problem.
Write a Comment
User Comments (0)
About PowerShow.com