DNA Sequencing - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

DNA Sequencing

Description:

Start at primer (restriction site) Grow DNA chain ... ARD, CRB ? A. R. B. CS262 Lecture 9, Win06, Batzoglou. Sequencing and Fragment Assembly ... – PowerPoint PPT presentation

Number of Views:238
Avg rating:3.0/5.0
Slides: 58
Provided by: root
Category:
Tags: dna | ard | sequencing

less

Transcript and Presenter's Notes

Title: DNA Sequencing


1
DNA Sequencing
2
DNA Sequencing gel electrophoresis
  1. Start at primer (restriction site)
  2. Grow DNA chain
  3. Include dideoxynucleoside (modified a, c, g, t)
  4. Stops reaction at all possible points
  5. Separate products with length, using gel
    electrophoresis

3
Electrophoresis diagrams
4
Method to sequence longer regions
genomic segment
cut many times at random (Shotgun)
Get one or two reads from each segment
500 bp
500 bp
5
Reconstructing the Sequence (Fragment Assembly)
reads
Cover region with 7-fold redundancy (7X)
Overlap reads and extend to reconstruct the
original genomic region
6
Definition of Coverage
C
  • Length of genomic segment L
  • Number of reads n
  • Length of each read l
  • Definition Coverage C n l / L
  • How much coverage is enough?
  • Lander-Waterman model
  • Assuming uniform distribution of reads, C10
    results in 1 gapped region /1,000,000 nucleotides

7
Repeats
  • Bacterial genomes 5
  • Mammals 50
  • Repeat types
  • Low-Complexity DNA (e.g. ATATATATACATA)
  • Microsatellite repeats (a1ak)N where k 3-6
  • (e.g. CAGCAGTAGCAGCACCAG)
  • Transposons
  • SINE (Short Interspersed Nuclear Elements)
  • e.g., ALU 300-long, 106 copies
  • LINE (Long Interspersed Nuclear Elements)
  • 4000-long, 200,000 copies
  • LTR retroposons (Long Terminal Repeats (700 bp)
    at each end)
  • cousins of HIV
  • Gene Families genes duplicate then diverge
    (paralogs)
  • Recent duplications 100,000-long, very similar
    copies

8
Sequencing and Fragment Assembly
3x109 nucleotides
50 of human DNA is composed of repeats
Error! Glued together two distant regions
9
What can we do about repeats?
  • Two main approaches
  • Cluster the reads
  • Link the reads

10
What can we do about repeats?
  • Two main approaches
  • Cluster the reads
  • Link the reads

11
What can we do about repeats?
  • Two main approaches
  • Cluster the reads
  • Link the reads

12
Sequencing and Fragment Assembly
3x109 nucleotides
ARB, CRD or ARD, CRB ?
13
Sequencing and Fragment Assembly
3x109 nucleotides
14
Strategies for whole-genome sequencing
  • Hierarchical Clone-by-clone
  • Break genome into many long pieces
  • Map each long piece onto the genome
  • Sequence each piece with shotgun
  • Example Yeast, Worm, Human, Rat
  • Online version of (1) Walking
  • Break genome into many long pieces
  • Start sequencing each piece with shotgun
  • Construct map as you go
  • Example Rice genome
  • Whole genome shotgun
  • One large shotgun pass on the whole genome
  • Example Drosophila, Human (Celera),

15
Hierarchical Sequencing
16
Hierarchical Sequencing Strategy
genome
  1. Obtain a large collection of BAC clones
  2. Map them onto the genome (Physical Mapping)
  3. Select a minimum tiling path
  4. Sequence each clone in the path with shotgun
  5. Assemble
  6. Put everything together

17
Methods of physical mapping
  • Goal
  • Make a map of the locations of each clone
    relative to one another
  • Use the map to select a minimal set of clones to
    sequence
  • Methods
  • Hybridization
  • Digestion

18
1. Hybridization
p1
pn
  • Short words, the probes, attach to complementary
    words
  • Construct many probes
  • Treat each BAC with all probes
  • Record which ones attach to it
  • Same words attaching to BACS X, Y ? overlap

19
Hybridization Computational Challenge
p1 p2 .pm
0 0 1 ..1
  • Matrix
  • m probes ? n clones
  • (i, j) 1, if pi hybridizes to Cj
  • 0, otherwise
  • Definition Consecutive ones matrix
  • 1s are consecutive in each row col
  • Computational problem
  • Reorder the probes so that matrix is in
    consecutive-ones form
  • Can be solved in O(m3) time (m gt n)

C1 C2 .Cn
1 1 0 ..0
1 0 1...0
pi1pi2.pim
1 1 1 0 0 0..0
0 1 1 1 1 1..0
0 0 1 1 1 0..0
Cj1Cj2 .Cjn
0 0 0 0 0 01 1 1 0
0 0 0 0 0 00 1 1 1
20
Hybridization Computational Challenge
pi1pi2.pim
pi1pi2.pim
1 1 1 0 0 0..0
0 1 1 1 1 1..0
0 0 1 1 1 0..0
Cj1Cj2 .Cjn
Cj1Cj2 .Cjn
0 0 0 0 0 01 1 1 0
0 0 0 0 0 00 1 1 1
  • If we put the matrix in consecutive-ones form,
  • then we can deduce the order of the clones
  • which pairs of clones overlap

21
Hybridization Computational Challenge
p1 p2 .pm
  • Additional challenge
  • A probe (short word) can hybridize in many
    places in the genome
  • Computational Problem
  • Find the order of probes that implies the
    minimal probe repetition
  • Equivalent find the shortest string of probes
    such that each clone appears as a substring
  • APX-hard
  • Solutions
  • Greedy, probabilistic, lots of manual curation

0 0 1 ..1
C1 C2 .Cn
1 1 0 ..0
1 0 1...0
22
2. Digestion
  • Restriction enzymes cut DNA where specific words
    appear
  • Cut each clone separately with an enzyme
  • Run fragments on a gel and measure length
  • Clones Ca, Cb have fragments of length li, lj,
    lk ? overlap
  • Double digestion
  • Cut with enzyme A, enzyme B, then enzymes A B

23
Online Clone-by-cloneThe Walking Method
24
The Walking Method
  1. Build a very redundant library of BACs with
    sequenced clone-ends (cheap to build)
  2. Sequence some seed clones
  3. Walk from seeds using clone-ends to pick
    library clones that extend left right

25
Walking An Example
26
Advantages Disadvantages of Hierarchical
Sequencing
  • Hierarchical Sequencing
  • ADV. Easy assembly
  • DIS. Build library physical map
  • redundant sequencing
  • Whole Genome Shotgun (WGS)
  • ADV. No mapping, no redundant sequencing
  • DIS. Difficult to assemble and resolve repeats
  • The Walking method motivation
  • Sequence the genome clone-by-clone without a
    physical map
  • The only costs involved are
  • Library of end-sequenced clones (cheap)
  • Sequencing

27
Walking off a Single Seed
  • Low redundant sequencing
  • Many sequential steps

28
Walking off a single clone is impractical
  • Cycle time to process one clone 1-2 months
  • Grow clone
  • Prepare Shear DNA
  • Prepare shotgun library perform shotgun
  • Assemble in a computer
  • Close remaining gaps
  • A mammalian genome would need 15,000 walking
    steps !

29
Walking off several seeds in parallel
Efficient
Inefficient
  • Few sequential steps
  • Additional redundant sequencing
  • In general, can sequence a genome in 5 walking
    steps,
  • with lt20 redundant sequencing

30
Using Two Libraries
Most inefficiency comes from closing a small gap
with a much larger clone
Solution Use a second library of small clones
31
Some Terminology insert a fragment that was
incorporated in a circular genome,
and can be copied (cloned) vector
the circular genome (host) that
incorporated the fragment BAC Bacterial
Artificial Chromosome, a type of
insertvector combination, typically
of length 100-200 kb read a 500-900 long word
that comes out of a sequencing
machine coverage the average number of reads
(or inserts) that cover a position in the
target DNA piece shotgun the process of
obtaining many reads sequencing from random
locations in DNA, to detect overlaps
and assemble
32
Whole Genome Shotgun Sequencing
genome
plasmids (2 10 Kbp)
forward-reverse paired reads
known dist
cosmids (40 Kbp)
500 bp
500 bp
33
Fragment Assembly(in whole-genome shotgun
sequencing)
34
Fragment Assembly
Given N reads Where N 30 million We need to
use a linear-time algorithm
35
Steps to Assemble a Genome
Some Terminology read a 500-900 long word
that comes out of sequencer mate pair a pair
of reads from two ends of the same insert
fragment contig a contiguous sequence formed
by several overlapping reads with no
gaps supercontig an ordered and oriented
set (scaffold) of contigs, usually by
mate pairs consensus sequence
derived from the sequene multiple alignment
of reads in a contig
1. Find overlapping reads
2. Merge some good pairs of reads into longer
contigs
3. Link contigs to form supercontigs
4. Derive consensus sequence
..ACGATTACAATAGGTT..
36
1. Find Overlapping Reads
(read, pos., word, orient.) aaactgcag aactgcagt ac
tgcagta gtacggatc tacggatct gggcccaaa g
gcccaaac gcccaaact actgcagta ctgcagtac gtacggatc
tacggatct acggatcta ctactacac tactacaca
(word, read, orient., pos.) aaactgcag aactgcagt ac
ggatcta actgcagta actgcagta cccaaactg cgg
atctac ctactacac ctgcagtac ctgcagtac gcccaaact ggc
ccaaac gggcccaaa gtacggatc gtacggatc tacggatct tac
ggatct tactacaca
aaactgcagtacggatct aaactgcag aactgcagt
gtacggatct tacggatct gggcccaaactgcagtac g
ggcccaaa ggcccaaac actgcagta
ctgcagtac gtacggatctactacaca gtacggatc
tacggatct ctactacac tactacaca
37
1. Find Overlapping Reads
  • Find pairs of reads sharing a k-mer, k 24
  • Extend to full alignment throw away if not gt98
    similar

TAGATTACACAGATTAC

TAGATTACACAGATTAC
  • Caveat repeats
  • A k-mer that occurs N times, causes O(N2)
    read/read comparisons
  • ALU k-mers could cause up to 1,000,0002
    comparisons
  • Solution
  • Discard all k-mers that occur too often
  • Set cutoff to balance sensitivity/speed tradeoff,
    according to genome at hand and computing
    resources available

38
1. Find Overlapping Reads
  • Create local multiple alignments from the
    overlapping reads

TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
39
1. Find Overlapping Reads
  • Correct errors using multiple alignment

TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTATTGA
TAG-TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG-TTACACAGATTACTGA
TAG-TTACACAGATTATTGA
insert A
correlated errors probably caused by repeats ?
disentangle overlaps
replace T with C
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
In practice, error correction removes up to 98
of the errors
TAG-TTACACAGATTATTGA
TAG-TTACACAGATTATTGA
40
2. Merge Reads into Contigs
  • Overlap graph
  • Nodes reads r1..rn
  • Edges overlaps (ri, rj, shift, orientation,
    score)

Reads that come from two regions of the genome
(blue and red) that contain the same repeat
Note of course, we dont know the color
of these nodes
41
2. Merge Reads into Contigs
Unique Contig
Overcollapsed Contig
  • We want to merge reads up to potential repeat
    boundaries

42
2. Merge Reads into Contigs
  • Ignore non-maximal reads
  • Merge only maximal reads into contigs

43
2. Merge Reads into Contigs
r
r1
r2
r3
  • Remove transitively inferable overlaps
  • If read r overlaps to the right reads r1, r2, and
    r1 overlaps r2, then (r, r2) can be inferred by
    (r, r1) and (r1, r2)

44
2. Merge Reads into Contigs
45
2. Merge Reads into Contigs
repeat boundary???
sequencing error
b
a

b
a
  • Ignore hanging reads, when detecting repeat
    boundaries

46
Overlap graph after forming contigs
Unitigs Gene Myers, 95
47
Repeats, errors, and contig lengths
  • Repeats shorter than read length are easily
    resolved
  • Read that spans across a repeat disambiguates
    order of flanking regions
  • Repeats with more base pair diffs than sequencing
    error rate are OK
  • We throw overlaps between two reads in different
    copies of the repeat
  • To make the genome appear less repetitive, try
    to
  • Increase read length
  • Decrease sequencing error rate
  • Role of error correction
  • Discards up to 98 of single-letter sequencing
    errors
  • decreases error rate
  • ? decreases effective repeat content
  • ? increases contig length

48
2. Merge Reads into Contigs
  • Insert non-maximal reads whenever unambiguous

49
3. Link Contigs into Supercontigs
Normal density
Too dense ? Overcollapsed
Inconsistent links ? Overcollapsed?
50
3. Link Contigs into Supercontigs
Find all links between unique contigs
Connect contigs incrementally, if ? 2 links
supercontig (aka scaffold)
51
3. Link Contigs into Supercontigs
Fill gaps in supercontigs with paths of repeat
contigs
52
4. Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTATTGACTTCATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
  • Derive multiple alignment from pairwise read
    alignments

Derive each consensus base by weighted
voting (Alternative take maximum-quality letter)
53
Some Assemblers
  • PHRAP
  • Early assembler, widely used, good model of read
    errors
  • Overlap O(n2) ? layout (no mate pairs) ?
    consensus
  • Celera
  • First assembler to handle large genomes (fly,
    human, mouse)
  • Overlap ? layout ? consensus
  • Arachne
  • Public assembler (mouse, several fungi)
  • Overlap ? layout ? consensus
  • Phusion
  • Overlap ? clustering ? PHRAP ? assemblage ?
    consensus
  • Euler
  • Indexing ? Euler graph ? layout by picking paths
    ? consensus

54
Quality of assemblies
Celeras assemblies of human and mouse
55
Quality of assembliesmouse
56
Quality of assembliesmouse
Terminology N50 contig length If we sort contigs
from largest to smallest, and start Covering the
genome in that order, N50 is the length Of the
contig that just covers the 50th percentile.
57
Quality of assembliesrat
58
History of WGA
1997
  • 1982 ?-virus, 48,502 bp
  • 1995 h-influenzae, 1 Mbp
  • 2000 fly, 100 Mbp
  • 2001 present
  • human (3Gbp), mouse (2.5Gbp), rat, chicken, dog,
    chimpanzee, several fungal genomes

Lets sequence the human genome with the shotgun
strategy
That is impossible, and a bad idea anyway
Phil Green
Gene Myers
59
Genomes Sequenced
  • http//www.genome.gov/10002154
Write a Comment
User Comments (0)
About PowerShow.com