Title: DNA Sequencing
1DNA Sequencing
2DNA Sequencing gel electrophoresis
- Start at primer (restriction site)
- Grow DNA chain
- Include dideoxynucleoside (modified a, c, g, t)
- Stops reaction at all possible points
- Separate products with length, using gel
electrophoresis
3Electrophoresis diagrams
4Method to sequence longer regions
genomic segment
cut many times at random (Shotgun)
Get one or two reads from each segment
500 bp
500 bp
5Reconstructing the Sequence (Fragment Assembly)
reads
Cover region with 7-fold redundancy (7X)
Overlap reads and extend to reconstruct the
original genomic region
6Definition of Coverage
C
- Length of genomic segment L
- Number of reads n
- Length of each read l
- Definition Coverage C n l / L
- How much coverage is enough?
- Lander-Waterman model
- Assuming uniform distribution of reads, C10
results in 1 gapped region /1,000,000 nucleotides
7Repeats
- Bacterial genomes 5
- Mammals 50
- Repeat types
- Low-Complexity DNA (e.g. ATATATATACATA)
- Microsatellite repeats (a1ak)N where k 3-6
- (e.g. CAGCAGTAGCAGCACCAG)
- Transposons
- SINE (Short Interspersed Nuclear Elements)
- e.g., ALU 300-long, 106 copies
- LINE (Long Interspersed Nuclear Elements)
- 4000-long, 200,000 copies
- LTR retroposons (Long Terminal Repeats (700 bp)
at each end) - cousins of HIV
- Gene Families genes duplicate then diverge
(paralogs) - Recent duplications 100,000-long, very similar
copies
8Sequencing and Fragment Assembly
3x109 nucleotides
50 of human DNA is composed of repeats
Error! Glued together two distant regions
9What can we do about repeats?
- Two main approaches
- Cluster the reads
- Link the reads
10What can we do about repeats?
- Two main approaches
- Cluster the reads
- Link the reads
11What can we do about repeats?
- Two main approaches
- Cluster the reads
- Link the reads
12Sequencing and Fragment Assembly
3x109 nucleotides
ARB, CRD or ARD, CRB ?
13Sequencing and Fragment Assembly
3x109 nucleotides
14Strategies for whole-genome sequencing
- Hierarchical Clone-by-clone
- Break genome into many long pieces
- Map each long piece onto the genome
- Sequence each piece with shotgun
- Example Yeast, Worm, Human, Rat
- Online version of (1) Walking
- Break genome into many long pieces
- Start sequencing each piece with shotgun
- Construct map as you go
- Example Rice genome
- Whole genome shotgun
- One large shotgun pass on the whole genome
- Example Drosophila, Human (Celera),
15Hierarchical Sequencing
16Hierarchical Sequencing Strategy
genome
- Obtain a large collection of BAC clones
- Map them onto the genome (Physical Mapping)
- Select a minimum tiling path
- Sequence each clone in the path with shotgun
- Assemble
- Put everything together
17Methods of physical mapping
- Goal
- Make a map of the locations of each clone
relative to one another - Use the map to select a minimal set of clones to
sequence - Methods
- Hybridization
- Digestion
181. Hybridization
p1
pn
- Short words, the probes, attach to complementary
words - Construct many probes
- Treat each BAC with all probes
- Record which ones attach to it
- Same words attaching to BACS X, Y ? overlap
19Hybridization Computational Challenge
p1 p2 .pm
0 0 1 ..1
- Matrix
- m probes ? n clones
-
- (i, j) 1, if pi hybridizes to Cj
- 0, otherwise
- Definition Consecutive ones matrix
- 1s are consecutive in each row col
- Computational problem
- Reorder the probes so that matrix is in
consecutive-ones form - Can be solved in O(m3) time (m gt n)
C1 C2 .Cn
1 1 0 ..0
1 0 1...0
pi1pi2.pim
1 1 1 0 0 0..0
0 1 1 1 1 1..0
0 0 1 1 1 0..0
Cj1Cj2 .Cjn
0 0 0 0 0 01 1 1 0
0 0 0 0 0 00 1 1 1
20Hybridization Computational Challenge
pi1pi2.pim
pi1pi2.pim
1 1 1 0 0 0..0
0 1 1 1 1 1..0
0 0 1 1 1 0..0
Cj1Cj2 .Cjn
Cj1Cj2 .Cjn
0 0 0 0 0 01 1 1 0
0 0 0 0 0 00 1 1 1
- If we put the matrix in consecutive-ones form,
- then we can deduce the order of the clones
- which pairs of clones overlap
21Hybridization Computational Challenge
p1 p2 .pm
- Additional challenge
- A probe (short word) can hybridize in many
places in the genome - Computational Problem
- Find the order of probes that implies the
minimal probe repetition - Equivalent find the shortest string of probes
such that each clone appears as a substring - APX-hard
- Solutions
- Greedy, probabilistic, lots of manual curation
0 0 1 ..1
C1 C2 .Cn
1 1 0 ..0
1 0 1...0
222. Digestion
- Restriction enzymes cut DNA where specific words
appear - Cut each clone separately with an enzyme
- Run fragments on a gel and measure length
- Clones Ca, Cb have fragments of length li, lj,
lk ? overlap - Double digestion
- Cut with enzyme A, enzyme B, then enzymes A B
23Online Clone-by-cloneThe Walking Method
24The Walking Method
- Build a very redundant library of BACs with
sequenced clone-ends (cheap to build) - Sequence some seed clones
- Walk from seeds using clone-ends to pick
library clones that extend left right
25Walking An Example
26Advantages Disadvantages of Hierarchical
Sequencing
- Hierarchical Sequencing
- ADV. Easy assembly
- DIS. Build library physical map
- redundant sequencing
- Whole Genome Shotgun (WGS)
- ADV. No mapping, no redundant sequencing
- DIS. Difficult to assemble and resolve repeats
- The Walking method motivation
- Sequence the genome clone-by-clone without a
physical map - The only costs involved are
- Library of end-sequenced clones (cheap)
- Sequencing
27Walking off a Single Seed
- Low redundant sequencing
- Many sequential steps
28Walking off a single clone is impractical
- Cycle time to process one clone 1-2 months
- Grow clone
- Prepare Shear DNA
- Prepare shotgun library perform shotgun
- Assemble in a computer
- Close remaining gaps
- A mammalian genome would need 15,000 walking
steps !
29Walking off several seeds in parallel
Efficient
Inefficient
- Few sequential steps
- Additional redundant sequencing
- In general, can sequence a genome in 5 walking
steps, - with lt20 redundant sequencing
30Using Two Libraries
Most inefficiency comes from closing a small gap
with a much larger clone
Solution Use a second library of small clones
31Some Terminology insert a fragment that was
incorporated in a circular genome,
and can be copied (cloned) vector
the circular genome (host) that
incorporated the fragment BAC Bacterial
Artificial Chromosome, a type of
insertvector combination, typically
of length 100-200 kb read a 500-900 long word
that comes out of a sequencing
machine coverage the average number of reads
(or inserts) that cover a position in the
target DNA piece shotgun the process of
obtaining many reads sequencing from random
locations in DNA, to detect overlaps
and assemble
32Whole Genome Shotgun Sequencing
genome
plasmids (2 10 Kbp)
forward-reverse paired reads
known dist
cosmids (40 Kbp)
500 bp
500 bp
33Fragment Assembly(in whole-genome shotgun
sequencing)
34Fragment Assembly
Given N reads Where N 30 million We need to
use a linear-time algorithm
35Steps to Assemble a Genome
Some Terminology read a 500-900 long word
that comes out of sequencer mate pair a pair
of reads from two ends of the same insert
fragment contig a contiguous sequence formed
by several overlapping reads with no
gaps supercontig an ordered and oriented
set (scaffold) of contigs, usually by
mate pairs consensus sequence
derived from the sequene multiple alignment
of reads in a contig
1. Find overlapping reads
2. Merge some good pairs of reads into longer
contigs
3. Link contigs to form supercontigs
4. Derive consensus sequence
..ACGATTACAATAGGTT..
361. Find Overlapping Reads
(read, pos., word, orient.) aaactgcag aactgcagt ac
tgcagta gtacggatc tacggatct gggcccaaa g
gcccaaac gcccaaact actgcagta ctgcagtac gtacggatc
tacggatct acggatcta ctactacac tactacaca
(word, read, orient., pos.) aaactgcag aactgcagt ac
ggatcta actgcagta actgcagta cccaaactg cgg
atctac ctactacac ctgcagtac ctgcagtac gcccaaact ggc
ccaaac gggcccaaa gtacggatc gtacggatc tacggatct tac
ggatct tactacaca
aaactgcagtacggatct aaactgcag aactgcagt
gtacggatct tacggatct gggcccaaactgcagtac g
ggcccaaa ggcccaaac actgcagta
ctgcagtac gtacggatctactacaca gtacggatc
tacggatct ctactacac tactacaca
371. Find Overlapping Reads
- Find pairs of reads sharing a k-mer, k 24
- Extend to full alignment throw away if not gt98
similar
TAGATTACACAGATTAC
TAGATTACACAGATTAC
- Caveat repeats
- A k-mer that occurs N times, causes O(N2)
read/read comparisons - ALU k-mers could cause up to 1,000,0002
comparisons - Solution
- Discard all k-mers that occur too often
- Set cutoff to balance sensitivity/speed tradeoff,
according to genome at hand and computing
resources available
381. Find Overlapping Reads
- Create local multiple alignments from the
overlapping reads
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
391. Find Overlapping Reads
- Correct errors using multiple alignment
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTATTGA
TAG-TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG-TTACACAGATTACTGA
TAG-TTACACAGATTATTGA
insert A
correlated errors probably caused by repeats ?
disentangle overlaps
replace T with C
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
In practice, error correction removes up to 98
of the errors
TAG-TTACACAGATTATTGA
TAG-TTACACAGATTATTGA
402. Merge Reads into Contigs
- Overlap graph
- Nodes reads r1..rn
- Edges overlaps (ri, rj, shift, orientation,
score)
Reads that come from two regions of the genome
(blue and red) that contain the same repeat
Note of course, we dont know the color
of these nodes
412. Merge Reads into Contigs
Unique Contig
Overcollapsed Contig
- We want to merge reads up to potential repeat
boundaries
422. Merge Reads into Contigs
- Ignore non-maximal reads
- Merge only maximal reads into contigs
432. Merge Reads into Contigs
r
r1
r2
r3
- Remove transitively inferable overlaps
- If read r overlaps to the right reads r1, r2, and
r1 overlaps r2, then (r, r2) can be inferred by
(r, r1) and (r1, r2)
442. Merge Reads into Contigs
452. Merge Reads into Contigs
repeat boundary???
sequencing error
b
a
b
a
- Ignore hanging reads, when detecting repeat
boundaries
46Overlap graph after forming contigs
Unitigs Gene Myers, 95
47Repeats, errors, and contig lengths
- Repeats shorter than read length are easily
resolved - Read that spans across a repeat disambiguates
order of flanking regions - Repeats with more base pair diffs than sequencing
error rate are OK - We throw overlaps between two reads in different
copies of the repeat - To make the genome appear less repetitive, try
to - Increase read length
- Decrease sequencing error rate
- Role of error correction
- Discards up to 98 of single-letter sequencing
errors - decreases error rate
- ? decreases effective repeat content
- ? increases contig length
482. Merge Reads into Contigs
- Insert non-maximal reads whenever unambiguous
493. Link Contigs into Supercontigs
Normal density
Too dense ? Overcollapsed
Inconsistent links ? Overcollapsed?
503. Link Contigs into Supercontigs
Find all links between unique contigs
Connect contigs incrementally, if ? 2 links
supercontig (aka scaffold)
513. Link Contigs into Supercontigs
Fill gaps in supercontigs with paths of repeat
contigs
524. Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTATTGACTTCATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
- Derive multiple alignment from pairwise read
alignments
Derive each consensus base by weighted
voting (Alternative take maximum-quality letter)
53Some Assemblers
- PHRAP
- Early assembler, widely used, good model of read
errors - Overlap O(n2) ? layout (no mate pairs) ?
consensus - Celera
- First assembler to handle large genomes (fly,
human, mouse) - Overlap ? layout ? consensus
- Arachne
- Public assembler (mouse, several fungi)
- Overlap ? layout ? consensus
- Phusion
- Overlap ? clustering ? PHRAP ? assemblage ?
consensus - Euler
- Indexing ? Euler graph ? layout by picking paths
? consensus
54Quality of assemblies
Celeras assemblies of human and mouse
55Quality of assembliesmouse
56Quality of assembliesmouse
Terminology N50 contig length If we sort contigs
from largest to smallest, and start Covering the
genome in that order, N50 is the length Of the
contig that just covers the 50th percentile.
57Quality of assembliesrat
58History of WGA
1997
- 1982 ?-virus, 48,502 bp
- 1995 h-influenzae, 1 Mbp
- 2000 fly, 100 Mbp
- 2001 present
- human (3Gbp), mouse (2.5Gbp), rat, chicken, dog,
chimpanzee, several fungal genomes
Lets sequence the human genome with the shotgun
strategy
That is impossible, and a bad idea anyway
Phil Green
Gene Myers
59Genomes Sequenced
- http//www.genome.gov/10002154