Title: Compartmentalized%20Shotgun%20Assembly
1Compartmentalized Shotgun Assembly
?
?
?
CSA Two stated motivations?
?
2Matcher matched
- matched Celera reads with PFP BACTIGS,
- 20.76 million Celera reads matched (76),
- 0.62 million had a mate pair that matched,
- 2.97 million Celera reads were unique and
un-screened, - 1.189 Gbp of unique DNA sequence, at 5.11X yields
a predicted 240 Mbp of unique Celera sequence.
3Combining Assembler assemblesCelera and PFP
sequence for a transient assembly
- first, Celera reads,
- are checked for over-collapsed regions,
- sequences with Mate Pairs that match region are
kept, - more mate pair matches higher value assembly,
- then Celera reads are combined with PFP reads,
- Greedy program recognizes highest value
assemblies first in order to build contigged
sequence, - then Stones to fill the gaps.
4ResultsPFP vs. CSA
- The GenBank (PFP) data for the Phase 1 and 2 BACs
yielded an average of 19.8 bactigs per BAC, of
average size 8099 bp, - Application of the Combining Assembler resulted
in individual Celera/BAC assemblies being put
together into an average of 1.83 scaffolds
(median of 1 scaffold) per BAC region consisting
of an average of 8.57 contigs of average size
18,973 bp.
pp. 1313, 1st column, last paragraph
5Compartmentalized Shotgun Assembly
?
6Celera Unique ScaffoldsWGA
- The 5.89 million Celera fragments not matching
the GenBank data were assembled with the
whole-genome assembler. - The Celera assembly resulted in a set of
scaffolds totaling 442 Mbp in span and consisting
of 326 Mbp of sequence. More than 20 of the
scaffolds were gt5 kbp long, and these averaged
63 sequence and 27 gaps with a total of 302 Mbp
of sequence.
7Compartmentalized Shotgun Assembly
?
?
8Tiler tiles
- Scaffolds into larger components using
- Mate End Pairs,
- BAC-end pairs,
- STS,
- Heuristic a rule of thumb, simplification, or
educated guess that reduces or limits the search
for solutions in domains that are difficult and
poorly understood. Unlike algorithms, heuristics
do not guarantee optimal (or even feasible)
solutions and are often used with no theoretical
guarantee.
9Compartmentalized Shotgun Assembly
- 3,845 Components
- shredded, WGA
1093
- gt 100 kbp Scaffolds
- 92 sequence, 8 gaps,
- 105,264 gaps, 1,935 scaffolds,
- 1.3 Mbp scaffold size, 23,242 bp contig size.
- gt 49 gaps lt 500 bp,
- gt 62 gaps lt 1 kb,
- No gap larger than 100 kbp.
11How do you compare assemblies?
12WGA vs. CSA
- This gives some measure of consistent coverage
- 1.982 Gbp (95.00) of the WGA is covered by the
CSA, - 2.169 Gbp (87.69) of the CSA is covered by the
WGA. - Only 31 scaffolds were unique to an assembly,
- 295 kb (0.012) CSA inconsistent with WGA,
- 2.108 Mb (0.11 WGA inconsistent with CSA,
Overall, CSA slightly better than WGA Why? How
does the CSA compare with the Clone-by-Clone
approach?
13Map First then sequence
Sequence First then map
14Mapping ScaffolderGM99 and fingerprint maps
15Mapping ScaffolderGM99 and fingerprint maps
16?
Tab. 4
17Assembly and Validation Analysisdid it really
work?
- Completeness of euchromatic sequence in the
assembly, - estimate the size and of gaps (Table 3),
18Assembly and Validation Analysisdid it really
work?
- Completeness of euchromatic sequence in the
assembly, - estimate the size and of gaps (Table 3),
- compare to finished sequences of 21,22
- 3.4 Mb gaps, 75 gaps are repeats,
- match with STS data (ePCR, BLAST),
- 93.4 tested found assembled, 5.5 in chaff
98.9, - Correctness
- Mate-Pair analysis.
19Mate Pair Analysis
- Valid correct orientation and correct distance
3 SD
2.7 were found to be invalid.
20CSA vs. PFP
What does this show?
21Chromosome 21
Yellow Same Orientation
Red Out of Order, Orientation
22Chromosome 8
23(No Transcript)
24- Whats the take home message?
25PFP
CSA
Fig. 7, key
26Fig. 7
27Gene Prediction and AnnotationWhys it So Hard
to Find Genes?
- Exons/Introns,
- Alternative Splicing/Termination,
- Alternate transcription start/stop sites,
- Tandem Repeats, Psuedogenes, etc.
- We dont really understand all there is to know
about gene and genome structure, - etc.
28Gene Number Predictions?before PFP, WGA or CSA
- Textbooks 100,000
- Upgraded to 142,634? EST data
- counts that fall far short
- EST Data --gt 35,000
- 35,000 genes based on the density of Chromosome
22 - 28, 000 - 34,000 Humans vs. pufferfish
29Automated Gene AnnotationOTTO
- Tell me how it works.
- How was it validated, including Table 7.
- if necessary, use the Online Primer and other
NCBI resources to broaden your understanding, - cDNAs, ESTs, RefSeq, Protein Sequence Databases,
BLAST, etc. are described in appropriate detail
on the WEB.
30Questions?
31(No Transcript)
32Repeat Resolver ...most of the remaining gaps
were due to repeats.
Rocks Use low Discriminator Value contig
sets to fill gaps, - find two or more mate
pairs with unambiguous matches in the scaffold
near the gap (2 kb, 10kb or 50 kb), (1 in
107),
Stones - find mate pair matches 2 kb, 10 kb,
and 50 kb from gap, place the mate in the gap,
check to see if its consistent with other
placed sequences.
33Repeat Resolver ...most of the remaining gaps
were due to repeats.
Rocks Use low Discriminator Value contig
sets to fill gaps, - find two or more mate
pairs with unambiguous matches in the scaffold
near the gap (2 kb, 10kb or 50 kb), (1 in
107), Stones - find mate pair matches 2 kb,
10 kb, and 50 kb from gap, place the mate in the
gap, check to see if its consistent with other
placed sequences.