Genome sequence assembly - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Genome sequence assembly

Description:

Break DNA into random fragments (8-10x coverage) Sequence the ends ... Caveat: numbers based on artificially chopping up. the genome of Wolbachia pipientis dMel ... – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 31
Provided by: cbcb6
Category:

less

Transcript and Presenter's Notes

Title: Genome sequence assembly


1
(No Transcript)
2
Genome sequence assembly
  • Assembly concepts and methods
  • Mihai Pop
  • Center for Bioinformatics and Computational
    Biology
  • University of Maryland

3
Building a library
  • Break DNA into random fragments (8-10x coverage)


Actual situation
4
Building a library
  • Break DNA into random fragments (8-10x coverage)
  • Sequence the ends of the fragments
  • Amplify the fragments in a vector
  • Sequence 800-1000 (500-700) bases at each end of
    the fragment


5
Assembling the fragments
6
Forward-reverse constraints
  • The sequenced ends are facing towards each other
  • The distance between the two fragments is known
    (within certain experimental error)

Insert
R
F
Clone
7
Building Scaffolds
  • Break DNA into random fragments (8-10x coverage)
  • Sequence the ends of the fragments
  • Assemble the sequenced ends
  • Build scaffolds


8
Assembly gaps
Physical gaps


Sequencing gaps

sequencing gap - we know the order and
orientation of the contigs and have at least one
clone spanning the gap physical gap - no
information known about the adjacent contigs, nor
about the DNA spanning the gap
9
Unifying view of assembly
Assembly
Scaffolding
10
Shotgun sequencing statistics
11
Typical contig coverage
Imagine raindrops on a sidewalk
12
Lander-Waterman statistics
  • L read length
  • T minimum detectable overlap
  • G genome size
  • N number of reads
  • c coverage (NL / G)
  • s 1 T/L
  • E(islands) Ne-cs
  • E(island size) L((ecs 1) / c 1 s)
  • contig island with 2 or more reads

13
Example
Genome size 1 Mbp Read Length 600
Detectable overlap 40
14
Experimental data
Caveat numbers based on artificially chopping
up the genome of Wolbachia pipientis dMel
15
Read coverage vs. Clone coverage
4 kbp
1 kbp
Read coverage 8X Clone (insert) coverage 16
2X coverage in BAC-ends implies 100x coverage by
BACs (1 BAC clone approx. 100kbp)
16
Assembly paradigms
  • Overlap-layout-consensus
  • greedy (TIGR Assembler, phrap, CAP3...)
  • graph-based (Celera Assembler, Arachne)
  • Eulerian path (especially useful for short read
    sequencing)

17
TIGR Assembler/phrap
  • Greedy
  • Build a rough map of fragment overlaps
  • Pick the largest scoring overlap
  • Merge the two fragments
  • Repeat until no more merges can be done

18
Overlap-layout-consensus
Main entity read Relationship between reads
overlap
1
4
7
2
5
8
3
6
9
2
3
4
5
6
7
8
9
1
ACCTGA ACCTGA AGCTGA ACCAGA
1
2
3
2
3
1
1
2
3
3
1
1
2
3
1
3
2
2
19
Paths through graphs and assembly
  • Hamiltonian circuit visit each node (city)
    exactly once, returning to the start

Genome
20
Implementation details
21
Overlap between two sequences
overlap (19 bases)
overhang (6 bases)
AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC
CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT
overhang
identity 18/19 94.7
  • overlap - region of similarity between regions
  • overhang - un-aligned ends of the sequences
  • The assembler screens merges based on
  • length of overlap
  • identity in overlap region
  • maximum overhang size.

22
All pairs alignment
  • Needed by the assembler
  • Try all pairs must consider n2 pairs
  • Smarter solution only n x coverage (e.g. 8)
    pairs are possible
  • Build a table of k-mers contained in sequences
    (single pass through the genome)
  • Generate the pairs from k-mer table (single pass
    through k-mer table)

k-mer
23
(No Transcript)
24
REPEATS
25
RptA
RptB
3
6
9
12
2
5
8
11
1
4
7
10
13
6
4
8
10
2
12
1
13
3
11
5
9
7
26
Non-repetitive overlap graph
27
Handling repeats
  • Repeat detection
  • pre-assembly find fragments that belong to
    repeats
  • statistically (most existing assemblers)
  • repeat database (RepeatMasker)
  • during assembly detect "tangles" indicative of
    repeats (Pevzner, Tang, Waterman 2001)
  • post-assembly find repetitive regions and
    potential mis-assemblies.
  • Reputer, RepeatMasker
  • "unhappy" mate-pairs (too close, too far,
    mis-oriented)
  • Repeat resolution
  • find DNA fragments belonging to the repeat
  • determine correct tiling across the repeat

28
Statistical repeat detection
  • Significant deviations from average coverage
    flagged as repeats.
  • - frequent k-mers are ignored
  • - arrival rate of reads in contigs compared
    with theoretical value
  • (e.g., 800 bp reads 8x coverage - reads
    "arrive" every 100 bp)
  • Problem 1 assumption of uniform distribution of
    fragments - leads to false positives
  • non-random libraries
  • poor clonability regions
  • Problem 2 repeats with low copy number are
    missed - leads to false negatives

29
Mis-assembled repeats
excision
collapsed tandem
rearrangement
30
SASA repeat (4776 AA, 14Kb)from Streptococcus
pneumoniae
MTETVEDKVSHSITGLDILKGIVAAGAVISGTVATQTKVFTNESAVLEKT
VEKTDALATNDTVVLGTISTSNSASSTSLSASESASTSASESASTSASTS
ASTSASESASTSASTSISASSTVVGSQTAAATEATAKKVEEDRKKPASDY
VASVTNVNLQSYAKRRKRSVDSIEQLLASIKNAAVFSGNTIVNGAPAINA
SLNIAKSETKVYTGEGVDSVYRVPIYYKLKVTNDGSKLTFTYTVTYVNPK
TNDLGNISSMRPGYSIYNSGTSTQTMLTLGSDLGKPSGVKNYITDKNGRQ
VLSYNTSTMTTQGSGYTWGNGAQMNGFFAKKGYGLTSSWTVPITGTDTSF
TFTPYAARTDRIGINYFNGGGKVVESSTTSQSLSQSKSLSVSASQSASAS
ASTSASASASTSASASASTSASASASTSASVSASTSASASASTSASASAS
TSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTS
ASESASTSASASASTSASASASTSASGSASTSTSASASTSASASASTSAS
ASASISASESASTSASESASTSTSASASTSASESASTSASASASTSASAS
ASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTS
ASASASTSASVSASTSASASASTSASASASTSASESASTSASASASTSAS
ASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASAS
ASTSASASASTSASASASTSASASASISASESASTSASASASTSASASAS
TSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTS
ASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSAS
ASASTSASASASTSASASASTSASASASTSASASASTSASASTSASESAS
TSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTS
ASASASTSASASASTSASASASISASESASTSASASASTSASVSASTSAS
ASASTSASESASTSASASASTSASESASTSASASASTSASASASISASES
ASTSASASASTSASASASTSASASASTSASESASTSTSASASTSASESAS
TSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSAS
ASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASAS
ASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASAS
TSASVSASTSASESASTSASASASTSASASASTSASESASTSASASASTS
ASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSAS
ASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASAS
ASTSASASASISASESASTSASASASTSASASASTSASVSASTSASASAS
TSASASASISASESASTSASASASTSASASASTSASASASTSASASASIS
ASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSAS
ASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASAS
ASTSASASASTSASVSASTSASESASTSASASASTSASASASTSASASAS
TSASESASTSASASASTSASASASTSASESASTSASASASTSASASASTS
ASASASTSASASASASTSASASASTSASASASTSASASASISASESASTS
ASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSAS
ASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASAS
TSASVSASTSASASASTSASASASTSASESASTSASASASTSASASASTS
ASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSAS
ASASTSASASASTSASASASISASESASTSASASASTSASASASTSASAS
ASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASAS
TSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTS
ASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASAS
ASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASAS
TSASASASTSASASASISASESASTSASASASTSASVSASTSASASASTS
ASESASTSASASASTSASESASTSASASASTSASASASISASESASTSAS
ASASTSASASASTSASASASTSASESASTSTSASASTSASESASTSASAS
ASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTS
ASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSAS
ASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASAS
ASTSASASASTSASVSASTSASESASTSASASASTSASASASTSASASAS
TSASESASTSASASASTSASASASTSASESASTSASASASTSASASASTS
ASASASTSASASASASTSASASASTSASASASTSASASASISASESASTS
ASASASASTSASASASTSASASASTSASASASISASESASTSASESASTS
TSASASTSASESASTSASASASTSASASASTSASASASTSASASTSASES
ASTSASASASTSASASASTSASASASTSASASASTSASASASTSASVSAS
TSASASASTSASASASTSASESASTSASASTSASESASTSASASASTSAS
ASASTSASASASTSASESASTSASASASTSASASASTSASESASTSASAS
ASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASAS
TSASASASTSASGSASTSTSASASTSASASASTSASASASISASESASTS
ASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSAS
ASTSASESASTSASASASTSASASASTSASASASTSASASASTSASVSAS
TSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTS
ASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSAS
ASASTSASASASISASESASTSASASASTSASASASTSASASASTSASES
ASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASAS
TSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTS
ASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASAS
ASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASAS
TSASASASISASESASTSASASASTSASVSASTSASASASTSASESASTS
ASASASTSASESASTSASASASTSASASASISASESASTSASASASTSAS
ASASTSASASASTSASESASTSTSASASTSASESASTSASASASTSASAS
ASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTS
ASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSAS
ESASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASES
ASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASAS
TSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTS
ASASASTSASASASTSASASASTSASASASTSASASASTSASASASISAS
ESASTSASASASTSASASASTSASVSASTSASASASTSASASASISASES
ASTSASASASTSASASASTSASASASTSASASASISASESASTSASASAS
TSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTS
ASASASTSASASASTSASESASTSASASASTSASASASISASESASTSAS
ASASTSASASASTSASASASTSASESASTSTSASASTSASESASTSASAS
ASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTS
ASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSAS
ASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASVS
ASTSASESASTSASASASTSASASASTSASESASTSASASASTSASESAS
TSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTS
ASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSAS
ASASISASESASTSASASASTSASASASTSASVSASTSASASASTSASAS
ASISASESASTSASASASTSASASASTSASASASTSASASASISASESAS
TSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTS
ASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSAS
ASASTSASASASTSVSNSANHSNSQVGNTSGSTGKSQKELPNTGTESSIG
SVLLGVLAAVTGIGLVAKRRKRDEEE
Write a Comment
User Comments (0)
About PowerShow.com