Title: Assembling and
1Assembling and Annotating Genomes
Deanna M. Church NCBI January 12, 2005
2Of mice and men
3Of mice and men
Fleischman et al. (1991) PNAS 8810885-10889
Both carry mutations in the Kit gene.
4The Basic Model
5Why sequence?
- Complete parts list for a given organism
Genes, promoters, regulatory regions, variation,
????
High quality, finished (or essentially
finished) sequence
II. Genes, Genes, Genes
Draft is probably good enough
III. Annotating a finished genome (Human,
soon to be mouse)
Low coverage (2X sequence coverage).
6What data is represented in GenBank
Data in GenBank is an interpretation of primary
sequence data
- Sequence reaction
- Read gel/call chromatogram (Phred/TraceTuner)
- Submit sequence
Steps for small, single pass sequence
- Assemble sequence and submit consensus (Phrap,
CAP3, CAP4)
Last step for large molecules (BAC, fosmids, long
cDNAs)
7Getting the raw data
gt500 Million Traces (and counting)
http//www.ncbi.nlm.nih.gov/Traces/
8Getting the raw data
And they just keep coming
9Getting the raw data
Scripted access for bulk retrieval
10Genome Sequencing Strategies
Not all bases are created equal
11Private and public efforts
Science (June, 1998)
Craig Venter
Science (September, 1998)
12Hierarchical Shotgun Assembly
Putting Genomes Together
This part is relatively cheap and easy
This part is hard and expensive
13HTGS keywords
htgs_phase0 low coverage sequence 1-2X
htgs_phase1 generally 4-5X sequence coverage,
several fragments not ordered or
oriented
htgs_phase2 sequence coverage can vary
(generally 5-10X) but fragments are
ordered and oriented.
htgs_phase3 highly accurate, finished sequence.
Error rate lt10-5
Draft sequence phase 1 or 2, but gt90 of the
bases are high quality (phred 20 or better)
htgs_active_fin center has finished shotgun
phase and moved to finishing
htgs_cancelled sequencing has discontinued on
this clone
14The Raw Data
15Putting genomes together
UCSC Jim Kent NCBI Paul Kitts Greg
Schuler Richa Agarwala
- Remove contaminants
(vector, E. coli, other organisms, virus)
- Bin clones by chromosome arm
- Incorporate clone order information using TPF
- Identify fragment overlaps
- Determine fragment order and orientation, remove
- sequence redundancy (This produces sequence
contigs given NT_XXXXXX type accession numbers)
- Place contigs on chromosome
16UCSC Jim Kent NCBI Paul Kitts Greg
Schuler Richa Agarwala
Putting genomes together
Overlapping draft clones
When BAC clones overlap, the sequence can be
made non-redundant. These contigs are given
NT_XXXXXX accession numbers
17Sequence Tagged Sites (STS)
A common language for physical mapping of the
human genome M. Olson, L. Hood, C. Cantor, and
D. Botstein Science 245, 1434-1435 (1989).
STS marker D6S1606
forward primer
microsatellite
GAGTTTGCACCATTGCACTCCAGCCTGGGCAAC (CA)n
AACGTGGCATGTGCCTGTACTCTCC CTCAAACGTGGTAACGTGAGGTCG
GACCCGTTG (GT)n TTGCACCGTACACGGACATGAGAGG
reverse primer
PCR product size 92 - 100 bases
18The Original Genome Resources- STS Maps
meiosis- genetic radiation- RH clones- clone based
genome
- each line represents an individual cell
line/animal that carries a particularbreak - - STSs can be amplified from DNA in these cell
lines/animals- based on cell line/animal marker
content, the breaks can be determined andthe
markers ordered.
19Electronic PCR (e-PCR)
STS marker D6S1606
microsatellite repeat
forward primer
GAGTTTGCACCATTGCACTCCAGCCTGGGCAACAAGAGTGAAACTCTGTC
ACAGA (CA)n AACGTGGCATGTGCCTGTACTCTC CTCAAACGTGGTA
ACGTGAGGTCGGACCCGTTGTTCTCACTTTGAGACAGTGTCT (GT)n
TTGCACCGTACACGGACATGAGAG
reverse primer
PCR product size 92 - 100 bases
Schuler (1997), Genome Research 7, 541-550
E-PCR software searches DNA sequences for exact
matches to both primers in correct order,
orientation, and spacing to be consistent with
known PCR product size.
20Electronic PCR (e-PCR)
http//www.ncbi.nlm.nih.gov/sutils/e-pcr/
21Putting genomes together
Ideally
Non-sequence based Map
22Putting genomes together
More like
23Human assembly Build 35
The Starting Material
Framework assemblies
388 contigs- 3.02 Gb
Contig Information
Type of source sequence Number used Length (bp)
Draft only 46 10,284,900
Finished only 334 2,833,780,000
N50 length Contig length at which 50 of the
bases in the assembly reside in a contig of at
least that size.
AGP A Golden Path
http//www.ncbi.nlm.nih.gov/genome/guide/human/HsS
tats.html
24Current Human assembly Build 34 (the essentially
finished genome)
Contig information
Range in kb Number Length (kb) Percentof total
lt300 218 30,276 1
300-1000 74 44,028 1.45
1000-5000 87 208,365 6.89
gt5000 119 2,737,630 90.64
N50- 29,105
N50 length Contig length at which 50 of the
bases in the assembly reside in a contig of at
least that size.
25Contigs and components in the MapViewer
26Mouse Genome Sequencing
27David Jaffe Jim Mullikin
Putting genomes together
BAC clones were constructed and end sequenced
before WGS project started
WGS
For mouse project only 40 kb clones and BAC
clones are available
End-sequence all clones and retain pairing
information mate-pairs
Each end sequence is referred to as a read
28Putting genomes together
David Jaffe Jim Mullikin
Constructing Supercontigs (scaffolds)
29Intermediate assemblies
Sanger Institute Jim Mullikin WIBR David
Jaffe NCBI Richa Agarwala Victor
Sapojnikov Wratko Hlavina Deanna Church
30The Mouse Genome- MGSCv3
David Jaffe- Arachne
(The Mouse Genome Sequencing Consortium)
Jim Mullikin- Phusion
The Starting Material
Waterston et al, 2004
RPCI-23 197 Kb RPCI-24 155 Kb
The Assembly
Total length of the assembly 2.5 Gb (90.9 of
genome)
224,713 WGS contigs
CAAA01000100
42,620 Supercontigs
( 274 finished BACs 49.5 Mb)
Assumes a 2.75 Gb genome
31The Mouse Genome- over time
7
32Contig/Supercontig size by chromosome
80
70
60
50
40
30
20
10
0
33How does MGSCv3 compare to Non-Sequence based maps
80 of STS markers on WI-Genetic Map localized
by e-PCR
72 of STS markers on WI/MRC RH Map localized by
e-PCR
Chromosome 7
lt3 chromosome conflict.
34Finished NT Contig By Build
Finished sequences are used to build hand-curated
contigs (NT contigs)
Currently 1.8 Gb (mostly) non-redundant
sequence1.1 Gb in Build 33
35The Mouse Genome- over time
NCBI Richa Agarwala
Mouse Build 30
36The Mouse Genome- combining resources
NCBI Richa Agarwala Deanna Church
Unplaced versus Total curated Contigs Build 30
0
.27
1.93
0.9
.56
1.83
1.19
3.64
1.38
0
0
1.41
4.07
3.61
1.27
0
5.56
4.48
2.94
0
100
780 Mb of Curated NT Sequence
37The Mouse Genome- combining resources
NCBI Richa Agarwala Deanna Church
Mmu4 unplaced contigs (Build 30)
10 unplaced NT contigs
(11 GenBank accessions)
Do align to WGS contigs mapped to Mmu4
Align to WGS contigs mapped to another chromsome
No hits/bad hits (mostly chrUn)
NT_039271 NT_039272 NT_039276 NT_039280
NT_039273 (MmuX)
NT_039269 NT_039270 NT_039274 NT_039278 NT_039279
38Segmental Duplications
Large, nearly identical copies of genomic DNA.
gt 1 Kb, gt 90 identity
Intrachromosomal
Interchromosomal
39Segmental Duplications
WGAC Analysis Whole Genome Assembly Comparison
BLAST the genome against itself and look for
sequence similarity.
caveat difficult to distinguish between
biological duplication and artificial duplication
introduced when producing draft assemblies.
WSSD Analysis Whole Genome Shotgun Sequence
Detection
BLAST WGS reads against an assembly and look for
increased depth of coverage
40Segmental Duplications
41Segmental Duplications
MGSCv3 (gt90 ID gt10 Kb)
60 of all duplication map to chrUn in MGSCv3
42Segmental Duplications
Comparison of duplication in the Mouse and Human
Genomes
WGAC analysis
gt1 KB
5.25
ND
ND
3.74
2.35
gt5 Kb
4.78
1.95
1.01
3.25
2.00
gt10 Kb
4.52
0.70
0.38
2.71
1.60
gt20 Kb
4.06
0.11
0.10
2.23
1.14
Duplications are underrepresented in the Whole
Genome Assembly (MGSCv3)
43Segmental Duplications
WSSD Finished BACs
Duplicated pre-quality score
Unique pre-quality score
Duplicated post-quality score
Unique post-quality score
44Segmental Duplications
WSSD (gt95 id) analysis of Build 30 BACs
The 6 BACs (5 NT clones) from Mmu4 that hit
chrUn are on the duplication positive list
45Segmental Duplications
Case Western Reserve Evan Eichler Jeff Bailey
46Segmental Duplications
RP23-3D2 chr.X_A3
Bari Italy Mario Ventura Mariano Rochi
- Validated 18/27 (67) In silico predictions by
FISH - 16/18 (90) were clustered intrachromosomal
duplications
This region described in Mileham and Brown
(1996) as a repeat sequence island
47Segmental Duplications
Gene Content of Duplications
Domain U D Enrichment
serpin 39 6 57.5
lectin_c 75 4 19.9
7tm 208 3 5.4
ANF_receptor 34 3 33
Defensin_propep 3 3 373.5
KRAB 68 3 16.5
defensins 2 3 560.3
lipocalin 23 2 32.5
AAA 35 1 10.7
DEAD 41 1 9.1
ENV_polyprotein 4 1 93.4
MAGE 5 1 74.7
RNA_helicase 10 1 37.4
Human
5 of the Genome is in Duplicated regions 6
of RefSeqs align to these regions
Mouse
1.5-2 of the Genome is in Duplicated
regions 0.5 of RefSeqs align to these regions
48MGSCv3 Duplication Analysis
both non redundant dup both non redundant dup both non redundant dup both non redundant dup both non redundant dup both non redundant dup
WGAC (Mb) WSSD supported WGAC (Mb) WSSD overlap WGAC () WSSD (Mb) WGAC overlap WSSD () Proportion of WSSD supported WGAC in chrom()
Evan Eichler Xinwei She Ginger Chang Eray
Tuzan Deanna Church
chr1 3.25 0.38 11.58 0.57 66.51 0.21
chr2 2.03 0.13 6.57 0.32 42.11 0.08
chr3 2.17 0.11 5.23 0.16 69.09 0.08
chr4 2.19 0.27 12.12 0.69 38.64 0.19
chr5 2.81 0.42 14.96 0.88 47.92 0.31
chr6 3.72 0.37 9.97 0.86 43.00 0.27
chr7 4.48 0.78 17.41 2.10 37.16 0.64
chr8 1.54 0.15 9.54 0.27 54.63 0.12
chr9 1.56 0.10 6.11 0.34 28.03 0.08
chr10 1.62 0.10 5.94 0.19 51.39 0.08
chr11 1.13 0.08 6.94 0.21 36.63 0.07
chr12 1.79 0.39 21.85 0.88 44.42 0.37
chr13 1.86 0.41 22.08 1.01 40.66 0.38
chr14 1.19 0.15 12.39 0.33 44.38 0.14
chr15 0.94 0.04 3.87 0.05 77.47 0.04
chr16 1.08 0.01 0.75 0.02 40.64 0.01
chr17 3.35 0.22 6.62 0.99 22.30 0.26
chr18 0.75 0.02 2.62 0.02 87.52 0.02
chr19 0.92 0.05 5.53 0.31 16.52 0.09
chrUn 23.78 13.03 54.80 82.02 15.89 12.91
chrX 3.17 0.31 9.91 0.86 36.41 0.23
Build 33 data
49NCBI Richa Agarwala Deanna Church
The Mouse Genome- combining resources
50Mouse assemblies Build 32
Framework assemblies
Contig information
Range in kb Number Length Percentof total
lt300 39373 1.94x108 7.10
300-1000 72 4.23x107 1.54
1000-5000 116 3.14x108 11.46
gt5000 156 2.19x109 79.92
All
Range in kb Number Length Percentof total
lt300 98 8.33x106 0.33
300-1000 70 4.13x107 1.62
1000-5000 116 3.14x108 12.30
gt5000 156 2.19x109 85.76
Mapped
51(No Transcript)
52(No Transcript)
53(No Transcript)
54Mapped Scaffold N50
55The Mouse Genome- combining resources
Refseqs with mulitple alignments to the genome
56Finished Sequence in 'Random' Bin
combined_2
57NCBI Richa Agarwala Deanna Church
The Mouse Genome- combining resources
Mouse Build 33 (current)
Clone based TPF
-local Order and Orientation problems
MGSCv3 based TPF
- Increased artificial duplication
- Lots of finished sequence in random bin
Combined TPF
- Not perfect, but better outcome. Manual
curation helps
And the winner is
5819
18
17
16
15
12
13
14
11
9
8
10
7
6
5
4
3
2
1
Build 33
Reference assembly N50 22.3 Mb
X
59Chromosome 7 inversion still present
60Mmu7 (3M 6M)
61Segmental Duplication Genome annotation will
under-represent the gene content if segmental
duplications are not included in the reference
assembly.
62Large scale variation in the genome
Nature Genetics, Sept. 2004
63Types of annotation
Feature Method
Genes By alignment, by prediction
Markers
By ePCR
Variation
By alignment
Clones/Cytogenetic location
By alignment (BAC ends, insert) or assembly
Phenotype
Via Gene identification, associated markers
By annotated BAC-END sequenced clones By
FISH-mapped clones used in assembly
Cytogenetic Position
Sequence characteristics
CpG islands, source of assembly
Gene Trap Clones
By alignment
Note Genes from other organisms are also
positioned based on alignment of mRNAs from one
species on that of another genome. Example the
human Map Viewer shows the position of ESTs and
other mRNAs from cow, pig, mouse, and rat.
64Reference Sequences
Goal One sequence entry for each naturally
occurring DNA, RNA and protein molecule
Key Curated annotation Calculated annotation
chromosome
NC_000000
RNA
protein
NM_000000 NR_000000
NP_000000
predictedRNA
predictedprotein
XM_000000/ XR_000000
XP_000000
Multiple products for one gene are instantiated
as separate RefSeqs with the same LocusID.
65Why do we need RefSeq?
66mRNA alignment
- General alignment
- at least 50 of length or gt1.0 kb
- gt95 identity, unless short exon
- No longer one alignment per contig per strand
- (changed recently because this led to failure to
annotate all members of a gene cluster) - Constraints on intron length (compactness)
- Shift within 3 nt to find splice sites conforming
to consensus (GT-AG, GC-AG, AT-AC) - Rank alignment by bit score, identity, score,
gaps, compactness - global alignment
- Best placement
- Add to score for introns to compensate for gap
penalty - Known ambiguity if gene/pseudogene pairs are
highly related, and few introns in gene
67Aligning cDNAs to the genome
- Different algorithms can produce different
results - Trying to balance alignment with searching for
splice sites.
NM_003490 (synapsin 3)
Between exons 7 and 8
68Making Gene Models (at NCBI)
69(No Transcript)
70Conflict resolution
- Integrated comparison with Ensembl and UCSC
- Placement of CDS
- Placement of and consensus splice junctions
- identity between RefSeq and Genome
- Reading frame
- Possible Actions
- Review current evidence
- Review alignment algorithms
- Review current RefSeqs
71Future consensus annotation
- CCDS identifier assigned to annotated proteins
that are consistently placed - Sequence may not be identical because NCBI
annotates and places existing RefSeqs that are
based on cDNAs and Ensembl generates mRNA and
protein products solely from the reference genome - cDNA (and thus protein) from a different allele
- RNA editing
- selenoproteins
- ribosomal slippage
- non-AUG initiation codon
- cDNA source has undetected sequence errors
72Future consensus annotation
- Preliminary Statistics based on Human Build 34.3
Count Total Conditions Satisfied 7802 7802 10
0 nucleotideposition 1499 9301 100
proteinposition 3053 12336 100 exon
position 23 12359 NCBI/Hinxton both
"good" 1540 13899 NCBI annotation
projected 1772 15671 One model
better 52 15723 Other model better
73Now that the genome is together
http//www.ncbi.nlm.nih.gov/mapview/map_search.cgi
?taxid10090
74http//www.ncbi.nlm.nih.gov/genome/seq/MmBlast.htm
l
/HsBlast.html
/RnBlast.html
/DrBlast.html
Data Access
DATABASES
Entry point into the Genome- view BLAST results
in the Map Viewer
Other data sets
Gene Trap Clones
75Data Access
76Navigating by location
77Multiple assemblies can be a good thing
Alignment of human Reference mRNAs
256 Reference assembly only 10 Celera assembly
only
- Assembly Gaps
- Assembly Errors
- Biological variation
78Mulitple assemblies can be a good thing
79(No Transcript)
80(No Transcript)
81Mulitple assemblies can be a good thing
Inversions An exon of DOCK3 is inverted in the
reference assembly relative to other available
information.
Celera Assembly
Reference Assembly
Other sequence data indicate the reference
assembly includes an inversion
NM_004947 181 tgaaggggatctttcctgcaaattacattcact
tgaaaaaggcaattgtcagtaataggg 240 AY254099 181
................................................
............ 240 AY145303 158
..................................................
.......... 217 AY145302 509
.a......c................t.....t..................
.....c.... 568 AK172930 518
.a......c................t.....t..................
.....c.... 577 AK122353 445
.c.....t..a......t.c.gc..tg.............t..ctg...a
.ag..c.aa. 504 AY233380 158
.c.....t..a......t.c.gc...g.............t..ctg...a
.ag..c.aa. 217 AC121608 21865
.....c................t.....t.....................
..c.... 21921 AL672208 61296
.....c................t.....t.....................
..c.... 61240
82Mulitple assemblies can be a good thing
83(No Transcript)
84Genome assembly and annotation is an ongoing
issue.
Weigh all of the evidence carefully
Multiple lines of evidence better than a single
thread
85Take home messages
Genome assembly and annotation is still not a
trivial problem
Be critical and review the evidence
http//www.ncbi.nlm.nih.gov/projects/assembly
86Assembly Database
NCBI Eugene Yaschenko Vladimir Alekseyev Mike
Dicuccio Deanna ChurchTIGR Martin Shumway
Steve Salzberg
87Acknowledgments
Genome Team Richa Agarwala Hsiu-Chuan Chen Slava
Chetvernin Deanna Church Olga Ermolaeva Wratko
Hlavina Wonhee Jang Jonathan Kans Yuri
Kapustin Ken Katz Paul Kitts Donna Maglott Jim
Ostell Kim Pruitt Sergey Resenchuk Victor
Sapojnikov Greg Schuler Steve Sherry Andrei
Shkeda Alexandre Souvorov Tatiana Tatusova Lukas
Wagner
RefSeq Curator Staff BLAST Team Entrez Team NCBI
Service Desk Staff
Duplication Analysis Evan Eichler Xinwei She Ze
ChengEray Tuzan Jeff Bailey Mario
Ventura Mariano Rocchi
Trace and Assembly Archive Vladimir
Alekseyev Anton Butanaev Alexey Egorov Andrew
Klymenko Sergey Pomorov Eugene Yaschenko Mike
Dicuccio
88Acknowledgments
Mouse Genome Sequencing Consortium
Sanger Institute Washington University Genome
Sequencing Center Whitehead (Broad) Institute
Genome Cener
Baylor College of Medicine Cold Spring Harbor
Laboratory Genome Therapeutics Corporation Harvard
Partners Genome Center Joint Genome
Institute NIH Intramural Sequencing Center UK-MRC
Sequencing Consortium The University of Oklahoma
Advanced Center for Genome Technology The
University of Texas Southwest