Assembling and

About This Presentation

Title:

Assembling and

Description:

Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005 Hierarchical Shotgun Assembly The Original Genome Resources- STS Maps Electronic PCR (e-PCR ... – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 66

Provided by: DeannaM2

Learn more at: https://www.ncbi.nlm.nih.gov

Category:

more less

Transcript and Presenter's Notes

Title: Assembling and

1
Assembling and Annotating Genomes
Deanna M. Church NCBI January 12, 2005
2
Of mice and men
3
Of mice and men
Fleischman et al. (1991) PNAS 8810885-10889
Both carry mutations in the Kit gene.
4
The Basic Model
5
Why sequence?

Complete parts list for a given organism

Genes, promoters, regulatory regions, variation,
????
High quality, finished (or essentially
finished) sequence
II. Genes, Genes, Genes
Draft is probably good enough
III. Annotating a finished genome (Human,
soon to be mouse)
Low coverage (2X sequence coverage).
6
What data is represented in GenBank
Data in GenBank is an interpretation of primary
sequence data

Sequence reaction
Read gel/call chromatogram (Phred/TraceTuner)
Submit sequence

Steps for small, single pass sequence

Assemble sequence and submit consensus (Phrap,
CAP3, CAP4)

Last step for large molecules (BAC, fosmids, long
cDNAs)
7
Getting the raw data
gt500 Million Traces (and counting)
http//www.ncbi.nlm.nih.gov/Traces/
8
Getting the raw data
And they just keep coming
9
Getting the raw data
Scripted access for bulk retrieval
10
Genome Sequencing Strategies
Not all bases are created equal
11
Private and public efforts
Science (June, 1998)
Craig Venter
Science (September, 1998)
12
Hierarchical Shotgun Assembly
Putting Genomes Together
This part is relatively cheap and easy
This part is hard and expensive
13
HTGS keywords
htgs_phase0 low coverage sequence 1-2X
htgs_phase1 generally 4-5X sequence coverage,
several fragments not ordered or
oriented
htgs_phase2 sequence coverage can vary
(generally 5-10X) but fragments are
ordered and oriented.
htgs_phase3 highly accurate, finished sequence.
Error rate lt10-5
Draft sequence phase 1 or 2, but gt90 of the
bases are high quality (phred 20 or better)
htgs_active_fin center has finished shotgun
phase and moved to finishing
htgs_cancelled sequencing has discontinued on
this clone
14
The Raw Data
15
Putting genomes together
UCSC Jim Kent NCBI Paul Kitts Greg
Schuler Richa Agarwala
- Remove contaminants
(vector, E. coli, other organisms, virus)
- Bin clones by chromosome arm
- Incorporate clone order information using TPF
- Identify fragment overlaps

Determine fragment order and orientation, remove
sequence redundancy (This produces sequence
contigs given NT_XXXXXX type accession numbers)

- Place contigs on chromosome
16
UCSC Jim Kent NCBI Paul Kitts Greg
Schuler Richa Agarwala
Putting genomes together
Overlapping draft clones
When BAC clones overlap, the sequence can be
made non-redundant. These contigs are given
NT_XXXXXX accession numbers
17
Sequence Tagged Sites (STS)
A common language for physical mapping of the
human genome M. Olson, L. Hood, C. Cantor, and
D. Botstein Science 245, 1434-1435 (1989).
STS marker D6S1606
forward primer
microsatellite
GAGTTTGCACCATTGCACTCCAGCCTGGGCAAC (CA)n
AACGTGGCATGTGCCTGTACTCTCC CTCAAACGTGGTAACGTGAGGTCG
GACCCGTTG (GT)n TTGCACCGTACACGGACATGAGAGG
reverse primer
PCR product size 92 - 100 bases
18
The Original Genome Resources- STS Maps
meiosis- genetic radiation- RH clones- clone based
genome

each line represents an individual cell
line/animal that carries a particularbreak
- STSs can be amplified from DNA in these cell
lines/animals- based on cell line/animal marker
content, the breaks can be determined andthe
markers ordered.

19
Electronic PCR (e-PCR)
STS marker D6S1606
microsatellite repeat
forward primer
GAGTTTGCACCATTGCACTCCAGCCTGGGCAACAAGAGTGAAACTCTGTC
ACAGA (CA)n AACGTGGCATGTGCCTGTACTCTC CTCAAACGTGGTA
ACGTGAGGTCGGACCCGTTGTTCTCACTTTGAGACAGTGTCT (GT)n
TTGCACCGTACACGGACATGAGAG
reverse primer
PCR product size 92 - 100 bases
Schuler (1997), Genome Research 7, 541-550
E-PCR software searches DNA sequences for exact
matches to both primers in correct order,
orientation, and spacing to be consistent with
known PCR product size.
20
Electronic PCR (e-PCR)
http//www.ncbi.nlm.nih.gov/sutils/e-pcr/
21
Putting genomes together
Ideally
Non-sequence based Map
22
Putting genomes together
More like
23
Human assembly Build 35
The Starting Material
Framework assemblies
388 contigs- 3.02 Gb
Contig Information
Type of source sequence Number used Length (bp)
Draft only 46 10,284,900
Finished only 334 2,833,780,000
N50 length Contig length at which 50 of the
bases in the assembly reside in a contig of at
least that size.
AGP A Golden Path
http//www.ncbi.nlm.nih.gov/genome/guide/human/HsS
tats.html
24
Current Human assembly Build 34 (the essentially
finished genome)
Contig information
Range in kb Number Length (kb) Percentof total
lt300 218 30,276 1
300-1000 74 44,028 1.45
1000-5000 87 208,365 6.89
gt5000 119 2,737,630 90.64
N50- 29,105
N50 length Contig length at which 50 of the
bases in the assembly reside in a contig of at
least that size.
25
Contigs and components in the MapViewer
26
Mouse Genome Sequencing
27
David Jaffe Jim Mullikin
Putting genomes together
BAC clones were constructed and end sequenced
before WGS project started
WGS
For mouse project only 40 kb clones and BAC
clones are available
End-sequence all clones and retain pairing
information mate-pairs
Each end sequence is referred to as a read
28
Putting genomes together
David Jaffe Jim Mullikin
Constructing Supercontigs (scaffolds)
29
Intermediate assemblies
Sanger Institute Jim Mullikin WIBR David
Jaffe NCBI Richa Agarwala Victor
Sapojnikov Wratko Hlavina Deanna Church
30
The Mouse Genome- MGSCv3
David Jaffe- Arachne
(The Mouse Genome Sequencing Consortium)
Jim Mullikin- Phusion
The Starting Material
Waterston et al, 2004
RPCI-23 197 Kb RPCI-24 155 Kb
The Assembly
Total length of the assembly 2.5 Gb (90.9 of
genome)
224,713 WGS contigs
CAAA01000100
42,620 Supercontigs
( 274 finished BACs 49.5 Mb)
Assumes a 2.75 Gb genome
31
The Mouse Genome- over time
7
32
Contig/Supercontig size by chromosome
80
70
60
50
40
30
20
10
0
33
How does MGSCv3 compare to Non-Sequence based maps
80 of STS markers on WI-Genetic Map localized
by e-PCR
72 of STS markers on WI/MRC RH Map localized by
e-PCR
Chromosome 7
lt3 chromosome conflict.
34
Finished NT Contig By Build
Finished sequences are used to build hand-curated
contigs (NT contigs)
Currently 1.8 Gb (mostly) non-redundant
sequence1.1 Gb in Build 33
35
The Mouse Genome- over time
NCBI Richa Agarwala
Mouse Build 30
36
The Mouse Genome- combining resources
NCBI Richa Agarwala Deanna Church
Unplaced versus Total curated Contigs Build 30
0
.27
1.93
0.9
.56
1.83
1.19
3.64
1.38
0
0
1.41
4.07
3.61
1.27
0
5.56
4.48
2.94
0
100
780 Mb of Curated NT Sequence
37
The Mouse Genome- combining resources
NCBI Richa Agarwala Deanna Church
Mmu4 unplaced contigs (Build 30)
10 unplaced NT contigs
(11 GenBank accessions)
Do align to WGS contigs mapped to Mmu4
Align to WGS contigs mapped to another chromsome
No hits/bad hits (mostly chrUn)
NT_039271 NT_039272 NT_039276 NT_039280
NT_039273 (MmuX)
NT_039269 NT_039270 NT_039274 NT_039278 NT_039279
38
Segmental Duplications
Large, nearly identical copies of genomic DNA.
gt 1 Kb, gt 90 identity
Intrachromosomal
Interchromosomal
39
Segmental Duplications
WGAC Analysis Whole Genome Assembly Comparison
BLAST the genome against itself and look for
sequence similarity.
caveat difficult to distinguish between
biological duplication and artificial duplication
introduced when producing draft assemblies.
WSSD Analysis Whole Genome Shotgun Sequence
Detection
BLAST WGS reads against an assembly and look for
increased depth of coverage
40
Segmental Duplications
41
Segmental Duplications
MGSCv3 (gt90 ID gt10 Kb)
60 of all duplication map to chrUn in MGSCv3
42
Segmental Duplications
Comparison of duplication in the Mouse and Human
Genomes
WGAC analysis
gt1 KB
5.25
ND
ND
3.74
2.35
gt5 Kb
4.78
1.95
1.01
3.25
2.00
gt10 Kb
4.52
0.70
0.38
2.71
1.60
gt20 Kb
4.06
0.11
0.10
2.23
1.14
Duplications are underrepresented in the Whole
Genome Assembly (MGSCv3)
43
Segmental Duplications
WSSD Finished BACs
Duplicated pre-quality score
Unique pre-quality score
Duplicated post-quality score
Unique post-quality score
44
Segmental Duplications
WSSD (gt95 id) analysis of Build 30 BACs
The 6 BACs (5 NT clones) from Mmu4 that hit
chrUn are on the duplication positive list
45
Segmental Duplications
Case Western Reserve Evan Eichler Jeff Bailey
46
Segmental Duplications
RP23-3D2 chr.X_A3
Bari Italy Mario Ventura Mariano Rochi

Validated 18/27 (67) In silico predictions by
FISH
16/18 (90) were clustered intrachromosomal
duplications

This region described in Mileham and Brown
(1996) as a repeat sequence island
47
Segmental Duplications
Gene Content of Duplications
Domain U D Enrichment
serpin 39 6 57.5
lectin_c 75 4 19.9
7tm 208 3 5.4
ANF_receptor 34 3 33
Defensin_propep 3 3 373.5
KRAB 68 3 16.5
defensins 2 3 560.3
lipocalin 23 2 32.5
AAA 35 1 10.7
DEAD 41 1 9.1
ENV_polyprotein 4 1 93.4
MAGE 5 1 74.7
RNA_helicase 10 1 37.4
Human
5 of the Genome is in Duplicated regions 6
of RefSeqs align to these regions
Mouse
1.5-2 of the Genome is in Duplicated
regions 0.5 of RefSeqs align to these regions
48
MGSCv3 Duplication Analysis
both non redundant dup both non redundant dup both non redundant dup both non redundant dup both non redundant dup both non redundant dup
WGAC (Mb) WSSD supported WGAC (Mb) WSSD overlap WGAC () WSSD (Mb) WGAC overlap WSSD () Proportion of WSSD supported WGAC in chrom()
Evan Eichler Xinwei She Ginger Chang Eray
Tuzan Deanna Church
chr1 3.25 0.38 11.58 0.57 66.51 0.21
chr2 2.03 0.13 6.57 0.32 42.11 0.08
chr3 2.17 0.11 5.23 0.16 69.09 0.08
chr4 2.19 0.27 12.12 0.69 38.64 0.19
chr5 2.81 0.42 14.96 0.88 47.92 0.31
chr6 3.72 0.37 9.97 0.86 43.00 0.27
chr7 4.48 0.78 17.41 2.10 37.16 0.64
chr8 1.54 0.15 9.54 0.27 54.63 0.12
chr9 1.56 0.10 6.11 0.34 28.03 0.08
chr10 1.62 0.10 5.94 0.19 51.39 0.08
chr11 1.13 0.08 6.94 0.21 36.63 0.07
chr12 1.79 0.39 21.85 0.88 44.42 0.37
chr13 1.86 0.41 22.08 1.01 40.66 0.38
chr14 1.19 0.15 12.39 0.33 44.38 0.14
chr15 0.94 0.04 3.87 0.05 77.47 0.04
chr16 1.08 0.01 0.75 0.02 40.64 0.01
chr17 3.35 0.22 6.62 0.99 22.30 0.26
chr18 0.75 0.02 2.62 0.02 87.52 0.02
chr19 0.92 0.05 5.53 0.31 16.52 0.09
chrUn 23.78 13.03 54.80 82.02 15.89 12.91
chrX 3.17 0.31 9.91 0.86 36.41 0.23
Build 33 data
49
NCBI Richa Agarwala Deanna Church
The Mouse Genome- combining resources
50
Mouse assemblies Build 32
Framework assemblies
Contig information
Range in kb Number Length Percentof total
lt300 39373 1.94x108 7.10
300-1000 72 4.23x107 1.54
1000-5000 116 3.14x108 11.46
gt5000 156 2.19x109 79.92
All
Range in kb Number Length Percentof total
lt300 98 8.33x106 0.33
300-1000 70 4.13x107 1.62
1000-5000 116 3.14x108 12.30
gt5000 156 2.19x109 85.76
Mapped
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
Mapped Scaffold N50
55
The Mouse Genome- combining resources
Refseqs with mulitple alignments to the genome
56
Finished Sequence in 'Random' Bin
combined_2
57
NCBI Richa Agarwala Deanna Church
The Mouse Genome- combining resources
Mouse Build 33 (current)
Clone based TPF
-local Order and Orientation problems
MGSCv3 based TPF

Increased artificial duplication
Lots of finished sequence in random bin

Combined TPF
- Not perfect, but better outcome. Manual
curation helps
And the winner is
58
19
18
17
16
15
12
13
14
11
9
8
10
7
6
5
4
3
2
1
Build 33
Reference assembly N50 22.3 Mb
X
59
Chromosome 7 inversion still present
60
Mmu7 (3M 6M)
61
Segmental Duplication Genome annotation will
under-represent the gene content if segmental
duplications are not included in the reference
assembly.
62
Large scale variation in the genome
Nature Genetics, Sept. 2004
63
Types of annotation
Feature Method
Genes By alignment, by prediction
Markers
By ePCR
Variation
By alignment
Clones/Cytogenetic location
By alignment (BAC ends, insert) or assembly
Phenotype
Via Gene identification, associated markers
By annotated BAC-END sequenced clones By
FISH-mapped clones used in assembly
Cytogenetic Position
Sequence characteristics
CpG islands, source of assembly
Gene Trap Clones
By alignment
Note Genes from other organisms are also
positioned based on alignment of mRNAs from one
species on that of another genome. Example the
human Map Viewer shows the position of ESTs and
other mRNAs from cow, pig, mouse, and rat.
64
Reference Sequences
Goal One sequence entry for each naturally
occurring DNA, RNA and protein molecule
Key Curated annotation Calculated annotation
chromosome
NC_000000
RNA
protein
NM_000000 NR_000000
NP_000000
predictedRNA
predictedprotein
XM_000000/ XR_000000
XP_000000
Multiple products for one gene are instantiated
as separate RefSeqs with the same LocusID.
65
Why do we need RefSeq?
66
mRNA alignment

General alignment
at least 50 of length or gt1.0 kb
gt95 identity, unless short exon
No longer one alignment per contig per strand
(changed recently because this led to failure to
annotate all members of a gene cluster)
Constraints on intron length (compactness)
Shift within 3 nt to find splice sites conforming
to consensus (GT-AG, GC-AG, AT-AC)
Rank alignment by bit score, identity, score,
gaps, compactness
global alignment
Best placement
Add to score for introns to compensate for gap
penalty
Known ambiguity if gene/pseudogene pairs are
highly related, and few introns in gene

67
Aligning cDNAs to the genome

Different algorithms can produce different
results
Trying to balance alignment with searching for
splice sites.

NM_003490 (synapsin 3)
Between exons 7 and 8
68
Making Gene Models (at NCBI)
69
(No Transcript)
70
Conflict resolution

Integrated comparison with Ensembl and UCSC
Placement of CDS
Placement of and consensus splice junctions
identity between RefSeq and Genome
Reading frame
Possible Actions
Review current evidence
Review alignment algorithms
Review current RefSeqs

71
Future consensus annotation

CCDS identifier assigned to annotated proteins
that are consistently placed
Sequence may not be identical because NCBI
annotates and places existing RefSeqs that are
based on cDNAs and Ensembl generates mRNA and
protein products solely from the reference genome
cDNA (and thus protein) from a different allele
RNA editing
selenoproteins
ribosomal slippage
non-AUG initiation codon
cDNA source has undetected sequence errors

72
Future consensus annotation

Preliminary Statistics based on Human Build 34.3

Count Total Conditions Satisfied 7802 7802 10
0 nucleotideposition 1499 9301 100
proteinposition 3053 12336 100 exon
position 23 12359 NCBI/Hinxton both
"good" 1540 13899 NCBI annotation
projected 1772 15671 One model
better 52 15723 Other model better
73
Now that the genome is together
http//www.ncbi.nlm.nih.gov/mapview/map_search.cgi
?taxid10090
74
http//www.ncbi.nlm.nih.gov/genome/seq/MmBlast.htm
l
/HsBlast.html

/RnBlast.html
/DrBlast.html
Data Access
DATABASES
Entry point into the Genome- view BLAST results
in the Map Viewer
Other data sets
Gene Trap Clones
75
Data Access
76
Navigating by location
77
Multiple assemblies can be a good thing
Alignment of human Reference mRNAs
256 Reference assembly only 10 Celera assembly
only

Assembly Gaps
Assembly Errors
Biological variation

78
Mulitple assemblies can be a good thing
79
(No Transcript)
80
(No Transcript)
81
Mulitple assemblies can be a good thing
Inversions An exon of DOCK3 is inverted in the
reference assembly relative to other available
information.
Celera Assembly
Reference Assembly
Other sequence data indicate the reference
assembly includes an inversion
NM_004947 181 tgaaggggatctttcctgcaaattacattcact
tgaaaaaggcaattgtcagtaataggg 240 AY254099 181
................................................
............ 240 AY145303 158
..................................................
.......... 217 AY145302 509
.a......c................t.....t..................
.....c.... 568 AK172930 518
.a......c................t.....t..................
.....c.... 577 AK122353 445
.c.....t..a......t.c.gc..tg.............t..ctg...a
.ag..c.aa. 504 AY233380 158
.c.....t..a......t.c.gc...g.............t..ctg...a
.ag..c.aa. 217 AC121608 21865
.....c................t.....t.....................
..c.... 21921 AL672208 61296
.....c................t.....t.....................
..c.... 61240
82
Mulitple assemblies can be a good thing
83
(No Transcript)
84
Genome assembly and annotation is an ongoing
issue.
Weigh all of the evidence carefully
Multiple lines of evidence better than a single
thread
85
Take home messages
Genome assembly and annotation is still not a
trivial problem
Be critical and review the evidence
http//www.ncbi.nlm.nih.gov/projects/assembly
86
Assembly Database
NCBI Eugene Yaschenko Vladimir Alekseyev Mike
Dicuccio Deanna ChurchTIGR Martin Shumway
Steve Salzberg
87
Acknowledgments
Genome Team Richa Agarwala Hsiu-Chuan Chen Slava
Chetvernin Deanna Church Olga Ermolaeva Wratko
Hlavina Wonhee Jang Jonathan Kans Yuri
Kapustin Ken Katz Paul Kitts Donna Maglott Jim
Ostell Kim Pruitt Sergey Resenchuk Victor
Sapojnikov Greg Schuler Steve Sherry Andrei
Shkeda Alexandre Souvorov Tatiana Tatusova Lukas
Wagner
RefSeq Curator Staff BLAST Team Entrez Team NCBI
Service Desk Staff
Duplication Analysis Evan Eichler Xinwei She Ze
ChengEray Tuzan Jeff Bailey Mario
Ventura Mariano Rocchi
Trace and Assembly Archive Vladimir
Alekseyev Anton Butanaev Alexey Egorov Andrew
Klymenko Sergey Pomorov Eugene Yaschenko Mike
Dicuccio
88
Acknowledgments
Mouse Genome Sequencing Consortium
Sanger Institute Washington University Genome
Sequencing Center Whitehead (Broad) Institute
Genome Cener
Baylor College of Medicine Cold Spring Harbor
Laboratory Genome Therapeutics Corporation Harvard
Partners Genome Center Joint Genome
Institute NIH Intramural Sequencing Center UK-MRC
Sequencing Consortium The University of Oklahoma
Advanced Center for Genome Technology The
University of Texas Southwest

Write a Comment

User Comments (0)

About PowerShow.com

Assembling and - PowerPoint PPT Presentation

Assembling and

Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005 Hierarchical Shotgun Assembly The Original Genome Resources- STS Maps Electronic PCR (e-PCR ... – PowerPoint PPT presentation