Title: Personalized Structural Variation of Human Genomes
1Personalized Structural Variation of Human
Genomes
- Evan Eichler
- University of Washington
Human Variome Project Meeting, Sept 27th, 2008
2Goals of Human Genome Structural Variation
Sequencing Project
- Sequence inversions, insertions and deletions at
the single basepair level (gt5 kbp) in order to
develop genotype assays to assess phenotypic
consequence - Copy-number status
- Sequence content
- Sequence organization (i.e. proximity with
respect to functional promoter, alleles vs
paralogues)
Color-Blindness in Humans The Opsin Loci
Adapted from Deeb, SS (2005) Clin. Genet.
67369-377
3Sequence-Based Resolution of Structural Variation
Human Genomic DNA
Genomic Library (1 million clones)
Sequence ends of genomic inserts Map to human
genome
Dataset 1,122,408 fosmid pairs preprocessed
(15.5X genome coverage) 639,204 fosmid
pairs BEST pairs (8.8 X genome coverage)
4Structural Variation Sequencing Project
- 8 HapMap Genomes sequenced to 0.3 X Sanger
sequence (10 X physical fosmid clone coverage) - Identifies 1700 sites of structural variation
and 525 novel insertions - Identifies 4 million SNPs and 795,000 indels
(3.5 vs. 10 FP) - Additional 19 genomes underway (WashU) (6 CEU 6
ASN 7 YRI)
- 40 map to duplications 20 are complex
structures
Kidd et al., Nature 2008
5Structural Variation of the GSTM1 Locus
Japanese Sample
Yoruban Sample
- 91 of human genome basepairs are covered by 4
or more clones
browser view (http//hgsv.washington.edu )
6Sequenced Structural Variation of APOBEC3B
- 24.5 kb deletion eliminates most of APOBEC3B but
creates fusion gene - Complete sequence facilitates rapid genotyping.
7World-Wide Distribution of APOBEC3B Deletion
- Fusion APOBEC3A/3B lt1 frequency Africans, 88
Papua New Guineans - Analysis of 1269 Human DNA samples.Fst places in
top 0.169
Kidd et al., Hum. Mol. Genet., 2007
8Structural Variation Map of the Human Genome
Kidd et al., Nature, 2008.
9Genotyping
Probe coverage for sequenced deletions on
commercial SNP platforms
Cooper et al. (2008) Nature Genet. Sept 7 Epub.
10Next-Generation ESP Technology A more
Comprehensive Catalogue of Structural Variation?
- Tuzun et al., 2005 (ESP Sanger 40 kbp fosmids)
- vs. Korbel et al., 2007 (454 3 kbp plasmids)
same genome - 297 sites (275 completely sequenced)
- 102 Deletions 139 Insertions, 56 inversions
- 181 sites not detected by Korbel
- 117/181 (64.6) carried duplicated sequences
- Of these, 42 deletions 107 insertions 32
inversions. - 116 intersected sites
- 53/116 (45.6) carried duplicated sequences
- 60 deletions 32 insertions 24 inversions.
- 75 of Insertions are missed and bias against SD
events. - 800 additional sites found (complementary
approaches)
11Depth-of-Coverage
- Whole genome shotgun sequence detection of
duplicated sequences (Bailey et al., 2002) - Establish benchmarks for depth of coverage based
on X, autosome and duplications of known copy
number (33 BACs) compute depth of coverage in 5
kb windows call regions where 6/7 windows exceed
3 s.d. of depth of coverage - Map reads using mrFAST algorithm to non
Repeatmasked regions of the genome - 75 million 454 WGS JDW and
- 200-400 million Solexa WGS per individual (CEPH
Trio)
Aksay, G and Alkan, C
12454 WGS (JDW) Correlation with Copy-Number
R20.94
R20.96
Solexa WGS (NA12878) Correlation with Copy-Number
R20.92
R20.93
13Personalized Duplication or Copy-Number Variation
Maps
Venter (Sanger)?
CNP1
Watson (454)?
NA12878 (Solexa)?
CNP2
NA12891 (Solexa)?
NA12892 (Solexa)?
- Two known 70 kbp CNPs, CNP1 duplication absent
in Venter but predicted - in Watson and NA12878, CNP2 present mother but
neither father or child
14Homozygous Deletion
Watson (454)
Venter (Sanger)
NA12878 (Solexa)
NA12878 NIM validated
NA12892 Agilent validated
15Summary
- Sequencing1700 sites of common structural
variation discovered and being sequenced 500
structural variants per individual gt5kbp human
genome incomplete (15.6 minor allele and 26.3
sequence that is CNV is not in reference genome) - clone resource provides means to sequence regions
any complex region of interest - Genotyping Current commercial platforms can not
adequately directly detect gt50 of common
structural variants - Next-generation sequencing increase the yield to
several thousand sites per individual but will be
biased to unique regions of the genome. - ve Copy-number of duplications may be estimated
by depth of coverage approach - -ve ESP bias against insertions and events
mapping within duplicated regions require longer
reads or clone reagents
16Acknowledgements
UWGSC Maynard Olson Rajinder Kaul
Eichler Lab Jeff Kidd Greg Cooper Andy
Sharp Heather Mefford Andy Itsara Can Alkan Gozde
Aksay Fereydoun Homozdiari Carl Baker Eray
Tuzun Priscillia Siswara FrancescaAntonacci Ze
Cheng Matthew Johnson Zhaoshi Jiang Xinwei
She Neil Shaffer Maika Malig
UCSF Dan Pinkel Donna Albertson
WashU Rick Wilson Tina Graves
Oxford Jonathan Flint Samantha Knight
Agencourt Doug Smith
U. of Pavia Orsetta Zuffardi Stefania Gimelli
UW Joshua Smith Debbie Nickerson Troy Zerr
U. Nijmegen Bert de Vries Joris Veltman
Stanford Rick Myers Devin Absher Jun Li
Epicure Consortium Thomas Sander Ingo Helbig
1000 Genomes Consortium
NIH Andy Singleton
17Properties of Normal Structural Variation
- Common 50.3 (866/1720) events seen in 2 or more
individuals (n9 individuals total) - Small Median 7. 8 kbp and average is 13. 1 kbp
with an average of 500-600 events per individual
(gt 5 kbp) - Gene family bias 107 sequenced events directly
affect gene structure 87 of these belong to
gene families - Recurrence estimate that 18 of the same events
occur on different SNP haplotypes - Human Genome Reference Incomplete
- 15.6 of sites, Reference genome is minor allele
- 26.3 of sites of structural variation correspond
to sequence that is not represented once within
the human genome.
18HERC2 Duplication
Watson (454)
Venter (Sanger)
NA12878 (Solexa)
NA12891 (Solexa)
NA12892 (Solexa)
No large differences!
Alkan, C.
19Deletion Detection
- For unique regions (no CNV detected)
- avg 1672.62 reads/5 kbp
- median 1640
- stdev 423.17
- For fosmid ESP deletion regions (validated by one
orthogonal method) - avg 1273.96 reads/5 kbp
- median 1143
- stdev 663.96
- Of the 164 deletions NA12878, 73 or 44.5 show no
evidence of depth-of-coverage depression.
20Hemizygous Deletion
21ESP Analysis NA12878 ESP Placement Stats
Max Span 1 million bp
Library 1 71,848,232 pairs mapped Expected
insert size 100bp (5X unmasked physical
coverage)?
Library 2 28,739,625 pairs mapped Expected
insert size 150bp (3X unmasked physical
coverage)?
22ESP Analysis NA12878 ESP Placement Stats
- Map ESP against repeatmasked hg18 reference
genome using mrFAST (Tuzun et al, 2005, Kidd et
al., 2008) - Sites supported by gt2 independent clones are
considered, any clones can have multiple
discordant mappings in the first pass, - Algorithm based on Set-Cover is implemented to
find a subset of repetitive (and unique) mappings
where the total number of sites are minimized (at
the end, each clone is assigned to a single
location)?
23Comparison with Kidd 2008
- Library 1 only,
- Insert size (100bp) is too small to compare
against longer insertions in Kidd structural
variation set, - Smaller insertions and deletions may intersect
with 1-100bp indel set.