Title: Human Population Genomics
1- Human Population Genomics
- ?Man, ?Woman, ? Birth, ? Death,
- ? Infinity, Plus
- ? Altruism, ? Cheap Talks,
- ? Bad Behavior, Money, ? God and
- ? Diversity on Steroids
2Jack Schwartz (1930 2009)
3Lord Jeffrey (misattributed badly paraphrased)
- Damn the Human Genomes.
- Small populations
- Genes too distant
- Pestered with duplications
- Feeble contrivance
- Could make a better one myself!
4Small Populations
- Non-equilibrium Models
- Population Bottlenecks
- Not Well-mixed
- Migration/Colonization Patterns
- Catastrophic Infections
- Heterozygous Advantages
5Wright-Fisher Process
6Moran Process
death
time
- Overlapping generations
- Distribution of time to replication
7Forces in Population Genetics
- How to understand forces that produce and
maintain inherited genetic variation - Forces
- Mutation
- Recombination
- Natural Selection
- Population Structure/Migration
- Random birth/death (drift)
8(No Transcript)
9Genes Too Distant
- 20,000 Genes (Estimate in 80s 120,000)
- Occurring about every 150 Kb
- Many more functional ncRNA
- snoRNA, siRNA, piRNA, etc.
- Uncharacterized
10Y
- From a genes point of view, reshuffling is a
great restorative - The Y, in its solitary state disapproves of such
laxity. Apart from small parts near each tip
which line up with a shared section of the X, it
stands aloof from the great DNA swap. Its genes,
such as they are, remain in purdah as the
generations succeed. As a result, each Y is a
genetic republic, insulated from the outside
world. Like most closed societies it becomes both
selfish and wasteful. Every lineage evolves an
identity of its own which, quite often, collapses
under the weight of its own inborn weaknesses. - Celibacy has ruined mans chromosome.
- Steve Jones, Y The descent of Men, 2002.
11DAZ locus on Y Chromosome
12Optical Mapping
- Capture and immobilize whole genomes as massive
collections of single DNA molecules
Cells gently lysed to extract genomic DNA
DNA captured in parallel arrays of long single
DNA molecules using microfluidic device
Genomic DNA, captured as single DNA molecules
produced by random breakage of intact chromosomes
13?
2. Interrogate with restriction
endonucleases 3. Maintain order of restriction
fragments in each molecule
Digestion reveals 6-nucleotide cleavage sites as
gaps
14??
- Single molecule maps are aligned to
sequence-based in silico maps to validate an
assembly - The Bayesian likelihood function leads to
long-range haplotypic score function. - Later we will see how to perform a shotgun map
assembly
15???
16????
- Sizing Error
- (Bernoulli labeling, absorption cross-section,
PSF) - Partial Digestion
- False Optical Sites
- Orientation
- Spurious molecules, Optical chimerism, Calibration
Image of restriction enzyme digested YAC clone
YAC clone 6H3, derived from human chromosome 11,
digested with the restriction endonuclease Eag I
and Mlu I, stained with a fluorochrome and imaged
by fluorescence microscopy.
17?????
Various combinations of error sources lead to
NP-hard Problems
18Pestered with duplications
- Complex Genome Structures
- Segmental Duplications
- Many types of Polymorphisms
- (SNPs, CNVs, SVs, etc.)
- Models of Genome Dynamics
- GOD (Genome Organizing Devices)
- Models of Coalescence
19Segmental Duplications
- Segmental duplications have been found to be
associated with genomic disorders. - Deletions Williams-Beuren syndrome
- Duplications Charcot-Marie-Tooth disease type 1A
- Inversions Haemophilia A
- Translocations Derivative 22 der(22) syndrome.
- Segmental duplications may be related to cancer
development by causing copy number fluctuations - Duplication of myc in lung cancer, and ERBB2 in
breast cancer.
20Recent Segmental Duplications
Human
- 3.5 5 of the human genome is found to contain
- segmental duplications, with length gt 5 or 1kb,
identity gt 90. - August, 2001 assembly,
- Bailey, et al. 2002.
- April, 2003 assembly,
- Cheung, et al. 2003.
- These duplications are estimated to have emerged
about 40Mya under neutral assumption. - The duplications are mostly interspersed
(non-tandem), and happen both inter- and
intra-chromosomally.
From Bailey, et al. 2002
21Recent Segmental Duplications
Mouse
- 1.2 of the mouse genome is found to contain
segmental duplications, with length gt 5kb,
identity gt 90. - February, 2003 mouse assembly,
- Cheung, et al. 2003.
- These duplications are estimated to have emerged
about 25Mya under neutral assumption. - The duplications happen both inter- and
intra-chromosomally.
From Cheung, et al. 2003
22Duplication Flanking Sequences
- What are the molecular mechanisms that caused the
recent segmental duplications in the human and
mouse genomes? - Thermodynamic instability in the DNA sequences
- Recombination between homologous repeat elements
- Other unknown mechanisms.
23Thermodynamics
Control
Data
5-breakpoint
3-breakpoint
5
3
512bp
-512bp
duplicated region
24?
25??
- The duplication flanking regions contain more
repeats than the control sequences SINE, LINE,
MaLR and HERV. - Only the repeats from the younger subfamilies
(with lower sequence divergence levels) are
over-represented in the duplication flanking
sequences.
26The Model
27The Mathematical Model
Time after duplication
1-a-2ß
1-a-2ß
1-a-2ß
h0--
a
a
a
a
f - -
h1--
?
2ß
?
2ß
2ß
?
h0-
h0
H0
a
a
a
a
f -
1-a-ß/2-?
1-a-ß/2-?
1-a-ß/2-?
2?
ß/2
2?
ß/2
2?
ß/2
h0
a
a
a
a
H1
f
h1
h1
1-a-2?
1-a-2?
1-a-2?
0 d lt e
e d lt 2e
(k-1)e d lt ke
h1 proportion of duplications by repeat
recombination h1 proportion of
duplications by recombination of the specific
repeat h1- - proportion of duplications
by recombination of other repeats h0 proportion
of duplications by other repeat-unrelated
mechanism h0 proportion of h0 with
common specific repeat in the flanking regions
h0- proportion of h0 with no common
specific repeat in the flanking regions
h0- - proportion of h0 with no specific repeat
in the flanking regions
a mutation rate in duplicated sequences ß
insertion rate of the specific repeat ?
mutation rate in the specific repeat d
divergence level of duplications e divergence
interval of duplications.
28Model Fitting
29Mer Frequencies
30Copy Number Variation Data
31CNVs in Unique regions
OR
32CNVs in Unique regions
Yoruba Japanese Chinese Ceph
No polymorphism 810 817 817 799
Amplifications only 43 43 46 55
Deletions only 46 37 36 44
Mixed 1 321 1 211
33CNVs in SD regions
AND
34CNV in SD regions
Yoruba Japanese Chinese Ceph
No polymorphism 786 794 785 741
Amplifications only 124 135 141 129
Deletions only 101 86 101 141
Mixed 43 40 27 44
Unique and SD regions show completely different
behavior of CNVs!
35Distance-dependent recombination
- The chance of recombination depends on the
distance between Allele A and its copy
36Simulation (probabilistic model)
37Simulation (probabilistic model)
38Observations Conclusions
- Mutation rate of 0.0001 and recombination rate of
0.001 in SD regions constitute the best fit to
observed real life data. - Single mutations cannot explain observed data,
but can be explained by convergence via
recombination. - Evolution-by-Duplication (EBD) appears to play a
crucial role in evolution and molds the genetic
circuitry in a rather constrained way, before it
is subject to selection pressure
39Feeble Contrivance
- GWAS (Genome-Wide Association Studies)
- Common Variants vs. Rare Variants
- Haplotype Phasing/Linkage Analysis
- Poor Experiment Design
- Reference Sequences
- Genotypic vs. Haplotypic References
- Weak Technologies
40Common vs. Rare Disease Variants
- From Ionita-Laza (2009)
- There are two disease models
- CDCV - common disease, common variants
- CDRV - common disease, rare variants
- The current genome-wide association studies only
consider common variants (frequency at least 5). - Feasible with available resources
- The common loci identified so far have small
effects (ORs 11 -15) and only explain a small
percentage of the estimated heritability. - Rare susceptibility variants are expected to play
an important role - population genetics theory (Pritchard, 2001)
- empirical evidence (BMI, blood pressure, autism,
Mendelian diseases etc.)
41Effect Size Distribution
42Capture-Recapture Model
- Suppose we have sequence data on Nind individuals
in a genomic region. - An individual shows variation at a position if
the corresponding allele is different from the
ancestral one. - A position is variable or is a variant if there
is at least one individual in the dataset with a
variation at that position. - Let xs be the number of individuals with
variation at position s xs gt 0. - What is N the total, unknown number of variants
in the region.
43One can estimate the following
- ?(t) NEW variants expected to be found in a
FUTURE dataset of size t . Nind. - t is a multiplier of initial dataset size, Nind.
- ?f(t) new variants with frequency at least f
. . .
44ENCODE dataset
- Ten 500Kb genomic regions were sequenced in
several unrelated DNA samples - 8 Yoruba (YRI)
- 16 CEPH European (CEPH)
- 7 Han Chinese (CHB)
- 8 Japanese (JPT)
- To make results comparable across the four
populations (YRI, CEPH, CHB and JPT), they
considered only 7 of the sequenced individuals
for each dataset.
45ENCODE - ?f(t)
- From Ionita-Laza et al. 2009
46How to Make a Better Human?
- Debugging a human better
- Sequencing a genome
- Sequencing a population
47S ?M ? A ? S ? H
- Single
- Molecule
- Approach to
- Sequencing-by-
- Hybridization
48SMASH
- Sequence a human size genome of about 6
Gbinclude both haplotypes. - Integrate
- Optical Mapping (Ordered Restriction Maps)
- Hybridization (with short nucleobase probes PNA
or LNA oligomers with dsDNA on a surface, and - Positional Sequencing by Hybridization (efficient
polynomial time algorithms to solve localized
versions of the PSBH problems)
49?
- Genomic DNA is carefully extracted
50??
- LNA probes of length 6 8 nucleotides are
hybridized to dsDNA (double-stranded genomic DNA) - The modified DNA is stretched on a 1 x 1 chip.
51???
- DNA adheres to the surface along the channels and
stretches out. - Size from 0.3 3 million base pairs in length.
- Bright emitters are attached to the probes and
imaged (Fig 3).
52????
- A restriction breaks the DNA at specific sites.
- The cut fragments of DNA relax like entropic
springs, leaving small visible gaps
53?????
- The DNA is then stained with a fluorogen (Fig 5)
and reimaged. - The two images are combined in a composite image
- suggesting the locations of a specific short word
(e.g., probes) within the context of a pattern of
restriction sites.
54??????
- The integrated intensity measures the length of
the DNA fragments. - The bright-emitters on probes provides a profile
for locations of the probes.
The restriction sites are represented by a tall
rectangle The probe sites by small circles
55???????
- These steps are repeated for all possible probe
compositions - (modulo reverse complementarity).
- Software assembles the haplotypic ordered
restriction maps with approximate probe locations
superimposed on the map.
56SMASH
- Local clusters of overlapping words are combined
by our PSBH (positional sequencing by
hybridization) algorithm
57Probe Map (lambda DNA)
58Final Probe Map
- Consensus map with 2 probe locations
- 14.8 and 52.4 of the DNA length.
- In close agreement with the correct map
- 50.2 and 85.7 (known from the sequence)
- Implied probe hybridization rate 42.
- Significantly better than the needed 30
59Four AFM images of lambda DNA with PNA probes
A
60Combinatorial Structure
61Discretization
62Prediction
The probability of successfully computing the
correct restriction map as a function of the
number of cuts in the map and number of molecules
used in creating the map
63Gentig Bayesian Approach
64Bayesian Model
65Robustness
- BAC Clones with 6-cutters
- Average Clone size 160 Kb Average Fragment
Size 4 Kb, Average Number of Cutsites 40. - Parameters
- Digestion rate can be as low as 10
- Orientation of DNA need not be known.
- 40 foreign DNA
- 85 DNA partially broken
- Relative sizing error up to 30
- 30 spurious randomly located cuts
66Single Molecule HapoltypingCandida Albicans
- The left end of chromsome-1 of the common fungus
Candida Albicans (being sequenced by Stanford). - Three polymorphisms
- (A) Fragment 2 is of size 41.19kb (top) vs
38.73kb (bottom). - (B) The 3rd fragment of size 7.76kb is missing
from the top haplotype. - (C)The large fragment in the middle is of size
61.78kb vs 59.66kb.
67Problem to Solve
- Given probe maps of some small region of the
genome for all N-bp hybridization probes (e.g.
all 2080 probes of 6-bp). - With known error rates (false positive, false
negatives and sizing errors). - Can we reconstruct the complete sequence ?
68Basic reconstruction algorithm
- Keep track of multiple sequence assemblies.
- Initialize with all possible 5-bp sequences.
- Try all 4 possible extensions of each sequence.
- Check if probe is present in corresponding map
if not add a penalty score to the sequence
involved. - Periodically delete sequences with high penalty.
- Stop when missing probe rate jumps significantly
from False Negative rate (2) to (100 - false
extension rate) 55. - Return highest scoring sequence.
69Anomalies
- Irresolvable Ambiguities
- From assemblies based on 6bp probes
- Error Pattern s w sRC
- Correct Pattern s wRC sRC
- s tcgcc (any 5 bases)
- sRCggcga (Reverse compliment of X)
- w CCCCTAAC (any short sequence under 50bp)
- wRC GTTAGGGG (Reverse compliment of Y)
AssemblytcgccCCCCTAAC ggcga
Correct
tcgccGTTAGGGGggcga
70Directed Eulerian Graph
71?
- Mixing solid bases with wild-card bases
- E.g., xx-x-x-xx (9-mers) or xxx- -x- -x- -xxx (14
mers) - An inert base
- Universal In terms of its ability to form base
pairs with the other natural DNA/RNA bases. - Examples
- The naturally occurring base hypoxanthine, as its
ribo- or 2'-deoxyribonucleoside
2'-deoxyisoinosine 7-deaza-2'-deoxyinosine
2-aza-2'-deoxyinosine
72Simulation Results
UNGAPPED
GAPPED
731000 Rupees Genome
22.67 US for 6 billion bases 135 billion US
for the entire human population
74Who we are
- Population
- David Albers (Columbia)
- Eric Aslakson (NYU)
- Mickey Atwal (CSHL)
- Ivan Iossifov (CSHL)
- Hossein Khiabanian (Columbia)
- Samantha Kleinberg (NYU)
- Partha Mitra (CSHL)
- Michaela Oswald (CSHL)
- Raul Rabadan (Columbia)
- Vladimir Trifonov (Colmbia)
- Daniel Valente (CSHL)
- Chris Wiggins (Columbia)
- Polymorphims
- Iuliana Ionita-Laza (Harvard)
- Antonina Mitrofanova (NYU)
- Joey Zhao (Princeton)
- SMASH
- TS Anantharaman (OpGen)
- Charles Cantor (Sequenom)
- Vladimir Demidov (BU)
- Pierre Franquin (NYU)
- Alex Lim (Ex-NYU)
- Toto Paxia (Ex-NYU)
- Jason Reed (UCLA)
- Andrew Sundstrom (NYU)
- SUTTA
- Giusepe Narzisi (NYU)
- Alessio Narzisi (NYU/Catania)
75Lord Jeffrey
- Beware prejudices.
- They are like rats, and men's minds are like
traps prejudices get in easily, but it is
doubtful if they ever get out.