Human Population Genomics - PowerPoint PPT Presentation

About This Presentation

Title:

Human Population Genomics

Description:

Human Population Genomics Man, Woman, Birth, Death, Infinity, Plus Altruism, Cheap Talks, Bad Behavior, Money, God and – PowerPoint PPT presentation

Number of Views:365

Avg rating:3.0/5.0

Slides: 72

Provided by: Giuseppe95

Learn more at: https://cs.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Human Population Genomics

1

Human Population Genomics
?Man, ?Woman, ? Birth, ? Death,
? Infinity, Plus
? Altruism, ? Cheap Talks,
? Bad Behavior, Money, ? God and
? Diversity on Steroids

2
Jack Schwartz (1930 2009)
3
Lord Jeffrey (misattributed badly paraphrased)

Damn the Human Genomes.
Small populations
Genes too distant
Pestered with duplications
Feeble contrivance
Could make a better one myself!

4
Small Populations

Non-equilibrium Models
Population Bottlenecks
Not Well-mixed
Migration/Colonization Patterns
Catastrophic Infections
Heterozygous Advantages

5
Wright-Fisher Process
6
Moran Process
death
time

Overlapping generations
Distribution of time to replication

7
Forces in Population Genetics

How to understand forces that produce and
maintain inherited genetic variation
Forces
Mutation
Recombination
Natural Selection
Population Structure/Migration
Random birth/death (drift)

8
(No Transcript)
9
Genes Too Distant

20,000 Genes (Estimate in 80s 120,000)
Occurring about every 150 Kb
Many more functional ncRNA
snoRNA, siRNA, piRNA, etc.
Uncharacterized

10
Y

From a genes point of view, reshuffling is a
great restorative
The Y, in its solitary state disapproves of such
laxity. Apart from small parts near each tip
which line up with a shared section of the X, it
stands aloof from the great DNA swap. Its genes,
such as they are, remain in purdah as the
generations succeed. As a result, each Y is a
genetic republic, insulated from the outside
world. Like most closed societies it becomes both
selfish and wasteful. Every lineage evolves an
identity of its own which, quite often, collapses
under the weight of its own inborn weaknesses.
Celibacy has ruined mans chromosome.
Steve Jones, Y The descent of Men, 2002.

11
DAZ locus on Y Chromosome
12
Optical Mapping

Capture and immobilize whole genomes as massive
collections of single DNA molecules

Cells gently lysed to extract genomic DNA
DNA captured in parallel arrays of long single
DNA molecules using microfluidic device
Genomic DNA, captured as single DNA molecules
produced by random breakage of intact chromosomes
13
?
2. Interrogate with restriction
endonucleases 3. Maintain order of restriction
fragments in each molecule
Digestion reveals 6-nucleotide cleavage sites as
gaps
14
??

Single molecule maps are aligned to
sequence-based in silico maps to validate an
assembly
The Bayesian likelihood function leads to
long-range haplotypic score function.
Later we will see how to perform a shotgun map
assembly

15
???
16
????

Sizing Error
(Bernoulli labeling, absorption cross-section,
PSF)
Partial Digestion
False Optical Sites
Orientation
Spurious molecules, Optical chimerism, Calibration

Image of restriction enzyme digested YAC clone
YAC clone 6H3, derived from human chromosome 11,
digested with the restriction endonuclease Eag I
and Mlu I, stained with a fluorochrome and imaged
by fluorescence microscopy.
17
?????
Various combinations of error sources lead to
NP-hard Problems
18
Pestered with duplications

Complex Genome Structures
Segmental Duplications
Many types of Polymorphisms
(SNPs, CNVs, SVs, etc.)
Models of Genome Dynamics
GOD (Genome Organizing Devices)
Models of Coalescence

19
Segmental Duplications

Segmental duplications have been found to be
associated with genomic disorders.
Deletions Williams-Beuren syndrome
Duplications Charcot-Marie-Tooth disease type 1A
Inversions Haemophilia A
Translocations Derivative 22 der(22) syndrome.
Segmental duplications may be related to cancer
development by causing copy number fluctuations
Duplication of myc in lung cancer, and ERBB2 in
breast cancer.

20
Recent Segmental Duplications
Human

3.5 5 of the human genome is found to contain
segmental duplications, with length gt 5 or 1kb,
identity gt 90.
August, 2001 assembly,
Bailey, et al. 2002.
April, 2003 assembly,
Cheung, et al. 2003.
These duplications are estimated to have emerged
about 40Mya under neutral assumption.
The duplications are mostly interspersed
(non-tandem), and happen both inter- and
intra-chromosomally.

From Bailey, et al. 2002
21
Recent Segmental Duplications
Mouse

1.2 of the mouse genome is found to contain
segmental duplications, with length gt 5kb,
identity gt 90.
February, 2003 mouse assembly,
Cheung, et al. 2003.
These duplications are estimated to have emerged
about 25Mya under neutral assumption.
The duplications happen both inter- and
intra-chromosomally.

From Cheung, et al. 2003
22
Duplication Flanking Sequences

What are the molecular mechanisms that caused the
recent segmental duplications in the human and
mouse genomes?
Thermodynamic instability in the DNA sequences
Recombination between homologous repeat elements
Other unknown mechanisms.

23
Thermodynamics
Control
Data
5-breakpoint
3-breakpoint
5
3
512bp
-512bp
duplicated region
24
?
25
??

The duplication flanking regions contain more
repeats than the control sequences SINE, LINE,
MaLR and HERV.
Only the repeats from the younger subfamilies
(with lower sequence divergence levels) are
over-represented in the duplication flanking
sequences.

26
The Model
27
The Mathematical Model
Time after duplication
1-a-2ß
1-a-2ß
1-a-2ß
h0--
a
a
a
a
f - -
h1--
?
2ß
?
2ß
2ß
?
h0-
h0
H0
a
a
a
a
f -
1-a-ß/2-?
1-a-ß/2-?
1-a-ß/2-?
2?
ß/2
2?
ß/2
2?
ß/2
h0
a
a
a
a
H1
f
h1
h1
1-a-2?
1-a-2?
1-a-2?
0 d lt e
e d lt 2e
(k-1)e d lt ke
h1 proportion of duplications by repeat
recombination h1 proportion of
duplications by recombination of the specific
repeat h1- - proportion of duplications
by recombination of other repeats h0 proportion
of duplications by other repeat-unrelated
mechanism h0 proportion of h0 with
common specific repeat in the flanking regions
h0- proportion of h0 with no common
specific repeat in the flanking regions
h0- - proportion of h0 with no specific repeat
in the flanking regions
a mutation rate in duplicated sequences ß
insertion rate of the specific repeat ?
mutation rate in the specific repeat d
divergence level of duplications e divergence
interval of duplications.
28
Model Fitting
29
Mer Frequencies
30
Copy Number Variation Data
31
CNVs in Unique regions
OR
32
CNVs in Unique regions
Yoruba Japanese Chinese Ceph
No polymorphism 810 817 817 799
Amplifications only 43 43 46 55
Deletions only 46 37 36 44
Mixed 1 321 1 211
33
CNVs in SD regions
AND
34
CNV in SD regions
Yoruba Japanese Chinese Ceph
No polymorphism 786 794 785 741
Amplifications only 124 135 141 129
Deletions only 101 86 101 141
Mixed 43 40 27 44
Unique and SD regions show completely different
behavior of CNVs!
35
Distance-dependent recombination

The chance of recombination depends on the
distance between Allele A and its copy

36
Simulation (probabilistic model)
37
Simulation (probabilistic model)
38
Observations Conclusions

Mutation rate of 0.0001 and recombination rate of
0.001 in SD regions constitute the best fit to
observed real life data.
Single mutations cannot explain observed data,
but can be explained by convergence via
recombination.
Evolution-by-Duplication (EBD) appears to play a
crucial role in evolution and molds the genetic
circuitry in a rather constrained way, before it
is subject to selection pressure

39
Feeble Contrivance

GWAS (Genome-Wide Association Studies)
Common Variants vs. Rare Variants
Haplotype Phasing/Linkage Analysis
Poor Experiment Design
Reference Sequences
Genotypic vs. Haplotypic References
Weak Technologies

40
Common vs. Rare Disease Variants

From Ionita-Laza (2009)
There are two disease models
CDCV - common disease, common variants
CDRV - common disease, rare variants
The current genome-wide association studies only
consider common variants (frequency at least 5).
Feasible with available resources
The common loci identified so far have small
effects (ORs 11 -15) and only explain a small
percentage of the estimated heritability.
Rare susceptibility variants are expected to play
an important role
population genetics theory (Pritchard, 2001)
empirical evidence (BMI, blood pressure, autism,
Mendelian diseases etc.)

41
Effect Size Distribution
42
Capture-Recapture Model

Suppose we have sequence data on Nind individuals
in a genomic region.
An individual shows variation at a position if
the corresponding allele is different from the
ancestral one.
A position is variable or is a variant if there
is at least one individual in the dataset with a
variation at that position.
Let xs be the number of individuals with
variation at position s xs gt 0.
What is N the total, unknown number of variants
in the region.

43
One can estimate the following

?(t) NEW variants expected to be found in a
FUTURE dataset of size t . Nind.
t is a multiplier of initial dataset size, Nind.
?f(t) new variants with frequency at least f
. . .

44
ENCODE dataset

Ten 500Kb genomic regions were sequenced in
several unrelated DNA samples
8 Yoruba (YRI)
16 CEPH European (CEPH)
7 Han Chinese (CHB)
8 Japanese (JPT)
To make results comparable across the four
populations (YRI, CEPH, CHB and JPT), they
considered only 7 of the sequenced individuals
for each dataset.

45
ENCODE - ?f(t)

From Ionita-Laza et al. 2009

46
How to Make a Better Human?

Debugging a human better
Sequencing a genome
Sequencing a population

47
S ?M ? A ? S ? H

Single
Molecule
Approach to
Sequencing-by-
Hybridization

48
SMASH

Sequence a human size genome of about 6
Gbinclude both haplotypes.
Integrate
Optical Mapping (Ordered Restriction Maps)
Hybridization (with short nucleobase probes PNA
or LNA oligomers with dsDNA on a surface, and
Positional Sequencing by Hybridization (efficient
polynomial time algorithms to solve localized
versions of the PSBH problems)

49
?

Genomic DNA is carefully extracted

50
??

LNA probes of length 6 8 nucleotides are
hybridized to dsDNA (double-stranded genomic DNA)
The modified DNA is stretched on a 1 x 1 chip.

51
???

DNA adheres to the surface along the channels and
stretches out.
Size from 0.3 3 million base pairs in length.
Bright emitters are attached to the probes and
imaged (Fig 3).

52
????

A restriction breaks the DNA at specific sites.
The cut fragments of DNA relax like entropic
springs, leaving small visible gaps

53
?????

The DNA is then stained with a fluorogen (Fig 5)
and reimaged.
The two images are combined in a composite image
suggesting the locations of a specific short word
(e.g., probes) within the context of a pattern of
restriction sites.

54
??????

The integrated intensity measures the length of
the DNA fragments.
The bright-emitters on probes provides a profile
for locations of the probes.

The restriction sites are represented by a tall
rectangle The probe sites by small circles
55
???????

These steps are repeated for all possible probe
compositions
(modulo reverse complementarity).
Software assembles the haplotypic ordered
restriction maps with approximate probe locations
superimposed on the map.

56
SMASH

Local clusters of overlapping words are combined
by our PSBH (positional sequencing by
hybridization) algorithm

57
Probe Map (lambda DNA)
58
Final Probe Map

Consensus map with 2 probe locations
14.8 and 52.4 of the DNA length.
In close agreement with the correct map
50.2 and 85.7 (known from the sequence)
Implied probe hybridization rate 42.
Significantly better than the needed 30

59
Four AFM images of lambda DNA with PNA probes
A
60
Combinatorial Structure
61
Discretization
62
Prediction
The probability of successfully computing the
correct restriction map as a function of the
number of cuts in the map and number of molecules
used in creating the map
63
Gentig Bayesian Approach
64
Bayesian Model
65
Robustness

BAC Clones with 6-cutters
Average Clone size 160 Kb Average Fragment
Size 4 Kb, Average Number of Cutsites 40.
Parameters
Digestion rate can be as low as 10
Orientation of DNA need not be known.
40 foreign DNA
85 DNA partially broken
Relative sizing error up to 30
30 spurious randomly located cuts

66
Single Molecule HapoltypingCandida Albicans

The left end of chromsome-1 of the common fungus
Candida Albicans (being sequenced by Stanford).
Three polymorphisms
(A) Fragment 2 is of size 41.19kb (top) vs
38.73kb (bottom).
(B) The 3rd fragment of size 7.76kb is missing
from the top haplotype.
(C)The large fragment in the middle is of size
61.78kb vs 59.66kb.

67
Problem to Solve

Given probe maps of some small region of the
genome for all N-bp hybridization probes (e.g.
all 2080 probes of 6-bp).
With known error rates (false positive, false
negatives and sizing errors).
Can we reconstruct the complete sequence ?

68
Basic reconstruction algorithm

Keep track of multiple sequence assemblies.
Initialize with all possible 5-bp sequences.
Try all 4 possible extensions of each sequence.
Check if probe is present in corresponding map
if not add a penalty score to the sequence
involved.
Periodically delete sequences with high penalty.
Stop when missing probe rate jumps significantly
from False Negative rate (2) to (100 - false
extension rate) 55.
Return highest scoring sequence.

69
Anomalies

Irresolvable Ambiguities
From assemblies based on 6bp probes
Error Pattern s w sRC
Correct Pattern s wRC sRC
s tcgcc (any 5 bases)
sRCggcga (Reverse compliment of X)
w CCCCTAAC (any short sequence under 50bp)
wRC GTTAGGGG (Reverse compliment of Y)

AssemblytcgccCCCCTAAC ggcga
Correct
tcgccGTTAGGGGggcga
70
Directed Eulerian Graph
71
?

Mixing solid bases with wild-card bases
E.g., xx-x-x-xx (9-mers) or xxx- -x- -x- -xxx (14
mers)
An inert base
Universal In terms of its ability to form base
pairs with the other natural DNA/RNA bases.
Examples
The naturally occurring base hypoxanthine, as its
ribo- or 2'-deoxyribonucleoside
2'-deoxyisoinosine 7-deaza-2'-deoxyinosine
2-aza-2'-deoxyinosine