Human Population Genomics - PowerPoint PPT Presentation

About This Presentation
Title:

Human Population Genomics

Description:

Human Population Genomics Man, Woman, Birth, Death, Infinity, Plus Altruism, Cheap Talks, Bad Behavior, Money, God and – PowerPoint PPT presentation

Number of Views:362
Avg rating:3.0/5.0
Slides: 72
Provided by: Giuseppe95
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Human Population Genomics


1
  • Human Population Genomics
  • ?Man, ?Woman, ? Birth, ? Death,
  • ? Infinity, Plus
  • ? Altruism, ? Cheap Talks,
  • ? Bad Behavior, Money, ? God and
  • ? Diversity on Steroids

2
Jack Schwartz (1930 2009)
3
Lord Jeffrey (misattributed badly paraphrased)
  • Damn the Human Genomes.
  • Small populations
  • Genes too distant
  • Pestered with duplications
  • Feeble contrivance
  • Could make a better one myself!

4
Small Populations
  • Non-equilibrium Models
  • Population Bottlenecks
  • Not Well-mixed
  • Migration/Colonization Patterns
  • Catastrophic Infections
  • Heterozygous Advantages

5
Wright-Fisher Process
6
Moran Process
death
time
  • Overlapping generations
  • Distribution of time to replication

7
Forces in Population Genetics
  • How to understand forces that produce and
    maintain inherited genetic variation
  • Forces
  • Mutation
  • Recombination
  • Natural Selection
  • Population Structure/Migration
  • Random birth/death (drift)

8
(No Transcript)
9
Genes Too Distant
  • 20,000 Genes (Estimate in 80s 120,000)
  • Occurring about every 150 Kb
  • Many more functional ncRNA
  • snoRNA, siRNA, piRNA, etc.
  • Uncharacterized

10
Y
  • From a genes point of view, reshuffling is a
    great restorative
  • The Y, in its solitary state disapproves of such
    laxity. Apart from small parts near each tip
    which line up with a shared section of the X, it
    stands aloof from the great DNA swap. Its genes,
    such as they are, remain in purdah as the
    generations succeed. As a result, each Y is a
    genetic republic, insulated from the outside
    world. Like most closed societies it becomes both
    selfish and wasteful. Every lineage evolves an
    identity of its own which, quite often, collapses
    under the weight of its own inborn weaknesses.
  • Celibacy has ruined mans chromosome.
  • Steve Jones, Y The descent of Men, 2002.

11
DAZ locus on Y Chromosome
12
Optical Mapping
  1. Capture and immobilize whole genomes as massive
    collections of single DNA molecules

Cells gently lysed to extract genomic DNA
DNA captured in parallel arrays of long single
DNA molecules using microfluidic device
Genomic DNA, captured as single DNA molecules
produced by random breakage of intact chromosomes
13
?
2. Interrogate with restriction
endonucleases 3. Maintain order of restriction
fragments in each molecule
Digestion reveals 6-nucleotide cleavage sites as
gaps
14
??
  • Single molecule maps are aligned to
    sequence-based in silico maps to validate an
    assembly
  • The Bayesian likelihood function leads to
    long-range haplotypic score function.
  • Later we will see how to perform a shotgun map
    assembly

15
???
16
????
  • Sizing Error
  • (Bernoulli labeling, absorption cross-section,
    PSF)
  • Partial Digestion
  • False Optical Sites
  • Orientation
  • Spurious molecules, Optical chimerism, Calibration

Image of restriction enzyme digested YAC clone
YAC clone 6H3, derived from human chromosome 11,
digested with the restriction endonuclease Eag I
and Mlu I, stained with a fluorochrome and imaged
by fluorescence microscopy.
17
?????
Various combinations of error sources lead to
NP-hard Problems
18
Pestered with duplications
  • Complex Genome Structures
  • Segmental Duplications
  • Many types of Polymorphisms
  • (SNPs, CNVs, SVs, etc.)
  • Models of Genome Dynamics
  • GOD (Genome Organizing Devices)
  • Models of Coalescence

19
Segmental Duplications
  • Segmental duplications have been found to be
    associated with genomic disorders.
  • Deletions Williams-Beuren syndrome
  • Duplications Charcot-Marie-Tooth disease type 1A
  • Inversions Haemophilia A
  • Translocations Derivative 22 der(22) syndrome.
  • Segmental duplications may be related to cancer
    development by causing copy number fluctuations
  • Duplication of myc in lung cancer, and ERBB2 in
    breast cancer.

20
Recent Segmental Duplications
Human
  • 3.5 5 of the human genome is found to contain
  • segmental duplications, with length gt 5 or 1kb,
    identity gt 90.
  • August, 2001 assembly,
  • Bailey, et al. 2002.
  • April, 2003 assembly,
  • Cheung, et al. 2003.
  • These duplications are estimated to have emerged
    about 40Mya under neutral assumption.
  • The duplications are mostly interspersed
    (non-tandem), and happen both inter- and
    intra-chromosomally.

From Bailey, et al. 2002
21
Recent Segmental Duplications
Mouse
  • 1.2 of the mouse genome is found to contain
    segmental duplications, with length gt 5kb,
    identity gt 90.
  • February, 2003 mouse assembly,
  • Cheung, et al. 2003.
  • These duplications are estimated to have emerged
    about 25Mya under neutral assumption.
  • The duplications happen both inter- and
    intra-chromosomally.

From Cheung, et al. 2003
22
Duplication Flanking Sequences
  • What are the molecular mechanisms that caused the
    recent segmental duplications in the human and
    mouse genomes?
  • Thermodynamic instability in the DNA sequences
  • Recombination between homologous repeat elements
  • Other unknown mechanisms.

23
Thermodynamics
Control
Data
5-breakpoint
3-breakpoint
5
3
512bp
-512bp
duplicated region
24
?
25
??
  • The duplication flanking regions contain more
    repeats than the control sequences SINE, LINE,
    MaLR and HERV.
  • Only the repeats from the younger subfamilies
    (with lower sequence divergence levels) are
    over-represented in the duplication flanking
    sequences.

26
The Model
27
The Mathematical Model
Time after duplication
1-a-2ß
1-a-2ß
1-a-2ß
h0--
a
a
a
a
f - -
h1--
?

?


?
h0-
h0
H0
a
a
a
a
f -
1-a-ß/2-?
1-a-ß/2-?
1-a-ß/2-?
2?
ß/2
2?
ß/2
2?
ß/2
h0
a
a
a
a
H1
f
h1
h1
1-a-2?
1-a-2?
1-a-2?
0 d lt e
e d lt 2e
(k-1)e d lt ke
h1 proportion of duplications by repeat
recombination h1 proportion of
duplications by recombination of the specific
repeat h1- - proportion of duplications
by recombination of other repeats h0 proportion
of duplications by other repeat-unrelated
mechanism h0 proportion of h0 with
common specific repeat in the flanking regions
h0- proportion of h0 with no common
specific repeat in the flanking regions
h0- - proportion of h0 with no specific repeat
in the flanking regions
a mutation rate in duplicated sequences ß
insertion rate of the specific repeat ?
mutation rate in the specific repeat d
divergence level of duplications e divergence
interval of duplications.
28
Model Fitting
29
Mer Frequencies
30
Copy Number Variation Data
31
CNVs in Unique regions
OR
32
CNVs in Unique regions
Yoruba Japanese Chinese Ceph
No polymorphism 810 817 817 799
Amplifications only 43 43 46 55
Deletions only 46 37 36 44
Mixed 1 321 1 211
33
CNVs in SD regions
AND
34
CNV in SD regions
Yoruba Japanese Chinese Ceph
No polymorphism 786 794 785 741
Amplifications only 124 135 141 129
Deletions only 101 86 101 141
Mixed 43 40 27 44
Unique and SD regions show completely different
behavior of CNVs!
35
Distance-dependent recombination
  • The chance of recombination depends on the
    distance between Allele A and its copy

36
Simulation (probabilistic model)
37
Simulation (probabilistic model)
38
Observations Conclusions
  • Mutation rate of 0.0001 and recombination rate of
    0.001 in SD regions constitute the best fit to
    observed real life data.
  • Single mutations cannot explain observed data,
    but can be explained by convergence via
    recombination.
  • Evolution-by-Duplication (EBD) appears to play a
    crucial role in evolution and molds the genetic
    circuitry in a rather constrained way, before it
    is subject to selection pressure

39
Feeble Contrivance
  • GWAS (Genome-Wide Association Studies)
  • Common Variants vs. Rare Variants
  • Haplotype Phasing/Linkage Analysis
  • Poor Experiment Design
  • Reference Sequences
  • Genotypic vs. Haplotypic References
  • Weak Technologies

40
Common vs. Rare Disease Variants
  • From Ionita-Laza (2009)
  • There are two disease models
  • CDCV - common disease, common variants
  • CDRV - common disease, rare variants
  • The current genome-wide association studies only
    consider common variants (frequency at least 5).
  • Feasible with available resources
  • The common loci identified so far have small
    effects (ORs 11 -15) and only explain a small
    percentage of the estimated heritability.
  • Rare susceptibility variants are expected to play
    an important role
  • population genetics theory (Pritchard, 2001)
  • empirical evidence (BMI, blood pressure, autism,
    Mendelian diseases etc.)

41
Effect Size Distribution
42
Capture-Recapture Model
  • Suppose we have sequence data on Nind individuals
    in a genomic region.
  • An individual shows variation at a position if
    the corresponding allele is different from the
    ancestral one.
  • A position is variable or is a variant if there
    is at least one individual in the dataset with a
    variation at that position.
  • Let xs be the number of individuals with
    variation at position s xs gt 0.
  • What is N the total, unknown number of variants
    in the region.

43
One can estimate the following
  • ?(t) NEW variants expected to be found in a
    FUTURE dataset of size t . Nind.
  • t is a multiplier of initial dataset size, Nind.
  • ?f(t) new variants with frequency at least f
    . . .

44
ENCODE dataset
  • Ten 500Kb genomic regions were sequenced in
    several unrelated DNA samples
  • 8 Yoruba (YRI)
  • 16 CEPH European (CEPH)
  • 7 Han Chinese (CHB)
  • 8 Japanese (JPT)
  • To make results comparable across the four
    populations (YRI, CEPH, CHB and JPT), they
    considered only 7 of the sequenced individuals
    for each dataset.

45
ENCODE - ?f(t)
  • From Ionita-Laza et al. 2009

46
How to Make a Better Human?
  • Debugging a human better
  • Sequencing a genome
  • Sequencing a population

47
S ?M ? A ? S ? H
  • Single
  • Molecule
  • Approach to
  • Sequencing-by-
  • Hybridization

48
SMASH
  • Sequence a human size genome of about 6
    Gbinclude both haplotypes.
  • Integrate
  • Optical Mapping (Ordered Restriction Maps)
  • Hybridization (with short nucleobase probes PNA
    or LNA oligomers with dsDNA on a surface, and
  • Positional Sequencing by Hybridization (efficient
    polynomial time algorithms to solve localized
    versions of the PSBH problems)

49
?
  • Genomic DNA is carefully extracted

50
??
  • LNA probes of length 6 8 nucleotides are
    hybridized to dsDNA (double-stranded genomic DNA)
  • The modified DNA is stretched on a 1 x 1 chip.

51
???
  • DNA adheres to the surface along the channels and
    stretches out.
  • Size from 0.3 3 million base pairs in length.
  • Bright emitters are attached to the probes and
    imaged (Fig 3).

52
????
  • A restriction breaks the DNA at specific sites.
  • The cut fragments of DNA relax like entropic
    springs, leaving small visible gaps

53
?????
  • The DNA is then stained with a fluorogen (Fig 5)
    and reimaged.
  • The two images are combined in a composite image
  • suggesting the locations of a specific short word
    (e.g., probes) within the context of a pattern of
    restriction sites.

54
??????
  • The integrated intensity measures the length of
    the DNA fragments.
  • The bright-emitters on probes provides a profile
    for locations of the probes.

The restriction sites are represented by a tall
rectangle The probe sites by small circles
55
???????
  • These steps are repeated for all possible probe
    compositions
  • (modulo reverse complementarity).
  • Software assembles the haplotypic ordered
    restriction maps with approximate probe locations
    superimposed on the map.

56
SMASH
  • Local clusters of overlapping words are combined
    by our PSBH (positional sequencing by
    hybridization) algorithm

57
Probe Map (lambda DNA)
58
Final Probe Map
  • Consensus map with 2 probe locations
  • 14.8 and 52.4 of the DNA length.
  • In close agreement with the correct map
  • 50.2 and 85.7 (known from the sequence)
  • Implied probe hybridization rate 42.
  • Significantly better than the needed 30

59
Four AFM images of lambda DNA with PNA probes
A
60
Combinatorial Structure
61
Discretization
62
Prediction
The probability of successfully computing the
correct restriction map as a function of the
number of cuts in the map and number of molecules
used in creating the map
63
Gentig Bayesian Approach
64
Bayesian Model
65
Robustness
  • BAC Clones with 6-cutters
  • Average Clone size 160 Kb Average Fragment
    Size 4 Kb, Average Number of Cutsites 40.
  • Parameters
  • Digestion rate can be as low as 10
  • Orientation of DNA need not be known.
  • 40 foreign DNA
  • 85 DNA partially broken
  • Relative sizing error up to 30
  • 30 spurious randomly located cuts

66
Single Molecule HapoltypingCandida Albicans
  • The left end of chromsome-1 of the common fungus
    Candida Albicans (being sequenced by Stanford).
  • Three polymorphisms
  • (A) Fragment 2 is of size 41.19kb (top) vs
    38.73kb (bottom).
  • (B) The 3rd fragment of size 7.76kb is missing
    from the top haplotype.
  • (C)The large fragment in the middle is of size
    61.78kb vs 59.66kb.

67
Problem to Solve
  • Given probe maps of some small region of the
    genome for all N-bp hybridization probes (e.g.
    all 2080 probes of 6-bp).
  • With known error rates (false positive, false
    negatives and sizing errors).
  • Can we reconstruct the complete sequence ?

68
Basic reconstruction algorithm
  • Keep track of multiple sequence assemblies.
  • Initialize with all possible 5-bp sequences.
  • Try all 4 possible extensions of each sequence.
  • Check if probe is present in corresponding map
    if not add a penalty score to the sequence
    involved.
  • Periodically delete sequences with high penalty.
  • Stop when missing probe rate jumps significantly
    from False Negative rate (2) to (100 - false
    extension rate) 55.
  • Return highest scoring sequence.

69
Anomalies
  • Irresolvable Ambiguities
  • From assemblies based on 6bp probes
  • Error Pattern s w sRC
  • Correct Pattern s wRC sRC
  • s tcgcc (any 5 bases)
  • sRCggcga (Reverse compliment of X)
  • w CCCCTAAC (any short sequence under 50bp)
  • wRC GTTAGGGG (Reverse compliment of Y)

AssemblytcgccCCCCTAAC ggcga
Correct
tcgccGTTAGGGGggcga
70
Directed Eulerian Graph
71
?
  • Mixing solid bases with wild-card bases
  • E.g., xx-x-x-xx (9-mers) or xxx- -x- -x- -xxx (14
    mers)
  • An inert base
  • Universal In terms of its ability to form base
    pairs with the other natural DNA/RNA bases.
  • Examples
  • The naturally occurring base hypoxanthine, as its
    ribo- or 2'-deoxyribonucleoside
    2'-deoxyisoinosine 7-deaza-2'-deoxyinosine
    2-aza-2'-deoxyinosine

72
Simulation Results
UNGAPPED
GAPPED
73
1000 Rupees Genome
22.67 US for 6 billion bases 135 billion US
for the entire human population
74
Who we are
  • Population
  • David Albers (Columbia)
  • Eric Aslakson (NYU)
  • Mickey Atwal (CSHL)
  • Ivan Iossifov (CSHL)
  • Hossein Khiabanian (Columbia)
  • Samantha Kleinberg (NYU)
  • Partha Mitra (CSHL)
  • Michaela Oswald (CSHL)
  • Raul Rabadan (Columbia)
  • Vladimir Trifonov (Colmbia)
  • Daniel Valente (CSHL)
  • Chris Wiggins (Columbia)
  • Polymorphims
  • Iuliana Ionita-Laza (Harvard)
  • Antonina Mitrofanova (NYU)
  • Joey Zhao (Princeton)
  • SMASH
  • TS Anantharaman (OpGen)
  • Charles Cantor (Sequenom)
  • Vladimir Demidov (BU)
  • Pierre Franquin (NYU)
  • Alex Lim (Ex-NYU)
  • Toto Paxia (Ex-NYU)
  • Jason Reed (UCLA)
  • Andrew Sundstrom (NYU)
  • SUTTA
  • Giusepe Narzisi (NYU)
  • Alessio Narzisi (NYU/Catania)

75
Lord Jeffrey
  • Beware prejudices.
  • They are like rats, and men's minds are like
    traps prejudices get in easily, but it is
    doubtful if they ever get out.
Write a Comment
User Comments (0)
About PowerShow.com