Title: Cryptic Variation in the Human mutation rate
1Cryptic Variation in the Human mutation rate
Alan Hodgkinson Adam Eyre-Walker, Manolis
Ladoukakis
2Variation in the mutation rate
- Between different chromosomes
- Between regions on chromosomes
- Neighbouring nucleotides
3Simple context effects
Hwang and Green (2004) PNAS 101 13994-14001
4Cryptic Variation
- Remote context
- AGTCGGTTACCGTGACGTTGAACGTGT
5Cryptic Variation
- Remote context
- AGTCGGTTACCGTGACGTTGAACGTGT
- Degenerate context
- AGTCGGTTACCGTGYSRGYGAACGTGT
6Cryptic Variation
- Remote context
- AGTCGGTTACCGTGACGTTGAACGTGT
- Degenerate context
- AGTCGGTTACCGTGYSRGYGAACGTGT
- No context / Complex context
7Our approach to the problem
- Search for SNPs in human sequences that also
have a SNP in the orthologous position in chimp.
8Our approach to the problem
- Search for SNPs in human sequences that also
have a SNP in the orthologous position in chimp.
Do we see more coincident SNPs than expected by
chance?
9The method
- Extract all human SNPs from dbSNP and construct
a BLAST database on a chromosome by chromosome
basis.
10The method
- Extract all human SNPs from dbSNP and construct
a BLAST database on a chromosome by chromosome
basis. - Extract all chimp SNPs from dbSNP with 50bp
either side of SNP.
11The method
- Extract all human SNPs from dbSNP and construct
a BLAST database on a chromosome by chromosome
basis. - Extract all chimp SNPs from dbSNP with 50bp
either side of SNP. - BLAST chimp SNPs against human database.
12The method
- Extract all human SNPs from dbSNP and construct
a BLAST database on a chromosome by chromosome
basis. - Extract all chimp SNPs from dbSNP with 50bp
either side of SNP. - BLAST chimp SNPs against human database.
- Extract results above a certain level of
homology where there is a SNP on both sequences
and reduce to 40bp either side of central
position.
13The method
- Extract all human SNPs from dbSNP and construct
a BLAST database on a chromosome by chromosome
basis. - Extract all chimp SNPs from dbSNP with 50bp
either side of SNP. - BLAST chimp SNPs against human database.
- Extract results above a certain level of
homology where there is a SNP on both sequences
and reduce to 40bp either side of central
position. - Repeating both including and excluding CpG
effects.
14Results
- 1.5 million chimp SNPs.
- 310,000 81bp alignments containing a human and
chimp SNP.
15Results
- 1.5 million chimp SNPs.
- 310,000 81bp alignments containing a human and
chimp SNP. - Observe the number of coincident SNPs.
- Calculate the expected number, taking into
account the effects of neighbouring nucleotides.
16Results
Obs Exp Ratio
All 11571 6592 1.76 (1.72,1.79)
No-CpG 5028 2533 1.98 (1.93,2.04)
17Results
C/T G/A C/A G/T C/G A/T
C/T 1.91 1.04 1.19 1.21 0.96
G/A 1.83 1.24 1.02 1.14 1.40
C/A 1.23 1.08 4.81 1.28 1.39
G/T 1.15 1.38 4.95 1.27 0.77
C/G 1.09 1.14 1.24 1.40 2.79
A/T 0.94 1.06 1.79 0.99 15.43
18Alternative Explanations
- Bias in the Method
- Selection
- Ancestral Polymorphism
- Paralogous SNPs
19Alternative Explanations
- Bias in the Method
- Selection
- Ancestral Polymorphism
- Paralogous SNPs
20Methodological Bias
- Simulated data with same density of human and
chimp SNPs as dbSNP under different divergence
and mutation patterns. - Method worked well under realistic conditions.
21Methodological Bias
All sites (HG)
Div Obs Exp Ratio 95 CI
0 839 812 1.033 (0.963,1.103)
1 2419 2316 1.040 (1.003,1.086)
2 681 685 0.995 (0.920,1.069)
Non CpG sites (HG)
Div Obs Exp Ratio 95 CI
0 401 428 0.936 (0.844,1.028)
1 1182 1228 0.963 (0.908,1.018)
2 374 400 0.935 (0.840,1.030)
22Methodological Bias
All sites (HG)
Div Obs Exp Ratio 95 CI
0 839 812 1.033 (0.963,1.103)
1 2419 2316 1.040 (1.003,1.086)
2 681 685 0.995 (0.920,1.069)
Non CpG sites (HG)
Div Obs Exp Ratio 95 CI
0 401 428 0.936 (0.844,1.028)
1 1182 1228 0.963 (0.908,1.018)
2 374 400 0.935 (0.840,1.030)
23Alternative Explanations
- Bias in the method
- Selection
- Ancestral Polymorphism
- Paralogous SNPs
24Selection
- Areas of low SNP density result in clustering
Human
Chimp
25Selection
- Areas of low SNP density result in clustering
Human
Chimp
Apparent excess of coincident SNPs
26Selection
27Alternative Explanations
- Bias in the method
- Selection
- Ancestral Polymorphism
- Paralogous SNPs
28Ancestral Polymorphism
- SNP inherited from common ancestor of chimp and
human
29Ancestral Polymorphism
- SNP inherited from common ancestor of chimp and
human
Increase in coincident SNPs
30Ancestral Polymorphism
- Expect observed/expected ratio to be same for
all transitions
C/T G/A C/A G/T C/G A/T
C/T 1.91 1.04 1.19 1.21 0.96
G/A 1.83 1.24 1.02 1.14 1.40
C/A 1.23 1.08 4.81 1.28 1.39
G/T 1.15 1.38 4.95 1.27 0.77
C/G 1.09 1.14 1.24 1.40 2.79
A/T 0.94 1.06 1.79 0.99 15.43
31Ancestral Polymorphism
- Repeated initial analysis with macaque data.
- Humans and Macaque split 23-24 million years
ago so we expect there to be no shared
polymorphisms.
32Ancestral Polymorphism
- Repeated initial analysis with macaque data.
- Humans and Macaque split 23-24 million years
ago so we expect there to be no shared
polymorphisms.
Obs Exp Ratio
All 77 47 1.64 (1.27,2.00)
No-CpG 34 23 1.51 (1.001,2.02)
33Alternative Explanations
- Bias in the method
- Selection
- Ancestral Polymorphism
- Paralogous SNPs
34Paralogous SNPs
- Excess of coincident SNPs a consequence of
artifactual SNPs called as a result of
substitutions in paralogous regions.
35Paralogous SNPs
- Excess of coincident SNPs a consequence of
artifactual SNPs called as a result of
substitutions in paralogous regions. - Musumeci et al (2010) 8.32 of human variation
in dbSNP may be due to paralogy.
36Paralogous SNPs
- Excess of coincident SNPs a consequence of
artifactual SNPs called as a result of
substitutions in paralogous regions. - Musumeci et al (2010) 8.32 of human variation
in dbSNP may be due to paralogy.
AGCTGCACGT Y CGGCATCCAA SNP AGCTGCACGT T
CGGCATCCAA Chromosome 1 AGCTGCACGT A
CGGCATCCAA Chromosome 7
Artifactual SNP
37Paralogous SNPs
AGCTGCACGT (T/A) CGGCATCCAA AGCTGCACGT T
CGGCATCCAA
AGCTGCACGT (T/A) CGGCATCCAA AGCTGCACGT T
CGGCATCCAA AGCTGCACGT A CGGCATCCAA
38Paralogous SNPs
AGCTGCACGT (T/A) CGGCATCCAA AGCTGCACGT T
CGGCATCCAA
AGCTGCACGT (T/A) CGGCATCCAA AGCTGCACGT T
CGGCATCCAA AGCTGCACGT A CGGCATCCAA
3.6 of coincident SNPs are possibly a
consequence of paralogous sequences
39Alternative Explanations
- Bias in the method
- Selection
- Ancestral Polymorphism
- Paralogous SNPs
Cryptic variation in the mutation rate
40Context Analysis
- 4517 sequences containing non-CpG coincident
SNPs flanked by 200bp. - Tabulate triplet frequencies at each position in
surrounding sequences. - Test whether the proportions of triplets we
observe at each position significantly different
from the proportions in the sequences as a whole.
41Context Analysis
- Coincident SNP in central position
42Context Analysis
- Coincident SNP in central position
No obvious context surrounding coincident SNPs
43Genomic Distribution
- Tallied the number of coincident SNPs per MB
- 3.91 coincident SNPs per MB.
- 1.68 non-CpG coincident SNPs per MB.
44Genomic Distribution
- Tallied the number of coincident SNPs per MB
- 3.91 coincident SNPs per MB.
- 1.68 non-CpG coincident SNPs per MB.
- If randomly distributed expect Poisson
distribution and ? ?2 3.91
45Genomic Distribution
- Tallied the number of coincident SNPs per MB
- 3.91 coincident SNPs per MB.
- 1.68 non-CpG coincident SNPs per MB.
- If randomly distributed expect Poisson
distribution and ? ?2 3.91 - ?2 13.27 (plt0.001) and so sampling variance
explains approximately 30 of total variance.
46Genomic Distribution
Feature r r2 p
SNP density 0.256 0.0655 lt0.001
Distance to Telomere -0.022 0.0004 0.226
Distance to Centromere 0.011 0.0001 0.565
Recombination Rate 0.107 0.0114 lt0.001
Nucleosome Association 0.004 0.0000 0.832
Gene Density -0.022 0.0004 0.230
GC content -0.006 0.0000 0.741
47Genomic Distribution
- SNP densities must drive coincident SNP
densities to a certain extent as approximately
half of coincident SNPs are created by chance
alone.
48Genomic Distribution
- SNP densities must drive coincident SNP
densities to a certain extent as approximately
half of coincident SNPs are created by chance
alone. - Recombination rate positively correlated with
SNP density (r 0.242, plt0.001). - Partial correlation controlling for SNP density
r 0.048, p0.011.
49Genomic Distribution
- SNP densities must drive coincident SNP
densities to a certain extent as approximately
half of coincident SNPs are created by chance
alone. - Recombination rate positively correlated with
SNP density (r 0.242, plt0.001). - Partial correlation controlling for SNP density
r 0.048, p0.011. - SNP densities explain 6.5 of the variance,
recombination rate explains 0.2 of the variance
of coincident SNPs.
50Genomic Distribution
Feature r r2 p
Coincident SNP Density 0.256 0.0655 lt0.001
Distance to Telomere -0.171 0.0292 lt0.001
Distance to Centromere -0.047 0.0022 0.012
Recombination Rate 0.234 0.0548 lt0.001
Nucleosome Association 0.187 0.0350 lt0.001
Gene Density 0.064 0.0041 0.001
GC content 0.184 0.0339 lt0.001
51Quantification
- Use Log-normal distribution of relative mutation
rates due to cryptic variation. - Model the number of coincident SNPs under the
effects of cryptic variation. - Incorporate effects of divergence.
52Quantification
- Use Log-normal distribution of relative mutation
rates due to cryptic variation. - Model the number of coincident SNPs under the
effects of cryptic variation. - Incorporate effects of divergence.
What level of variation in the log-normal
distribution explains our results?
53Log-normal model
Fastest 5 of sites mutate 16.4 times faster
than slowest 5 of sites.
54Summary
- Cryptic variation in the mutation rate.
55Summary
- Cryptic variation in the mutation rate.
- No obvious context surrounding coincident SNPs.
56Summary
- Cryptic variation in the mutation rate.
- No obvious context surrounding coincident SNPs.
- Variation is truly cryptic.
57Summary
- Cryptic variation in the mutation rate.
- No obvious context surrounding coincident SNPs.
- Variation is truly cryptic.
- Genomic distribution of coincident SNPs is
over-dispersed
58Summary
- Cryptic variation in the mutation rate.
- No obvious context surrounding coincident SNPs.
- Variation is truly cryptic.
- Genomic distribution of coincident SNPs is
over-dispersed - Variation in mutation rate is substantial.
59Acknowledgments
Manolis Ladoukakis
Adam Eyre-Walker