Title: Centro Nacional Genotipado
1Centro Nacional Genotipado
- Análisis bioinformático de secuencias y expresión
de genes y genomas. - Human SNPs. Teoria y prácticas.
- Arcadi Navarro
- Madrid, 6 de Abril de 2006
2Qué es el CeGen?
- El CeGen es una plataforma tecnológica,
iniciativa de GENOMA ESPAÑA, que tiene por
objetivo proporcionar los elementos de
conocimiento y la infraestructura necesaria para
realizar proyectos de genotipado de SNPs (Single
Nucleotide Polymorphisms) a gran escala y bajo
coste. - Los destinatarios de esta iniciativa son los
grupos de investigación de universidades,
hospitales, centros de investigación e industria. - El CeGen pretende contribuir a dar un salto
cualitativo y cuantitativo en la investigación
mediante servicios de alto valor añadido
proporcionados desde España.
3Los SNPsSingle Nucleotide Polymorphisms
- Frecuentes.
- Bien distribuidos.
- Estables.
- Funcionales?.
- Permiten procesamiento a
- gran escala.
4What is a SNP (Single Nucleotide Polymorphism)?
A SNP is a position in a genome at which two or
more different bases occur in the population,
each with a frequency gt1.
GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG
- SNPs are the most common type of variations
(genetic markers). - There are only two variants for SNP G-T or A-C
5Interés científico
- Hay una gran demanda de analizar SNPs en gran
cantidad y las tecnologías disponibles son
costosas y muy variadas. - El CeGen ofrece desde España servicios de
genotipado a gran escala a bajo coste y adaptados
a cada necesidad. - El CeGen ofrece soporte científico y acceso a
unas tecnologías para proyectos que de otro modo
requerirían mucho más tiempo y recursos
- A más facilidad para genotipar, se podrán abordar
proyectos más ambiciosos y en mayor número - Ej. En estudios de asociación donde se requiere
genotipar muchos individuos (casos, controles) - Posibilidad de whole genome scan
- Proyectos de BioBank
6Interés estratégico
- El CeGen puede ayudar en la solicitud de
proyectos de investigación que impliquen
genotipado a varios niveles - Proponiendo estrategias de selección SNPs
- Ofreciendo las tecnologías a gran escala que dan
viabilidad al proyecto - Calculando presupuestos que pueden incorporarse a
la solicitud
7Volvamos a los SNPs
- Frecuentes.
- Bien distribuidos.
- Estables.
- Funcionales?.
- Permiten procesamiento a
- gran escala.
8Terminology
Allele is one of a number of alterative forms of
the same gene occupying a given locus. If we are
considering SNPs, an allele is one of two
alternative forms. Locus is physical location of
allele on the chromosome Haplotype is a set of
alleles that tend to be inherited together (not
easily separable by recombination). Example
Consider 2 loci, each with two possible alleles,
the first locus being either A or a, the second
locus being B or b. Then the genotype of an
individual have 4 possible haplotypes AB, Ab,
aB, ab.
9tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattct
caattctatttcactggtctatggcagagaacacaaaatatggccagtgg
cctaaatccagcctactaccttttttttttttttgtaacattttactaac
atagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaa
tggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttac
tctacagccctttataaaaacagtgtgccaactcctgatttatgaactta
tcattatgtcaataccatactgtctttattactgtagttttataagtcat
gacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttt
tggccatcctagatatactttgtattgccacataaatttgaagatcagcc
tgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtag
aatctatagattaattagaggagaatgactatcttgacaatactgctgcc
cctctgtattcgtgggggattggttccacaacaacacccaccccccactc
ggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaa
aatccatggatgctcaagtccatataaaatgccatactatttgcatataa
cctctgcaatcctcccctatagtttagatcatctctagattacttataat
actaataaaatctaaatgctatgtaaatagttgctatactgtgttgaggg
ttttttgttttgttttgttttatttgtttgtttgtttgtattttaagaga
tggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagc
ttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcct
cccaaagtgctgggatacaggtgtgacccactgtgcccagttattatttt
ttatttgtattattttactgttgtattatttttaattattttttctgaat
attttccatctatagttggttgaatcatggatgtggaacaggcaaatatg
gagggctaactgtattgcatcttccagttcatgagtatgcagtctctctg
tttatttaaagttttagtttttctcaaccatgtttacttttcagtataca
agactttgacgttttttgttaaatgtatttgtaagtattttattatttgt
gatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgt
aatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggaga
tcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatag
aaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcg
ggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagt
gagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactc
tgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttctt
aatttcattttcaggttttttatttatttctactatatggatacatgatt
gatttttgtatattgatcatgtatcctgcaaactagctaacatagtttat
tatttctctttttttgtggattttaaaggattttctacatagataaataa
acacacataaacagttttacttctttcttttcaacctagactggatgcat
tttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactaga
gaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatc
cctgctttccccctgattttagggggaatgttttcagtctttcactattt
aatatgattttagctataggtttatcctagatccctgttatcatgttgag
gaaattcccttctatttctagtttgttgagattttttaattcatgtgatt
gcgctatctggctttgctctca
10(No Transcript)
11(No Transcript)
12http//www.ncbi.nlm.nih.gov/About/primer/snps.html
13Para qué sirven los SNPs? (I)
- Genotipado en humanos
- Búsqueda de genes de susceptibilidad para
enfermedades - Diagnóstico / pronóstico
- Metabolismo de fármacos
- Reacciones adversas a fármacos (medicina
personalizada) - Reacción ante factores ambientales
- Genética forense
- Estructura y dinámica del genoma
- Evolución del genoma
14Para qué sirven los SNPs? (II)
- Genotipado en otras especies
- Identificación microbiana
- Análisis de comunidades microbianas
- Genotipación en levaduras
- Uso similar a humanos en ratón y otros organismos
modelo - Gran uso en especies domésticas (QTL) tanto
animales como vegetales - Identificación de variedades vegetales
- Su uso se expandirá con nuevos mapas de SNPs
15SNPs in the Human Genome
- All humans share 99.9 the same genetic sequence
- SNPs occur about every 1000 base pairs
- 90 of human genome variation comes from SNPs
- The human genome contains 10 million validated
SNPs and 21 million submited. - 340,000 SNPs are found in genes
- SNPs are not evenly spaced along the sequence
- SNP-rich regions
- SNP-poor regions
16What is a Haplotype?
- A haplotype is a sequence of alleles stretching
along an extended segment of DNA a sort of
super allele! - Haplotypes are usually inherited as a single unit
from parents.
Aa Bb
AB ab
Ab aB
17Alleles of Adjacent SNPs on a Chromosome form
Haplotypes
a. Short stretch of DNA for 4 different people
3 SNPs are present
b. Haplotypes made up of a combination of
different alleles at 20 nearby SNPs
c. Genotyping just 3 tag SNPs can distinguish
all 4 haplotypes
18What is Linkage Disequilibrium?
- Linkage Disequilibrium (LD) is nonindependence
(nonrandomness) of alleles at different sites
(different SNPs for the rest of the session). - Example
- Suppose that allele A at locus 1 and allele B
at locus 2 are at frequencies pA and pB,
respectively, in the population. - If the two loci are independent, then we
would expect to see the AB haplotype at frequency
pApB. - If the population frequency of the AB
haplotype is either higher or lower than
thisimplying that particular alleles tend to be
observed togetherthen the two loci are said to
be in LD.
19Linkage Equilibrium
Eg. Two adjacent SNPs (A and B) are genotyped in
a population.
There are 4 possible haplotypes
SNP 1
Under linkage equilibrium we have what we
expect fAB fAfB faB fafB fAb fAfb fAb
fAfb
A a
B fAB faB fB
b fAb fAb fb
fA fa
SNP 2
20Linkage Disequilibrium (LD)
Eg. Two adjacent SNPs are genotyped in a
population
There are 4 possible haplotypes
Under linkage disequilibrium we have different
results fAB fAfB D faB fafB - D fAb fAfb
- D fAb fafb D where D is the LD
coefficient D fAB ? fab faB ? fAb or D fAB
fA ? fB
SNP 1
A a
B fAB faB fB
b fAb fab fb
fA fa
SNP 2
21Linkage Disequilibrium (LD)
Linkage disequilibrium
Linkage equilibrium
SNP 1
SNP 1
A a
B 0.5 0 0.5
b 0 0.5 0.5
0.5 0.5
A a
B 0.25 0.25 0.5
b 0.25 0.25 0.5
0.5 0.5
SNP 2
SNP 2
where D is 0.25 D fAB ? fab faB ? fAb 0.5 ?
0.5 0 ? 0 0.25 D fAB fA ? fB 0.5 0.5 ?
0.5 0.25
22Linkage Disequilibrium (LD)
Linkage disequilibrium
Linkage equilibrium
SNP 1
SNP 1
A a
B 0.02 0.58 0.6
b 0.18 0.22 0.4
0.2 0.8
A a
B 0.12 0.48 0.6
b 0.08 0.32 0.4
0.2 0.8
SNP 2
SNP 2
where D is -0.10 D fAB ? fab faB ? fAb 0.02
? 0.22 0.58 ? 0.18 -0.10 D fAB fA ? fB
0.02 0.6 ? 0.2 -0.10
23Linkage Disequilibrium (LD)
24- Assessing LD
- Measuring it with some parameter (we have just
seen D) - Testing statistically whether it exists random
association or LD?
25The Measure of LD
- D coefficient is dependent on marginal allele
frequencies in contingency table. - This limitation disqualifies D as a useful
measure of association because it is data
dependent and cannot be compared for different
SNPs or populations. - However D can be normalised to D making it
comparable across pops and SNPs.
26The D measure of LD
- D D/Dmax
- Dmax the absolute max. D value or
-
- Dmax min (fAfb,fafB) when D gt 0
- Dmax min (fAfB,fafb) when D lt 0
27Example D Calculation
Hap Freqs
Allele Freqs
fAB 0.765 faB 0.235 fAb 0.167 fAb 0.833
fA 0.52 fa 0.48 fB 0.59 fb 0.41
D (0.765 ? 0.833) (0.167 ? 0.235) 0.025
Since D is positive (gt0) Dmax min (0.52 ? 0.41,
0.48 ? 0.59) Dmax 0.2132 D D/Dmax
0.025/0.2132
Thus, D 0.117
28Interpretation of D coefficient
- D 1 (perfect positive LD between SNP alleles)
- D 0 (linkage equilibrium or no association
between SNP alleles - D -1 (perfect negative LD between SNP alleles)
- D 0.87 (strong positive LD between SNP alleles
- D 0.12 (weak positive LD between SNP alleles
D is constrained between 1 and 1 where
Significance (P-value) for D is determined from
Chi-squared distribution
29Interpretation of D coefficient
D0 means no LD D 1 means complete
LD Careful can be 1 when 3 haplotypes are
present
30LD Plots of Adjacent SNPs
LD varies significantly across genomic regions
31The r2 measure of LD
Disequilibrium coefficient r2 (sometimes also
denoted by D2) represents the statistical
correlation between 2 sites. Consider two
biallelic loci on the same chromosome, with
alleles A and a at the first locus and with
alleles B and b at the second locus. The allele
frequencies will be written as pA, pa, pB, and
pb, and the four haplotype frequencies will be
written as pAB, pAb, paB, and pab. Then
32r2 is related to D, of course
y
y
b
b
B
B
x
x
A
a
A
a
33r2 vs. D
Both measures are 1 in case of complete
disequilibrium and 0 is there is no LD. But r2
1 corresponds to situation where 2 haplotypes are
present (out of possible four), while D is less
certain and D1 can reflect 2 or 3 haplotypes
present.
34The representation of LD
1
This is just Excel, say.
0.9
0.8
0.7
0.6
Linkage Disequilibrium D'
0.5
0.4
0.3
0.2
0.1
0
10kb
5kb
20kb
80kb
40kb
160kb
3.5
Distance Between SNPs (Base Pairs)
35The representation of LD
36The representation of LD
This is HelixTree software screen shot D 0 is
blue, D 1 is red.
37The representation of LD
The Graphic Overview of Linkage Disequilibrium
(GOLD) software package http//www.sph.umich.edu/c
sg/abecasis/GOLD/
High recombination
High LD
Map
38The representation of LD
High recombination
Haploview
High LD
39Testing for LD
LD tests
- LD for k alleles
- Dm
- Fishers exact test
- ?2
40(No Transcript)
41where ? are the degrees of freedom n is sample
size t is the observed likelihood ratio (LR) ? is
the average LR in a permuted distribution ? is
the standard deviation of the permuted LR
distribution
Zhao et al., Ann. Hum. Genet., 63167-179,1999
42FNF fraction not found. Based in the fact that
LD reduces the number of haplotypes
Ke expected number of haplotypes under linkage
equilibrium, given the allele frequencies and the
sample size Ko observed number of
haplotypes Kmin minimum possible number of
haplotypes
Slatkin, Genetics 1541367-1378, 2000 Mateu et
al., Am. J. Hum. Genet. 68103-117, 2001
43OK, fine
But, why is there any LD at all in the genome?
44The origins of Linkage Disequilibrium
Variations in Chromosomes within a Population
45So LD is the basisof, for example, association
studies (youll see more about this
later...).And we can go even deeper LD decays
with recombination
Dt1(1-r)Dt
46LD is a function of distance
- Distances
- Physical distances between alleles are
base-pairs. - 2) Measure of distance based on the probability
of recombination, the unit is called Morgan. - - A distance of 1 centiMorgan (cM) between two
alleles means that they have 1 chances of being
separated by recombination. - - In humans, a genetic distance of 1 cM is
roughly equal to a physical distance of 1 million
base pairs (1Mbp).
47Section 1
LD decays with recombination
1,000 gens. ago
Time present
48- (Think Finland 1000 founders 2000 years ago
consistent expansion) - Few (maybe none) reoccurrences of
disease-causing mutation
- (Think Earth 10,000 "founders" (Ne)100,000
years ago) - Assume old mutations cause common diseases
49Whait a sec, these are the haplotypes, right?
Variations in Chromosomes within a Population
50And remember one can select tag-SNPs
a. Short stretch of DNA for 4 different people
3 SNPs are present
b. Haplotypes made up of a combination of
different alleles at 20 nearby SNPs
c. Genotyping just 3 tag SNPs can distinguish
all 4 haplotypes
51Cool!!!
- Nowadays, we can massively genotype individuals.
We could potentially cover the whole genome
using the property of LD and a few
tag-SNPsand..andand - But
- How many SNPs to tag all the genome?
- Andcan we easily ascertain individual
haplotypes?
52Haplotyping Phase Problem
Observed SNP1 G/T SNP2 A/C Possible
Haplotypes GA, TC or GC, TA n SNPs ?
2n possible haplotypes
53The Problem
- Its not yet easy to measure an individuals
(only two) haplotypes - Molecular haplotyping (nucleotide sequencing) is
the gold standard - A more efficient strategy
- Focus on regions, such as certain genes
- Estimate haplotypes from SNP data (genotypes)
- Use LD map, and reduce the number of loci to
represent the haplotype - Use haplotype map (DB) key SNPs haplotype
blocks with strong LD
54Molecular Haplotyping
- Hetero-duplex analysis, mismatch detection,
allele-specific PCR - Have potential to get high-throughput
- Only practical for short haplotypes (2-5 kb vs.
50-100kb) - Costly
- Rolling Circle amplification method, etc
- Can handle larger size
- Difficult to automate
55In-silico Haplotyping
- Alias Haplotype Reconstruction, Haplotype
Inference, - Computational Haplotyping, Statistical
Haplotyping, etc. - Advantages
- Cost effective
- High-throughput
- Difficulty
- Phase Ambiguity Haplotypes increase
exponentially with SNPs
56In-silico Haplotyping Two Tasks
- Reconstruction of the haplotypes of the sampled
individuals. - II. Estimation of haplotypes frequencies in a
population.
57In-silico Haplotyping Approaches
- Clarks algorithm
- E-M algorithm (expectation-maximization
algorithm) - Bayesian algorithm
- Message many different approaches
58How Far Does Association (LD) Extend Between
Neighboring Common Sites?
Theoretical (given 1cM/Mb) 3-8 kbbut
59Strategy for Assessing Extent of LD
Distance from core single nucleotide polymorphism
(SNP)
19 regions 44 Caucasian samples from Utah a
great deal of DNA sequencing per sample
60How far does the signal reach? Results
61MYSTERY What explains the long-range LD?
LD and population genomics
? Maybe an important event in population history?
62Positive Control 48 Swedes
Identical pattern to Utah
6396 Nigerians (Yoruba)
Much Less LD
Associations in Africans a SUBSET of those in
Caucasians
MUST be influenced by population history
64Confirmation of less LD in Africans from Direct
DNA Sequencing
65More evidence from Genotyping 5,000 SNPs
(Gabriel et al. 2002)
66Explanation Bottleneck or Founder Effect in
History of North Europeans
Ancestral Population
100,000 years ago
Likely lt10 founding chromosomes
What was this event? (1) Out of Africa?
(2) Founding of Europe?
North Europeans
Yoruba Ancestors
67Given the demographic properties of LD, which
populations are best suited for association-based
mapping studies?
- - LD reflects the ages of haplotypes in
populations. - - Population founded more recently is useful for
detecting long-range associations between
disease-causing mutations and marker SNPs. - Older populations are useful for fine-scale
mapping. - But things are always more complex
68How far does the signal reach? Results
69Maybe clearer this way
LD varies substantially across the genome!
70MYSTERY What explains the huge genomic variance
in LD distribution?
LD and population genomics
? Maybe a lot of intra-genomic diversity. Maybe
haplotype blocks?
71...it is not simple even within a population a
patterned structure of recombination in the
genome can create blocks of LD
72Haplotype Blocks
- The human genome may be defined as regions of
high LD called haplotype blocks - These are separated by smaller regions of low LD
usually attributed to recombination hotspots - A haplotype block consists of a few common
haplotypes that account for a large DNA segment
73Haplotype Blocks
Haplotype Block
- Each row represents a SNP
- Blue dot major
- yellow minor
- Each column represents a single chromosome
- The 147 SNPs are divided into 18 blocks defined
by black lines. - The expanded box on the right is a SNP block of
26 SNPs over 19kb of genomic DNA. The 4 most
common of 7 different haplotypes include 80 of
the chromosomes, and can be distinguished with 2
SNPs
Chromosomes
SNPS
74These would be blocks
High recombination
High LD
Map
75... Likely to be caused by recombination hotspots
(but things are not that easy)
76So what we need is an haplotype map
- The Haplotype Map, HapMap, will be a map of these
haplotype blocks and the SNPs that identify the
haplotypes. The HapMap will be a key resource for
finding genes that contribute to disease risk and
drug response.
77- What is the HapMap?
- The HapMap is a catalogue of common genetic
variants (SNPs) that occur in humans. - What information does the HapMap provide?
- the characteristics of the SNPs (sequence
variation, allele freqs) - where they occur in the genome (relative
positions) - how they are distributed in human populations (LD
and haplotype blocks)
78Aims of the Hapmap Project
- To develop a map of the human genome that
describes the common patterns of DNA sequence
variation (haplotypes) - For use in establishing connections between
genetic variants and disease. - Populations sampled (n 270 people)
- African
- Asian
- European
- 614030 SNPs genotyped (55 million genotypes)
79The Construction of the HapMap
- Three main steps
- 1. SNPs are identified in different individuals
from different ethnic groups - 2. Adjacent SNPs that are inherited together are
compiled into haplotypes - 3. SNPs that uniquely represent haplotypes ie.
tagSNPs are identified for use in genetic
association studies of disease
80(No Transcript)
81Nicotine metabolising gene
82HapMap Genotype Data Analysis
- The raw genotype data can be downloaded from the
HapMap website - HAPLOVIEW is a useful tool for conducting
genotype analysis
83(No Transcript)
84LD and disease
Youll see a lot about this later. Let me just
tell you a couple of things
85Gene mapping by linkage in an dominant mendelian
disease only recombination events in the
families carry information to narrow down the
location of the gene
86In LD mapping, all the recombination events in
the history of the disease are used to find the
gene region
87LD and complex diseases LD between a marker and
the (unknown) genetic variant contributing to the
disease underlies the association approach (i.e.,
comparing allele frequencies between cases and
controls)
88Power to detect association improves when using
haplotypes
89In summary
- Knowing about LD and Haplotype Blocks empowers us
to detect association between markers and disease
and to perform many linkage-based disease
studies.