Centro Nacional Genotipado - PowerPoint PPT Presentation

1 / 89
About This Presentation
Title:

Centro Nacional Genotipado

Description:

Centro Nacional Genotipado An lisis bioinform tico de secuencias y expresi n de genes y genomas. Human SNPs. Teoria y pr cticas. Arcadi Navarro – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 90
Provided by: BrunoG53
Category:

less

Transcript and Presenter's Notes

Title: Centro Nacional Genotipado


1
Centro Nacional Genotipado
  • Análisis bioinformático de secuencias y expresión
    de genes y genomas.
  • Human SNPs. Teoria y prácticas. 
  • Arcadi Navarro
  • Madrid, 6 de Abril de 2006

2
Qué es el CeGen?
  • El CeGen es una plataforma tecnológica,
    iniciativa de GENOMA ESPAÑA, que tiene por
    objetivo proporcionar los elementos de
    conocimiento y la infraestructura necesaria para
    realizar proyectos de genotipado de SNPs (Single
    Nucleotide Polymorphisms) a gran escala y bajo
    coste.
  • Los destinatarios de esta iniciativa son los
    grupos de investigación de universidades,
    hospitales, centros de investigación e industria.
  • El CeGen pretende contribuir a dar un salto
    cualitativo y cuantitativo en la investigación
    mediante servicios de alto valor añadido
    proporcionados desde España.

3
Los SNPsSingle Nucleotide Polymorphisms
  • Frecuentes.
  • Bien distribuidos.
  • Estables.
  • Funcionales?.
  • Permiten procesamiento a
  • gran escala.

4
What is a SNP (Single Nucleotide Polymorphism)?
A SNP is a position in a genome at which two or
more different bases occur in the population,
each with a frequency gt1.
GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG
  • SNPs are the most common type of variations
    (genetic markers).
  • There are only two variants for SNP G-T or A-C

5
Interés científico
  • Hay una gran demanda de analizar SNPs en gran
    cantidad y las tecnologías disponibles son
    costosas y muy variadas.
  • El CeGen ofrece desde España servicios de
    genotipado a gran escala a bajo coste y adaptados
    a cada necesidad.
  • El CeGen ofrece soporte científico y acceso a
    unas tecnologías para proyectos que de otro modo
    requerirían mucho más tiempo y recursos
  • A más facilidad para genotipar, se podrán abordar
    proyectos más ambiciosos y en mayor número
  • Ej. En estudios de asociación donde se requiere
    genotipar muchos individuos (casos, controles)
  • Posibilidad de whole genome scan
  • Proyectos de BioBank

6
Interés estratégico
  • El CeGen puede ayudar en la solicitud de
    proyectos de investigación que impliquen
    genotipado a varios niveles
  • Proponiendo estrategias de selección SNPs
  • Ofreciendo las tecnologías a gran escala que dan
    viabilidad al proyecto
  • Calculando presupuestos que pueden incorporarse a
    la solicitud

7
Volvamos a los SNPs
  • Frecuentes.
  • Bien distribuidos.
  • Estables.
  • Funcionales?.
  • Permiten procesamiento a
  • gran escala.

8
Terminology
Allele is one of a number of alterative forms of
the same gene occupying a given locus. If we are
considering SNPs, an allele is one of two
alternative forms. Locus is physical location of
allele on the chromosome Haplotype is a set of
alleles that tend to be inherited together (not
easily separable by recombination). Example
Consider 2 loci, each with two possible alleles,
the first locus being either A or a, the second
locus being B or b. Then the genotype of an
individual have 4 possible haplotypes AB, Ab,
aB, ab.
9
tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattct
caattctatttcactggtctatggcagagaacacaaaatatggccagtgg
cctaaatccagcctactaccttttttttttttttgtaacattttactaac
atagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaa
tggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttac
tctacagccctttataaaaacagtgtgccaactcctgatttatgaactta
tcattatgtcaataccatactgtctttattactgtagttttataagtcat
gacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttt
tggccatcctagatatactttgtattgccacataaatttgaagatcagcc
tgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtag
aatctatagattaattagaggagaatgactatcttgacaatactgctgcc
cctctgtattcgtgggggattggttccacaacaacacccaccccccactc
ggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaa
aatccatggatgctcaagtccatataaaatgccatactatttgcatataa
cctctgcaatcctcccctatagtttagatcatctctagattacttataat
actaataaaatctaaatgctatgtaaatagttgctatactgtgttgaggg
ttttttgttttgttttgttttatttgtttgtttgtttgtattttaagaga
tggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagc
ttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcct
cccaaagtgctgggatacaggtgtgacccactgtgcccagttattatttt
ttatttgtattattttactgttgtattatttttaattattttttctgaat
attttccatctatagttggttgaatcatggatgtggaacaggcaaatatg
gagggctaactgtattgcatcttccagttcatgagtatgcagtctctctg
tttatttaaagttttagtttttctcaaccatgtttacttttcagtataca
agactttgacgttttttgttaaatgtatttgtaagtattttattatttgt
gatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgt
aatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggaga
tcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatag
aaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcg
ggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagt
gagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactc
tgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttctt
aatttcattttcaggttttttatttatttctactatatggatacatgatt
gatttttgtatattgatcatgtatcctgcaaactagctaacatagtttat
tatttctctttttttgtggattttaaaggattttctacatagataaataa
acacacataaacagttttacttctttcttttcaacctagactggatgcat
tttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactaga
gaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatc
cctgctttccccctgattttagggggaatgttttcagtctttcactattt
aatatgattttagctataggtttatcctagatccctgttatcatgttgag
gaaattcccttctatttctagtttgttgagattttttaattcatgtgatt
gcgctatctggctttgctctca
10
(No Transcript)
11
(No Transcript)
12
http//www.ncbi.nlm.nih.gov/About/primer/snps.html
13
Para qué sirven los SNPs? (I)
  • Genotipado en humanos
  • Búsqueda de genes de susceptibilidad para
    enfermedades
  • Diagnóstico / pronóstico
  • Metabolismo de fármacos
  • Reacciones adversas a fármacos (medicina
    personalizada)
  • Reacción ante factores ambientales
  • Genética forense
  • Estructura y dinámica del genoma
  • Evolución del genoma

14
Para qué sirven los SNPs? (II)
  • Genotipado en otras especies
  • Identificación microbiana
  • Análisis de comunidades microbianas
  • Genotipación en levaduras
  • Uso similar a humanos en ratón y otros organismos
    modelo
  • Gran uso en especies domésticas (QTL) tanto
    animales como vegetales
  • Identificación de variedades vegetales
  • Su uso se expandirá con nuevos mapas de SNPs

15
SNPs in the Human Genome
  • All humans share 99.9 the same genetic sequence
  • SNPs occur about every 1000 base pairs
  • 90 of human genome variation comes from SNPs
  • The human genome contains 10 million validated
    SNPs and 21 million submited.
  • 340,000 SNPs are found in genes
  • SNPs are not evenly spaced along the sequence
  • SNP-rich regions
  • SNP-poor regions

16
What is a Haplotype?
  • A haplotype is a sequence of alleles stretching
    along an extended segment of DNA a sort of
    super allele!
  • Haplotypes are usually inherited as a single unit
    from parents.

Aa Bb
AB ab
Ab aB
17
Alleles of Adjacent SNPs on a Chromosome form
Haplotypes
a. Short stretch of DNA for 4 different people
3 SNPs are present
b. Haplotypes made up of a combination of
different alleles at 20 nearby SNPs
c. Genotyping just 3 tag SNPs can distinguish
all 4 haplotypes
18
What is Linkage Disequilibrium?
  • Linkage Disequilibrium (LD) is nonindependence
    (nonrandomness) of alleles at different sites
    (different SNPs for the rest of the session).
  • Example
  • Suppose that allele A at locus 1 and allele B
    at locus 2 are at frequencies pA and pB,
    respectively, in the population.
  • If the two loci are independent, then we
    would expect to see the AB haplotype at frequency
    pApB.
  • If the population frequency of the AB
    haplotype is either higher or lower than
    thisimplying that particular alleles tend to be
    observed togetherthen the two loci are said to
    be in LD.

19
Linkage Equilibrium
Eg. Two adjacent SNPs (A and B) are genotyped in
a population.
There are 4 possible haplotypes
SNP 1
Under linkage equilibrium we have what we
expect fAB fAfB faB fafB fAb fAfb fAb
fAfb
A a
B fAB faB fB
b fAb fAb fb
fA fa
SNP 2
20
Linkage Disequilibrium (LD)
Eg. Two adjacent SNPs are genotyped in a
population
There are 4 possible haplotypes
Under linkage disequilibrium we have different
results fAB fAfB D faB fafB - D fAb fAfb
- D fAb fafb D where D is the LD
coefficient D fAB ? fab faB ? fAb or D fAB
fA ? fB
SNP 1
A a
B fAB faB fB
b fAb fab fb
fA fa
SNP 2
21
Linkage Disequilibrium (LD)
Linkage disequilibrium
Linkage equilibrium
SNP 1
SNP 1
A a
B 0.5 0 0.5
b 0 0.5 0.5
0.5 0.5
A a
B 0.25 0.25 0.5
b 0.25 0.25 0.5
0.5 0.5
SNP 2
SNP 2
where D is 0.25 D fAB ? fab faB ? fAb 0.5 ?
0.5 0 ? 0 0.25 D fAB fA ? fB 0.5 0.5 ?
0.5 0.25
22
Linkage Disequilibrium (LD)
Linkage disequilibrium
Linkage equilibrium
SNP 1
SNP 1
A a
B 0.02 0.58 0.6
b 0.18 0.22 0.4
0.2 0.8
A a
B 0.12 0.48 0.6
b 0.08 0.32 0.4
0.2 0.8
SNP 2
SNP 2
where D is -0.10 D fAB ? fab faB ? fAb 0.02
? 0.22 0.58 ? 0.18 -0.10 D fAB fA ? fB
0.02 0.6 ? 0.2 -0.10
23
Linkage Disequilibrium (LD)
24
  • Assessing LD
  • Measuring it with some parameter (we have just
    seen D)
  • Testing statistically whether it exists random
    association or LD?

25
The Measure of LD
  • D coefficient is dependent on marginal allele
    frequencies in contingency table.
  • This limitation disqualifies D as a useful
    measure of association because it is data
    dependent and cannot be compared for different
    SNPs or populations.
  • However D can be normalised to D making it
    comparable across pops and SNPs.

26
The D measure of LD
  • D D/Dmax
  • Dmax the absolute max. D value or
  • Dmax min (fAfb,fafB) when D gt 0
  • Dmax min (fAfB,fafb) when D lt 0

27
Example D Calculation
Hap Freqs
Allele Freqs
fAB 0.765 faB 0.235 fAb 0.167 fAb 0.833
fA 0.52 fa 0.48 fB 0.59 fb 0.41
D (0.765 ? 0.833) (0.167 ? 0.235) 0.025
Since D is positive (gt0) Dmax min (0.52 ? 0.41,
0.48 ? 0.59) Dmax 0.2132 D D/Dmax
0.025/0.2132
Thus, D 0.117
28
Interpretation of D coefficient
  • D 1 (perfect positive LD between SNP alleles)
  • D 0 (linkage equilibrium or no association
    between SNP alleles
  • D -1 (perfect negative LD between SNP alleles)
  • D 0.87 (strong positive LD between SNP alleles
  • D 0.12 (weak positive LD between SNP alleles

D is constrained between 1 and 1 where
Significance (P-value) for D is determined from
Chi-squared distribution
29
Interpretation of D coefficient
D0 means no LD D 1 means complete
LD Careful can be 1 when 3 haplotypes are
present
30
LD Plots of Adjacent SNPs
LD varies significantly across genomic regions
31
The r2 measure of LD
Disequilibrium coefficient r2 (sometimes also
denoted by D2) represents the statistical
correlation between 2 sites. Consider two
biallelic loci on the same chromosome, with
alleles A and a at the first locus and with
alleles B and b at the second locus. The allele
frequencies will be written as pA, pa, pB, and
pb, and the four haplotype frequencies will be
written as pAB, pAb, paB, and pab. Then
32
r2 is related to D, of course
y
y
b
b
B
B
x
x
A
a
A
a
33
r2 vs. D
Both measures are 1 in case of complete
disequilibrium and 0 is there is no LD. But r2
1 corresponds to situation where 2 haplotypes are
present (out of possible four), while D is less
certain and D1 can reflect 2 or 3 haplotypes
present.
34
The representation of LD
1
This is just Excel, say.
0.9
0.8
0.7
0.6
Linkage Disequilibrium D'
0.5
0.4
0.3
0.2
0.1
0
10kb
5kb
20kb
80kb
40kb
160kb
3.5
Distance Between SNPs (Base Pairs)
35
The representation of LD
36
The representation of LD
This is HelixTree software screen shot D 0 is
blue, D 1 is red.
37
The representation of LD
The Graphic Overview of Linkage Disequilibrium
(GOLD) software package http//www.sph.umich.edu/c
sg/abecasis/GOLD/
High recombination
High LD
Map
38
The representation of LD
High recombination
Haploview
High LD
39
Testing for LD
LD tests
  • LD for k alleles
  • Dm
  • Fishers exact test
  • ?2
  • LD for n loci
  • x
  • FNF
  • ?2
  • Fishers exact test

40
(No Transcript)
41
where ? are the degrees of freedom n is sample
size t is the observed likelihood ratio (LR) ? is
the average LR in a permuted distribution ? is
the standard deviation of the permuted LR
distribution
Zhao et al., Ann. Hum. Genet., 63167-179,1999
42
FNF fraction not found. Based in the fact that
LD reduces the number of haplotypes
Ke expected number of haplotypes under linkage
equilibrium, given the allele frequencies and the
sample size Ko observed number of
haplotypes Kmin minimum possible number of
haplotypes
Slatkin, Genetics 1541367-1378, 2000 Mateu et
al., Am. J. Hum. Genet. 68103-117, 2001
43
OK, fine
But, why is there any LD at all in the genome?
44
The origins of Linkage Disequilibrium
Variations in Chromosomes within a Population
45
So LD is the basisof, for example, association
studies (youll see more about this
later...).And we can go even deeper LD decays
with recombination
Dt1(1-r)Dt
46
LD is a function of distance
  • Distances
  • Physical distances between alleles are
    base-pairs.
  • 2) Measure of distance based on the probability
    of recombination, the unit is called Morgan.
  • - A distance of 1 centiMorgan (cM) between two
    alleles means that they have 1 chances of being
    separated by recombination.
  • - In humans, a genetic distance of 1 cM is
    roughly equal to a physical distance of 1 million
    base pairs (1Mbp).

47
Section 1
LD decays with recombination
1,000 gens. ago
Time present
48
  • (Think Finland 1000 founders 2000 years ago
    consistent expansion)
  • Few (maybe none) reoccurrences of
    disease-causing mutation
  • (Think Earth 10,000 "founders" (Ne)100,000
    years ago)
  • Assume old mutations cause common diseases

49
Whait a sec, these are the haplotypes, right?
Variations in Chromosomes within a Population
50
And remember one can select tag-SNPs
a. Short stretch of DNA for 4 different people
3 SNPs are present
b. Haplotypes made up of a combination of
different alleles at 20 nearby SNPs
c. Genotyping just 3 tag SNPs can distinguish
all 4 haplotypes
51
Cool!!!
  • Nowadays, we can massively genotype individuals.
    We could potentially cover the whole genome
    using the property of LD and a few
    tag-SNPsand..andand
  • But
  • How many SNPs to tag all the genome?
  • Andcan we easily ascertain individual
    haplotypes?

52
Haplotyping Phase Problem
Observed SNP1 G/T SNP2 A/C Possible
Haplotypes GA, TC or GC, TA n SNPs ?
2n possible haplotypes
53
The Problem
  • Its not yet easy to measure an individuals
    (only two) haplotypes
  • Molecular haplotyping (nucleotide sequencing) is
    the gold standard
  • A more efficient strategy
  • Focus on regions, such as certain genes
  • Estimate haplotypes from SNP data (genotypes)
  • Use LD map, and reduce the number of loci to
    represent the haplotype
  • Use haplotype map (DB) key SNPs haplotype
    blocks with strong LD

54
Molecular Haplotyping
  • Hetero-duplex analysis, mismatch detection,
    allele-specific PCR
  • Have potential to get high-throughput
  • Only practical for short haplotypes (2-5 kb vs.
    50-100kb)
  • Costly
  • Rolling Circle amplification method, etc
  • Can handle larger size
  • Difficult to automate

55
In-silico Haplotyping
  • Alias Haplotype Reconstruction, Haplotype
    Inference,
  • Computational Haplotyping, Statistical
    Haplotyping, etc.
  • Advantages
  • Cost effective
  • High-throughput
  • Difficulty
  • Phase Ambiguity Haplotypes increase
    exponentially with SNPs

56
In-silico Haplotyping Two Tasks
  • Reconstruction of the haplotypes of the sampled
    individuals.
  • II. Estimation of haplotypes frequencies in a
    population.

57
In-silico Haplotyping Approaches
  • Clarks algorithm
  • E-M algorithm (expectation-maximization
    algorithm)
  • Bayesian algorithm
  • Message many different approaches

58
How Far Does Association (LD) Extend Between
Neighboring Common Sites?
Theoretical (given 1cM/Mb) 3-8 kbbut
59
Strategy for Assessing Extent of LD
Distance from core single nucleotide polymorphism
(SNP)
19 regions 44 Caucasian samples from Utah a
great deal of DNA sequencing per sample
60
How far does the signal reach? Results
61
MYSTERY What explains the long-range LD?
LD and population genomics
? Maybe an important event in population history?
62
Positive Control 48 Swedes
Identical pattern to Utah
63
96 Nigerians (Yoruba)
Much Less LD
Associations in Africans a SUBSET of those in
Caucasians
MUST be influenced by population history
64
Confirmation of less LD in Africans from Direct
DNA Sequencing
65
More evidence from Genotyping 5,000 SNPs
(Gabriel et al. 2002)
66
Explanation Bottleneck or Founder Effect in
History of North Europeans
Ancestral Population
100,000 years ago
Likely lt10 founding chromosomes
What was this event? (1) Out of Africa?
(2) Founding of Europe?
North Europeans
Yoruba Ancestors
67
Given the demographic properties of LD, which
populations are best suited for association-based
mapping studies?
  • - LD reflects the ages of haplotypes in
    populations.
  • - Population founded more recently is useful for
    detecting long-range associations between
    disease-causing mutations and marker SNPs.
  • Older populations are useful for fine-scale
    mapping.
  • But things are always more complex

68
How far does the signal reach? Results
69
Maybe clearer this way
LD varies substantially across the genome!
70
MYSTERY What explains the huge genomic variance
in LD distribution?
LD and population genomics
? Maybe a lot of intra-genomic diversity. Maybe
haplotype blocks?
71
...it is not simple even within a population a
patterned structure of recombination in the
genome can create blocks of LD
72
Haplotype Blocks
  • The human genome may be defined as regions of
    high LD called haplotype blocks
  • These are separated by smaller regions of low LD
    usually attributed to recombination hotspots
  • A haplotype block consists of a few common
    haplotypes that account for a large DNA segment

73
Haplotype Blocks
Haplotype Block
  • Each row represents a SNP
  • Blue dot major
  • yellow minor
  • Each column represents a single chromosome
  • The 147 SNPs are divided into 18 blocks defined
    by black lines.
  • The expanded box on the right is a SNP block of
    26 SNPs over 19kb of genomic DNA. The 4 most
    common of 7 different haplotypes include 80 of
    the chromosomes, and can be distinguished with 2
    SNPs

Chromosomes
SNPS
74
These would be blocks
High recombination
High LD
Map
75
... Likely to be caused by recombination hotspots
(but things are not that easy)
76
So what we need is an haplotype map
  • The Haplotype Map, HapMap, will be a map of these
    haplotype blocks and the SNPs that identify the
    haplotypes. The HapMap will be a key resource for
    finding genes that contribute to disease risk and
    drug response.

77
  • What is the HapMap?
  • The HapMap is a catalogue of common genetic
    variants (SNPs) that occur in humans.
  • What information does the HapMap provide?
  • the characteristics of the SNPs (sequence
    variation, allele freqs)
  • where they occur in the genome (relative
    positions)
  • how they are distributed in human populations (LD
    and haplotype blocks)

78
Aims of the Hapmap Project
  • To develop a map of the human genome that
    describes the common patterns of DNA sequence
    variation (haplotypes)
  • For use in establishing connections between
    genetic variants and disease.
  • Populations sampled (n 270 people)
  • African
  • Asian
  • European
  • 614030 SNPs genotyped (55 million genotypes)

79
The Construction of the HapMap
  • Three main steps
  • 1. SNPs are identified in different individuals
    from different ethnic groups
  • 2. Adjacent SNPs that are inherited together are
    compiled into haplotypes
  • 3. SNPs that uniquely represent haplotypes ie.
    tagSNPs are identified for use in genetic
    association studies of disease

80
(No Transcript)
81
Nicotine metabolising gene
82
HapMap Genotype Data Analysis
  • The raw genotype data can be downloaded from the
    HapMap website
  • HAPLOVIEW is a useful tool for conducting
    genotype analysis

83
(No Transcript)
84
LD and disease
Youll see a lot about this later. Let me just
tell you a couple of things
85
Gene mapping by linkage in an dominant mendelian
disease only recombination events in the
families carry information to narrow down the
location of the gene
86
In LD mapping, all the recombination events in
the history of the disease are used to find the
gene region
87
LD and complex diseases LD between a marker and
the (unknown) genetic variant contributing to the
disease underlies the association approach (i.e.,
comparing allele frequencies between cases and
controls)
88
Power to detect association improves when using
haplotypes
89
In summary
  • Knowing about LD and Haplotype Blocks empowers us
    to detect association between markers and disease
    and to perform many linkage-based disease
    studies.
Write a Comment
User Comments (0)
About PowerShow.com