Title: Computational and Statistical Challenges in Association Studies
1Computational and Statistical Challenges in
Association Studies
- Eleazar Eskin
- University of California, Los Angeles
2The Human Genome Project
What we are announcing today is that we have
reached a milestonethat is, covering the genome
ina working draft of the human sequence.
I would be willing to make a prediction that
within 10 years, we will have the potential of
offering any of you the opportunity to find out
what particular genetic conditions you may be at
increased risk for
Washington, DC June, 26, 2000.
3Human Genetics
Mother
Father
- Disease Risk
- genetic factors account for 20-80 of disease
risk. - Many genes contribute to complex diseases.
Child
- Personalized Medicine
- Treatment decisions influenced by diagnostics
- Understanding Disease Biology
- New drug targets.
- Understanding of mechanism of disease.
Where are the risk factors? (Genetic Basis of
Disease)
4Human Genetics
Mother
Father
- Disease Risk
- genetic factors account for 20-80 of disease
risk. - Many genes contribute to complex diseases.
no recombination shown
Child
- Personalized Medicine
- Treatment decisions influenced by diagnostics
- Understanding Disease Biology
- New drug targets.
- Understanding of mechanism of disease.
Where are the risk factors? (Genetic Basis of
Disease)
5Disease Association StudiesThe search for
genetic factors
- Comparing the DNA contents of two populations
- Cases - individuals carrying the disease.
- Controls - background population.
Differences within a gene between the two
populations is evidence the gene is involved in
the disease.
6Single Nucleotide Polymorphisms(SNPs)
AGAGCCGTCGACAGGTATAGCCTA AGAGCCGTCGACATGTATAGTCTA
AGAGCAGTCGACAGGTATAGTCTA AGAGCAGTCGACAGGTATAGCCTA
AGAGCCGTCGACATGTATAGCCTA AGAGCAGTCGACATGTATAGCCT
A AGAGCCGTCGACAGGTATAGCCTA AGAGCCGTCGACAGGTATAGCC
TA
- Human Variation
- Humans differ by 0.1 of their DNA.
- A significant fraction of this variation is
accounted by SNPs.
7Single Nucleotide PolymorphismsAssociation
Analysis
Cases (Individuals with the disease)
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACA
TGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AG
AGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
CGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATG
AGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAG
CCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTG
AGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGAT
CGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
Controls (Healthy individuals)
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGA
GCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACA
TGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AG
AGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGC
CGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATG
AGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAG
CCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAG
ATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTG
AGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGAT
CAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
8Single Nucleotide Polymorphisms Association
Analysis
Cases
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
Controls
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
9Single Nucleotide Polymorphisms(SNPs)
Cases
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
Controls
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
- Millions of Common SNPs
- Correlations between SNPs
- SNP locations unknown
False Positives
Challenges
10- Successor to the Human Genome Project
- International consortium that aims in genotyping
the genome of 270 individuals from four
different populations. - Launched in 2002. First phase was finished in
October (Nature, 2005). - Collected genotypes for 3.9 million SNPs.
- Location and correlation structure of many common
SNPs.
11Public Genotype Data Growth
- More SNPs increase genome coverage in association
studies. - More genotypes allow for discovery of weaker
associations.
12Some Computational Challenges
- Genetics - identifying disease genes
- Haplotype phasing - preprocessing SNPs
- Association study design
- Association study analysis
- Population stratification
- Inferring evolutionary processes (recombination
rates, selection, haplotype ancestry). - Etc
- Genomics - functions of disease genes
- Predicting functional effect of variation
- Understanding disease effect on gene regulation
- Understanding disease effect on metabolic
pathways - Combining systems biology with genetics
- Etc
13Haplotype Phasing using Imperfect Phylogeny
14Haplotype Phasing
Haplotypes
ATCCGA AGACGC
- High throughput cost effective sequencing
technology gives genotypes and not haplotypes.
15Haplotype Limited Diversity
- Previous studies on local haplotype structure
- (Daly et al., 2001) chromosome 5q31.
- (Patil et al., 2001) chromosome 21.
- Study findings
- The SNPs on each haplotype are correlated.
- SNPs can be separated into blocks of limited
diversity. - Local regions have few haplotypes.
16Haplotype Data in a Block
(Daly et al., 2001) Block 6 from Chromosome 5q31
17ExamplePhasing
Genotypes 22222222 22000001 22022002 22222222 2200
0001 22022002 22000001
Maximum Likelihood Haplotype Inference is
a NP-Hard Problem
18Narrowing the SearchPerfect Phylogeny
- A directed phylogenetic tree.
- 0,1 alphabet.
- Each site mutates at mostonce.
- No recombination.
00000
2
01000
1
5
11000
01001
3
11100
4
11110
19The Perfect Phylogeny Haplotype Problem (PPH)
- Given genotypes over a short region.
- Find compatible haplotypes which correspond to a
perfect phylogeny tree. - Gusfield 02.
- PPH deficiency the data does not fit the model.
20Solving PPH
- A very simple o(nm2) algorithm for PPH problem.
(Also Gusfield 02, Bafna et al., 2003)
But in practice, we do not expect to see
perfect phylogeny in biological data.
We extend our algorithms to the case where the
data is almost perfect phylogeny.
Eskin, Halperin, Karp Large Scale
Reconstruction of Haplotypes from Genotype
Data.' RECOMB 2003.
21HAP Algorithm
- HAP Local Predictions
- http//research.calit2.net/hap/
- Over 6,000 users of webserver.
- Main Ideas
- Imperfect Phylogeny
- Maximum Likelihood Criterion
- Extremely efficient.
- Orders of magnitude faster than other algorithms.
Eskin, Halperin, Karp Large Scale
Reconstruction of Haplotypes from Genotype
Data.' RECOMB 2003.
22Public Genotype Data Growth
Eskin, Halperin, Karp RECOMB 2003
23Phasing Methods
- HAP is one of many phasing algorithms.
- Clark, 1990, Excoffier and Slatkin, 1995, PHASE
Stephens et al., 2001, HAPLOTYPER - Niu et al.,
2002. Gusfield, 2000, Lancia et al. 2001. Many
more
Algorithms were designed for only 4-12 SNPs!
How do we phase entire chromosomes?
HAP tiling extension phasing for long
regions. Leverages the speed of HAP.
24Scaling to Whole GenomesHAP-TILE
genotypes
Local predictions
- For each window we compute the haplotypes
using HAP - We tile the windows using dynamic programming
25Haplotype Tiling Problem
(ignoring homozygous positions)
001000 110111 010000 101111 011111 100000
000101 111010 000011 111100
100110 011001
00100000110 11011111001
- NP-Hard Problem
- Dynamic Programming Solution
- (Eskin et al. 2004.)
26Incorporating Physical Length
Length-based Prediction Confidences
(minimum weighted number of conflicts)
10kb 4.2 12kb 3.8 38kb 1.2 43kb 0.9 22kb 2.2 14kb
3.6
001000 110111 010000 101111 011111 100000
000101 111010 001011 110100
100110 011001
00100000110 11011111001
27Phasing Running Time Comparison(Phaseoff
Competition)
Marchini et al. American Journal of Human
Genetics, 2006.
28Phasing Running Time Comparison(NCBI dbSNP
Benchmark)
29Public Genotype Data Growth
Eskin, Halperin, Karp RECOMB 2003
30RECOMB 2003 Submission
31Genome Haplotype Resource
Haplotypes predictions and tag SNPs for all 286
million genotypes including HapMap data in NCBIs
dbSNP database publicly available at
http//www.ncbi.nlm.nih.gov/projects/SNP/ Phase
2 HapMap haplotypes will be available soon.
Noah Zaitlen, Hyun Min Kang, Michael Feolo,
Stephen Sherry, Eran Halperin, Eleazar Eskin.
Inference and Analysis of Haplotypes from
Combined Genotyping Studies Deposited in dbSNP.''
Genome Research. 15(11)1594-600. 2005. David
A. Hinds, Laura L. Stuve, Geoffrey B. Nilsen,
Eran Halperin, Eleazar Eskin, Dennis G.
Ballinger, Kelly A. Frazer, David R. Cox.
Whole-Genome Patterns of Common DNA Variation in
Three Human Populations Science 18 February
307(5712)1072-1079. 2005.
32Human Variation and Disease
- Goals
- Identify variation that plays a role in human
disease. - Understand function of that variation.
- Genetic Architecture of Complex Traits
- Post-HapMap data
- identifies location of variation.
- provides structure of variation.
- Integrated Genomics Approaches
- Allows the leveraging of existing genomic
resources such as aligned genomes. - Integration of additional data such as gene
expression. - Convergence of Research Areas Genetics
(Association) and Bioinformatics (Functional
Annotation)
33Human Complex Traits Challenges
- Little is known on complex disease
- Models of complex trait factors unknown.
- Most factors in complex diseases are unknown.
- Degree of interactions between factors unknown.
- High degree of genetic variation between
individuals. - Very little phenotypic information on
individuals. - No direct experimentation possible.
34Weighted Haplotype Association
35Association Statistics
- Assume we are given N/2 cases and N/2 control
individuals. - Since each individual has 2 chromosomes, we have
a total of N case chromosomes and N control
chromosomes. - At SNP A, let pA and p-A be the observed case
and control frequencies respectively. - We know that
- pA N(pA, pA(1-pA)/N).
- p-A N(p-A, p-A(1-p-A)/N).
36Association Statistics
- pA N(pA, pA(1-pA)/N).
- p-A N(p-A, p-A(1-p-A)/N).
- pA- p-A N(pA- p-A,(pA(1-pA)p-A(1-p-A))/N)
- We approximate
- pA(1-pA)p-A(1-p-A) 2 pA(1-pA)
- then if pA p-A
-
37Association Statistic
- Under the null hypothesis pA- p-A0
- We compute the statistic SA.
- If SAlt ?-1(?/2) or SAgt-?-1(?/2) then the
association is significant at level ?.
-
38Association Power
- Lets assume that SNP A is causal and pA ? p-A
- Given the true pA and p-A, if we collect N
individuals, and compute the statistic SA, the
probability that SA has a significance level of ?
is the power. - Power is the chance of detecting an association
of a certain strength with a certain number of
individuals.
39Association Statistic
- Lets assume that pA ? p-A then
40Association Power
Power of association test
Threshold for significance
Non-centrality parameter.
41Association Power
- Statistical Power of an association with N
individuals, non-centrality parameter
and significance threshold ? is P(?, ) - Note that if ?0, power is always ?.
42Indirect Association
- Now lets assume that we have 2 markers, A and B.
Let us assume that marker B is the causal
mutation, but we are observing marker A. - If we observed marker B directly our statistic
would be
43Indirect Association
- However, we are observing A where our statistic
is - What is the relation between SA and SB?
44Indirect Association
45Indirect Association
- We assume conditional probability distributions
are equal in case and control samples
46Indirect Association
47Indirect Association
48Indirect Association
- How many individuals, NA, do we need to collect
at marker A to achieve the same power as if we
collected NB markers at marker B?
49Visualization in terms of Power
Power of association test
Threshold for significance
Non-centrality parameters.
50Correlating Haplotypes with the Disease
- The disease may be correlated with a SNP not in
the panel. - The disease may be more correlated with a
haplotype (group of SNPs) than with any single
SNP in the panel. - Haplotype tests
- Which haplotypes should we test?
- Which blocks should we pick?
51Key Problem Indirect Association
- We have the HapMap.
- Information on 4,000,000 SNPs.
- AffyMetrix gene chip collects information on
500,000 SNPs. - What about the remaining 3,500,000 SNPs?
- So far, we have designed studies by picking tag
SNPs with high r2. - Can we use the HapMap when performing
association? - Multi-Tag methods.
52Haplotypes as Proxies for Hidden SNPs(de Bakker
et al., 2005)
53Haplotypes as Proxies for Hidden SNPs (de Bakker
2005)
54WHAP - Weighted HAPlotype Analysis
A
0.71AA 0.29AG
55WHAP - Weighted Haplotypes
A
0.71AA 0.29AG
56Basic MultiMarker Method
- For each SNP in HapMap, find haplotype among
genotyped SNPs that has highest r2 to the SNP. - Perform association at each SNP and each added
haplotype. - Now instead of performing 500,000 tests, we
perform 4,000,000 tests.
57Weighted Haplotype Test
- For each haplotype h, we assign a weight wh
- We use a weighted allele frequency statistic
- This statistic is the weighted numerator in SA.
- What is the variance of this statistic?
- Complication Haplotype frequencies are not
independent!
58Weighted Haplotype Example
- Assume we have 4 haplotypes AB, Ab, aB and ab.
- If we set the weights so that wABwAb1 and
waBwab0, this is equivalent to looking at the
single SNP A. - If we set the weights so that wAB1 and
wAbwaBwab0, this is equivalent to looking at
the single haplotype AB. - Other weights are can be something in between.
59Variance of Wh
- Use from statistics
- Var(AB)Var(A)Var(B)-2Cov(A,B)
- Let us assume that we have a 4 sided die.
- Let A1, A2, A3, A4 be the event of rolling a 1,
2, 3, or 4 which has probability p1, p2, p3, and
p4 respectively. - Var(Ai)pi(1-pi)
- Cov(Ai,Aj)pipj
60Variance of Wh
- If we are rolling N times, the variance and
covariance of the observed frequency is - Var(pi)pi(1-pi)/N
- Cov(pi,pj)pipj/N
- Then in the cases and controls
- Var(pi- pi-)2pi(1-pi)/N
- If we assign a weight wi to each side of the die
- Var(wi(pi- pi-)) wi2 2pi(1-pi)/N
(case/controls are independent) - Cov(wipi,wipj)wipiwipj? wiwipipj
61Variance of Wh
- Then Var(w1p1 w2p2 w3p3 w4p4)N is
- N( Var(w1p1) Var(w2p2) Var(w3p3)
Var(w4p4) - -2Cov(w1p1,w2p2) -2Cov(w1p1,w3p3)
- -2Cov(w1p1,w4p4) -2Cov(w2p2,w3p3)
- -2Cov(w2p2,w4p4) -2Cov(w3p3,w4p4) )
- w12p1(1-p1)w22p2(1-p2)w32p3(1-p3)w42p4
(1-p4) - -2(w1w2p1p2w1w3p1p3w1w4p1p4w2w3p2p3
- w2w4p2p4w3w4p3p4)
- w12p1-w12 (p1)2 w22p2-w22 (p2)2
- w32p3-w32 (p3)2 w42p4-w42 (p4)2
- -2(w1w2p1p2w1w3p1p3w1w4p1p4w2w3p2p3
- w2w4p2p4w3w4p3p4)
62Variance of Wh
- Var(w1p1 w2p2 w3p3 w4p4)N
- w12p1-w12 (p1)2 w22p2-w22 (p2)2
- w32p3-w32 (p3)2 w42p4-w42 (p4)2
- -2(w1w2p1p2w1w3p1p3w1w4p1p4w2w3p2p3
- w2w4p2p4w3w4p3p4)
- w12p1 w22p2 w32p3 w42p4
- -w12 (p1)2-w22 (p2)2-w32 (p3)2-w42 (p4)2
- -2(w1w2p1p2w1w3p1p3w1w4p1p4w2w3p2p3
- w2w4p2p4w3w4p3p4)
- w12p1 w22p2 w32p3 w42p4
- - (w1p1 w2p2 w3p3 w4p4)2
63The ?-test
- Each haplotype h is assigned a weight wh.
- N is the number of individuals.
- ph - the probablity for h in cases/controls, or
average. - Under the null, the ?-test is ?2 distributed.
64Non-Centrality Parameter
- Under weights w1,w2,w3,w4 and true case/control
probabilities p1,p2,p3,p4 and
p1-,p2-,p3-,p4-, Wh is expected to be - When normalizing for the variance, the
non-centrality parameter is
65Wh and indirect association
- Let us assume that SNP C is causal with
non-centrality parameter ?C. - If we perform weighted haplotype association, the
noncentrality parameter is ?h. - How are they related? (i.e. What is the power of
the weighted haplotype association test). - Using the same technique, we can show that ?Crh
?h, where rh is the conceptual equivalent of r in
2 SNP case.
66Indirect Association (Flashback)
67Indirect Association (Flashback)
- Since conditional probability distributions are
equal in case and control samples
68Indirect Association (Wh)
- Since conditional probability distributions are
equal in case and control samples
69Indirect Association (Flashback)
70Indirect Association (Wh)
71Indirect Association (Flashback)
- How many individuals, NA, do we need to collect
at marker A to achieve the same power as if we
collected NB markers at marker B.
72The Relation to Power
The power of detecting the SNPwith N individuals
is the sameas using the tag SNPs withN/rh2
individuals.
73Setting the weights
- Power depends on rh. We want to set the weights
so that rh is maximized. - Proving what is the maximum is not so easy.
74Choosing the Weights
- Optimal weights
- wh(s5) P(s5 A h) qAh
75Choosing the Weights
- Optimal weights
- wh(s5) P(s5 A h) qAh
76The Relation to Power
- This is exactly r2 in the case of one tag SNP.
- WHAP always has at least as much power as
- single SNP test
- single haplotype test
- haplotype group test
- ?2 with k degrees of freedom.
77WHAP FAQ
- Why is WHAP always more powerful?
- What if there is a SNP with r21?
- What if the weights are set wrong?
- Does the false positive rate increase?
- Does the power decrease?
- Mathematically what does this affect?
- Why can the weights be wrong?
- i.e. What are the underlying assumptions of the
method.
78Apply tests T1,,T4M
Cases 0.5M SNPs
Controls 0.5M SNPs
HapMap 4M SNPs
Use as training dataset to getthe weights
Tests T1,,T4M
Positive results give evidence for a causal SNP
- can be verified by a follow up/two stage study.
79How Many SNPs are Captured?
80(No Transcript)
81Power Simulations
- Relative power to using all SNPs.
- Tested on the ENCODE regions, Affy 500k tag
SNPs.
82Practical Issues
- We assume we have the haplotype frequencies in
the HapMap (not the phase). - We assume the case/control populations are coming
from the same population as the HapMap. - Over-fitting
- Train with half of the data, test the other half.
- No correlation between the haps and random SNPs.
83(No Transcript)
84WHAP r2 in a region. Red lines are collected
SNPs. Blue lines are rh2 values.
85Associations using WHAP. Red lines are
assocations at collected SNPs. Blue lines are
associations at uncollected SNPs inferred by WHAP.
86(No Transcript)
87Optimal Genome Wide Tagging by Reduction to SAT
88Correlation Strucutre
89Example r2 Matrix
90Graph Representation
91Satisfiability and SAT Solvers
- Boolean variables called literals
- Logical operators
- AND ?
- OR ?
- NOT
- Example
- (s1 ? s2) ? (s2 ? s3 ? s1)
- s1 false s2 false s3 true
92Negation Normal Form
A. Darwiche
rooted DAG (Circuit)
93CNF Form and Logical Solutions
94NNF Form of Solutions
95Local Single SNP r2 Tagging
- Generate a clause for each SNP
- Clause for SNP si contains all covers
- Input CNF as conjuction of all clauses
- Compile with minSAT solver
- Find solutions by traversal of NNF
96Optimal Tagging
97Whole Genome Tagging
98MultiMarker Example
99MultiMarker Tagging
100Best N-tagging
- Fixed budget of SNPs
- Maximize function such as
- Average r2
- Power
- Swtich from minSAT to maxSAT
- Each literal si has a weight wi (,-)
- Choose literals that maximize weight
- Enforce constraints with infinite weight
101Adder Circuit
102UCLA Adnan Darwiche Arthur Choi Knot
Pipatswisawat ICSI Eran Halperin Richard
Karp Perlegen Sciences David Hinds David Cox
Ph.D. Students Buhm Han Nils Homer Hyun Min
Kang Sean ORourke Jimmie Ye Noah Zaitlen
Webserver Hosted By