Computational and Statistical Challenges in Association Studies - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

Computational and Statistical Challenges in Association Studies

Description:

The Human Genome Project ' ... that is, covering the genome in...a working draft of the human sequence. ... the risk factors? (Genetic Basis of Disease) Human ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 80
Provided by: eleaza
Category:

less

Transcript and Presenter's Notes

Title: Computational and Statistical Challenges in Association Studies


1
Computational and Statistical Challenges in
Association Studies
  • Eleazar Eskin
  • University of California, Los Angeles

2
The Human Genome Project
What we are announcing today is that we have
reached a milestonethat is, covering the genome
ina working draft of the human sequence.
I would be willing to make a prediction that
within 10 years, we will have the potential of
offering any of you the opportunity to find out
what particular genetic conditions you may be at
increased risk for
Washington, DC June, 26, 2000.
3
Human Genetics
Mother
Father
  • Disease Risk
  • genetic factors account for 20-80 of disease
    risk.
  • Many genes contribute to complex diseases.

Child
  • Personalized Medicine
  • Treatment decisions influenced by diagnostics
  • Understanding Disease Biology
  • New drug targets.
  • Understanding of mechanism of disease.

Where are the risk factors? (Genetic Basis of
Disease)
4
Human Genetics
Mother
Father
  • Disease Risk
  • genetic factors account for 20-80 of disease
    risk.
  • Many genes contribute to complex diseases.

no recombination shown
Child
  • Personalized Medicine
  • Treatment decisions influenced by diagnostics
  • Understanding Disease Biology
  • New drug targets.
  • Understanding of mechanism of disease.

Where are the risk factors? (Genetic Basis of
Disease)
5
Disease Association StudiesThe search for
genetic factors
  • Comparing the DNA contents of two populations
  • Cases - individuals carrying the disease.
  • Controls - background population.

Differences within a gene between the two
populations is evidence the gene is involved in
the disease.
6
Single Nucleotide Polymorphisms(SNPs)
AGAGCCGTCGACAGGTATAGCCTA AGAGCCGTCGACATGTATAGTCTA
AGAGCAGTCGACAGGTATAGTCTA AGAGCAGTCGACAGGTATAGCCTA
AGAGCCGTCGACATGTATAGCCTA AGAGCAGTCGACATGTATAGCCT
A AGAGCCGTCGACAGGTATAGCCTA AGAGCCGTCGACAGGTATAGCC
TA
  • Human Variation
  • Humans differ by 0.1 of their DNA.
  • A significant fraction of this variation is
    accounted by SNPs.

7
Single Nucleotide PolymorphismsAssociation
Analysis
Cases (Individuals with the disease)
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACA
TGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AG
AGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
CGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATG
AGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAG
CCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTG
AGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGAT
CGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
Controls (Healthy individuals)
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGA
GCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACA
TGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AG
AGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGC
CGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATG
AGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAG
CCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAG
ATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTG
AGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGAT
CAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
8
Single Nucleotide Polymorphisms Association
Analysis
Cases
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
Controls
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
9
Single Nucleotide Polymorphisms(SNPs)
Cases
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
Controls
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACAT
GAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAG
CCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGC
CGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGT
GAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGAT
CGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTC
GACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCG
ACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGAT
CGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACA
TGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACAT
GTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGG
TATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACA
TGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGA
TCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATA
GTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAG
CCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATA
GCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGT
AGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTA
CATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTAC
ATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAG
AGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
AGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGA
GATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCG
TCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGA
GATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCG
ACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGA
CATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGAC
ATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCG
ACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATG
AGATCGGTA
  • Millions of Common SNPs
  • Correlations between SNPs
  • SNP locations unknown

False Positives
Challenges
10
  • Successor to the Human Genome Project
  • International consortium that aims in genotyping
    the genome of 270 individuals from four
    different populations.
  • Launched in 2002. First phase was finished in
    October (Nature, 2005).
  • Collected genotypes for 3.9 million SNPs.
  • Location and correlation structure of many common
    SNPs.

11
Public Genotype Data Growth
  • More SNPs increase genome coverage in association
    studies.
  • More genotypes allow for discovery of weaker
    associations.

12
Some Computational Challenges
  • Genetics - identifying disease genes
  • Haplotype phasing - preprocessing SNPs
  • Association study design
  • Association study analysis
  • Population stratification
  • Inferring evolutionary processes (recombination
    rates, selection, haplotype ancestry).
  • Etc
  • Genomics - functions of disease genes
  • Predicting functional effect of variation
  • Understanding disease effect on gene regulation
  • Understanding disease effect on metabolic
    pathways
  • Combining systems biology with genetics
  • Etc

13
Haplotype Phasing using Imperfect Phylogeny
14
Haplotype Phasing
Haplotypes
ATCCGA AGACGC
  • High throughput cost effective sequencing
    technology gives genotypes and not haplotypes.

15
Haplotype Limited Diversity
  • Previous studies on local haplotype structure
  • (Daly et al., 2001) chromosome 5q31.
  • (Patil et al., 2001) chromosome 21.
  • Study findings
  • The SNPs on each haplotype are correlated.
  • SNPs can be separated into blocks of limited
    diversity.
  • Local regions have few haplotypes.

16
Haplotype Data in a Block
(Daly et al., 2001) Block 6 from Chromosome 5q31
17
ExamplePhasing
Genotypes 22222222 22000001 22022002 22222222 2200
0001 22022002 22000001
Maximum Likelihood Haplotype Inference is
a NP-Hard Problem
18
Narrowing the SearchPerfect Phylogeny
  • A directed phylogenetic tree.
  • 0,1 alphabet.
  • Each site mutates at mostonce.
  • No recombination.

00000
2
01000
1
5
11000
01001
3
11100
4
11110
19
The Perfect Phylogeny Haplotype Problem (PPH)
  • Given genotypes over a short region.
  • Find compatible haplotypes which correspond to a
    perfect phylogeny tree.
  • Gusfield 02.
  • PPH deficiency the data does not fit the model.

20
Solving PPH
  • A very simple o(nm2) algorithm for PPH problem.
    (Also Gusfield 02, Bafna et al., 2003)

But in practice, we do not expect to see
perfect phylogeny in biological data.
We extend our algorithms to the case where the
data is almost perfect phylogeny.
Eskin, Halperin, Karp Large Scale
Reconstruction of Haplotypes from Genotype
Data.' RECOMB 2003.
21
HAP Algorithm
  • HAP Local Predictions
  • http//research.calit2.net/hap/
  • Over 6,000 users of webserver.
  • Main Ideas
  • Imperfect Phylogeny
  • Maximum Likelihood Criterion
  • Extremely efficient.
  • Orders of magnitude faster than other algorithms.

Eskin, Halperin, Karp Large Scale
Reconstruction of Haplotypes from Genotype
Data.' RECOMB 2003.
22
Public Genotype Data Growth
Eskin, Halperin, Karp RECOMB 2003
23
Phasing Methods
  • HAP is one of many phasing algorithms.
  • Clark, 1990, Excoffier and Slatkin, 1995, PHASE
    Stephens et al., 2001, HAPLOTYPER - Niu et al.,
    2002. Gusfield, 2000, Lancia et al. 2001. Many
    more

Algorithms were designed for only 4-12 SNPs!
How do we phase entire chromosomes?
HAP tiling extension phasing for long
regions. Leverages the speed of HAP.
24
Scaling to Whole GenomesHAP-TILE
genotypes
Local predictions
  • For each window we compute the haplotypes
    using HAP
  • We tile the windows using dynamic programming

25
Haplotype Tiling Problem
(ignoring homozygous positions)
001000 110111 010000 101111 011111 100000
000101 111010 000011 111100
100110 011001
00100000110 11011111001
  • NP-Hard Problem
  • Dynamic Programming Solution
  • (Eskin et al. 2004.)

26
Incorporating Physical Length
Length-based Prediction Confidences
(minimum weighted number of conflicts)
10kb 4.2 12kb 3.8 38kb 1.2 43kb 0.9 22kb 2.2 14kb
3.6
001000 110111 010000 101111 011111 100000
000101 111010 001011 110100
100110 011001

00100000110 11011111001
27
Phasing Running Time Comparison(Phaseoff
Competition)
Marchini et al. American Journal of Human
Genetics, 2006.
28
Phasing Running Time Comparison(NCBI dbSNP
Benchmark)
29
Public Genotype Data Growth
Eskin, Halperin, Karp RECOMB 2003
30
RECOMB 2003 Submission
31
Genome Haplotype Resource
Haplotypes predictions and tag SNPs for all 286
million genotypes including HapMap data in NCBIs
dbSNP database publicly available at
http//www.ncbi.nlm.nih.gov/projects/SNP/ Phase
2 HapMap haplotypes will be available soon.
Noah Zaitlen, Hyun Min Kang, Michael Feolo,
Stephen Sherry, Eran Halperin, Eleazar Eskin.
Inference and Analysis of Haplotypes from
Combined Genotyping Studies Deposited in dbSNP.''
Genome Research. 15(11)1594-600. 2005. David
A. Hinds, Laura L. Stuve, Geoffrey B. Nilsen,
Eran Halperin, Eleazar Eskin, Dennis G.
Ballinger, Kelly A. Frazer, David R. Cox.
Whole-Genome Patterns of Common DNA Variation in
Three Human Populations Science 18 February
307(5712)1072-1079. 2005.
32
Human Variation and Disease
  • Goals
  • Identify variation that plays a role in human
    disease.
  • Understand function of that variation.
  • Genetic Architecture of Complex Traits
  • Post-HapMap data
  • identifies location of variation.
  • provides structure of variation.
  • Integrated Genomics Approaches
  • Allows the leveraging of existing genomic
    resources such as aligned genomes.
  • Integration of additional data such as gene
    expression.
  • Convergence of Research Areas Genetics
    (Association) and Bioinformatics (Functional
    Annotation)

33
Human Complex Traits Challenges
  • Little is known on complex disease
  • Models of complex trait factors unknown.
  • Most factors in complex diseases are unknown.
  • Degree of interactions between factors unknown.
  • High degree of genetic variation between
    individuals.
  • Very little phenotypic information on
    individuals.
  • No direct experimentation possible.

34
Weighted Haplotype Association
35
Association Statistics
  • Assume we are given N/2 cases and N/2 control
    individuals.
  • Since each individual has 2 chromosomes, we have
    a total of N case chromosomes and N control
    chromosomes.
  • At SNP A, let pA and p-A be the observed case
    and control frequencies respectively.
  • We know that
  • pA N(pA, pA(1-pA)/N).
  • p-A N(p-A, p-A(1-p-A)/N).





36
Association Statistics
  • pA N(pA, pA(1-pA)/N).
  • p-A N(p-A, p-A(1-p-A)/N).
  • pA- p-A N(pA- p-A,(pA(1-pA)p-A(1-p-A))/N)
  • We approximate
  • pA(1-pA)p-A(1-p-A) 2 pA(1-pA)
  • then if pA p-A






37
Association Statistic
  • Under the null hypothesis pA- p-A0
  • We compute the statistic SA.
  • If SAlt ?-1(?/2) or SAgt-?-1(?/2) then the
    association is significant at level ?.

-
38
Association Power
  • Lets assume that SNP A is causal and pA ? p-A
  • Given the true pA and p-A, if we collect N
    individuals, and compute the statistic SA, the
    probability that SA has a significance level of ?
    is the power.
  • Power is the chance of detecting an association
    of a certain strength with a certain number of
    individuals.

39
Association Statistic
  • Lets assume that pA ? p-A then

40
Association Power
Power of association test
Threshold for significance
Non-centrality parameter.
41
Association Power
  • Statistical Power of an association with N
    individuals, non-centrality parameter
    and significance threshold ? is P(?, )
  • Note that if ?0, power is always ?.

42
Indirect Association
  • Now lets assume that we have 2 markers, A and B.
    Let us assume that marker B is the causal
    mutation, but we are observing marker A.
  • If we observed marker B directly our statistic
    would be

43
Indirect Association
  • However, we are observing A where our statistic
    is
  • What is the relation between SA and SB?

44
Indirect Association
  • We want to relate
  • to

45
Indirect Association
  • We assume conditional probability distributions
    are equal in case and control samples

46
Indirect Association
  • Then

47
Indirect Association
  • Note that

48
Indirect Association
  • How many individuals, NA, do we need to collect
    at marker A to achieve the same power as if we
    collected NB markers at marker B?

49
Visualization in terms of Power
Power of association test
Threshold for significance
Non-centrality parameters.
50
Correlating Haplotypes with the Disease
  • The disease may be correlated with a SNP not in
    the panel.
  • The disease may be more correlated with a
    haplotype (group of SNPs) than with any single
    SNP in the panel.
  • Haplotype tests
  • Which haplotypes should we test?
  • Which blocks should we pick?

51
Key Problem Indirect Association
  • We have the HapMap.
  • Information on 4,000,000 SNPs.
  • AffyMetrix gene chip collects information on
    500,000 SNPs.
  • What about the remaining 3,500,000 SNPs?
  • So far, we have designed studies by picking tag
    SNPs with high r2.
  • Can we use the HapMap when performing
    association?
  • Multi-Tag methods.

52
Haplotypes as Proxies for Hidden SNPs(de Bakker
et al., 2005)
53
Haplotypes as Proxies for Hidden SNPs (de Bakker
2005)
54
WHAP - Weighted HAPlotype Analysis
A
0.71AA 0.29AG
55
WHAP - Weighted Haplotypes
A
0.71AA 0.29AG
56
Basic MultiMarker Method
  • For each SNP in HapMap, find haplotype among
    genotyped SNPs that has highest r2 to the SNP.
  • Perform association at each SNP and each added
    haplotype.
  • Now instead of performing 500,000 tests, we
    perform 4,000,000 tests.

57
Weighted Haplotype Test
  • For each haplotype h, we assign a weight wh
  • We use a weighted allele frequency statistic
  • This statistic is the weighted numerator in SA.
  • What is the variance of this statistic?
  • Complication Haplotype frequencies are not
    independent!

58
Weighted Haplotype Example
  • Assume we have 4 haplotypes AB, Ab, aB and ab.
  • If we set the weights so that wABwAb1 and
    waBwab0, this is equivalent to looking at the
    single SNP A.
  • If we set the weights so that wAB1 and
    wAbwaBwab0, this is equivalent to looking at
    the single haplotype AB.
  • Other weights are can be something in between.

59
Variance of Wh
  • Use from statistics
  • Var(AB)Var(A)Var(B)-2Cov(A,B)
  • Let us assume that we have a 4 sided die.
  • Let A1, A2, A3, A4 be the event of rolling a 1,
    2, 3, or 4 which has probability p1, p2, p3, and
    p4 respectively.
  • Var(Ai)pi(1-pi)
  • Cov(Ai,Aj)pipj

60
Variance of Wh
  • If we are rolling N times, the variance and
    covariance of the observed frequency is
  • Var(pi)pi(1-pi)/N
  • Cov(pi,pj)pipj/N
  • Then in the cases and controls
  • Var(pi- pi-)2pi(1-pi)/N
  • If we assign a weight wi to each side of the die
  • Var(wi(pi- pi-)) wi2 2pi(1-pi)/N
    (case/controls are independent)
  • Cov(wipi,wipj)wipiwipj? wiwipipj




61
Variance of Wh
  • Then Var(w1p1 w2p2 w3p3 w4p4)N is
  • N( Var(w1p1) Var(w2p2) Var(w3p3)
    Var(w4p4)
  • -2Cov(w1p1,w2p2) -2Cov(w1p1,w3p3)
  • -2Cov(w1p1,w4p4) -2Cov(w2p2,w3p3)
  • -2Cov(w2p2,w4p4) -2Cov(w3p3,w4p4) )
  • w12p1(1-p1)w22p2(1-p2)w32p3(1-p3)w42p4
    (1-p4)
  • -2(w1w2p1p2w1w3p1p3w1w4p1p4w2w3p2p3
  • w2w4p2p4w3w4p3p4)
  • w12p1-w12 (p1)2 w22p2-w22 (p2)2
  • w32p3-w32 (p3)2 w42p4-w42 (p4)2
  • -2(w1w2p1p2w1w3p1p3w1w4p1p4w2w3p2p3
  • w2w4p2p4w3w4p3p4)




62
Variance of Wh
  • Var(w1p1 w2p2 w3p3 w4p4)N
  • w12p1-w12 (p1)2 w22p2-w22 (p2)2
  • w32p3-w32 (p3)2 w42p4-w42 (p4)2
  • -2(w1w2p1p2w1w3p1p3w1w4p1p4w2w3p2p3
  • w2w4p2p4w3w4p3p4)
  • w12p1 w22p2 w32p3 w42p4
  • -w12 (p1)2-w22 (p2)2-w32 (p3)2-w42 (p4)2
  • -2(w1w2p1p2w1w3p1p3w1w4p1p4w2w3p2p3
  • w2w4p2p4w3w4p3p4)
  • w12p1 w22p2 w32p3 w42p4
  • - (w1p1 w2p2 w3p3 w4p4)2




63
The ?-test
  • Each haplotype h is assigned a weight wh.
  • N is the number of individuals.
  • ph - the probablity for h in cases/controls, or
    average.
  • Under the null, the ?-test is ?2 distributed.

64
Non-Centrality Parameter
  • Under weights w1,w2,w3,w4 and true case/control
    probabilities p1,p2,p3,p4 and
    p1-,p2-,p3-,p4-, Wh is expected to be
  • When normalizing for the variance, the
    non-centrality parameter is

65
Wh and indirect association
  • Let us assume that SNP C is causal with
    non-centrality parameter ?C.
  • If we perform weighted haplotype association, the
    noncentrality parameter is ?h.
  • How are they related? (i.e. What is the power of
    the weighted haplotype association test).
  • Using the same technique, we can show that ?Crh
    ?h, where rh is the conceptual equivalent of r in
    2 SNP case.

66
Indirect Association (Flashback)
  • We want to relate
  • to

67
Indirect Association (Flashback)
  • Since conditional probability distributions are
    equal in case and control samples

68
Indirect Association (Wh)
  • Since conditional probability distributions are
    equal in case and control samples

69
Indirect Association (Flashback)
  • Then

70
Indirect Association (Wh)
71
Indirect Association (Flashback)
  • How many individuals, NA, do we need to collect
    at marker A to achieve the same power as if we
    collected NB markers at marker B.

72
The Relation to Power
The power of detecting the SNPwith N individuals
is the sameas using the tag SNPs withN/rh2
individuals.
73
Setting the weights
  • Power depends on rh. We want to set the weights
    so that rh is maximized.
  • Proving what is the maximum is not so easy.

74
Choosing the Weights
  • Optimal weights
  • wh(s5) P(s5 A h) qAh

75
Choosing the Weights
  • Optimal weights
  • wh(s5) P(s5 A h) qAh

76
The Relation to Power
  • This is exactly r2 in the case of one tag SNP.
  • WHAP always has at least as much power as
  • single SNP test
  • single haplotype test
  • haplotype group test
  • ?2 with k degrees of freedom.

77
WHAP FAQ
  • Why is WHAP always more powerful?
  • What if there is a SNP with r21?
  • What if the weights are set wrong?
  • Does the false positive rate increase?
  • Does the power decrease?
  • Mathematically what does this affect?
  • Why can the weights be wrong?
  • i.e. What are the underlying assumptions of the
    method.

78
Apply tests T1,,T4M
Cases 0.5M SNPs
Controls 0.5M SNPs
HapMap 4M SNPs
Use as training dataset to getthe weights
Tests T1,,T4M
Positive results give evidence for a causal SNP
- can be verified by a follow up/two stage study.
79
How Many SNPs are Captured?
80
(No Transcript)
81
Power Simulations
  • Relative power to using all SNPs.
  • Tested on the ENCODE regions, Affy 500k tag
    SNPs.

82
Practical Issues
  • We assume we have the haplotype frequencies in
    the HapMap (not the phase).
  • We assume the case/control populations are coming
    from the same population as the HapMap.
  • Over-fitting
  • Train with half of the data, test the other half.
  • No correlation between the haps and random SNPs.

83
(No Transcript)
84
WHAP r2 in a region. Red lines are collected
SNPs. Blue lines are rh2 values.
85
Associations using WHAP. Red lines are
assocations at collected SNPs. Blue lines are
associations at uncollected SNPs inferred by WHAP.
86
(No Transcript)
87
Optimal Genome Wide Tagging by Reduction to SAT
88
Correlation Strucutre
89
Example r2 Matrix
90
Graph Representation
91
Satisfiability and SAT Solvers
  • Boolean variables called literals
  • Logical operators
  • AND ?
  • OR ?
  • NOT
  • Example
  • (s1 ? s2) ? (s2 ? s3 ? s1)
  • s1 false s2 false s3 true

92
Negation Normal Form
A. Darwiche
rooted DAG (Circuit)
93
CNF Form and Logical Solutions
94
NNF Form of Solutions
95
Local Single SNP r2 Tagging
  • Generate a clause for each SNP
  • Clause for SNP si contains all covers
  • Input CNF as conjuction of all clauses
  • Compile with minSAT solver
  • Find solutions by traversal of NNF

96
Optimal Tagging
97
Whole Genome Tagging
98
MultiMarker Example
99
MultiMarker Tagging
100
Best N-tagging
  • Fixed budget of SNPs
  • Maximize function such as
  • Average r2
  • Power
  • Swtich from minSAT to maxSAT
  • Each literal si has a weight wi (,-)
  • Choose literals that maximize weight
  • Enforce constraints with infinite weight

101
Adder Circuit
102
UCLA Adnan Darwiche Arthur Choi Knot
Pipatswisawat ICSI Eran Halperin Richard
Karp Perlegen Sciences David Hinds David Cox
Ph.D. Students Buhm Han Nils Homer Hyun Min
Kang Sean ORourke Jimmie Ye Noah Zaitlen
Webserver Hosted By
Write a Comment
User Comments (0)
About PowerShow.com