Handling and analyzing data from a genome-wide association study - PowerPoint PPT Presentation

About This Presentation
Title:

Handling and analyzing data from a genome-wide association study

Description:

Develop by Shaun Purcell ... Conneely Charles Ding William Duren Terry Gliedt Larry Hu Anne Jackson Xiao-Yi Li Andrew Skol Heather Stringham Peggy White Cristen ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 28
Provided by: lauraS162
Category:

less

Transcript and Presenter's Notes

Title: Handling and analyzing data from a genome-wide association study


1
Handling and analyzing data from a genome-wide
association study
  • Laura Scott
  • Biostatistics Department
  • Center for Statistical Genetics
  • University of Michigan

2
Outline
  • Storing large amounts of genotype data
  • Quality control
  • Generating initial association analysis
  • Viewing results
  • Imputation of missing SNP genotypes
  • Storing results and planning specialized analysis

3
Genotype data is huge
  • 500,000 SNPs 2000 cases controls
    1,000,000,000 genotypes!
  • Need compact ways to store data
  • If store each genotype as 00, 01, or 11 will have
    file that looks like
  • Person 100011001010001001001100.
  • 000000100001000010001000.

4
Genotype data is huge
  • 500,000 SNPs 2000 cases controls
    1,000,000,000 genotypes!
  • Need compact ways to store data
  • If store each genotype as 00, 01, or 11 will have
    file that looks like
  • SNP 100011001010001001001100.
  • 000000100001000010001000.

5
Genotype data is huge
  • 500,000 SNPs 2000 cases controls
    1,000,000,000 genotypes!
  • Need compact ways to store data
  • If store each genotype as 00, 01, or 11 will have
    file that looks like
  • 100011001010001001001100.
  • 000000100001000010001000.
  • Total file space for 300K SNPs 4 Gigabytes
  • Largest chromosome file .4 Gigabytes

6
Need to do extensive planning for genotype data
before it arrives
  • Chromosome datasets are too large for SAS and
    other commonly used analytic packages
  • Need programs to select and write out genotype
    data in multiple formats
  • Tests of procedures with large-scale trial
    datasets

7
Gather other data needed for analysis
  • SNP information
  • Chromosome
  • Position
  • SNP annotation
  • Gene
  • Function
  • Translation of called allele to a standard allele
  • Example forward strand of given genome build

8
How good is the data?
  • Identify and remove bad samples and SNPs
  • Compute summary statistics
  • Percent successfully genotyped samples
  • Average genotyping success rate
  • Duplicate sample error rate
  • Non-Mendelian inheritance error rates (errors not
    consistent with normal transmission of
    chromosomes in family members)

9
Identify bad samples and remove
  • Poor quality samples
  • Sample genotype success rate lt 95 to 97.5
  • Greater proportion of heterozygous genotypes than
    expected
  • Related individuals (if independent samples)
  • Based on pair-wise comparisons of similarity of
    genotypes
  • Sample switches
  • Wrong sex
  • Regions of homozygosity in cell line

10
Identify poor quality SNPs and remove
  • Expected proportions of genotypes are not
    consistent with observed allele frequency (Hardy
    Weinberg Equilibrium (HWE))
  • HWE p-value lt 10-4 to 10-6
  • Look for deviation from expected distribution of
    p-values under the null
  • Genotyping success rate lt 95
  • Duplicate sample or Non-Mendelian error rate is
    elevated
  • Differential missingness in cases and controls

11
Programs are available for large scale quality
control analysis
  • Plink
  • Duplicate error rates, sample relatedness, HWE,..
  • Develop by Shaun Purcell
  • http//pngu.mgh.harvard.edu/purcell/plink/
  • GAINQC Software used for the quality control
    analysis of the GAIN project
  • Duplicate error rates, sample relatedness, HWE,..
  • Developed by Shyam Gopalakrishnan and Goncalo
    Abecasis
  • Available from gopalakr_at_umich.edu

12
Initial analysis is straightforward once have
everything in place
  • Case/control association
  • Use test that is not affected by deviations from
    HWE
  • Cochran-Armitage test for trend
  • Equivalent to score test in logistic regression
  • TDT or other family-based test
  • Quantitative trait association

13
Programs are available for large scale
case-control or family-based analysis
  • Plink
  • Case/control, tdt, quantitative traits
  • Develop by Shaun Purcell
  • http//pngu.mgh.harvard.edu/purcell/plink/
  • Merlin
  • Quantitative traits in independent samples or
    families, ability to impute genotypes for untyped
    individuals based on genotyped family members
  • Developed by Goncalo Abecasis
  • http//www.sph.umich.edu/csg/abecasis/Merlin/

14
Are the results believable?
  • Are stronger associations correlated with poorer
    quality control measures?
  • Is there a strong deviation from expected
    distribution of p-values?
  • Is there confounding from differences in the
    genetic origins of case and control samples
    (population stratification)?
  • Genomic control
  • Eigenstrat analysis

15
Seeing from many different angles is believing
(sometimes)
  • Plink graphical output
  • User added custom tracks in the UCSC browser
  • http//genome.ucsc.edu/
  • http//genome.ucsc.edu/goldenPath/help/hgTracksHel
    p.htmlCustomTracks
  • Homemade graphes

16
FUSION T2D association
17
Many different ways to display similar data
Zeggini et al. (2007) Science 316 13361341
Diabetes Genetic Initiative (2007) Science
3161331-1336 Scott et al., (2007) Science
3161341-1345
18
Getting more for your genotyping dollars
Imputation of SNP genotypes
  • Impute/predict genotypes for
  • Missing data within genotyped markers
  • Untyped markers
  • Uses haplotype structure of existing sample such
    as HapMap samples to infer data for samples with
    sparser marker set

19
Observed genotypes
Study Sample
HapMap
Gonçalo Abecasis
20
Identify match among reference
Gonçalo Abecasis
21
Phase chromosomes, impute missing genotypes
Gonçalo Abecasis
22
Imputing genotype data allows much more thorough
analysis
  • Allows testing of untyped variation
  • Allows easy combination of data across genotyping
    platforms
  • Provides complete data for analysis with multiple
    SNPs

23
Imputed data takes care to generate, analyze and
understand
  • Requires large scale computing resources
  • Need to assess quality of imputation
  • Compare imputed gentoypes to actual genotypes
  • Error rates are higher than for genotyped SNPs
  • Works less well for rarer alleles
  • Best to take account of uncertainty imputed SNPs
    in analysis
  • Need ways to take into account fractional
    genotype counts

24
Imputation programs are available
  • IMPUTE
  • Developed by Jonathan Marchini
  • Nature Genetics, Advance online publication
  • http//www.stats.ox.ac.uk/marchini/software
  • Mach 1.0, Markov Chain Haplotyping
  • Developed by Goncalo Abecasis
  • http//www.sph.umich.edu/csg/abecasis/MACH/

25
Need to store results and prepare for large scale
specialized analysis
  • System to store, view, merge results
  • SQL database
  • Plink
  • Testing speed of specialized analyses in
    different statistical packages
  • Potential development of software to run large
    scale specialized analysis

26
Summary Ideally what needs to happen before
getting the data
  • Ability to store, select and write out genotype
    data in multiple formats for quality control and
    association analysis
  • Identification of primary quality control and
    analysis programs
  • Systems to store, view, merge results
  • Adequate computing resources to do intensive
    computing
  • Testing of standard and specialized processes
    with large-scale trial datasets

27
FUSION study
U Michigan
CIDR
NHGRI / NIH
U Michigan
Gonçalo Abecasis Yun Li Jun Ding Paul Scheet
Kimberly Doheny Elizabeth Pugh
Michael Boehnke Karen Conneely Charles Ding
William Duren Terry Gliedt Larry Hu Anne
Jackson Xiao-Yi Li Andrew Skol Heather
Stringham Peggy White Cristen Willer Fang
Xiang Rui Xiao
Francis Collins Lori Bonnycastle Peter
Chines Michael Erdos Narisu NarisuL.
Prokunina-Olsson Nancy Riebow Andrew Sprau Amy
Swift Maurine Tong
Calvin College
Randall Pruim
USC
Richard Bergman Thomas Buchanan Richard Watanabe
UNC-Chapel Hill
National Public Health Institute Helsinki
Karen Mohlke Kyle Gaulton Jason Luo Li Qin
Jaakko Tuomilehto Timo Valle
Write a Comment
User Comments (0)
About PowerShow.com