Data Mining in Linkage Disequilibrium Mapping - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Data Mining in Linkage Disequilibrium Mapping

Description:

The raw data is genetic markers. LD is the non-random association between ... Mixed-radix number (Zhao & Sham 2003) QuickSort (Zhao & Qian submitted) Examples ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 16
Provided by: Epid87
Category:

less

Transcript and Presenter's Notes

Title: Data Mining in Linkage Disequilibrium Mapping


1
Data Mining in Linkage Disequilibrium Mapping
  • Jing Hua Zhao
  • Epidemiology
  • j.zhao_at_public-health.ucl.ac.uk
  • June 2003

2
Outline of the Talk
  • The problem
  • Why data mining?
  • Haplotype construction
  • Challenging issues

3
Current Paradigm
  • Complex traits (Lander Schork 1994)
  • Association mapping (Risch Merikangas 1996)
  • The need of both family and population-based
    study (Hodge et al. 2003)
  • SNPs

4
Linkage Disequilibrium
  • The raw data is genetic markers
  • LD is the non-random association between alleles
    at different loci
  • Contains information on genetics of population
    (selection, mutation, recombination, admixture)

5
An Model with LDs
  • Log-linear model to allow for higher order
    interaction (Weir Wilson 1986)
  • Applicable to a variety of null hypotheses
    (Huttley Wilson 2000)
  • Number of terms is exponential

6
Why Data Mining?
  • 1.8 million SNPs, 1,240 hits on haplotype and
    data mining in 0.15 seconds
  • Data mining is the process of exploration and
    analysis, by automatic or semi-automatic means,
    of large quantities of data in order to discover
    meaningful patterns and results (Berry Linoff,
    1997, 2000)

7
A Statistical Perspective
  • Traditionally EDA, for a particular question
  • Sheer size of data is problematic
  • Now DM could be defined as the process of
    secondary analysis of large datrabases aimed at
    finding unsuspected relationships which are of
    interest or value to the database owners (Hand
    1998)

8
Haplotype Pattern Mining
  • Figure 1 (a) Strongly disease-associated
    haplotype patterns
  • Enumeration
  • DFS, which has good running time property

9
Significance
  • A simple Chi-squared statistic by a 2x2 table
    containing disease-associated and control
    chromosomes, in accordance with D, significance
    determined via simulation
  • Simulation on prevalence, evolutionary history
    and sample size, robustness
  • Applicable to family data (Zhang et al. 2001)

10
Emerging Rules
  • LD patterns are highly strutured (Daly et al.
    2001)
  • 5-8 markers (Niu et al. 2002 Zaykin et al.
    2002Toivonen et al. 2000)
  • htSNPs (Johnson et al. 2001)

11
Problem of Haplotype Uncertainty
  • EM (Cepellini et al. 1955)
  • MCMC (Guo Thompson 1992 Lazzaroni Lange
    1997 Stephens et al. 2001, Niu et al. 2002)
  • Heuristic algorithms

12
Haplotype Reconstruction
  • Table of genotypes (Xie Ott 1993)
  • Table of sufficient statistics (Zhao et al. 2000)
    and linked list
  • Binary trees (Zhao Sham 2002)
  • Mixed-radix number (Zhao Sham 2003)
  • QuickSort (Zhao Qian submitted)

13
Examples
  • HLA (the evolution of EM algorithms, information
    content of SNP and SSR)
  • ALDH2 (missing data, effectiveness of heuristic
    method)
  • APOC (the disadvantage of QuickSort, heuristics,
    the inclusion of covariates)

14
Challenging Issues
  • Genotype/Phenotype relationship by Whitehall II
    data (10,308 civil servants, with 7.000 APOE
    genotypings)
  • Associated with cognitive declines
  • Need longitudinal data
  • Will tie up with BioBank project

15
Statistical Methodology
  • GLM needs to be extended
  • The same with LDA models such as GLMM
  • Search and Sort paradigm (Knuth)
Write a Comment
User Comments (0)
About PowerShow.com