Dimensionality Reduction in the Analysis of Human Genetics Data - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Dimensionality Reduction in the Analysis of Human Genetics Data

Description:

Dimensionality Reduction in the Analysis of Human Genetics Data Petros Drineas Rensselaer Polytechnic Institute Computer Science Department To access my web page: – PowerPoint PPT presentation

Number of Views:199
Avg rating:3.0/5.0
Slides: 59
Provided by: petro152
Category:

less

Transcript and Presenter's Notes

Title: Dimensionality Reduction in the Analysis of Human Genetics Data


1
Dimensionality Reduction in the Analysis of Human
Genetics Data
Petros Drineas Rensselaer Polytechnic
Institute Computer Science Department
To access my web page
drineas
2
Human genetic history
Much of the biological and evolutionary history
of our species is written in our DNA
sequences. Population genetics can help
translate that historical message.
The genetic variation among humans is a small
portion of the human genome.All humans are
almost than 99.9 identical.
3
Our objective
Fact Dimensionality Reduction techniques (such
as Principal Components Analysis PCA) separate
different populations and result to plots that
correlate well with geography.
4
Our objective
  • Fact
  • Dimensionality Reduction techniques (such as
    Principal Components Analysis PCA) separate
    different populations and result to plots that
    correlate well with geography.
  • Our goal
  • Based on this observation, we seek unsupervised,
    efficient algorithms for the selection of a small
    set of genetic markers that can be used to
  • capture population structure, and
  • predict individual ancestry.

5
The math behind
  • To this end, we employ matrix algorithms and
    matrix decompositions such as
  • the Singular Value Decomposition (SVD), and
  • the CX decomposition.
  • We provide novel, unsupervised algorithms for
    selecting ancestry informative markers.

6
Overview
  • Background
  • The Singular Value Decomposition (SVD)
  • The CX decomposition
  • Removing redundant markers (Column Subset
    Selection Problem)
  • Selecting Ancestry Informative Markers
  • A worldwide set of populations
  • Admixed European-American populations
  • The POPulation REference Sample (POPRES)

7
Single Nucleotide Polymorphisms (SNPs)
Single Nucleotide Polymorphisms the most common
type of genetic variation in the genome across
different individuals. They are known locations
at the human genome where two alternate
nucleotide bases (alleles) are observed (out of
A, C, G, T).
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
There are 10 million SNPs in the human genome,
so this matrix could have 10 million columns.
8
Our data as a matrix
SNPs
Individuals
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
SNPs
Individuals
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
example ?? 1 ?G 0 GG -1
9
Forging population variation structure
  • Genetic diversity and population (sub)structure
    is caused by
  • Mutation
  • Mutations are changes to the base pair sequence
    of the DNA.
  • Natural selection
  • Genotypes that correspond to favorable traits and
    are heritable become more common in successive
    generations of a population of reproducing
    organisms.
  • ? Mutations increase genetic diversity.
  • Under natural selection, beneficial mutations
    increase in frequency, and vice versa.

10
Forging population variation structure
  • Genetic drift
  • Sampling effects on evolution
  • Example say that the RAF of a SNP in a small
    population is p.
  • The offspring generation would (in expectation)
    have a RAF of p as well for the same SNP.
  • In reality, it will have a RAF of p (a drifted
    frequency)
  • Gene flow
  • Transfer of alleles between populations
    (immigration)
  • Non-random mating
  • Reduces interaction between (sub)populations.
  • Other demographic events

11
Dimensionality reduction
SNPs
Individuals
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
  • Mutation/natural selection/population history,
    etc. result to significant structure in the data.
  • This structure can be extracted via
    dimensionality reduction techniques.
  • In most cases, unsupervised techniques suffice!
  • In most cases, linear dimensionality reduction
    techniques (PCA) suffice!

12
Why study population structure?
  • Mapping causative genes for common complex
    disorders
  • (e.g. diabetes, heart conditions, obesity, etc.)
  • History of human populations
  • Genealogy
  • Forensics
  • Conservation genetics

13
Population stratification
  • Definition
  • Correlation between subpopulations in the
    case/control samples and the phenotype under
    investigation.
  • Effects
  • Confounds the study, typically leading to false
    positive correlations.
  • Solution
  • The problem can be addressed either by careful
    sample collection or by statistical
    post-processing of the results (Price et al
    (2006) Nat Genet).

14
Population stratification (contd)
Population 1
Cases
Population 2
AC
AA
CC
Example of the confounding effects of population
stratification in an association study.
Controls
Marchini et al (2004) Nat Genet
15
Recall our objective
  • Develop unsupervised, efficient algorithms for
    the selection of a small set of SNPs that can be
    used to
  • capture population structure, and
  • predict individual ancestry.
  • Why? cost efficiency.
  • Lets discuss (briefly) prior work

16
Inferring population structure
Oceania
Africa
Europe
Central Asia
East Asia
Middle East
America
377 STRPs, Rosenberg et al (2004) Science
  • Examples of available algorithms/software
    packages
  • STRUCTURE Pritchard et al (2000) Genetics
  • FRAPPE Li et al (2008) Science

17
Selecting ancestry informative markers
  • Existing methods (Fst, Informativeness, d)
  • Rosenberg et al (2003) Am J Hum Genet
  • Allele frequency based.
  • Require prior knowledge of individual ancestry
    (supervised).
  • Such knowledge may not be available.
  • (e.g., populations of complex ancestry, large
    multi-centered studies of anonymous samples,
    etc.)
  • Unsupervised feature selection techniques are
    often preferable because they tend to not overfit
    the data.

18
The Singular Value Decomposition (SVD)
Let A be a matrix with m rows (one for each
subject) and n columns (one for each
SNP). Matrix rows points (vectors) in a
Euclidean space, e.g., given 2 objects (x d),
each described with respect to two features, we
get a 2-by-2 matrix. Two objects are close if
the angle between their corresponding vectors is
small.
19
SVD, intuition
Let the blue circles represent m data points in a
2-D Euclidean space. Then, the SVD of the m-by-2
matrix of the data will return
20
Singular values
?2
?1 measures how much of the data variance is
explained by the first singular vector. ?2
measures how much of the data variance is
explained by the second singular vector.
?1
21
SVD formal definition
? rank of A U (V) orthogonal matrix containing
the left (right) singular vectors of A. S
diagonal matrix containing the singular values of
A.
22
Rank-k approximations via the SVD
?
A
VT
U

features
significant
sig.
noise
noise

significant
noise
objects
23
Rank-k approximations (Ak)
  • Uk (Vk) orthogonal matrix containing the top k
    left (right) singular vectors of A.
  • k diagonal matrix containing the top k singular
    values of A.

24
PCA and SVD
Principal Components Analysis (PCA) essentially
amounts to the computation of the Singular Value
Decomposition (SVD) of a covariance matrix. SVD
is the algorithmic tool behind MultiDimensional
Scaling (MDS) and Factor Analysis.
25
The data
European Americans
South Altaians
-
-
Spanish
African Americans
Japanese
Chinese
Puerto Rico
Nahua
Mende
Mbuti
Mala
Burunge
Quechua
Africa
Europe
E Asia
America
274 individuals from 12 populations genotyped on
10,000 SNPs Shriver et al (2005) Human Genomics
26
America
Africa
Asia
Europe
Paschou et al (2007) PLoS Genetics
27
America
Africa
Asia
Europe
Not altogether satisfactory the singular vectors
are linear combinations of all SNPs, and of
course can not be assayed! Can we find actual
SNPs that capture the information in the singular
vectors? (E.g., spanning the same subspace )
Paschou et al (2007) PLoS Genetics
28
SVD decomposes a matrix as
The SVD has strong optimality properties.
Top k left singular vectors
  • X UkTA ?k VkT
  • The columns of Uk are linear combinations of up
    to all columns of A.

29
CX decomposition
Goal make (some norm) of A-CX small.
c columns of A
Why? If A is an subject-SNP matrix, then
selecting representative columns is equivalent to
selecting representative SNPs to capture the same
structure as the top eigenSNPs. We want c as
small as possible!
30
CX decomposition
Goal make (some norm) of A-CX small.
c columns of A
Theory for any matrix A, we can find C such
that is almost equal to the norm of A-Ak with c
k.
31
CX decomposition
c columns of A
Easy to prove that optimal X CA. (C is the
Moore-Penrose pseudoinverse of C.) Thus, the
challenging part is to find good columns (SNPs)
of A to include in C. From a mathematical
perspective, this is a hard combinatorial problem.
32
A theoremDrineas et al (2008) SIAM J Mat Anal
Appl
Given an m-by-n matrix A, there exists an
algorithm that picks, in expectation, at most O(
k log k / ?2 ) columns of A runs in O(mn2) time,
and with probability at least 1-10-20
33
The CX algorithm
Input m-by-n matrix A, target rank k, number of
columns c Output C, the matrix consisting of the
selected columns
  • CX algorithm
  • Compute probabilities pj summing to 1
  • For each j 1,2,,n, pick the j-th column of A
    with probability min1,cpj
  • Let C be the matrix consisting of the chosen
    columns
  • (C has in expectation at most c columns)

34
Subspace sampling (Frobenius norm)
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The rows of VkT are orthonormal vectors,
but its columns (VkT)(i) are not.
35
Subspace sampling (Frobenius norm)
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The rows of VkT are orthonormal vectors,
but its columns (VkT)(i) are not.
Subspace sampling in O(mn2) time
Leverage scores (many references in the
statistics community)
Normalization s.t. the pj sum up to 1
36
Deterministic variant of CX
Paschou et al (2007) PLoS Genetics Mahoney and
Drineas (2009) PNAS
Input m-by-n matrix A, integer k, and c
(number of SNPs to pick) Output the selected
SNPs
  • CX algorithm
  • Compute the scores pj
  • Pick the columns (SNPs) corresponding to the top
    c scores.

37
Deterministic variant of CX (contd)
Paschou et al (2007) PLoS Genetics Mahoney and
Drineas (2009) PNAS
Input m-by-n matrix A, integer k, and c
(number of SNPs to pick) Output the selected
SNPs we will call them PCA Informative Markers
or PCAIMs
  • CX algorithm
  • Compute the scores pj
  • Pick the columns (SNPs) corresponding to the top
    c scores.

In order to estimate k for SNP data, we developed
a permutation-based test to determine whether a
certain principal component is significant or
not. (A similar test was presented in Patterson
et al (2006) PLoS Genetics)
38
Worldwide data
European Americans
South Altaians
-
-
Spanish
African Americans
Japanese
Chinese
Puerto Rico
Nahua
Mende
Mbuti
Mala
Burunge
Quechua
Africa
Europe
E Asia
America
274 individuals, 12 populations, 10,000 SNPs
using the Affymetrix array Shriver et al (2005)
Human Genomics
39
Selecting PCA-correlated SNPs for individual
assignment to four continents (Africa, Europe,
Asia, America)
Africa
Europe
Asia
America
top 30 PCA-correlated SNPs
PCA-scores
SNPs by chromosomal order
Paschou et al (2007) PLoS Genetics
40
Correlation coefficient between true and
predicted membership of an individual to a
particular geographic continent. (Use a subset
of SNPs, cluster the individuals using k-means.)
Paschou et al (2007) PLoS Genetics
41
Cross-validation on HapMap data
Paschou et al (2007) PLoS Genetics
42
  • 9 indigenous populations from four different
    continents (Africa, Europe, Asia, Americas)
  • All SNPs and 10 principal components perfect
    clustering!
  • 50 PCAIMs SNPs, almost perfect clustering.

43
The Human Genome Diversity Panel Appox. 1000
individuals 650K Illumina Array Li et al (2008)
Science
44
Highest scoring genes in HGDP dataset
Gene Function (RefSeq)
EDAR Ectodermal development, hair follicle formation.
PTK6 Intracellular signal transducer in epithelial tissues. Sensitization of cells to epidermal growth factor.
GALNT13 Initiates O-linked glycosylation of mucins.
SPATA20 Associated with spermatogenesis.
MCHR1 Plasma membrane protein which binds melanin-concentrating hormone. Probably involved in the neuronal regulation of food consumption.
FOXP1 Forkhead box transcription factors play important roles in the regulation of tissue- and cell type-specific gene transcription during both development and adulthood.
PSCD3 Involved in the control of Golgi structure and function.
CNTNAP2 Member of the neurexin family which functions in the vertebrate nervous system as cell adhesion molecules and receptors.
OCA2 Skin/Hair/Eye pigmentation.
EGFR This protein is a receptor for members of the epidermal growth factor family. Associated with the melanin pathway.
Barreiro et al (2008) Nat Genet, Sabeti et al
(2007) Nature, The International HapMap
Consortium (2007) Nature
45
A problem with the CX decomposition
Input m-by-n matrix A, integer k, and c (number
of SNPs to pick) Output the selected PCA
Informative Markers or PCAIMs
  • CX algorithm
  • Compute the scores pj
  • Pick the columns (SNPs) corresponding to the top
    c scores.

Problem Highly correlated SNPs (a.k.a., SNPs
that are in LD) get similar high scores, and
thus the deterministic variant would select
redundant SNPs. How do we remove this redundancy?
46
Column Subset Selection Problem (CSSP)
Definition Given an m-by-c matrix A, find k
columns of A forming an m-by-k matrix C that are
maximally uncorrelated (a.k.a., select maximally
uncorrelated SNPs).
47
Column Subset Selection Problem (CSSP)
Definition Given an m-by-c matrix A, find k
columns of A forming an m-by-k matrix C that are
maximally uncorrelated (a.k.a., select maximally
uncorrelated SNPs). Metric of correlation A
common formulation is to select a set of SNPs
that span a parallel-piped of maximal
volume. This formulation is NP-hard (i.e.,
intractable even for small values of m, c, and k).
48
Column Subset Selection Problem (CSSP)
Definition Given an m-by-c matrix A, find k
columns of A forming an m-by-k matrix C that are
maximally uncorrelated (a.k.a., select maximally
uncorrelated SNPs). Metric of correlation A
common formulation is to select a set of SNPs
that span a parallel-piped of maximal
volume. This formulation is NP-hard (i.e.,
intractable even for small values of m, c, and
k). Significant prior work The CSSP has been
studied in the Numerical Linear Algebra
community, and many provably accurate
approximation algorithms exist.
49
Prior work on the CSSP 1965 2000Boutsidis,
Mahoney, and Drineas (2009) SODA, under review in
Num Math
50
The greedy QR algorithm
We use a standard greedy approach (the
Rank-Revealing QR factorization). The algorithm
performs k iterations In the first iteration,
the top PCAIM is picked In the second
iteration, a PCAIM is picked that is as
uncorrelated to with the previously selected
PCAIM as possible In the third iteration the
chosen PCAIM has to be as uncorrelated as
possible with the first two previously selected
PCAIMs And so on
Efficient implementations are available, and run
in minutes for typical values of m, c, and k.
Paschou et al (2008) PLoS Genetics
51
A European American sample
  • Datasets
  • CHORI dataset
  • 980 European Americans, 300,000 SNPs (Illumina
    chip)
  • Simon et al (2006) Am J Cardiol Albert et al
    (2001) JAMA
  • Coriell dataset
  • 541 European Americans, 300,000 SNPs (Illumina
    chip)
  • Fung et al (2006) LANCET
  • HapMap
  • 90 Yoruba (YRI), 90 CEPH (CEU), 90 Han Chinese
    Japanese (CHB-JPT)

Paschou et al (2008) PLoS Genetics
52
Europeans
Africans
Asians
PCA plot of European-American populations and
HapMap 2 populations. (The top three eigenSNPs
are presented.)
Paschou et al (2008) PLoS Genetics
53
(No Transcript)
54
POPRES - The Population Reference Sample
  • 6,000 individuals of European, African-American,
    East Asian, South Asian, and Mexican origin
  • Genotyped with Affy 500K array set
  • Nelson et al (2008) Am J Hum Genet
  • 3,192 Europeans
  • Correlation between genetic structure and
    geographic origin
  • Lao et al (2008) Current Biology
  • Novembre et al (2008) Nature
  • We analyzed 1,387 individuals from Novembre et
    al.
  • Randomly included only 200 UK and 125 Swiss
    French (even out sample sizes)
  • excluded
  • Europeans not sampled in Europe
  • Putative relatives
  • Outliers in preliminary PCA

55
(No Transcript)
56
Conclusions
  • Using linear algebraic techniques (e.g., matrix
    decompositions) we selected markers that capture
    population structure.
  • Our technique requires no prior assumptions and
    builds upon the power of SVD and PCA to identify
    population structure in various settings,
    including admixed populations.
  • Prior theoretical work and mathematical
    understanding of the underlying problem was
    fundamental in designing our algorithm!

57
Future research
  • Unsupervised dimensionality reduction techniques
    are NOT successful in separating cases from
    controls in GWAS studies.
  • Why?
  • Because the disease signal is too weak.
  • Potential remedies?
  • Supervised techniques (Fischer Discriminant
    Analysis or LDA, etc.).
  • Sparse approximations for regression problems.
  • Goal?
  • Design a global test that may help uncover
    effects of gene-gene interactions in disease risk.

58
Acknowledgements
  • Collaborators
  • P. Paschou, Democritus University, Greece
  • E. Ziv, UCSF
  • E. Burchard, UCSF
  • K. K. Kidd, Yale University
  • M. Shriver, Penn State
  • R. Krauss, Oakland Research Institute
  • Students
  • Asif Javed, RPI (now at IBM)
  • Jamey Lewis, RPI

Funding NSF (Drineas), Tourette Syndrome
Association (Paschou), NIH (Ziv)
  • P. Paschou, M. W. Mahoney, A. Javed, J. Kidd, A.
    Pakstis, S. Gu, K. Kidd, and P. Drineas. (2007)
    Intra- and inter-population genotype
    reconstruction from tagging SNPs, Genome
    Research, 17(1), pp. 96-107.
  • P. Paschou, E. Ziv, E. Burchard, S. Choudhry, W.
    Rodriguez-Cintron, M. W. Mahoney, and P. Drineas.
    (2007) PCA-correlated SNPs for structure
    identification in worldwide human populations,
    PLoS Genetics, 3(9), pp. 1672-1686.
  • P. Paschou, P. Drineas, J. Lewis, C. Nievergelt,
    D. Nickerson, J. Smith, P. Ridker, D. Chasman, R.
    Krauss, and E. Ziv. (2008) Tracing sub-structure
    in the European American population with
    PCA-informative markers, PLoS Genetics, 4(7), pp.
    1-13.
  • M. W. Mahoney and P. Drineas. (2009) CUR matrix
    decompositions for improved data analysis,
    Proceedings of the National Academy of Sciences,
    106(3), pp. 697-702.
Write a Comment
User Comments (0)
About PowerShow.com