Title: Dimensionality Reduction in the Analysis of Human Genetics Data
1Dimensionality Reduction in the Analysis of Human
Genetics Data
Petros Drineas Rensselaer Polytechnic
Institute Computer Science Department
To access my web page
drineas
2Human genetic history
Much of the biological and evolutionary history
of our species is written in our DNA
sequences. Population genetics can help
translate that historical message.
The genetic variation among humans is a small
portion of the human genome.All humans are
almost than 99.9 identical.
3Our objective
Fact Dimensionality Reduction techniques (such
as Principal Components Analysis PCA) separate
different populations and result to plots that
correlate well with geography.
4Our objective
- Fact
- Dimensionality Reduction techniques (such as
Principal Components Analysis PCA) separate
different populations and result to plots that
correlate well with geography. - Our goal
- Based on this observation, we seek unsupervised,
efficient algorithms for the selection of a small
set of genetic markers that can be used to - capture population structure, and
- predict individual ancestry.
5The math behind
- To this end, we employ matrix algorithms and
matrix decompositions such as - the Singular Value Decomposition (SVD), and
- the CX decomposition.
- We provide novel, unsupervised algorithms for
selecting ancestry informative markers.
6Overview
- Background
- The Singular Value Decomposition (SVD)
- The CX decomposition
- Removing redundant markers (Column Subset
Selection Problem) - Selecting Ancestry Informative Markers
- A worldwide set of populations
- Admixed European-American populations
- The POPulation REference Sample (POPRES)
7Single Nucleotide Polymorphisms (SNPs)
Single Nucleotide Polymorphisms the most common
type of genetic variation in the genome across
different individuals. They are known locations
at the human genome where two alternate
nucleotide bases (alleles) are observed (out of
A, C, G, T).
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
There are 10 million SNPs in the human genome,
so this matrix could have 10 million columns.
8Our data as a matrix
SNPs
Individuals
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
SNPs
Individuals
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
example ?? 1 ?G 0 GG -1
9Forging population variation structure
- Genetic diversity and population (sub)structure
is caused by - Mutation
- Mutations are changes to the base pair sequence
of the DNA. - Natural selection
- Genotypes that correspond to favorable traits and
are heritable become more common in successive
generations of a population of reproducing
organisms. - ? Mutations increase genetic diversity.
- Under natural selection, beneficial mutations
increase in frequency, and vice versa.
10Forging population variation structure
- Genetic drift
- Sampling effects on evolution
- Example say that the RAF of a SNP in a small
population is p. - The offspring generation would (in expectation)
have a RAF of p as well for the same SNP. - In reality, it will have a RAF of p (a drifted
frequency) - Gene flow
- Transfer of alleles between populations
(immigration) - Non-random mating
- Reduces interaction between (sub)populations.
- Other demographic events
11Dimensionality reduction
SNPs
Individuals
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
- Mutation/natural selection/population history,
etc. result to significant structure in the data. - This structure can be extracted via
dimensionality reduction techniques. - In most cases, unsupervised techniques suffice!
- In most cases, linear dimensionality reduction
techniques (PCA) suffice!
12Why study population structure?
- Mapping causative genes for common complex
disorders - (e.g. diabetes, heart conditions, obesity, etc.)
- History of human populations
- Genealogy
- Forensics
- Conservation genetics
13Population stratification
- Definition
- Correlation between subpopulations in the
case/control samples and the phenotype under
investigation. - Effects
- Confounds the study, typically leading to false
positive correlations. - Solution
-
- The problem can be addressed either by careful
sample collection or by statistical
post-processing of the results (Price et al
(2006) Nat Genet).
14Population stratification (contd)
Population 1
Cases
Population 2
AC
AA
CC
Example of the confounding effects of population
stratification in an association study.
Controls
Marchini et al (2004) Nat Genet
15Recall our objective
- Develop unsupervised, efficient algorithms for
the selection of a small set of SNPs that can be
used to - capture population structure, and
- predict individual ancestry.
- Why? cost efficiency.
- Lets discuss (briefly) prior work
16 Inferring population structure
Oceania
Africa
Europe
Central Asia
East Asia
Middle East
America
377 STRPs, Rosenberg et al (2004) Science
- Examples of available algorithms/software
packages - STRUCTURE Pritchard et al (2000) Genetics
- FRAPPE Li et al (2008) Science
17Selecting ancestry informative markers
- Existing methods (Fst, Informativeness, d)
- Rosenberg et al (2003) Am J Hum Genet
- Allele frequency based.
- Require prior knowledge of individual ancestry
(supervised). - Such knowledge may not be available.
- (e.g., populations of complex ancestry, large
multi-centered studies of anonymous samples,
etc.) - Unsupervised feature selection techniques are
often preferable because they tend to not overfit
the data.
18The Singular Value Decomposition (SVD)
Let A be a matrix with m rows (one for each
subject) and n columns (one for each
SNP). Matrix rows points (vectors) in a
Euclidean space, e.g., given 2 objects (x d),
each described with respect to two features, we
get a 2-by-2 matrix. Two objects are close if
the angle between their corresponding vectors is
small.
19SVD, intuition
Let the blue circles represent m data points in a
2-D Euclidean space. Then, the SVD of the m-by-2
matrix of the data will return
20Singular values
?2
?1 measures how much of the data variance is
explained by the first singular vector. ?2
measures how much of the data variance is
explained by the second singular vector.
?1
21SVD formal definition
? rank of A U (V) orthogonal matrix containing
the left (right) singular vectors of A. S
diagonal matrix containing the singular values of
A.
22Rank-k approximations via the SVD
?
A
VT
U
features
significant
sig.
noise
noise
significant
noise
objects
23Rank-k approximations (Ak)
- Uk (Vk) orthogonal matrix containing the top k
left (right) singular vectors of A. - k diagonal matrix containing the top k singular
values of A.
24PCA and SVD
Principal Components Analysis (PCA) essentially
amounts to the computation of the Singular Value
Decomposition (SVD) of a covariance matrix. SVD
is the algorithmic tool behind MultiDimensional
Scaling (MDS) and Factor Analysis.
25The data
European Americans
South Altaians
-
-
Spanish
African Americans
Japanese
Chinese
Puerto Rico
Nahua
Mende
Mbuti
Mala
Burunge
Quechua
Africa
Europe
E Asia
America
274 individuals from 12 populations genotyped on
10,000 SNPs Shriver et al (2005) Human Genomics
26America
Africa
Asia
Europe
Paschou et al (2007) PLoS Genetics
27America
Africa
Asia
Europe
Not altogether satisfactory the singular vectors
are linear combinations of all SNPs, and of
course can not be assayed! Can we find actual
SNPs that capture the information in the singular
vectors? (E.g., spanning the same subspace )
Paschou et al (2007) PLoS Genetics
28SVD decomposes a matrix as
The SVD has strong optimality properties.
Top k left singular vectors
- X UkTA ?k VkT
- The columns of Uk are linear combinations of up
to all columns of A.
29CX decomposition
Goal make (some norm) of A-CX small.
c columns of A
Why? If A is an subject-SNP matrix, then
selecting representative columns is equivalent to
selecting representative SNPs to capture the same
structure as the top eigenSNPs. We want c as
small as possible!
30CX decomposition
Goal make (some norm) of A-CX small.
c columns of A
Theory for any matrix A, we can find C such
that is almost equal to the norm of A-Ak with c
k.
31CX decomposition
c columns of A
Easy to prove that optimal X CA. (C is the
Moore-Penrose pseudoinverse of C.) Thus, the
challenging part is to find good columns (SNPs)
of A to include in C. From a mathematical
perspective, this is a hard combinatorial problem.
32A theoremDrineas et al (2008) SIAM J Mat Anal
Appl
Given an m-by-n matrix A, there exists an
algorithm that picks, in expectation, at most O(
k log k / ?2 ) columns of A runs in O(mn2) time,
and with probability at least 1-10-20
33The CX algorithm
Input m-by-n matrix A, target rank k, number of
columns c Output C, the matrix consisting of the
selected columns
- CX algorithm
- Compute probabilities pj summing to 1
- For each j 1,2,,n, pick the j-th column of A
with probability min1,cpj - Let C be the matrix consisting of the chosen
columns - (C has in expectation at most c columns)
34Subspace sampling (Frobenius norm)
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The rows of VkT are orthonormal vectors,
but its columns (VkT)(i) are not.
35Subspace sampling (Frobenius norm)
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The rows of VkT are orthonormal vectors,
but its columns (VkT)(i) are not.
Subspace sampling in O(mn2) time
Leverage scores (many references in the
statistics community)
Normalization s.t. the pj sum up to 1
36Deterministic variant of CX
Paschou et al (2007) PLoS Genetics Mahoney and
Drineas (2009) PNAS
Input m-by-n matrix A, integer k, and c
(number of SNPs to pick) Output the selected
SNPs
- CX algorithm
- Compute the scores pj
- Pick the columns (SNPs) corresponding to the top
c scores.
37Deterministic variant of CX (contd)
Paschou et al (2007) PLoS Genetics Mahoney and
Drineas (2009) PNAS
Input m-by-n matrix A, integer k, and c
(number of SNPs to pick) Output the selected
SNPs we will call them PCA Informative Markers
or PCAIMs
- CX algorithm
- Compute the scores pj
- Pick the columns (SNPs) corresponding to the top
c scores.
In order to estimate k for SNP data, we developed
a permutation-based test to determine whether a
certain principal component is significant or
not. (A similar test was presented in Patterson
et al (2006) PLoS Genetics)
38Worldwide data
European Americans
South Altaians
-
-
Spanish
African Americans
Japanese
Chinese
Puerto Rico
Nahua
Mende
Mbuti
Mala
Burunge
Quechua
Africa
Europe
E Asia
America
274 individuals, 12 populations, 10,000 SNPs
using the Affymetrix array Shriver et al (2005)
Human Genomics
39Selecting PCA-correlated SNPs for individual
assignment to four continents (Africa, Europe,
Asia, America)
Africa
Europe
Asia
America
top 30 PCA-correlated SNPs
PCA-scores
SNPs by chromosomal order
Paschou et al (2007) PLoS Genetics
40Correlation coefficient between true and
predicted membership of an individual to a
particular geographic continent. (Use a subset
of SNPs, cluster the individuals using k-means.)
Paschou et al (2007) PLoS Genetics
41 Cross-validation on HapMap data
Paschou et al (2007) PLoS Genetics
42- 9 indigenous populations from four different
continents (Africa, Europe, Asia, Americas) - All SNPs and 10 principal components perfect
clustering! - 50 PCAIMs SNPs, almost perfect clustering.
43The Human Genome Diversity Panel Appox. 1000
individuals 650K Illumina Array Li et al (2008)
Science
44Highest scoring genes in HGDP dataset
Gene Function (RefSeq)
EDAR Ectodermal development, hair follicle formation.
PTK6 Intracellular signal transducer in epithelial tissues. Sensitization of cells to epidermal growth factor.
GALNT13 Initiates O-linked glycosylation of mucins.
SPATA20 Associated with spermatogenesis.
MCHR1 Plasma membrane protein which binds melanin-concentrating hormone. Probably involved in the neuronal regulation of food consumption.
FOXP1 Forkhead box transcription factors play important roles in the regulation of tissue- and cell type-specific gene transcription during both development and adulthood.
PSCD3 Involved in the control of Golgi structure and function.
CNTNAP2 Member of the neurexin family which functions in the vertebrate nervous system as cell adhesion molecules and receptors.
OCA2 Skin/Hair/Eye pigmentation.
EGFR This protein is a receptor for members of the epidermal growth factor family. Associated with the melanin pathway.
Barreiro et al (2008) Nat Genet, Sabeti et al
(2007) Nature, The International HapMap
Consortium (2007) Nature
45 A problem with the CX decomposition
Input m-by-n matrix A, integer k, and c (number
of SNPs to pick) Output the selected PCA
Informative Markers or PCAIMs
- CX algorithm
- Compute the scores pj
- Pick the columns (SNPs) corresponding to the top
c scores.
Problem Highly correlated SNPs (a.k.a., SNPs
that are in LD) get similar high scores, and
thus the deterministic variant would select
redundant SNPs. How do we remove this redundancy?
46Column Subset Selection Problem (CSSP)
Definition Given an m-by-c matrix A, find k
columns of A forming an m-by-k matrix C that are
maximally uncorrelated (a.k.a., select maximally
uncorrelated SNPs).
47Column Subset Selection Problem (CSSP)
Definition Given an m-by-c matrix A, find k
columns of A forming an m-by-k matrix C that are
maximally uncorrelated (a.k.a., select maximally
uncorrelated SNPs). Metric of correlation A
common formulation is to select a set of SNPs
that span a parallel-piped of maximal
volume. This formulation is NP-hard (i.e.,
intractable even for small values of m, c, and k).
48Column Subset Selection Problem (CSSP)
Definition Given an m-by-c matrix A, find k
columns of A forming an m-by-k matrix C that are
maximally uncorrelated (a.k.a., select maximally
uncorrelated SNPs). Metric of correlation A
common formulation is to select a set of SNPs
that span a parallel-piped of maximal
volume. This formulation is NP-hard (i.e.,
intractable even for small values of m, c, and
k). Significant prior work The CSSP has been
studied in the Numerical Linear Algebra
community, and many provably accurate
approximation algorithms exist.
49Prior work on the CSSP 1965 2000Boutsidis,
Mahoney, and Drineas (2009) SODA, under review in
Num Math
50The greedy QR algorithm
We use a standard greedy approach (the
Rank-Revealing QR factorization). The algorithm
performs k iterations In the first iteration,
the top PCAIM is picked In the second
iteration, a PCAIM is picked that is as
uncorrelated to with the previously selected
PCAIM as possible In the third iteration the
chosen PCAIM has to be as uncorrelated as
possible with the first two previously selected
PCAIMs And so on
Efficient implementations are available, and run
in minutes for typical values of m, c, and k.
Paschou et al (2008) PLoS Genetics
51A European American sample
- Datasets
- CHORI dataset
- 980 European Americans, 300,000 SNPs (Illumina
chip) - Simon et al (2006) Am J Cardiol Albert et al
(2001) JAMA - Coriell dataset
- 541 European Americans, 300,000 SNPs (Illumina
chip) - Fung et al (2006) LANCET
- HapMap
- 90 Yoruba (YRI), 90 CEPH (CEU), 90 Han Chinese
Japanese (CHB-JPT)
Paschou et al (2008) PLoS Genetics
52Europeans
Africans
Asians
PCA plot of European-American populations and
HapMap 2 populations. (The top three eigenSNPs
are presented.)
Paschou et al (2008) PLoS Genetics
53(No Transcript)
54POPRES - The Population Reference Sample
- 6,000 individuals of European, African-American,
East Asian, South Asian, and Mexican origin - Genotyped with Affy 500K array set
- Nelson et al (2008) Am J Hum Genet
- 3,192 Europeans
- Correlation between genetic structure and
geographic origin - Lao et al (2008) Current Biology
- Novembre et al (2008) Nature
- We analyzed 1,387 individuals from Novembre et
al. - Randomly included only 200 UK and 125 Swiss
French (even out sample sizes) - excluded
- Europeans not sampled in Europe
- Putative relatives
- Outliers in preliminary PCA
55(No Transcript)
56Conclusions
- Using linear algebraic techniques (e.g., matrix
decompositions) we selected markers that capture
population structure. - Our technique requires no prior assumptions and
builds upon the power of SVD and PCA to identify
population structure in various settings,
including admixed populations. - Prior theoretical work and mathematical
understanding of the underlying problem was
fundamental in designing our algorithm!
57Future research
- Unsupervised dimensionality reduction techniques
are NOT successful in separating cases from
controls in GWAS studies. - Why?
- Because the disease signal is too weak.
- Potential remedies?
- Supervised techniques (Fischer Discriminant
Analysis or LDA, etc.). - Sparse approximations for regression problems.
- Goal?
- Design a global test that may help uncover
effects of gene-gene interactions in disease risk.
58Acknowledgements
- Collaborators
- P. Paschou, Democritus University, Greece
- E. Ziv, UCSF
- E. Burchard, UCSF
- K. K. Kidd, Yale University
- M. Shriver, Penn State
- R. Krauss, Oakland Research Institute
- Students
- Asif Javed, RPI (now at IBM)
- Jamey Lewis, RPI
Funding NSF (Drineas), Tourette Syndrome
Association (Paschou), NIH (Ziv)
- P. Paschou, M. W. Mahoney, A. Javed, J. Kidd, A.
Pakstis, S. Gu, K. Kidd, and P. Drineas. (2007)
Intra- and inter-population genotype
reconstruction from tagging SNPs, Genome
Research, 17(1), pp. 96-107. - P. Paschou, E. Ziv, E. Burchard, S. Choudhry, W.
Rodriguez-Cintron, M. W. Mahoney, and P. Drineas.
(2007) PCA-correlated SNPs for structure
identification in worldwide human populations,
PLoS Genetics, 3(9), pp. 1672-1686. - P. Paschou, P. Drineas, J. Lewis, C. Nievergelt,
D. Nickerson, J. Smith, P. Ridker, D. Chasman, R.
Krauss, and E. Ziv. (2008) Tracing sub-structure
in the European American population with
PCA-informative markers, PLoS Genetics, 4(7), pp.
1-13. - M. W. Mahoney and P. Drineas. (2009) CUR matrix
decompositions for improved data analysis,
Proceedings of the National Academy of Sciences,
106(3), pp. 697-702.