Dimensionality Reduction in the Analysis of Human Genetics Data

About This Presentation

Title:

Dimensionality Reduction in the Analysis of Human Genetics Data

Description:

Dimensionality Reduction in the Analysis of Human Genetics Data Petros Drineas Rensselaer Polytechnic Institute Computer Science Department To access my web page: – PowerPoint PPT presentation

Number of Views:202

Avg rating:3.0/5.0

Slides: 59

Provided by: petro152

Category:

more less

Transcript and Presenter's Notes

Title: Dimensionality Reduction in the Analysis of Human Genetics Data

1
Dimensionality Reduction in the Analysis of Human
Genetics Data
Petros Drineas Rensselaer Polytechnic
Institute Computer Science Department
To access my web page
drineas
2
Human genetic history
Much of the biological and evolutionary history
of our species is written in our DNA
sequences. Population genetics can help
translate that historical message.
The genetic variation among humans is a small
portion of the human genome.All humans are
almost than 99.9 identical.
3
Our objective
Fact Dimensionality Reduction techniques (such
as Principal Components Analysis PCA) separate
different populations and result to plots that
correlate well with geography.
4
Our objective

Fact
Dimensionality Reduction techniques (such as
Principal Components Analysis PCA) separate
different populations and result to plots that
correlate well with geography.
Our goal
Based on this observation, we seek unsupervised,
efficient algorithms for the selection of a small
set of genetic markers that can be used to
capture population structure, and
predict individual ancestry.

5
The math behind

To this end, we employ matrix algorithms and
matrix decompositions such as
the Singular Value Decomposition (SVD), and
the CX decomposition.
We provide novel, unsupervised algorithms for
selecting ancestry informative markers.

6
Overview

Background
The Singular Value Decomposition (SVD)
The CX decomposition
Removing redundant markers (Column Subset
Selection Problem)
Selecting Ancestry Informative Markers
A worldwide set of populations
Admixed European-American populations
The POPulation REference Sample (POPRES)

7
Single Nucleotide Polymorphisms (SNPs)
Single Nucleotide Polymorphisms the most common
type of genetic variation in the genome across
different individuals. They are known locations
at the human genome where two alternate
nucleotide bases (alleles) are observed (out of
A, C, G, T).
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
There are 10 million SNPs in the human genome,
so this matrix could have 10 million columns.
8
Our data as a matrix
SNPs
Individuals
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
SNPs
Individuals
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
example ?? 1 ?G 0 GG -1
9
Forging population variation structure

Genetic diversity and population (sub)structure
is caused by
Mutation
Mutations are changes to the base pair sequence
of the DNA.
Natural selection
Genotypes that correspond to favorable traits and
are heritable become more common in successive
generations of a population of reproducing
organisms.
? Mutations increase genetic diversity.
Under natural selection, beneficial mutations
increase in frequency, and vice versa.

10
Forging population variation structure

Genetic drift
Sampling effects on evolution
Example say that the RAF of a SNP in a small
population is p.
The offspring generation would (in expectation)
have a RAF of p as well for the same SNP.
In reality, it will have a RAF of p (a drifted
frequency)
Gene flow
Transfer of alleles between populations
(immigration)
Non-random mating
Reduces interaction between (sub)populations.
Other demographic events

11
Dimensionality reduction
SNPs
Individuals
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1

Mutation/natural selection/population history,
etc. result to significant structure in the data.
This structure can be extracted via
dimensionality reduction techniques.
In most cases, unsupervised techniques suffice!
In most cases, linear dimensionality reduction
techniques (PCA) suffice!

12
Why study population structure?

Mapping causative genes for common complex
disorders
(e.g. diabetes, heart conditions, obesity, etc.)
History of human populations
Genealogy
Forensics
Conservation genetics

13
Population stratification

Definition
Correlation between subpopulations in the
case/control samples and the phenotype under
investigation.
Effects
Confounds the study, typically leading to false
positive correlations.
Solution
The problem can be addressed either by careful
sample collection or by statistical
post-processing of the results (Price et al
(2006) Nat Genet).

14
Population stratification (contd)
Population 1
Cases
Population 2
AC
AA
CC
Example of the confounding effects of population
stratification in an association study.
Controls
Marchini et al (2004) Nat Genet
15
Recall our objective

Develop unsupervised, efficient algorithms for
the selection of a small set of SNPs that can be
used to
capture population structure, and
predict individual ancestry.
Why? cost efficiency.
Lets discuss (briefly) prior work

16
Inferring population structure
Oceania
Africa
Europe
Central Asia
East Asia
Middle East
America
377 STRPs, Rosenberg et al (2004) Science

Examples of available algorithms/software
packages
STRUCTURE Pritchard et al (2000) Genetics
FRAPPE Li et al (2008) Science

17
Selecting ancestry informative markers

Existing methods (Fst, Informativeness, d)
Rosenberg et al (2003) Am J Hum Genet
Allele frequency based.
Require prior knowledge of individual ancestry
(supervised).
Such knowledge may not be available.
(e.g., populations of complex ancestry, large
multi-centered studies of anonymous samples,
etc.)
Unsupervised feature selection techniques are
often preferable because they tend to not overfit
the data.

18
The Singular Value Decomposition (SVD)
Let A be a matrix with m rows (one for each
subject) and n columns (one for each
SNP). Matrix rows points (vectors) in a
Euclidean space, e.g., given 2 objects (x d),
each described with respect to two features, we
get a 2-by-2 matrix. Two objects are close if
the angle between their corresponding vectors is
small.
19
SVD, intuition
Let the blue circles represent m data points in a
2-D Euclidean space. Then, the SVD of the m-by-2
matrix of the data will return
20
Singular values
?2
?1 measures how much of the data variance is
explained by the first singular vector. ?2
measures how much of the data variance is
explained by the second singular vector.
?1
21
SVD formal definition
? rank of A U (V) orthogonal matrix containing
the left (right) singular vectors of A. S
diagonal matrix containing the singular values of
A.
22
Rank-k approximations via the SVD
?
A
VT
U

features
significant
sig.
noise
noise

significant
noise
objects
23
Rank-k approximations (Ak)

Uk (Vk) orthogonal matrix containing the top k
left (right) singular vectors of A.
k diagonal matrix containing the top k singular
values of A.

24
PCA and SVD
Principal Components Analysis (PCA) essentially
amounts to the computation of the Singular Value
Decomposition (SVD) of a covariance matrix. SVD
is the algorithmic tool behind MultiDimensional
Scaling (MDS) and Factor Analysis.
25
The data
European Americans
South Altaians
-
-
Spanish
African Americans
Japanese
Chinese
Puerto Rico
Nahua
Mende
Mbuti
Mala
Burunge
Quechua
Africa
Europe
E Asia
America
274 individuals from 12 populations genotyped on
10,000 SNPs Shriver et al (2005) Human Genomics
26
America
Africa
Asia
Europe
Paschou et al (2007) PLoS Genetics
27
America
Africa
Asia
Europe
Not altogether satisfactory the singular vectors
are linear combinations of all SNPs, and of
course can not be assayed! Can we find actual
SNPs that capture the information in the singular
vectors? (E.g., spanning the same subspace )
Paschou et al (2007) PLoS Genetics
28
SVD decomposes a matrix as
The SVD has strong optimality properties.
Top k left singular vectors

X UkTA ?k VkT
The columns of Uk are linear combinations of up
to all columns of A.

29
CX decomposition
Goal make (some norm) of A-CX small.
c columns of A
Why? If A is an subject-SNP matrix, then
selecting representative columns is equivalent to
selecting representative SNPs to capture the same
structure as the top eigenSNPs. We want c as
small as possible!
30
CX decomposition
Goal make (some norm) of A-CX small.
c columns of A
Theory for any matrix A, we can find C such
that is almost equal to the norm of A-Ak with c
k.
31
CX decomposition
c columns of A
Easy to prove that optimal X CA. (C is the
Moore-Penrose pseudoinverse of C.) Thus, the
challenging part is to find good columns (SNPs)
of A to include in C. From a mathematical
perspective, this is a hard combinatorial problem.
32
A theoremDrineas et al (2008) SIAM J Mat Anal
Appl
Given an m-by-n matrix A, there exists an
algorithm that picks, in expectation, at most O(
k log k / ?2 ) columns of A runs in O(mn2) time,
and with probability at least 1-10-20
33
The CX algorithm
Input m-by-n matrix A, target rank k, number of
columns c Output C, the matrix consisting of the
selected columns

CX algorithm
Compute probabilities pj summing to 1
For each j 1,2,,n, pick the j-th column of A
with probability min1,cpj
Let C be the matrix consisting of the chosen
columns
(C has in expectation at most c columns)

34
Subspace sampling (Frobenius norm)
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The rows of VkT are orthonormal vectors,
but its columns (VkT)(i) are not.
35
Subspace sampling (Frobenius norm)
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The rows of VkT are orthonormal vectors,
but its columns (VkT)(i) are not.
Subspace sampling in O(mn2) time
Leverage scores (many references in the
statistics community)
Normalization s.t. the pj sum up to 1
36
Deterministic variant of CX
Paschou et al (2007) PLoS Genetics Mahoney and
Drineas (2009) PNAS
Input m-by-n matrix A, integer k, and c
(number of SNPs to pick) Output the selected
SNPs

CX algorithm
Compute the scores pj
Pick the columns (SNPs) corresponding to the top
c scores.

37
Deterministic variant of CX (contd)
Paschou et al (2007) PLoS Genetics Mahoney and
Drineas (2009) PNAS
Input m-by-n matrix A, integer k, and c
(number of SNPs to pick) Output the selected
SNPs we will call them PCA Informative Markers
or PCAIMs

CX algorithm
Compute the scores pj
Pick the columns (SNPs) corresponding to the top
c scores.

In order to estimate k for SNP data, we developed
a permutation-based test to determine whether a
certain principal component is significant or
not. (A similar test was presented in Patterson
et al (2006) PLoS Genetics)
38
Worldwide data
European Americans
South Altaians
-
-
Spanish
African Americans
Japanese
Chinese
Puerto Rico
Nahua
Mende
Mbuti
Mala
Burunge
Quechua
Africa
Europe
E Asia
America
274 individuals, 12 populations, 10,000 SNPs
using the Affymetrix array Shriver et al (2005)
Human Genomics
39
Selecting PCA-correlated SNPs for individual
assignment to four continents (Africa, Europe,
Asia, America)
Africa
Europe
Asia
America
top 30 PCA-correlated SNPs
PCA-scores
SNPs by chromosomal order
Paschou et al (2007) PLoS Genetics
40
Correlation coefficient between true and
predicted membership of an individual to a
particular geographic continent. (Use a subset
of SNPs, cluster the individuals using k-means.)
Paschou et al (2007) PLoS Genetics
41
Cross-validation on HapMap data
Paschou et al (2007) PLoS Genetics
42

9 indigenous populations from four different
continents (Africa, Europe, Asia, Americas)
All SNPs and 10 principal components perfect
clustering!
50 PCAIMs SNPs, almost perfect clustering.

43
The Human Genome Diversity Panel Appox. 1000
individuals 650K Illumina Array Li et al (2008)
Science
44
Highest scoring genes in HGDP dataset
Gene Function (RefSeq)
EDAR Ectodermal development, hair follicle formation.
PTK6 Intracellular signal transducer in epithelial tissues. Sensitization of cells to epidermal growth factor.
GALNT13 Initiates O-linked glycosylation of mucins.
SPATA20 Associated with spermatogenesis.
MCHR1 Plasma membrane protein which binds melanin-concentrating hormone. Probably involved in the neuronal regulation of food consumption.
FOXP1 Forkhead box transcription factors play important roles in the regulation of tissue- and cell type-specific gene transcription during both development and adulthood.
PSCD3 Involved in the control of Golgi structure and function.
CNTNAP2 Member of the neurexin family which functions in the vertebrate nervous system as cell adhesion molecules and receptors.
OCA2 Skin/Hair/Eye pigmentation.
EGFR This protein is a receptor for members of the epidermal growth factor family. Associated with the melanin pathway.
Barreiro et al (2008) Nat Genet, Sabeti et al
(2007) Nature, The International HapMap
Consortium (2007) Nature
45
A problem with the CX decomposition
Input m-by-n matrix A, integer k, and c (number
of SNPs to pick) Output the selected PCA
Informative Markers or PCAIMs

CX algorithm
Compute the scores pj
Pick the columns (SNPs) corresponding to the top
c scores.

Problem Highly correlated SNPs (a.k.a., SNPs
that are in LD) get similar high scores, and
thus the deterministic variant would select
redundant SNPs. How do we remove this redundancy?
46
Column Subset Selection Problem (CSSP)
Definition Given an m-by-c matrix A, find k
columns of A forming an m-by-k matrix C that are
maximally uncorrelated (a.k.a., select maximally
uncorrelated SNPs).
47
Column Subset Selection Problem (CSSP)
Definition Given an m-by-c matrix A, find k
columns of A forming an m-by-k matrix C that are
maximally uncorrelated (a.k.a., select maximally
uncorrelated SNPs). Metric of correlation A
common formulation is to select a set of SNPs
that span a parallel-piped of maximal
volume. This formulation is NP-hard (i.e.,
intractable even for small values of m, c, and k).
48
Column Subset Selection Problem (CSSP)
Definition Given an m-by-c matrix A, find k
columns of A forming an m-by-k matrix C that are
maximally uncorrelated (a.k.a., select maximally
uncorrelated SNPs). Metric of correlation A
common formulation is to select a set of SNPs
that span a parallel-piped of maximal
volume. This formulation is NP-hard (i.e.,
intractable even for small values of m, c, and
k). Significant prior work The CSSP has been
studied in the Numerical Linear Algebra
community, and many provably accurate
approximation algorithms exist.
49
Prior work on the CSSP 1965 2000Boutsidis,
Mahoney, and Drineas (2009) SODA, under review in
Num Math
50
The greedy QR algorithm
We use a standard greedy approach (the
Rank-Revealing QR factorization). The algorithm
performs k iterations In the first iteration,
the top PCAIM is picked In the second
iteration, a PCAIM is picked that is as
uncorrelated to with the previously selected
PCAIM as possible In the third iteration the
chosen PCAIM has to be as uncorrelated as
possible with the first two previously selected
PCAIMs And so on
Efficient implementations are available, and run
in minutes for typical values of m, c, and k.
Paschou et al (2008) PLoS Genetics
51
A European American sample

Datasets
CHORI dataset
980 European Americans, 300,000 SNPs (Illumina
chip)
Simon et al (2006) Am J Cardiol Albert et al
(2001) JAMA
Coriell dataset
541 European Americans, 300,000 SNPs (Illumina
chip)
Fung et al (2006) LANCET
HapMap
90 Yoruba (YRI), 90 CEPH (CEU), 90 Han Chinese
Japanese (CHB-JPT)

Paschou et al (2008) PLoS Genetics
52
Europeans
Africans
Asians
PCA plot of European-American populations and
HapMap 2 populations. (The top three eigenSNPs
are presented.)
Paschou et al (2008) PLoS Genetics
53
(No Transcript)
54
POPRES - The Population Reference Sample

6,000 individuals of European, African-American,
East Asian, South Asian, and Mexican origin
Genotyped with Affy 500K array set
Nelson et al (2008) Am J Hum Genet
3,192 Europeans
Correlation between genetic structure and
geographic origin
Lao et al (2008) Current Biology
Novembre et al (2008) Nature
We analyzed 1,387 individuals from Novembre et
al.
Randomly included only 200 UK and 125 Swiss
French (even out sample sizes)
excluded
Europeans not sampled in Europe
Putative relatives
Outliers in preliminary PCA

55
(No Transcript)
56
Conclusions

Using linear algebraic techniques (e.g., matrix
decompositions) we selected markers that capture
population structure.
Our technique requires no prior assumptions and
builds upon the power of SVD and PCA to identify
population structure in various settings,
including admixed populations.
Prior theoretical work and mathematical
understanding of the underlying problem was
fundamental in designing our algorithm!

57
Future research

Unsupervised dimensionality reduction techniques
are NOT successful in separating cases from
controls in GWAS studies.
Why?
Because the disease signal is too weak.
Potential remedies?
Supervised techniques (Fischer Discriminant
Analysis or LDA, etc.).
Sparse approximations for regression problems.
Goal?
Design a global test that may help uncover
effects of gene-gene interactions in disease risk.

58
Acknowledgements

Collaborators
P. Paschou, Democritus University, Greece
E. Ziv, UCSF
E. Burchard, UCSF
K. K. Kidd, Yale University
M. Shriver, Penn State
R. Krauss, Oakland Research Institute

Students
Asif Javed, RPI (now at IBM)
Jamey Lewis, RPI

Funding NSF (Drineas), Tourette Syndrome
Association (Paschou), NIH (Ziv)

P. Paschou, M. W. Mahoney, A. Javed, J. Kidd, A.
Pakstis, S. Gu, K. Kidd, and P. Drineas. (2007)
Intra- and inter-population genotype
reconstruction from tagging SNPs, Genome
Research, 17(1), pp. 96-107.
P. Paschou, E. Ziv, E. Burchard, S. Choudhry, W.
Rodriguez-Cintron, M. W. Mahoney, and P. Drineas.
(2007) PCA-correlated SNPs for structure
identification in worldwide human populations,
PLoS Genetics, 3(9), pp. 1672-1686.
P. Paschou, P. Drineas, J. Lewis, C. Nievergelt,
D. Nickerson, J. Smith, P. Ridker, D. Chasman, R.
Krauss, and E. Ziv. (2008) Tracing sub-structure
in the European American population with
PCA-informative markers, PLoS Genetics, 4(7), pp.
1-13.
M. W. Mahoney and P. Drineas. (2009) CUR matrix
decompositions for improved data analysis,
Proceedings of the National Academy of Sciences,
106(3), pp. 697-702.