Title: CLUSTERING
1CLUSTERING GENE EXPRESSION DATA BASED ON
P-VALUES
Rebecka Jornsten
Department of Statistics
Rutgers University
rebecka_at_stat.rutgers.edu
Joint work with Prof. Regina Liu and Jun Li
2Gene Expression Data
N the number of genes considered C the number
of experimental conditions n the number of
replicates under the j-th condition X the
gene expression level of the k-th replicate under
the j-th condition for the i-th gene.
j
ijk
Cond1 Cond2
Cond3 rep1
rep2 rep3 rep1 rep2 rep3
rep1 rep2 rep3 . 1 0.46 0.30
0.80 1.51 0.90 1.01 ... -0.43 0.12
-0.17 2 -0.10 0.49 0.24 0.06
0.46 0.20 ... 0.27 1.32 0.89
3 0.15 0.74 0.04 0.10 0.20 0.55
... -0.54 -0.42 -0.05
Gene
X
322
Assumption X iid Normal with meanm and
SDs
ijk
ij
ij
3Gene clustering
We perform these two tasks jointly
- Which genes are similar? different?
Components in clustering analysis
1. Similarity/dissimilarity measure
2. Clustering algorithm
3. Cluster validation
Gene selection
- Which genes are significantly differentially
expressed?
Components in gene selection
1. Type of test
2. Computing p-values (based on distribution
assumptions, or permutation)
3. Adjusting p-values for multiple comparisons
4- Motivation
- P-value based clustering
- Joint gene clustering and selection
- Explicitly takes into account the variability of
the data (the - experimental setup)
- Provides a standardized way to assess the degree
of similarity - between genes
- Is less arbitrary than many of the existing
choices - of dissimilarity measures, such as Euclidean
distance - Can be easily calibrated for different
separating criteria
- Increased power by selecting clusters of genes,
rather - than genes one-by-one.
- Name-that-cluster we can anchor the clusters
with - pseudo-genes with experimental profiles of
interest.
5P-value as the dissimilarity measure
Similarity between genes (separating
criterion) hypothesis testing problem
Examples A lot of flexibility!
- Testing whether the gene mean vectors across
experimental - conditions are equal
- Testing whether gene mean and variance vectors
across - experimental conditions are equal
- Alt1 assuming independent errors for all genes
- Alt2 assuming independent errors for replicates,
but - allowing genes to be correlated within each
replicate - (pairwise tests)
-
6P-value as the dissimilarity measure (ctd)
- Determine whether or not two genes are
dissimilar according to some specified criterion
- Determine whether or not to reject the
corresponding null hypothesis associated with the
specified criterion
- small p-values a strong evidence against
similarity - Dissimilarity measure 1- P-value
- What we need for clustering apply our chosen
test and - P-value computation method to all pairs of genes
7Clustering algorithms
The gene-gene P-values provide the dissimilarity
measures .. Now, how do we cluster the genes?
Different clustering algorithms will generate
different results. Some approaches
- PAM a global cost-function that emphasizes
within-cluster similarity a tends to generate
equal size clusters - PAMsil using the silhouette validation criterion
to cluster, takes both between-cluster and
within-cluster (dis)similarities into account
(van der Laan et al) a greedy/aggressive - Hierarchical clustering more or less greedy
depending on the selected linkage, generate
clusters by cutting the tree at a chosen level
8Cluster validation how many clusters?
This is a difficult problem for noisy gene
expression data! Some approaches
- Silhouette width P-values are already
standardized so our silhouette (for gene i)
corresponds to the difference between average
P-values from gene i to members of the its
cluster, and average P-values from gene i to the
nearest competing cluster. - We then select the number of clusters to maximize
the non-standardized silhouette width. - Combined P-values We select the number of
clusters such that the combined P-values between
all clusters satisfy a chosen criterion (e.g. not
exceeding 5).
9Simulation study 1a Increased power
3 clusters with sizes 20/20/20 False
positive/negative discovery rates
1. BH corrections. 2. PAM on p-values 3. PAMsil
on p-values, non-standardized sil, 4. PAMsil 5.
PAM on average data 6. PAM on full
data
10Simulation study 1b Increased power
3 clusters with sizes 40/10/10 False
positive/negative discovery rates
1. BH corrections. 2. PAM on p-values 3. PAMsil
on p-values, non-standardized sil, 4. PAMsil 5.
PAM on average data 6. PAM on full
data
11Simulation study 2a - Flexibility 3 clusters,
20 in each -- the first and last 10 members of
each cluster have two different variances. Use a
LRT test to test gene-pairs for equal mean and
variance vectors of clusters should be 6
P-value matrix Euclidean distance
Euclidean distance
based on average
based on full data
12Simulation study 2b - Flexibility
3 clusters with sizes 20/50/10 5 members of
first cluster have a larger variance than the
rest. Error rates
1. PAMsil on p-values 2. PAMsil average data 3.
PAM on average
13Simulation study 3 Effect of Replication
3 clusters, 20 members in each -- 5, 15 or 25
replicates for each gene
P-value matrix Euclidean
distance Euclidean distance
based on average based
on full data
14Simulation study 3 Effect of Replications
3 clusters, 20 members in each -- 5, 15 or 25
replicates for each gene Table of the fraction of
time we choose 3 clusters over 2 with each
method, for 5,15 or 25 replicates. Note that the
p-value based method adapts to the data
5
15 25
No. of replicates
Method
PAMsil on p-values .19 1 1 PAMsil on
average .42 .40 .39 PAMsil on
full data .38 .39 .45
15Using microarrays to screen anti-inflammatory
drugs in injured spinal cord
Data provided by R.Hart, Dept. Neuroscience,
Rutgers
- 7 conditions
- 3 replicates of each condition
- 1664 genes
- Cluster genes to look for useful patterns
uninjured, injured but untreated, and injured
treated by five different drugs
16P-value based clustering
3 genes, chemokines thought to be beneficial for
recovery
The null cluster (1518 genes)
7 genes, related to cleaning mechanisms in the
cell (macrophages), as well as stress-response
(hsp)
Drug 4, NS398, a COX-2 inhibitor
17Selecting genes
Side note 1 If we increase to FDR 5 for BH the
overlap is 104 genes, out of 379.
Alt 1 Name-that-cluster use Welsh F to decide
which cluster is the null cluster Alt 2
Benjamini-Hochberg (BH) correction Comparison
Side note 2 If we cluster with PAM we get
clusters of roughly equal size, and all clusters
contain many null genes.
Benjamini-Hochberg 152 genes selected at FDR 1
97
55
91
P-value based clustering (PAMsil) 146 genes not
in the null-cluster
18Clustering genes filtered by BH
PAMsil with P-values for the BH subset (152 genes)
Missing some chemokines, and the stress
response genes
PAMsil on average data for the BH subset (152
genes)
19Alternative clustering
When we increase the number of clusters to 5 for
the BH filtered average data (PAMsil) the
clustering is dominated by a few outliers
Macrophage inflammatory protein
Two outlying genes PAMsil may be too
aggressive?
20Concluding Remarks
- P-value-based clustering approach
- reflects the exact experimental setup
- has valid statistical justification
- allows for flexible separating criteria
- Joint gene selection/clustering approach
- increased power/reduced false negative rate
- PAMsil preferable to PAM since PAM tends to
generate - equal sized clusters
- PAMsil on P-values preferable to PAMsil on
average - since less sensitive with respect to
noisy/outlying genes
21Future work
- Explore other clustering algorithms
- asymmetric costs?
- focus only on between-cluster p-values?
- Explore the use of other tests
- More extensive simulations, and application to
- other gene expression data sets
- How to deal with rag-bag genes?
- allow for rag-bag clusters
- post-processing/filtering and validation after
clustering