CLUSTERING - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

CLUSTERING

Description:

Joint work with Prof. Regina Liu and Jun Li. rebecka_at_stat.rutgers.edu. Gene Expression Data ... X : the gene expression level of the k-th replicate under the ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 22

Provided by: Jul117

Category:

more less

Transcript and Presenter's Notes

Title: CLUSTERING

1
CLUSTERING GENE EXPRESSION DATA BASED ON
P-VALUES
Rebecka Jornsten
Department of Statistics
Rutgers University
rebecka_at_stat.rutgers.edu
Joint work with Prof. Regina Liu and Jun Li
2
Gene Expression Data
N the number of genes considered C the number
of experimental conditions n the number of
replicates under the j-th condition X the
gene expression level of the k-th replicate under
the j-th condition for the i-th gene.
j
ijk
Cond1 Cond2
Cond3 rep1
rep2 rep3 rep1 rep2 rep3
rep1 rep2 rep3 . 1 0.46 0.30
0.80 1.51 0.90 1.01 ... -0.43 0.12
-0.17 2 -0.10 0.49 0.24 0.06
0.46 0.20 ... 0.27 1.32 0.89
3 0.15 0.74 0.04 0.10 0.20 0.55
... -0.54 -0.42 -0.05

Gene
X
322
Assumption X iid Normal with meanm and
SDs
ijk
ij
ij
3
Gene clustering
We perform these two tasks jointly

Which genes are similar? different?

Components in clustering analysis
1. Similarity/dissimilarity measure
2. Clustering algorithm
3. Cluster validation
Gene selection

Which genes are significantly differentially
expressed?

Components in gene selection
1. Type of test
2. Computing p-values (based on distribution
assumptions, or permutation)
3. Adjusting p-values for multiple comparisons
4

Motivation
P-value based clustering
Joint gene clustering and selection

Explicitly takes into account the variability of
the data (the
experimental setup)
Provides a standardized way to assess the degree
of similarity
between genes
Is less arbitrary than many of the existing
choices
of dissimilarity measures, such as Euclidean
distance
Can be easily calibrated for different
separating criteria

Increased power by selecting clusters of genes,
rather
than genes one-by-one.
Name-that-cluster we can anchor the clusters
with
pseudo-genes with experimental profiles of
interest.

5
P-value as the dissimilarity measure
Similarity between genes (separating
criterion) hypothesis testing problem
Examples A lot of flexibility!

Testing whether the gene mean vectors across
experimental
conditions are equal
Testing whether gene mean and variance vectors
across
experimental conditions are equal
Alt1 assuming independent errors for all genes
Alt2 assuming independent errors for replicates,
but
allowing genes to be correlated within each
replicate
(pairwise tests)

6
P-value as the dissimilarity measure (ctd)

Determine whether or not two genes are
dissimilar according to some specified criterion
Determine whether or not to reject the
corresponding null hypothesis associated with the
specified criterion

small p-values a strong evidence against
similarity
Dissimilarity measure 1- P-value
What we need for clustering apply our chosen
test and
P-value computation method to all pairs of genes

7
Clustering algorithms
The gene-gene P-values provide the dissimilarity
measures .. Now, how do we cluster the genes?
Different clustering algorithms will generate
different results. Some approaches

PAM a global cost-function that emphasizes
within-cluster similarity a tends to generate
equal size clusters
PAMsil using the silhouette validation criterion
to cluster, takes both between-cluster and
within-cluster (dis)similarities into account
(van der Laan et al) a greedy/aggressive
Hierarchical clustering more or less greedy
depending on the selected linkage, generate
clusters by cutting the tree at a chosen level

8
Cluster validation how many clusters?
This is a difficult problem for noisy gene
expression data! Some approaches

Silhouette width P-values are already
standardized so our silhouette (for gene i)
corresponds to the difference between average
P-values from gene i to members of the its
cluster, and average P-values from gene i to the
nearest competing cluster.
We then select the number of clusters to maximize
the non-standardized silhouette width.
Combined P-values We select the number of
clusters such that the combined P-values between
all clusters satisfy a chosen criterion (e.g. not
exceeding 5).

9
Simulation study 1a Increased power
3 clusters with sizes 20/20/20 False
positive/negative discovery rates
1. BH corrections. 2. PAM on p-values 3. PAMsil
on p-values, non-standardized sil, 4. PAMsil 5.
PAM on average data 6. PAM on full
data
10
Simulation study 1b Increased power
3 clusters with sizes 40/10/10 False
positive/negative discovery rates
1. BH corrections. 2. PAM on p-values 3. PAMsil
on p-values, non-standardized sil, 4. PAMsil 5.
PAM on average data 6. PAM on full
data
11
Simulation study 2a - Flexibility 3 clusters,
20 in each -- the first and last 10 members of
each cluster have two different variances. Use a
LRT test to test gene-pairs for equal mean and
variance vectors of clusters should be 6
P-value matrix Euclidean distance
Euclidean distance
based on average
based on full data
12
Simulation study 2b - Flexibility
3 clusters with sizes 20/50/10 5 members of
first cluster have a larger variance than the
rest. Error rates
1. PAMsil on p-values 2. PAMsil average data 3.
PAM on average
13
Simulation study 3 Effect of Replication
3 clusters, 20 members in each -- 5, 15 or 25
replicates for each gene
P-value matrix Euclidean
distance Euclidean distance
based on average based
on full data
14
Simulation study 3 Effect of Replications
3 clusters, 20 members in each -- 5, 15 or 25
replicates for each gene Table of the fraction of
time we choose 3 clusters over 2 with each
method, for 5,15 or 25 replicates. Note that the
p-value based method adapts to the data
5
15 25
No. of replicates
Method
PAMsil on p-values .19 1 1 PAMsil on
average .42 .40 .39 PAMsil on
full data .38 .39 .45
15
Using microarrays to screen anti-inflammatory
drugs in injured spinal cord
Data provided by R.Hart, Dept. Neuroscience,
Rutgers

7 conditions
3 replicates of each condition
1664 genes
Cluster genes to look for useful patterns

uninjured, injured but untreated, and injured
treated by five different drugs
16
P-value based clustering
3 genes, chemokines thought to be beneficial for
recovery
The null cluster (1518 genes)
7 genes, related to cleaning mechanisms in the
cell (macrophages), as well as stress-response
(hsp)
Drug 4, NS398, a COX-2 inhibitor
17
Selecting genes
Side note 1 If we increase to FDR 5 for BH the
overlap is 104 genes, out of 379.
Alt 1 Name-that-cluster use Welsh F to decide
which cluster is the null cluster Alt 2
Benjamini-Hochberg (BH) correction Comparison
Side note 2 If we cluster with PAM we get
clusters of roughly equal size, and all clusters
contain many null genes.
Benjamini-Hochberg 152 genes selected at FDR 1
97
55
91
P-value based clustering (PAMsil) 146 genes not
in the null-cluster
18
Clustering genes filtered by BH
PAMsil with P-values for the BH subset (152 genes)
Missing some chemokines, and the stress
response genes
PAMsil on average data for the BH subset (152
genes)
19
Alternative clustering
When we increase the number of clusters to 5 for
the BH filtered average data (PAMsil) the
clustering is dominated by a few outliers
Macrophage inflammatory protein
Two outlying genes PAMsil may be too
aggressive?
20
Concluding Remarks