Title: Cluster Analysis of Gene Expression Profiles
1Cluster Analysis of Gene Expression Profiles
- Identifying groups of genes that exhibit a
similar expression "behavior" across a number of
experimental conditions - Assuming that such "co-expression" will tell us
something about these genes are regulated or even
possibly something about their function
(Functional Genomics) - Using information from multiple genes at a time -
as opposed to the single gene at a time analysis
we did so far - We can also cluster biological samples based on
the expression of some or all of the genes - Example Identifying groups of molecularly
similar tumor - "Molecular phenotyping"
- "Unsupervised learning" in the computer science
lingo
2Cluster analysissource("http//eh3.uc.edu/teachin
g/cfg/2006/R/ClusterAnalysis.R")
- Often a large portion of genes are "not
interesting" - The meaning of the "not interesting" depends on
the context - Possibly we are interested in genes that whose
expression is not constant across all
experimental conditions. To remove
"non-interesting" genes one can apply a
"variation filter". - Various sorts of "filtering" of "non-interesting"
genes generally amounts to performing some kind
of informal statistical testing with a very low
confidence. - For now, we will just play with our data with
some more exciting examples to follow - We have six measurements for each gene and will
try to cluster genes and experimental conditions
using this data
3Cluster analysis
- gt load(url("http//eh3.uc.edu/teaching/cfg/2006/da
ta/SimpleData.RData")) - gt Niclt-grep("Nic",dimnames(SimpleData)2)
- gt Ctllt-grep("Ctl",dimnames(SimpleData)2)
- gt MNiclt-apply(SimpleData,Nic,1,mean,na.rmTRUE)
- gt VNiclt-apply(SimpleData,Nic,1,var,na.rmTRUE)
- gt MCtllt-apply(SimpleData,Ctl,1,mean,na.rmTRUE)
- gt VCtllt-apply(SimpleData,Ctl,1,var,na.rmTRUE)
- gt NNiclt-apply(!is.na(SimpleData,Nic),1,sum,na.rm
TRUE) - gt NCtllt-apply(!is.na(SimpleData,Ctl),1,sum,na.rm
TRUE) - gt VNicCtllt-(((NNic-1)VNic)((NCtl-1)VCtl))/(NCtl
NNic-2) - gt DFlt-NNicNCtl-2
- gt TStatlt-abs(MNic-MCtl)/((VNicCtl((1/NNic)(1/NCt
l)))0.5) - gt TPvaluelt-2pt(TStat,DF,lower.tailFALSE)
- gt SigGeneslt-(TPvaluelt0.001)
- gt sum(SigGenes)
- 1 7
4Cluster analysis
- gt library(marray)
- gt library(mclust)
- gt pallt-maPalette(low"green", high"red",
mid"black") - gt MinExplt-min(SimpleDataSigGenes,27)
- gt MaxExplt-max(SimpleDataSigGenes,27)
- gt heatmap(data.matrix(SimpleDataSigGenes,27),Co
lvNA,RowvNA,colpal,labRowas.character(SimpleDa
taSigGenes,1),scale"none") - gt maColorBar(seq(MinExp,MaxExp,(MaxExp-MinExp)/5)
, colpal, horizontalFALSE, k5)
5Cluster analysis
gt heatmap(data.matrix(SimpleDataSigGenes,27),co
lpal,labRowas.character(SimpleDataSigGenes,1),
scale"none")
- Genes were selected based on their differences
between Nic and Ctl treatments - not obvsious
except for one gene
6Cluster analysis - centered data
- gt CenteredDatalt-SimpleData,27-apply(SimpleData
,27,1,mean,na.rmT) - gt heatmap(data.matrix(CenteredDataSigGenes,),col
pal,labRowas.character(SimpleDataSigGenes,1),s
cale"none") - gt heatmap(data.matrix(SimpleDataSigGenes,27),co
lpal,labRowas.character(SimpleDataSigGenes,1))
7Hierarchical Clustering
- Calculating the "distance" or "similarity between
each pair of expression profiles - Merging two "closest" profiles, forming a "node"
in the clustering tree and re-calculating the
"distance between such a "sub-cluster" and rest
of the profiles or sub-clusters using on of the
"linkage" principles. Again merge two closest
sub-clusters - Complete linkage - define the distance/similarity
between the two clusters as the maximum/minimum
distance/similarity between pairs of profiles in
which one profile is from the first sub-cluster
and the other profile is from the second
sub-cluster - Average linkage - define the distance/similarity
between the two clusters as the average
distance/similarity between pairs of profiles in
which one profile is from the first sub-cluster
and the other profile is from the second
sub-cluster - Single linkage - define the distance/similarity
between the two clusters as the minimum/maximum
distance/similarity between pairs of profiles in
which one profile is from the first sub-cluster
and the other profile is from the second
sub-cluster
8Euclidian Distance
- R actually operates on distances, so similarities
have to be transformed into distances - usually
straightforward - Euclidian distance
- In 2 and 3 dimensions, this is our usual, every
day's distance - gt EDistanceslt-dist(CenteredDataSigGenes,,method
"euclidean", diag T, upper T) - gt print(EDistances,digits2)
-
9Distance Matrix
- Distance Matrix - whole
- 34 440 596 2797 4466 4512 7651
- 34 0.00 8.55 5.64 5.46 8.15 8.03 9.14
- 440 8.55 0.00 3.01 3.19 0.82 0.82 1.13
- 596 5.64 3.01 0.00 0.33 2.53 2.48 3.59
- 2797 5.46 3.19 0.33 0.00 2.71 2.62 3.72
- 4466 8.15 0.82 2.53 2.71 0.00 0.47 1.18
- 4512 8.03 0.82 2.48 2.62 0.47 0.00 1.14
- 7651 9.14 1.13 3.59 3.72 1.18 1.14 0.00
- Distance Matrix - lower triangular
- gt EDistanceslt-dist(CenteredDataSigGenes,,method
"euclidean") - gt print(EDistances,digits2)
- 34 440 596 2797 4466 4512
- 440 8.55
- 596 5.64 3.01
- 2797 5.46 3.19 0.33
- 4466 8.15 0.82 2.53 2.71
- 4512 8.03 0.82 2.48 2.62 0.47
10Dendrograms - Complete Linkage
- gt Clusteringlt-hclust(EDistances,method"complete")
- gt plot(Clustering)
Distance Matrix - lower triangular 34
440 596 2797 4466 4512 440 8.55
596 5.64 3.01 2797
5.46 3.19 0.33 4466 8.15 0.82 2.53
2.71 4512 8.03 0.82 2.48 2.62 0.47
7651 9.14 1.13 3.59 3.72 1.18 1.14
11Clustering genes and samples
- gt EDistancesSlt-dist(t(CenteredDataSigGenes,),met
hod "euclidean") - gt ClusteringSlt-hclust(EDistancesS,method"complete
") - gt heatmap(data.matrix(CenteredDataSigGenes,),Col
vas.dendrogram(ClusteringS),Rowvas.dendrogram(Cl
ustering), - colpal,scale"none")
gt TwoClusterslt-cutree(ClusteringS,k 2, h
NULL) gt TwoClusters Ctl Nic Nic.1 Nic.2 Ctl.1
Ctl.2 1 2 2 2 1 1
12Clustering by partitioning K-means algorithm
- For a pre-specified number of clusters iterate
between calculating cluster "centroides" (i.e.
cluster means) and re-assigning each profile to
the cluster with the closest "centroid"
13Clustering k-means
- gt TwoCKmeanslt-kmeans(t(CenteredDataSigGenes,),
2, iter.max 10) - gt TwoCKmeans
- K-means clustering with 2 clusters of sizes 3, 3
- Cluster means
- 34 440 596 2797
4466 4512 7651 - 1 2.510742 -0.9565299 0.2554164 0.3246475
-0.770937 -0.7398173 -1.181848 - 2 -2.510742 0.9565299 -0.2554164 -0.3246475
0.770937 0.7398173 1.181848 - Clustering vector
- Ctl Nic Nic.1 Nic.2 Ctl.1 Ctl.2
- 1 2 2 2 1 1
- Within cluster sum of squares by cluster
- 1 1.0805679 0.9474704
- Available components
- 1 "cluster" "centers" "withinss" "size"
14Questions
- How many clusters there are in the data?
- What is the statistical significance of a
clustering? - What is a confidence in assigning any particular
expression profile to any particular cluster? - Difficult questions, particularly difficult to
answer when using heuristic methods like
hierarchical clustering and k-means - Need statistical models
15Two genes at a time
- Are these two genes co-expressed?
- By looking at their expression patterns alone,
combined with the null distribution of the
similarity measure in non-co-expressed genes, we
could conclude that this is the case.
16Another look
- What if we knew that there are two and only two
distinct patterns in the data and we know how
they look (thick dashed lines)? - Given this additional information we are likely
to conclude that our two genes actually have
different patterns of expression.
17Many genes at a time
- Simultaneous detection of patterns of
expression defined by groups of expression
profiles and assignment of individual expression
profiles to appropriate patterns. - By looking at all genes at the same time, we
came up with a completely different conclusion
than when looking at only two of them. - Questions How many clusters? How confident are
we in the number of clusters in the data? How
confident are we that our two genes belong to two
different clusters? Is such a confidence
statement taking into account the uncertainty
about the true number of clusters?
18Gene-specific normalization of the data
19Clustering using non-normalized data
K-means
Euclidian Distance
Pearson's correlation
20Clustering using normalized data
K-means
Euclidian Distance
Pearson's correlation
21Why do we cluster?
Co-expression
Co-regulation
Functional relationship
Assigning function to genes
22Why do we cluster - Functional Annotation?
23Dissecting the gene expression regulatory
mechanisms
S.Tavazoie, J.D.Hughes, M.J.Campbell, R.J.Cho,
G.M.Church. Systematic determination of genetic
network architecture, Nat.Genet., 22, (1999)
281-285.