Title: Discrimination and clustering with microarray gene expression data
1Discrimination and clustering with microarray
gene expression data
- Terry Speed, Jane Fridlyand, Yee Hwa
Yang and Sandrine Dudoit
Department of Statistics, UC Berkeley,
Department of Biochemistry, Stanford University
ENAR, Charlotte NC, March 27 2001
2Outline
- Introductory comments
- Classification
- Clustering
- A synthesis
- Concluding remarks
3Tumor classification
- A reliable and precise classification of
tumors is essential for successful treatment
of cancer. - Current methods for classifying human
malignancies rely on a variety of morphological,
clinical and molecular variables. - In spite of recent progress, there are still
uncertainties in diagnosis. Also, it is likely
that the existing classes are heterogeneous. - DNA microarrays may be used to characterize
the molecular variations among tumors by
monitoring gene expression on a genomic scale.
4Tumor classification, ctd
- There are three main types of statistical
problems associated with tumor classification - 1. The identification of new/unknown tumor
classes using gene expression profiles - 2. The classification of malignancies into known
classes - 3. The identification of marker genes that
characterize the different tumor classes. - These issues are relevant to other questions
we meet , e.g. characterising/classifying neurons
or the toxicity of chemicals administered to
cells or model animals.
5Gene Expression Data
- Gene expression data on p genes for n samples
mRNA samples
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene i in mRNA sample j
Log( Red intensity / Green intensity)
Log(Avg. PM - Avg. MM)
6Comparison of discrimination methods
- In this field many people are inventing new
methods of classification or using quite complex
ones (e.g. SVMs). Is this necessary? - We did a study comparing several methods on
three publicly available tumor data sets the
Leukemia data set, the Lymphoma data set, and the
NIH 60 tumor cell line data, as well as some
unpublished data sets. - We compared NN, FLDA, DLDA, DQDA and CART, the
last with or without aggregation (bagging or
boosting). - The results were unequivocal simplest is best!
7(No Transcript)
8Images of correlation matrix between 81 samples
4,682 genes
50 genes
Lymphoma data set 29 B-CLL, 9 FL, 43 DLBCL,
9(No Transcript)
10(No Transcript)
11Cluster Analysis
- Can cluster genes, cell samples, or both.
- Strengthens signal when averages are taken within
clusters of genes (Eisen). - Useful (essential ?) when seeking new subclasses
of cells, tumors, etc. - Leads to readily interpreted figures.
12Clusters
Taken from Nature February, 2000 Paper by A
Alizadeh et al Distinct types of diffuse large
B-cell lymphoma identified by Gene expression
profiling,
13Discovering sub-groups
14Clustering problems
- Suppose we have gene expression
- data on p genes for n tumor mRNA
- samples in the form of gene expression
- profiles xi (xi1, , xip), i1,,p.
- Three related tasks are
- 1. Estimating the number of tumor clusters
- 2. Assigning each tumor sample to a
cluster - 3. Assessing the strength/confidence of
cluster assignments for individual tumors. - These are generic clustering problems.
15Assessing the strength/confidence ofcluster
assignments
- The silhouette width of an observation is
- s (b-a )/max(a,b)
- where a is the average dissimilarity between
the observation and all others in the cluster to
which it belongs, and b is the smallest of the
average dissimilarities between the observation
and ones in other clusters. Large s means well
clustered. -
16Bagging
- In discriminant analysis, it is well known that
gains in accuracy can be obtained by aggregating
predictors built from perturbed versions of the
learning set (cf. bagging and boosting). - In the bootstrap aggregating or bagging
procedure, perturbed learning sets of the same
size as the original learning set are formed by
drawing at random with replacement from the
learning set, i.e., by forming non-parametric
bootstrap replicates of the learning set. - Predictors are build for each perturbed dataset
and aggregated by plurality voting.
17Bagging a clustering algorithm
- For a fixed number k of clusters
- Generate multiple bootstrap learning sets (B50)
- Apply the clustering algorithm to each bootstrap
learning set - Re-label the clusters for the bootstrap learning
sets so that there is maximum overlap with the
original clustering of these observations - The cluster assignment of each observation is
then obtained by plurality voting. - Record for each observation its cluster vote
(CV), which is the proportion of votes in favour
of the winning cluster.
18Lymphoma data set
19Leukemia data set
20(No Transcript)
21Comparison of clustering and other approaches to
microarray data analysis
- Cluster analyses
- 1) Usually outside the normal framework of
statistical inference - 2) less appropriate when only a few genes are
likely to change. - 3) Needs lots of experiments
- Single gene approaches
- 1) may be too noisy in general to show much
- 2) may not reveal coordinated effects of
positively correlated genes. - 3) harder to relate to pathways.
22(No Transcript)
23Clustering as a means to an end
- We and others (Stanford) are working on methods
which try to combine combine clustering with more
traditional approaches to microarray data
analysis. -
- Idea find clusters of genes and average their
responses to reduce noise and enhance
interpretability. - Use testing to assign significance with averages
of clusters of genes as we would with single
genes.
24Clustering genes
E.g. p5
- Let p number of genes.
- 1. Calculate within class correlation.
- 2. Perform hierarchical clustering which will
produce (2p-1) clusters of genes. - 3. Average within clusters of genes.
- 4 Perform testing on averages of clusters of
genes as if they were single genes.
Cluster 6(1,2)
Cluster 7(1,2,3)
Cluster 8(4,5)
Cluster 9 (1,2,3,4,5)
1
2
3
4
5
25Data - Ro1
- Transgenic mice with a modified Gi coupled
receptor (Ro1). - Experiment induced expression of Ro1 in mice.
- 8 control (ctl) mice
- 9 treatment mice eight weeks after Ro1 being
induced. - Long-term question Which groups of genes work
together. - Based on paper Conditional expression of a
Gi-coupled receptor causes ventricular conduction
delay and a lethal cardiomyopathy, see Redfern C.
et al. PNAS, April 25, 2000. - http//www.pnas.org also
- http//www.GenMAPP.org/ (Conklin lab, UCSF)
26Histogram
Cluster of genes (1703, 3754)
27Top 15 averages of gene clusters
T Group ID
- -13.4 7869 (1703, 3754)
- -12.1 3754
- 11.8 6175
- 11.7 4689
- 11.3 6089
- 11.2 1683
- -10.7 2272
- 10.7 9955 (6194, 1703, 3754)
- 10.7 5179
- 10.6 3916
- -10.4 8255 (4572, 4772, 5809)
- -10.4 4772
- -10.4 10548 (2534, 1343, 1954)
- 10.3 9476 (6089, 5455, 3236, 4014)
Might be influenced by 3754
Correlation
28Closing remarks
- More sophisticated classification methods may
become justified when data sets are larger. -
- There seems to be considerable room for
approaches which bring cluster analysis into a
more traditional statistical framework. -
- The idea of using clustering to obtain
derived variables seems promising, but has yet to
realise this promise.
29Acknowledgments
- UCB
- Yee Hwa Yang
- Jane Fridlyand
- WEHI
- Natalie Thorne
- PMCI
- David Bowtell
- Chuang Fong Kong
- Stanford
- Sandrine Dudoit
- UCSF
- Bruce Conklin
- Karen Vranizan