Discrimination and clustering with microarray gene expression data - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Discrimination and clustering with microarray gene expression data

Description:

Discrimination and clustering with microarray gene expression data ... Department of Biochemistry, Stanford University. ENAR, Charlotte NC, March 27 2001 ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 30
Provided by: cen7154
Category:

less

Transcript and Presenter's Notes

Title: Discrimination and clustering with microarray gene expression data


1
Discrimination and clustering with microarray
gene expression data
  • Terry Speed, Jane Fridlyand, Yee Hwa
    Yang and Sandrine Dudoit

Department of Statistics, UC Berkeley,
Department of Biochemistry, Stanford University

ENAR, Charlotte NC, March 27 2001
2
Outline
  • Introductory comments
  • Classification
  • Clustering
  • A synthesis
  • Concluding remarks

3
Tumor classification
  • A reliable and precise classification of
    tumors is essential for successful treatment
    of cancer.
  • Current methods for classifying human
    malignancies rely on a variety of morphological,
    clinical and molecular variables.
  • In spite of recent progress, there are still
    uncertainties in diagnosis. Also, it is likely
    that the existing classes are heterogeneous.
  • DNA microarrays may be used to characterize
    the molecular variations among tumors by
    monitoring gene expression on a genomic scale.

4
Tumor classification, ctd
  • There are three main types of statistical
    problems associated with tumor classification
  • 1. The identification of new/unknown tumor
    classes using gene expression profiles
  • 2. The classification of malignancies into known
    classes
  • 3. The identification of marker genes that
    characterize the different tumor classes.
  • These issues are relevant to other questions
    we meet , e.g. characterising/classifying neurons
    or the toxicity of chemicals administered to
    cells or model animals.

5
Gene Expression Data
  • Gene expression data on p genes for n samples

mRNA samples
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene i in mRNA sample j
Log( Red intensity / Green intensity)

Log(Avg. PM - Avg. MM)
6
Comparison of discrimination methods
  • In this field many people are inventing new
    methods of classification or using quite complex
    ones (e.g. SVMs). Is this necessary?
  • We did a study comparing several methods on
    three publicly available tumor data sets the
    Leukemia data set, the Lymphoma data set, and the
    NIH 60 tumor cell line data, as well as some
    unpublished data sets.
  • We compared NN, FLDA, DLDA, DQDA and CART, the
    last with or without aggregation (bagging or
    boosting).
  • The results were unequivocal simplest is best!

7
(No Transcript)
8
Images of correlation matrix between 81 samples
4,682 genes
50 genes
Lymphoma data set 29 B-CLL, 9 FL, 43 DLBCL,
9
(No Transcript)
10
(No Transcript)
11
Cluster Analysis
  • Can cluster genes, cell samples, or both.
  • Strengthens signal when averages are taken within
    clusters of genes (Eisen).
  • Useful (essential ?) when seeking new subclasses
    of cells, tumors, etc.
  • Leads to readily interpreted figures.

12
Clusters
Taken from Nature February, 2000 Paper by A
Alizadeh et al Distinct types of diffuse large
B-cell lymphoma identified by Gene expression
profiling,
13
Discovering sub-groups
14
Clustering problems
  • Suppose we have gene expression
  • data on p genes for n tumor mRNA
  • samples in the form of gene expression
  • profiles xi (xi1, , xip), i1,,p.
  • Three related tasks are
  • 1. Estimating the number of tumor clusters
  • 2. Assigning each tumor sample to a
    cluster
  • 3. Assessing the strength/confidence of
    cluster assignments for individual tumors.
  • These are generic clustering problems.

15
Assessing the strength/confidence ofcluster
assignments
  • The silhouette width of an observation is
  • s (b-a )/max(a,b)
  • where a is the average dissimilarity between
    the observation and all others in the cluster to
    which it belongs, and b is the smallest of the
    average dissimilarities between the observation
    and ones in other clusters. Large s means well
    clustered.

16
Bagging
  • In discriminant analysis, it is well known that
    gains in accuracy can be obtained by aggregating
    predictors built from perturbed versions of the
    learning set (cf. bagging and boosting).
  • In the bootstrap aggregating or bagging
    procedure, perturbed learning sets of the same
    size as the original learning set are formed by
    drawing at random with replacement from the
    learning set, i.e., by forming non-parametric
    bootstrap replicates of the learning set.
  • Predictors are build for each perturbed dataset
    and aggregated by plurality voting.

17
Bagging a clustering algorithm
  • For a fixed number k of clusters
  • Generate multiple bootstrap learning sets (B50)
  • Apply the clustering algorithm to each bootstrap
    learning set
  • Re-label the clusters for the bootstrap learning
    sets so that there is maximum overlap with the
    original clustering of these observations
  • The cluster assignment of each observation is
    then obtained by plurality voting.
  • Record for each observation its cluster vote
    (CV), which is the proportion of votes in favour
    of the winning cluster.

18
Lymphoma data set
19
Leukemia data set
20
(No Transcript)
21
Comparison of clustering and other approaches to
microarray data analysis
  • Cluster analyses
  • 1) Usually outside the normal framework of
    statistical inference
  • 2) less appropriate when only a few genes are
    likely to change.
  • 3) Needs lots of experiments
  • Single gene approaches
  • 1) may be too noisy in general to show much
  • 2) may not reveal coordinated effects of
    positively correlated genes.
  • 3) harder to relate to pathways.

22
(No Transcript)
23
Clustering as a means to an end
  • We and others (Stanford) are working on methods
    which try to combine combine clustering with more
    traditional approaches to microarray data
    analysis.
  • Idea find clusters of genes and average their
    responses to reduce noise and enhance
    interpretability.
  • Use testing to assign significance with averages
    of clusters of genes as we would with single
    genes.

24
Clustering genes
E.g. p5
  • Let p number of genes.
  • 1. Calculate within class correlation.
  • 2. Perform hierarchical clustering which will
    produce (2p-1) clusters of genes.
  • 3. Average within clusters of genes.
  • 4 Perform testing on averages of clusters of
    genes as if they were single genes.

Cluster 6(1,2)
Cluster 7(1,2,3)
Cluster 8(4,5)
Cluster 9 (1,2,3,4,5)
1
2
3
4
5
25
Data - Ro1
  • Transgenic mice with a modified Gi coupled
    receptor (Ro1).
  • Experiment induced expression of Ro1 in mice.
  • 8 control (ctl) mice
  • 9 treatment mice eight weeks after Ro1 being
    induced.
  • Long-term question Which groups of genes work
    together.
  • Based on paper Conditional expression of a
    Gi-coupled receptor causes ventricular conduction
    delay and a lethal cardiomyopathy, see Redfern C.
    et al. PNAS, April 25, 2000.
  • http//www.pnas.org also
  • http//www.GenMAPP.org/ (Conklin lab, UCSF)

26
Histogram
Cluster of genes (1703, 3754)
27
Top 15 averages of gene clusters
T Group ID
  • -13.4 7869 (1703, 3754)
  • -12.1 3754
  • 11.8 6175
  • 11.7 4689
  • 11.3 6089
  • 11.2 1683
  • -10.7 2272
  • 10.7 9955 (6194, 1703, 3754)
  • 10.7 5179
  • 10.6 3916
  • -10.4 8255 (4572, 4772, 5809)
  • -10.4 4772
  • -10.4 10548 (2534, 1343, 1954)
  • 10.3 9476 (6089, 5455, 3236, 4014)

Might be influenced by 3754
Correlation
28
Closing remarks
  • More sophisticated classification methods may
    become justified when data sets are larger.
  • There seems to be considerable room for
    approaches which bring cluster analysis into a
    more traditional statistical framework.
  • The idea of using clustering to obtain
    derived variables seems promising, but has yet to
    realise this promise.

29
Acknowledgments
  • UCB
  • Yee Hwa Yang
  • Jane Fridlyand
  • WEHI
  • Natalie Thorne
  • PMCI
  • David Bowtell
  • Chuang Fong Kong
  • Stanford
  • Sandrine Dudoit
  • UCSF
  • Bruce Conklin
  • Karen Vranizan
Write a Comment
User Comments (0)
About PowerShow.com