Title: Part 5: Linking Microarray Data with Survival Analysis
1Part 5 Linking Microarray Data with Survival
Analysis
2Use of microarray data via model-based
classification in the study and prediction of
survival from lung cancer (Ben-Tovim Jones et
al., 2005)
3Problems
- Censored Observations the time of occurrence of
the event - (death) has not yet been observed.
- Small Sample Sizes study limited by patient
numbers - Specific Patient Group is the study applicable
to other - populations?
- Difficulty in integrating different studies
(different - microarray platforms)
4A Case Study The Lung Cancer data sets from
CAMDA03
Four independently acquired lung cancer data sets
(Harvard, Michigan, Stanford and Ontario). The
challenge To integrate information from
different data sets (2 Affy chips of different
versions, 2 cDNA arrays). The final goal To
make an impact on cancer biology and eventually
patient care. Especially, we welcome the
methodology of survival analysis using
microarrays for cancer prognosis (Park et al.
Bioinformatics S120, 2002).
5Methodology of Survival Analysis using Microarrays
Cluster the tissue samples (eg using hierarchical
clustering), then compare the survival curves for
each cluster using a non-parametric Kaplan-Meier
analysis (Alizadeh et al. 2000). Park et al.
(2002), Nguyen and Rocke (2002) used partial
least squares with the proportional hazards
model of Cox. Unsupervised vs. Supervised
Methods Semi-supervised approach of Bair and
Tibshirani (2004), to combine gene expression
data with the clinical data.
6AIM To link gene-expression data with survival
from lung cancer in the CAMDA03
challenge A CLUSTER ANALYSIS We apply a
model-based clustering approach to classify
tumour tissues on the basis of microarray gene
expression. B SURVIVAL ANALYSIS The
association between the clusters so formed and
patient survival (recurrence) times is
established. C DISCRIMINANT ANALYSIS We
demonstrate the potential of the
clustering-based prognosis as a predictor of the
outcome of disease.
7Lung Cancer
Approx. 80 of lung cancer patients have NSCLC
(of which adenocarcinoma is the most common
form). All Patients diagnosed with NSCLC are
treated on the basis of stage at presentation
(tumour size, lymph node involvement and
presence of metastases). Yet 30 of patients
with resected stage I lung cancer will die of
metastatic cancer within 5 years of
surgery. Want a prognostic test for early-stage
lung adenocarcinoma to identify patients more
likely to recur, and therefore who would benefit
from adjuvant therapy.
8Lung Cancer Data Sets
(see http//www.camda.duke.edu/camda03)
Wigle et al. (2002), Garber et al. (2001),
Bhattacharjee et al. (2001), Beer et al. (2002).
9Heat Map for 2880 Ontario Genes (39 Tissues)
Genes
Tissues
10(No Transcript)
11Heat Maps for the 20 Ontario Gene-Groups (39
Tissues)
Genes
Tissues
Tissues are ordered as Recurrence (1-24) and
Censored (25-39)
12Expression Profiles for Useful Metagenes (Ontario
39 Tissues)
Gene Group 1
Gene Group 2
Our Tissue Cluster 1
Our Tissue Cluster 2
Log Expression Value
Recurrence (1-24)
Censored (25-39)
Gene Group 19
Gene Group 20
Tissues
13Tissue Clusters
CLUSTER ANALYSIS via EMMIX-GENE of 20 METAGENES
yields TWO CLUSTERS CLUSTER 1 (31) 23
(recurrence) plus
8 (censored) CLUSTER 2 (8) 1 (recurrence)
plus 7
(censored)
Poor-prognosis
Good-prognosis
14SURVIVAL ANALYSIS LONG-TERM SURVIVOR (LTS)
MODEL where T is time to recurrence and p1
1- p2 is the prior prob. of recurrence. Adopt
Weibull model for the survival function
for recurrence S1(t).
15Fitted LTS Model vs. Kaplan-Meier
16PCA of Tissues Based on Metagenes
Second PC
First PC
17PCA of Tissues Based on Metagenes
Second PC
First PC
18PCA of Tissues Based on All Genes (via SVD)
Second PC
First PC
19PCA of Tissues Based on All Genes (via SVD)
Second PC
First PC
20Cluster-Specific Kaplan-Meier Plots
21Survival Analysis for Ontario Dataset
Cluster No. of Tissues No. of Censored Mean time to Failure (?SE)
1 2 29 8 8 7 665 ? 85.9 1388 ? 155.7
A significant difference between Kaplan-Meier
estimates for the two clusters (P0.027).
- Coxs proportional hazards analysis
Variable Hazard ratio (95 CI) P-value
Cluster 1 vs. Cluster 2 Tumor stage (I vs. IIIII) 6.78 (0.9 51.5) 1.07 (0.57 2.0) 0.06 0.83
22Discriminant Analysis (Supervised
Classification) A prognosis classifier was
developed to predict the class of origin of a
tumor tissue with a small error rate after
correction for the selection bias. A support
vector machine (SVM) was adopted to identify
important genes that play a key role on
predicting the clinical outcome, using all the
genes, and the metagenes. A cross-validation
(CV) procedure was used to calculate the
prediction error, after correction for the
selection bias. Â
23ONTARIO DATA (39 tissues) Support Vector Machine
(SVM) with Recursive Feature Elimination (RFE)
0.12
0.1
0.08
Error Rate (CV10E)
0.06
0.04
0.02
0
0
2
4
6
8
10
12
log2 (number of genes)
Ten-fold Cross-Validation Error Rate (CV10E) of
Support Vector Machine (SVM). applied to g2
clusters (G1 1-14, 16- 29,33,36,38 G2
15,30-32,34,35,37,39)
24STANFORD DATA
918 genes based on 73 tissue samples from 67
patients. Row and column normalized, retained
451 genes after select-genes step. Used 20
metagenes to cluster tissues. Retrieved
histological groups.
25Heat Maps for the 20 Stanford Gene-Groups (73
Tissues)
Genes
Tissues
Tissues are ordered by their histological
classification Adenocarcinoma (1-41), Fetal Lung
(42), Large cell (43-47), Normal (48-52),
Squamous cell (53-68), Small cell (69-73)
26STANFORD CLASSIFICATION Cluster 1 1-19
(good prognosis) Cluster 2 20-26
(long-term survivors) Cluster 3 27-35
(poor prognosis)
27Heat Maps for the 15 Stanford Gene-Groups (35
Tissues)
Genes
Tissues
Tissues are ordered by the Stanford
classification into AC groups AC group 1 (1-19),
AC group 2 (20-26), AC group 3 (27-35)
28Expression Profiles for Top Metagenes (Stanford
35 AC Tissues)
Gene Group 1
Gene Group 2
Stanford AC group 1
Stanford AC group 2
Stanford AC group 3
Misallocated
Log Expression Value
Gene Group 4
Gene Group 3
Tissues
29Cluster-Specific Kaplan-Meier Plots
30Cluster-Specific Kaplan-Meier Plots
31Survival Analysis for Stanford Dataset
Cluster No. of Tissues No. of Censored Mean time to Failure (?SE)
1 2 17 5 10 0 37.5 ? 5.0 5.2 ? 2.3
A significant difference in survival between
clusters (Plt0.001)
- Coxs proportional hazards analysis
Variable Hazard ratio (95 CI) P-value
Cluster 3 vs. Clusters 12 Grade 3 vs. grades 1 or 2 Tumor size No. of tumors in lymph nodes Presence of metastases 13.2 (2.1 81.1) 1.94 (0.5 8.5) 0.96 (0.3 2.8) 1.65 (0.7 3.9) 4.41 (1.0 19.8) 0.005 0.38 0.93 0.25 0.05
32Survival Analysis for Stanford Dataset
- Univariate Coxs proportional hazards analysis
(metagenes)
Metagene Coefficient (SE) P-value
1 2 3 4 5 1.37 (0.44) -0.24 (0.31) 0.14 (0.34) -1.01 (0.56) 0.66 (0.65) 0.002 0.44 0.68 0.07 0.31
6 7 8 9 10 -0.63 (0.50) -0.68 (0.57) 0.75 (0.46) -1.13 (0.50) 0.73 (0.39) 0.20 0.24 0.10 0.02 0.06
11 12 13 14 15 0.35 (0.50) -0.55 (0.41) -0.61 (0.48) 0.22 (0.36) 1.70 (0.92) 0.48 0.18 0.20 0.53 0.06
33Survival Analysis for Stanford Dataset
- Multivariate Coxs proportional hazards
analysis (metagenes)
Metagene Coefficient (SE) P-value
1 2 8 11 3.44 (0.95) -1.60 (0.62) -1.55 (0.73) 1.16 (0.54) 0.0003 0.010 0.033 0.031
The final model consists of four metagenes.
34STANFORD DATA Support Vector Machine (SVM) with
Recursive Feature Elimination (RFE)
0.07
0.06
0.05
0.04
Error Rate (CV10E)
0.03
0.02
0.01
0
0
1
2
3
4
5
6
7
8
9
10
log2 (number of genes)
Ten-fold Cross-Validation Error Rate (CV10E) of
Support Vector Machine (SVM). Applied to g2
clusters.
35- CONCLUSIONS
- We applied a model-based clustering approach to
- classify tumors using their gene signatures into
- clusters corresponding to tumor type
- clusters corresponding to clinical outcomes for
tumors of a given subtype - In (a), almost perfect correspondence between
- cluster and tumor type, at least for non-AC
- tumors (but not in the Ontario dataset).
36CONCLUSIONS (cont.)
The clusters in (b) were identified with clinical
outcomes (e.g. recurrence/recurrence-free and
death/long-term survival). We were able to show
that gene-expression data provide prognostic
information, beyond that of clinical indicators
such as stage.
37CONCLUSIONS (cont.)
Based on the tissue clusters, a discriminant
analysis using support vector machines (SVM)
demonstrated further the potential of gene
expression as a tool for guiding treatment
therapy and patient care to lung cancer patients.
This supervised classification procedure was
used to provide marker genes for prediction of
clinical outcomes. (In addition to those
provided by the cluster-genes step in the initial
unsupervised classification.)
38LIMITATIONS
Small number of tumors available (e.g Ontario and
Stanford datasets). Clinical data available
for only subsets of the tumors often for only
one tumor type (AC). High proportion of
censored observations limits comparison of
survival rates.