Title: Part II: Discriminative Margin Clustering
1Part IIDiscriminative Margin Clustering
- Joint work with
- Rob Tibshirani, Dept of Statistics
- Patrick O. Brown, School of Medicine
- Stanford University
2Gene Expression
- Micro-array technology
- Find expression values of all genes in a tissue
- Expression pattern of genes related to
characteristics of tissue type - Gene expression is combinatorial
- Many factors need to combine for expression of a
gene - Combinations of expressions lead to certain
phenotypes - Poorly understood
3Feature Sets for Tumors
- Set of genes with higher expression in a cancer
type compared to every normal tissue type in the
body - Combinatorial gene expression signature
- Potential use in diagnostics and drug treatments
- If these genes encode cell surface proteins
- can target them using antibodies
- Kills tumor cells
- Does not harm normal cells
4Feature Set Definition
Convex combination of genes which gives maximum
separation in expression values
Constraint w1w2 1
w1xw2y
Expression value for Gene y
Tumor t
Margin m
Around 100 samples
Normal Set N
Expression Value for Gene x
5Computing the Feature Set
Definition naturally extends to collections of
tumor samples
6Example
Gene T N1 N2
g1 100 50 10
g2 100 10 50
w1g1w2g2 100 30 30
w1 0.5
w2 0.5
Margin 100 30 70
7Contrast with Previous Work
- Previous work focused just on classifiers
- Separating tumor class from corresponding normal
class - Separating tumor from all other tumor tissues
- Linear and quadratic Support Vector Machines
- Brown et al. , Moler et al. , Ramaswamy et
al. , Su et al., Grate et al. - Problem Many cancers have poorly understood
subtypes - We focus on two combined aspects
- Classifiers separating tumor from all normal
tissue classes - Clustering tumors based on this paradigm of
separation
8Traditional Clustering
- Cluster tissues based on similarity of gene
expression patterns - Similar tissues have correlated gene expressions
- Eisen, et al. PNAS 1998
- Problem Genes driving the clustering
- Large classes of genes that are all regulated
together - Cell cycle and cell proliferation
- Protein biosynthesis and cell growth
- Respiration
- We need to weight these gene classes appropriately
9Our Results
- Feature sets for tumor samples very small
- Picks only one from a correlated set of genes
- Genes with different functions expressed in
different normal tissues - Hierarchically cluster tumor samples
- Similarity metric for two tumor sets Combined
Margin - Tumor samples with similar feature sets group
together - Identify natural clusters of tumor samples
- Construct feature sets for each cluster
- Biological significance
10Clustering Hardness
- Given
- Set of n tumors
- Margin M
- Find largest tumor subset with margin ? M
- Problem is n1-? hard to approximate
- Reduction from maximum clique problem
11Clustering Algorithm
G
F
m2
m1
H
Gene y
E
Tumors
Margin m2
A
A
B
D
C
G
H
F
E
D
B
C
Margin m1
Normal
Gene x
12Cluster Boundaries
- Each node in tree labeled with combined margin of
tumor samples in sub-tree - Margin reduces as we move up the tree
- Chop tree at a chosen margin cut-off
- Sub-trees are the clusters
- Breast cancer samples group into three clusters
- ERBB2 (ERBB2 and GRB7)
- Luminal A type (ESR1, NAT1 and GATA3)
- Basal cell type(?) (Keratin, Fibrillin and
Fibronectin)
13Properties of Feature Sets
- Feature set for a tumor cluster
- Has at most 20 genes
- Most of the weight concentrated on a few genes
Tumor Cluster Genes Fraction of weight
ERBB2 Breast ERBB2 65
Luminal A Breast ESR1, NAT1, GATA3 55
Prostate sub-type AMACR 40
Ovarian sub-type MSLN, PAX8, COL1A2 65
14Quality of Clustering
- Random partitioning of tumor samples
- Divide tumor samples randomly into training and
test groups - Cluster training group
- Find cluster with best feature set margin for
test sample - Label the sample with the tumor type for that
cluster - Classifies unknown tumor samples accurately
- At least 75 accuracy in categorizing test
samples - At least 90 accuracy for CNS, Breast, Kidney,
Ovary and Prostate cancers
15Discussion
- Small feature sets for a tumor class
- Based only on discriminating it versus normal
tissues - Property Also discriminates it from other tumor
classes - Highly expressed genes unique to the tumor class
- Biological validation of our method
- ERBB2 and ESR1 can be targeted by monoclonal
antibodies - Some of the most effective treatments for breast
cancers - AMACR is recently recognized prostate cancer
marker - Function not very well understood
- MSLN is a well studied ovarian cancer marker
16Expanding Feature Sets
- Consider weighted combinations which have close
to optimal margin - Let optimal margin M
- P(?) Polytope of feature sets with margin ? M
- ? - Find weight vector with min Euclidean norm in
P(?) - Intuition
- Manhattan norm of any weight vector 1
- Minimizing Euclidean norm spreads the weights
- Around 100 genes in feature set
17Genes in Larger Feature Sets
- Genes with similar expression patterns
- Example ERBB2 and GRB7
- Genes expressed across cancer types
- Not very strongly expressed
- Do not drive the clustering
- Example Proliferation and cell cycle related
genes - C20ORF1, CENPF, NUF2R, TOPK, L2DTL, KNSL1,
- Example Possible alterations to chromosome 22
- PRAME
18Future Work
- Identify cell surface proteins in feature sets
- Possible use in chemotherapy and diagnostics
- Findings for Ovarian and Pancreatic cancers being
tested in the laboratory - Identify genes highly expressed across cancer
types - Examples TFAP2A, ADAM12 and LOX
- Biological significance?
- Succinct representations for biological
functions - Examples Cell cycle, respiration,
- Applications in clustering and modeling gene
expression