Title: Classifying Gene Expression Profiles from Pairwise mRNA Comparisons
1Classifying Gene Expression Profiles from
Pairwise mRNA Comparisons
- Yan Qi
- Biomedical Engineering Department
- 10/3/2006
2Outline
- The problem of Molecular classification
- The TSP classifier
- Results on three Cancer datasets
- Extensions of the TSP classifier
3Molecular Classification
- Objective predict class labels, e.g. cancer
subtypes, disease states using gene expression
profiles - Data G x n matrix, G is number of genes, n is
number of samples, each column is a gene
expression profile -
4Mathematical formulation
- Gene expression profile X ( X1, X2, , XG)
- Binary Class label Y 1 or Y 2
- Classifier A mapping f from X to Y
- Training dataset
- A G x n matrix, n n1n2
- Y G x 1 vector where n1 entries are 1, n2
entries are 2 - Learning find a mapping from A to f that
minimizes generalization error
5Challenges to standard learning methods
- Statistical dilemma n ltlt G
- Examples
- Consequence Over-fitting hence poor
generalizability - Practical issue complex f and DB
- Example ANN, SVM, random forests
- Consequence results are hard to interpret
biologically, inefficient in diagnostic settings
6The TSP classifierMotivation and strategy
- Rank-based scoring exploits the expression levels
of genes relative to each other and obtains
invariance to normalization. - Reduce model complexity by making the classifier
parameter free - Select informative gene pairs by proper LOOCV and
construct intuitive and biologically
interpretable classification rule by voting.
7TSP classifierrank-based score
- Idea replace expression value by genes ranks
within profiles. - Feature
- Score
-
- where
8Gene pair selection
- TSP
- The number of TSP and sample size
- A few when sample size is not too small ( gt102 ).
- Many when sample size is small ( lt 102 ).
- Example myocardial tissue gene expression
profiles - G 22283 n112, n210
- 2460 statistically significant TSPs
9Classification with one gene pair
- Let be a unique TSP
- Suppose
- TSP classifier
- Error on training set
10An Example
11Classification with multiple gene pairs
majority vote
- Seek a mapping from outputs of multiple single
TSP classifiers to a final prediction. - Let
- Output from represent a vote from
TSP i. - Final prediction class that receives the
majority vote from - Assume the features are
conditionally independent given the class and
equal, a Naïve Bayes classifier is equivalent to
using majority vote
12Classification with multiple TSPsmajority vote
Naïve Bayes classifier
- Assume
- Where
- Let
-
- Naïve Bayes Classifier
- Each TSP contribute equally if
-
13Loop of cross-validation
- Leave-one-out CV
- Estimated accuracy 1 - e/n where e is total
number of errors in cross-validation - Only the TSPs are determined by CV, unbiased
error estimate. - More complicated models, e.g. ANN and Decision
tree need to include both model topology and
parameters in CV loop, more likely to be biased.
14Three bench-mark cancer datasets
- Determine lymphnode status in breast tumor
samples ( West et al. 2001, G7129, n 49 ) - Classify leukemia subtypes ( Golub et al 1999, G
7129, n72) - Distinguish prostate tumors from normal samples (
Singh et al. 2002, G 12600, n102)
15Statistical significance of the score is
evaluated by permutation test
- Repeat e.g. 1000 times
- Keep feature matrix A
- Randomize class labels by keeping
- n1 and n2 unchanged
- Get top score Zmax
- Get histogram of Zmax
16The top scoring gene pairs for the three cancer
studies
17What does TSP represent?
- Change weak predictors into strong predictors?
- Change reference e.g. gene 3 as reference for
gene 1 and gene 2 - Combine two markers e.g. x4 and X5 might be
individual markers, one for each class, x4-X5 ?
18Performance of TSP classifier compared with
previous studies
- Breast cancer k-nearest neighbors (8-26) DLDA
(8-19) DQDA (11-26) - logitboost (9-21)
random forests (6-20) SVM (7-29) - Leukemia correlation analysis weighted vote
(85 on test set and 95 - on CV set.
- Prostate cancer k-nearest neighbour to genes
chosen by t-statistic
19Discussion and extensions
- Multiple class classification
- Unique TSP when there are gtgt1 TSPs, which is the
most informative? - kTSP there might be many pairs of genes with
informative ordering, combine this information
for more accurate classification? - Normalization invariance a suitable method to
integrate heterogeneous microarray datasets where
experimental conditions and normalization schemes
differ.