Supervised Learning, Classification, Discrimination - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Supervised Learning, Classification, Discrimination

Description:

Supervised Learning, Classification, Discrimination SLIDES RECYCLED FROM ppt s by Darlene Goldstein http://statwww.epfl.ch/davison/teaching/Microarrays/ – PowerPoint PPT presentation

Number of Views:356
Avg rating:3.0/5.0
Slides: 49
Provided by: mou64
Category:

less

Transcript and Presenter's Notes

Title: Supervised Learning, Classification, Discrimination


1
Supervised Learning, Classification,
Discrimination
SLIDES RECYCLED FROM ppt slides by Darlene
Goldstein http//statwww.epfl.ch/davison/teaching/
Microarrays/
2
Gene expression data
  • Data on G genes for n samples

mRNA samples
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene i in mRNA sample j
3
Machine learning tasks
  • Task assign objects to classes (groups) on the
    basis of measurements made on the objects
  • Unsupervised classes unknown, want to discover
    them from the data (cluster analysis)
  • Supervised classes are predefined, want to use
    a (training or learning) set of labeled objects
    to form a classifier for classification of future
    observations

4
Discrimination
  • Objects (e.g. arrays) are to be classified as
    belonging to one of a number of predefined
    classes 1, 2, , K
  • Each object associated with a class label (or
    response) Y ? 1, 2, , K and a feature vector
    (vector of predictor variables) of G
    measurements X (X1, , XG)
  • Aim predict Y from X.

5
Example Tumor Classification
  • Reliable and precise classification essential for
    successful cancer treatment
  • Current methods for classifying human
    malignancies rely on a variety of morphological,
    clinical and molecular variables
  • Uncertainties in diagnosis remain likely that
    existing classes are heterogeneous
  • Characterize molecular variations among tumors by
    monitoring gene expression (microarray)
  • Hope that microarrays will lead to more reliable
    tumor classification (and therefore more
    appropriate treatments and better outcomes)

6
Tumor Classification Using Gene Expression Data
  • Three main types of statistical problems
    associated with tumor classification
  • Identification of new/unknown tumor classes using
    gene expression profiles (unsupervised learning
    clustering)
  • Classification of malignancies into known classes
    (supervised learning discrimination)
  • Identification of marker genes that
    characterize the different tumor classes (feature
    or variable selection).

7
Classifiers
  • A predictor or classifier partitions the space of
    gene expression profiles into K disjoint subsets,
    A1, ..., AK, such that for a sample with
    expression profile X(X1, ...,XG) ? Ak the
    predicted class is k
  • Classifiers are built from a learning set (LS)
  • L (X1, Y1), ..., (Xn,Yn)
  • Classifier C built from a learning set L
  • C( . ,L) X ? 1,2, ... ,K
  • Predicted class for observation X
  • C(X,L) k if X is in Ak

8
Decision Theory (I)
  • Can view classification as statistical decision
    theory must decide which of the classes an
    object belongs to
  • Use the observed feature vector X to aid in
    decision making
  • Denote population proportion of objects of class
    k as pk p(Y k)
  • Assume objects in class k have feature vectors
    with density pk(X) p(XY k)

9
Decision Theory (II)
  • One criterion for assessing classifier quality is
    the misclassification rate,
  • p(C(X)?Y)
  • A loss function L(i,j) quantifies the loss
    incurred by erroneously classifying a member of
    class i as class j
  • The risk function R(C) for a classifier is the
    expected (average) loss
  • R(C) EL(Y,C(X))

10
Decision Theory (III)
  • Typically L(i,i) 0
  • In many cases can assume symmetric loss with
    L(i,j) 1 for i ? j (so that different types of
    errors are equivalent)
  • In this case, the risk is simply the
    misclassification probability
  • There are some important examples, such as in
    diagnosis, where the loss function is not
    symmetric

11
Maximum likelihood discriminant rule
  • A maximum likelihood estimator (MLE) chooses the
    parameter value that makes the chance of the
    observations the highest
  • For known class conditional densities pk(X), the
    maximum likelihood (ML) discriminant rule
    predicts the class of an observation X by
  • C(X) argmaxk pk(X)

12
Fisher Linear Discriminant Analysis
  • First applied in 1935 by M. Barnard at the
    suggestion of R. A. Fisher (1936), Fisher linear
    discriminant analysis (FLDA)
  • finds linear combinations of the gene expression
    profiles XX1,...,XG with large ratios of
    between-groups to within-groups sums of squares -
    discriminant variables
  • predicts the class of an observation X by the
    class whose mean vector is closest to X in terms
    of the discriminant variables

13
Gaussian ML Discriminant Rules
  • For multivariate Gaussian (normal) class
    densities XY k N(?k, ?k), the ML classifier
    is
  • C(X) argmink (X - ?k) ?k-1 (X - ?k) log ?k
  • In general, this is a quadratic rule (Quadratic
    discriminant analysis, or QDA)
  • In practice, population mean vectors ?k and
    covariance matrices ?k are estimated by
    corresponding sample quantities

14
Gaussian ML Discriminant Rules
  • When all class densities have the same covariance
    matrix, ?k ????the discriminant rule is linear
    (Linear discriminant analysis, or LDA FLDA for k
    2)
  • C(X) argmink (X - ?k) ?-1 (X - ?k)
  • When all class densities have the same diagonal
    covariance matrix ?diag(?12 ?G2), the
    discriminant rule is again linear (Diagonal
    linear discriminant analysis, or DLDA)

15
Nearest Neighbor Classification
  • Based on a measure of distance between
    observations (e.g. Euclidean distance or one
    minus correlation)
  • k-nearest neighbor rule (Fix and Hodges (1951))
    classifies an observation X as follows
  • find the k observations in the learning set
    closest to X
  • predict the class of X by majority vote, i.e.,
    choose the class that is most common among those
    k observations.
  • The number of neighbors k can be chosen by
    cross-validation (more on this later)

16
How to construct a tree predictor
  • BINARY RECURSIVE PARTITIONING
  • Binary split parent node into two child nodes
  • Recursive each child node can be treated as
    parent node
  • Partitioning data set is partitioned into
    mutually exclusive subsets in each split

17
Tree construction
High 17 Low 83
Is BP lt 91?
No
Yes
High 12 Low 88
High 70 Low 30
Is age lt 62.5?
Classified as high risk!
No
Yes
High 2 Low 98
High 23 Low 77
Classified as low risk!
Is ST present?
Yes
No
High 11 Low 89
High 50 Low 50
Classified as low risk!
Classified as high risk!
18
Classification Trees
  • Partition the feature space into a set of
    rectangles, then fit a simple model in each one
  • Binary tree structured classifiers are
    constructed by repeated splits of subsets (nodes)
    of the measurement space X into two descendant
    subsets (starting with X itself)
  • Each terminal subset is assigned a class label
    the resulting partition of X corresponds to the
    classifier
  • RPART function in R

19
Classification Tree

20
Three Aspects of Tree Construction
  • Split Selection Rule
  • Split-stopping Rule
  • Class assignment Rule
  • Different approaches to these three issues
    (e.g. CART Classification And Regression Trees,
    Breiman et al. (1984) C4.5 and C5.0, Quinlan
    (1993)).

21
Three Rules (CART)
  • Splitting At each node, choose split maximizing
    decrease in impurity (e.g. Gini index, entropy,
    misclassification error)
  • Split-stopping Grow large tree, prune to obtain
    a sequence of subtrees, then use cross-validation
    to identify the subtree with lowest
    misclassification rate
  • Class assignment For each terminal node, choose
    the class minimizing the resubstitution estimate
    of misclassification probability, given that a
    case falls into this node

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Other Classifiers Include
  • Support vector machines (SVMs)
  • Neural networks
  • Random forest predictors
  • HUNDREDS more

27
Feature selection and missing data
  • Feature selection
  • Automatic with trees
  • For DA, NN need preliminary selection
  • Need to account for selection when assessing
    performance
  • Missing data
  • Automatic imputation with trees
  • Otherwise, impute (or ignore)

28
Performance Assessment-error rate- test set
error- learning set error (aka resubstitution
error)-cross-validation
29
Performance assessment (I)
  • Resubstitution estimation error rate on the
    learning set
  • Problem downward bias
  • Test set estimation divide cases in learning
    set into two sets, L1 and L2 classifier built
    using L1, error rate computed for L2. L1 and L2
    must be iid.
  • Problem reduced effective sample size

30
Performance assessment (II)
  • V-fold cross-validation (CV) estimation Cases
    in learning set randomly divided into V subsets
    of (nearly) equal size. Build classifiers
    leaving one set out test set error rates
    computed on left out set and averaged.
  • Bias-variance tradeoff smaller V can give
    larger bias but smaller variance
  • Out-of-bag estimation only used when dealing
    with bagged predictors

31
Performance assessment (III)
  • Common error to do feature selection using all of
    the data, then CV only for model building and
    classification
  • However, usually features are unknown and the
    intended inference includes feature selection.
    Then, CV estimates as above tend to be downward
    biased.
  • Features should be selected only from the
    learning set used to build the model (and not the
    entire learning set)

32
Aggregating classifiers
  • Breiman (1996, 1998) found that gains in accuracy
    could be obtained by aggregating predictors built
    from perturbed versions of the learning set the
    multiple versions of the predictor are aggregated
    by voting.
  • Let C(., Lb) denote the classifier built from the
    bth perturbed learning set Lb, and let wb denote
    the weight given to predictions made by this
    classifier. The predicted class for an
    observation x is given by
  • argmaxk ?b wbI(C(x,Lb)
    k)

33
Bagging
  • Bagging Bootstrap aggregating
  • Nonparametric Bootstrap (standard bagging)
    perturbed learning sets drawn at random with
    replacement from the learning sets predictors
    built for each perturbed dataset and aggregated
    by plurality voting (wb 1)
  • Parametric Bootstrap perturbed learning sets
    are multivariate Gaussian
  • Convex pseudo-data (Breiman 1996)

34
Aggregation By-products Out-of-bag estimation
of error rate
  • Out-of-bag error rate estimate unbiased
  • Use the left out cases from each bootstrap sample
    as a test set
  • Classify these test set cases, and compare to the
    class labels of the learning set to get the
    out-of-bag estimate of the error rate

35
Aggregation By-products Case-wise information
  • Class probability estimates (votes) (0,1) the
    proportion of votes for the winning class
    gives a measure of prediction confidence
  • Vote margins (1,1) the proportion of votes for
    the true class minus the maximum of the
    proportion of votes for each of the other
    classes can be used to detect mislabeled
    (learning set) cases

36
Aggregation By-products Variable Importance
Statistics
  • Measure of predictive power
  • For each tree, randomly permute the values of the
    jth variable for the out-of-bag cases, use to get
    new classifications
  • Several possible importance measures

37
Aggregation By-products Intrinsic Case
Proximities
  • Proportion of trees for which cases i and j are
    in the same terminal node
  • Clustering
  • Outlier detection
  • 1/sum(squared proximities of cases in same class)

38
Boosting
  • Freund and Schapire (1997), Breiman (1998)
  • Data resampled adaptively so that the weights in
    the resampling are increased for those cases most
    often misclassified
  • Predictor aggregation done by weighted voting

39
Comparison of classifiers
  • Dudoit, Fridlyand, Speed (JASA, 2002)
  • FLDA
  • DLDA
  • DQDA
  • NN
  • CART
  • Bagging and boosting

40
Comparison study datasets
  • Leukemia Golub et al. (1999)
  • n 72 samples, G 3,571 genes
  • 3 classes (B-cell ALL, T-cell ALL, AML)
  • Lymphoma Alizadeh et al. (2000)
  • n 81 samples, G 4,682 genes
  • 3 classes (B-CLL, FL, DLBCL)
  • NCI 60 Ross et al. (2000)
  • N 64 samples, p 5,244 genes
  • 8 classes

41
Leukemia data, 2 classes Test set error
rates150 LS/TS runs
42
Leukemia data, 3 classes Test set error
rates150 LS/TS runs
43
Lymphoma data, 3 classes Test set error rates
N150 LS/TS runs
44
NCI 60 data Test set error rates150 LS/TS runs
45
Results
  • In the main comparison of Dudoit et al, NN and
    DLDA had the smallest error rates, FLDA had the
    highest
  • For the lymphoma and leukemia datasets,
    increasing the number of genes to G200 didn't
    greatly affect the performance of the various
    classifiers there was an improvement for the NCI
    60 dataset.
  • More careful selection of a small number of genes
    (10) improved the performance of FLDA dramatically

46
Comparison study Discussion (I)
  • Diagonal LDA ignoring correlation between
    genes helped here
  • Unlike classification trees and nearest
    neighbors, LDA is unable to take into account
    gene interactions
  • Although nearest neighbors are simple and
    intuitive classifiers, their main limitation is
    that they give very little insight into
    mechanisms underlying the class distinctions

47
Comparison study Discussion (II)
  • Classification trees are capable of handling and
    revealing interactions between variables
  • Useful by-product of aggregated classifiers
    prediction votes, variable importance statistics
  • Variable selection A crude criterion such as
    BSS/WSS may not identify the genes that
    discriminate between all the classes and may not
    reveal interactions between genes
  • With larger training sets, expect improvement in
    performance of aggregated classifiers

48
Acknowledgements
  • Sandrine Dudoit
  • Jane Fridlyand
  • Yee Hwa (Jean) Yang
  • Terry Speed
Write a Comment
User Comments (0)
About PowerShow.com