Title: ICS 278: Data Mining Lectures 10,11: Classification Algorithms
1ICS 278 Data MiningLectures 10,11
Classification Algorithms
- Padhraic Smyth
- Department of Information and Computer Science
- University of California, Irvine
2Notation
- Variables X, C.. with values x, y (lower case)
- Vectors indicated by X
- Components of X indicated by Xj with values xj
- Matrix data set D with n rows and p columns
- jth column contains values for variable Xj
- ith row contains a vector of measurements on
object i, indicated by x(i) - The jth measurement value for the ith object is
xj(i) - Unknown parameter for a model q
- Can also use other Greek letters, like a, b, d, g
ew - Vector of parameters q
3Classification
- Predictive modeling predict Y given X
- Y is real-valued gt regression
- Y is categorical gt classification
- Often use C rather than Y to indicate the class
variable - Classification
- Many applications speech recognition, document
classification, OCR, loan approval, face
recognition, etc
4Classification v. Regression
- Similar in many ways
- both learn a mapping from X to C or Y
- Both sensitive to dimensionality of X
- Generalization to new data is important in both
- Test error versus model complexity
- Many models can be used for either classification
or regression, e.g., - trees, neural networks
- Most important differences
- Categorical Y versus real-valued Y
- Different score functions
- E.g., classification error versus squared error
5Decision Region Terminlogy
6Probabilistic view of Classification
- Notation let there be K classes c1,..cK
- Class marginals p(ck) probability of class k
- Class-conditional probabilities p(
x ck ) probability of x given ck , k 1,K - Posterior class probabilities (by Bayes rule)
p( ck x ) p( x ck ) p(ck) /
p(x) , k 1,K - where p(x)
S p( x cj ) p(cj) - In theory this is all we need.in practice
this may not be best approach.
7Example of Probabilistic Classification
p( x c1 )
p( x c2 )
8Example of Probabilistic Classification
p( x c1 )
p( x c2 )
1
p( c1 x )
0.5
0
9Example of Probabilistic Classification
p( x c1 )
p( x c2 )
1
p( c1 x )
0.5
0
10Decision Regions and Bayes Error Rate
p( x c1 )
p( x c2 )
Class c2
Class c1
Class c2
Class c1
Class c2
Optimal decision regions regions where 1 class
is more likely Optimal decision regions ?
optimal decision boundaries
11Decision Regions and Bayes Error Rate
p( x c1 )
p( x c2 )
Class c2
Class c1
Class c2
Class c1
Class c2
Optimal decision regions regions where 1 class
is more likely Optimal decision regions ?
optimal decision boundaries Bayes error rate
fraction of examples misclassified by optimal
classifier
shaded area above (see equation
10.3 in text)
12Procedure for optimal Bayes classifier
- For each class learn a model p( x ck )
- E.g., each class is multivariate Gaussian with
its own mean and covariance - Use Bayes rule to obtain p( ck x )
- gt this yields the optimal decision
regions/boundaries - gt use these decision regions/boundaries for
classification - Correct in theory. but practical problems
include - How do we model p( x ck ) ?
- Even if we know the model for p( x ck ),
modeling a distribution or density will be very
difficult in high dimensions (e.g., p 100) - Alternative approach model the decision
boundaries directly
133 categories of classifiers in general
- Generative (or class-conditional) classifiers
- Learn models for p( x ck ), use Bayes rule to
find decision boundaries - Examples naïve Bayes models, Gaussian
classifiers - Regression (or posterior class probabilities)
- Learn a model for p( ck x ) directly
- Example logistic regression (see lecture 5/6),
neural networks - Discriminative classifiers
- No probabilities
- Learn the decision boundaries directly
- Examples
- Linear boundaries perceptrons, linear SVMs
- Piecewise linear boundaries decision trees,
nearest-neighbor classifiers - Non-linear boundaries non-linear SVMs
- Note one can usually post-fit class
probability estimates p( ck x ) to a
discriminative classifier
14Which type of classifier is appropriate?
- Lets look at the score functions
- c(i) true class, c(x(i) q) class predicted
by the classifier - Class-mismatch loss functions
- S(q) 1/n Si Cost c(i), c(x(i) q)
- where cost(i, j) cost of misclassifying
true class i as predicted class j - e.g., cost(i,j) 0 if ij, 1 otherwise
(misclassification error or 0-1 loss) - and more generally cost(i,j) is a matrix of K
x K losses (e.g., surgery, spam email, etc) - Class-probability loss functions
- S(q) 1/n Si log p(c(i) x(i)
q ) (log probability score) - or S(q) 1/n Si c(i) p(c(i)
x(i) q ) 2 (Brier score)
15Example classifying spam email
- 0-1 loss function
- Appropriate if we just want to maximize accuracy
- Asymmetric cost matrix
- Appropriate if missing non-spam emails is more
costly than failing to detect spam emails - Probability loss
- Appropriate if we wanted to rank all emails by
p(spam email features), e.g., to allow the
user to look at emails via a ranked list. - In general dont solve a harder problem than you
need to, or dont model aspects of the problem
you dont need to (e.g., modeling p(xc)) -
Vapnik, 1996.
16Examples of classifiers
- Generative/class-conditional/probabilistic, based
on p( x ck ), - Naïve Bayes (simple, but often effective in high
dimensions) - Parametric generative models, e.g., Gaussian (can
be effective in low-dimensional problems leads
to quadratic boundaries in general) - Regression-based, p( ck x ) directly
- Logistic regression simple, linear in odds
space - Neural network non-linear extension of logistic,
can be difficult to work with - Discriminative models, focus on locating optimal
decision boundaries - Linear discriminants, perceptrons simple,
sometimes effective - Support vector machines generalization of linear
discriminants, can be quite effective,
computational complexity is an issue - Nearest neighbor simple, can scale poorly in
high dimensions - Decision trees swiss army knife, often
effective in high dimensionis
17Naïve Bayes Classifiers
- Generative probabilistic model with conditional
independence assumption on p( x ck ), i.e.
p( x ck ) P p( xj
ck ) - Typically used with nominal variables
- Real-valued variables discretized to create
nominal versions - (alternative is to model each p( xj ck ) with a
parametric model less widely used) - Comments
- Simple to train (just estimate conditional
probabilities for each feature-class pair) - Often works surprisingly well in practice
- e.g., state of the art for text-classification,
basis of many widely used spam filters - Feature selection can be helpful, e.g.,
information gain - Note that even if CI assumptions are not met, it
may still be able to approximate the optimal
decision boundaries (seems to happen in practice) - However. on most problems can usually be beaten
with a more complex model (plus more work)
18Link between Logistic Regression and Naïve Bayes
Naïve Bayes
Logistic Regression
19Imbalanced Class Distributions
- Common in data mining to have one class be much
less likely than the others - e.g., 0.1 of examples are fraudulent or have a
disease - If we train a standard classifier on a random
sample of data it is very difficult to beat the
majority classifier in terms of accuracy - Approaches
- Stratified sampling artificially create training
data with 50 of each class being present, and
then correct for this in prediction - E.g., learn p(xc) on stratified data and use
true p( c ) when predicting with a probabilistic
model - Use a different score function
- We are often interested in scoring/screening/ranki
ng cases when using the model - Thus, scores such as how many of the class of
interest are ranked in the top 1 of predictions
may be more relevant than overall accuracy (e.g.,
in document retrieval)
20Ranking and Lift Curves
- Many problems where we are interested in ranking
examples in terms of how likely they are to the
positive class - E.g., credit scoring, fraud detection, medical
screening, document retrieval - E.g., use classifier to rank N test examples
according to p(cx) and then pick the top K,
where K is much smaller than N - Lift curve
- n number of true positives that appear in top
K of ranked list - r number of true positives that would appear if
we ranked randomly - n/r is the lift provided by the classifier for
top K - e.g., K 10, r 200, n 300, lift 1.5, or
50 increase in lift - Random ranking gives lift 1, or 0 increase in
lift
21- Target variable response/no-response from
mailing campaign - Training and test sets each of size 250k
- Standard model had 80 variables variable
selection reduced this to 7 - Note non-monotonicity in lower curve
(undesirable)
22ROC plots
- Rank the N test examples by p(cx)
- or whatever real-number our classifier produces
that indicates likelihood of belonging to class 1 - Let k number of examples classified in class 1,
and m number in class 0, and km N - For all possible thresholds for this ranked list
- count number of positives kt
- true positive rate kt /k
- count number of false alarms, mt
- false positive rate mt /m
- ROC plot plot of true positive rate kt v false
positive rate mt
23ROC Example
- N 10 examples,
- k 6 true class 1s,
- m 4 class 0s
- The first column is a possible ranking from a
classifier
24ROC Plot
- Area under curve (AUC) often used as a metric to
summarize ROC - Online example at http//www.anaesthetist.com/mnm/
stats/roc/
Diagonal line corresponds to random ranking
25Example Link Prediction in Coauthor Graphs O
Madadhain, Hutchins, Smyth, SIGKDD, 2005
- Binary classification problem
- Training data
- graph of coauthor links, 100k authors, 300k links
- data over several years
- Test data coauthor graph for same authors in a
future year - Classification problem
- predict if pair(A,B) will coauthor
- Training and test pairs selected in various ways
- Compared a variety of different classifiers and
evaluation metrics - Skewed class distribution
- No link present (class 0) in 93.8 of test
examples - Link present (class 1) in 6.2 of test examples
26Evaluation Metrics
- Classification error
- If p(linkA,B) gt 0.5, predict a link
- Brier Score
- S p(linkA,B I(A,B) 2
- ROC Area
- area under ROC plot (between 0 and 1)
27Link Prediction Evaluation
28Link Prediction Evaluation
29Link Prediction Evaluation
30Lift Curves for Different Models
Base Rate of links 6.2
31Interpretation of Ranking at Top of Ranked List
- Top 50 ranked candidates
- Averaged contains 44 true links
- Logistic contains 40 true links
- Baseline contains 3 true links
- Top 500 ranked candidates
- Averaged contains 300 true links
- Logistic contains 298 true links
- Baseline contains 31 true links
32Lift Curves for Different Models
Base Rate of links 0.2
33Calibration
- In addition to ranking we may be interested in
how accurate our estimates of p(cx) are, - i.e., if the model says p(cx) 0.9, how
accurate is this number? - Calibration
- a model is well-calibrated if its probabilistic
predictions match real-world empirical
frequencies - i.e., if a classifier predicts p(cx) 0.9 for
100 examples, then on average we would expect
about 90 of these examples to belong to class c,
and 10 not to. - We can estimate calibration curves by binning a
classifiers probabilistic predictions, and
measuring how many
34Calibration in Probabilistic Prediction
35Linear Discriminants
- Discriminant -gt method for computing class
decision boundaries - Linear discriminant -gt linear decision boundaries
- Linear Discriminant Analysis (LDA)
- Earliest known classifier (1936, R.A. Fisher)
- See section 10.4 for math details
- Find a projection onto a vector such that means
for each class (2 classes) are separated as much
as possible (with variances taken into account
appropriately) - Reduces to a special case of parametric Gaussian
classifier in certain situations - Many subsequent variations on this basic theme
(e.g., regularized LDA) - Other linear discriminants
- Decision boundary (p-1) dimensional hyperplane
in p dimensions - Perceptron learning algorithms (pre-dated neural
networks) - Simple error correction based learning
algorithms - Linear SVMs use a sophisticated margin idea
for selecting the hyperplane
36Nearest Neighbor Classifiers
- kNN select the k nearest neighbors to x from the
training data and select the majority class from
these neighbors - k is a parameter
- Small k noisier estimates, Large k smoother
estimates - Best value of k often chosen by cross-validation
- Comments
- Virtually assumption free
- Gives piecewise linear boundaries (i.e.,
non-linear overall) - Interesting theoretical properties
Bayes error lt error(kNN) lt 2 x Bayes error
(asymptotically) - Disadvantages
- Can scale poorly with dimensionality sensitive
to distance metric - Requires fast lookup at run-time to do
classification with large n - Does not provide any interpretable model
37Local Decision Boundaries
Boundary? Points that are equidistant between
points of class 1 and 2 Note locally the
boundary is (1) linear (because of Euclidean
distance) (2) halfway between the 2 class
points (3) at right angles to connector
1
2
Feature 2
1
2
2
?
1
Feature 1
38Finding the Decision Boundaries
1
2
Feature 2
1
2
2
?
1
Feature 1
39Finding the Decision Boundaries
1
2
Feature 2
1
2
2
?
1
Feature 1
40Finding the Decision Boundaries
1
2
Feature 2
1
2
2
?
1
Feature 1
41Overall Boundary Piecewise Linear
Decision Region for Class 1
Decision Region for Class 2
1
2
Feature 2
1
2
2
?
1
Feature 1
42Example Choosing k in kNN
(example from G. Ridgeway, 2003)
43Decision Tree Classifiers
- Widely used in practice
- Can handle both real-valued and nominal inputs
(unusual) - Good with high-dimensional data
- similar algorithms as used in constructing
regression trees - historically, developed both in statistics and
computer science - Statistics
- Breiman, Friedman, Olshen and Stone, CART, 1984
- Computer science
- Quinlan, ID3, C4.5 (1980s-1990s)
44Decision Tree Example
Debt
Income
45Decision Tree Example
Debt
Income gt t1
??
Income
t1
46Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
??
47Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
48Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
Note tree boundaries are piecewise linear and
axis-parallel
49Binary split selection criteria
- Q(t) N1Q1(t) N2Q2(t), where t is the
threshold - average quality of the split
- Let p1k be the proportion of class k points in
region 1 - Error criterion for a branch
- Q1(t) 1 - p1k
- Gini index Q1(t) Sk p1k (1 -
p1k) - Cross-entropy Q1(t) Sk p1k
log p1k - Cross-entropy and Gini work better in practice
than direct minimization of classification error
at each node
50How to Choose the Right-Sized Tree?
Predictive Error
Error on Test Data
Error on Training Data
Size of Decision Tree
Ideal Range for Tree Size
51Choosing a Good Tree for Prediction
- General idea
- grow a large tree
- prune it back to create a family of subtrees
- weakest link pruning
- score the subtrees and pick the best one
- Massive data sizes (e.g., n 100k data points)
- use training data set to fit a set of trees
- use a validation data set to score the subtrees
- Smaller data sizes (e.g., n 1k or less)
- use cross-validation
- use explicit penalty terms (e.g., Bayesian
methods)
52Example Spam Email Classification
- Data Set (from the UCI Machine Learning Archive)
- 4601 email messages from 1999
- Manually labelled as spam (60), non-spam (40)
- 54 features percentage of words matching a
specific word/character - Business, address, internet, free, george, !, ,
etc - Average/longest/sum lengths of uninterrupted
sequences of CAPS - Error Rates (Hastie, Tibshirani, Friedman, 2001)
- Training 3056 emails, Testing 1536 emails
- Decision tree 8.7
- Logistic regression error 7.6
- Naïve Bayes 10 (typically)
53(No Transcript)
54(No Transcript)
55Treating Missing Data in Trees
- Missing values are common in practice
- Approaches to handing missing values
- During training
- Ignore rows with missing values (inefficient)
- During testing
- Send the example being classified down both
branches and average predictions - Replace missing values with an imputed value
(can be suboptimal) - Other approaches
- Treat missing as a unique value (useful if
missing values are correlated with the class) - Surrogate splits method
- Search for and store surrogate variables/splits
during training
56Other Issues with Classification Trees
- Why use binary splits?
- Multiway splits can be used, but cause
fragmentation - Linear combination splits?
- can produces small improvements
- optimization is much more difficult (need weights
and split point) - Trees are much less interpretable
- Model instability
- A small change in the data can lead to a
completely different tree - Model averaging techniques (like bagging) can be
useful - Tree bias
- Poor at approximating non-axis-parallel
boundaries - Producing rule sets from tree models (e.g., c5.0)
57Why Trees are widely used in Practice
- Can handle high dimensional data
- builds a model using 1 dimension at time
- Can handle any type of input variables
- categorical, real-valued, etc
- most other methods require data of a single type
(e.g., only real-valued) - Invariant to monotonic transformations of input
variables - E.g., using x, 10x 2, log(x), 2x, etc, will
not change the tree - Trees are (somewhat) interpretable
- domain expert can read off the trees logic
- Tree algorithms are relatively easy to code and
test
58Limitations of Trees
- Representational Bias
- classification piecewise linear boundaries,
parallel to axes - regression piecewise constant surfaces
- Trees do not scale well to massive data sets
(e.g., N in millions) - repeated (unpredictable) access of subsets of the
data - e.g., compare to linear scanning
- High Variance
- trees can be unstable as a function of the
sample - e.g., small change in the data -gt completely
different tree - causes two problems
- 1. High variance contributes to prediction error
- 2. High variance reduces interpretability
- Trees are good candidates for model combining
- Often used with boosting and bagging
59Decision Trees are not stable
Moving just one example slightly may lead to
quite different trees and space partition! Lack
of stability against small perturbation of data.
Figure from Duda, Hart Stork, Chap. 8
60Example of Tree Instability 2 trees fit to 2
splits of data, from G. Ridgeway, 2003
61Model Averaging
- Can average over parameters and models
- E.g., weighted linear combination of predictions
from multiple models - y S wk yk
- Why? Any predictions from a point estimate of
parameters or a single model has only a small
chance of the being the best - Averaging makes our predictions more stable and
less sensitive to random variations in a
particular data set (good for less stable models
like trees)
62Model Averaging
- Model averaging flavors
- Fully Bayesian average over uncertainty in
parameters and models - empirical Bayesian learn weights over multiple
models - E.g., stacking and bagging
- Build multiple simple models in a systematic way
and combine them, e.g., - Bagging
- Build models on random subsets of the data and
then combine - E.g., Random forests stochastically perturb the
data, learn multiple trees, and then combine for
prediction - Stacking/Ensemble methods
- Build multiple different models and then learn to
combine them - Combining weights learned on a different data set
than parameter estimation - Boosting
- Start with a simple model
- Reweight the training data to emphasize where the
model makes errors
63Bagging for Combining Classifiers
- Training data sets of size N
- Generate B bootstrap sampled data sets of size
N - Bootstrap sample sample with replacement
- e.g. B 100
- Build B models (e.g., trees), one for each
bootstrap sample - Intuition is that the bootstrapping perturbs
the data enough to make the models more resistant
to true variability - For prediction, combine the predictions from the
B models - E.g., for classification p(c x) fraction of B
models that predict c - Plus generally improves accuracy on models such
as trees - Negative lose interpretability
- Related techniques random forests, boosting.
64green majority vote purple averaging the
probabilities
From Hastie, Tibshirani, and Friedman, 2001
65Illustration of Boosting Color of points class
label Diameter of points weight at each
iteration Dashed line single stage classifier.
Green line combined, boosted classifier Dotted
blue in last two bagging (from G. Rätsch, Phd
thesis, 2001)
66Support Vector Machines
- Support vector machines
- Use a specific loss function, the margin
- Results in convex optimization problem, solvable
by quadratic programming - Decision boundary represented by examples
(support vectors) in training data - Linear version
- Uses clever placement of the hyperplane
- Very useful in high-dimensional problems, e.g.,
text classification - Non-linear version
- kernel trick for high-dimensional problems
- Some parameter tuning required, e.g., using
validation data - Computational complexity can be O(N3) without
speedups - Heuristic approximations
- e.g., Platt (1999), Sequential Minimal
Optimization (SMO) - Will discuss SVMs again in future lecture on text
classification
67Experiments by Komarek and Moore, 2005
68Accuracies and Training Time Komarek and
Moore, 2005
69Accuracies and Training Time Komarek and
Moore, 2005
70From Caruana and Niculescu-Mizil, 2005 Results av
eraged over 8 well-known classification data sets
71Comparison of accuracy across three classifiers
Naive Bayes, Maximum Entropy and Linear SVM,
using three data sets 20 newsgroups, the
Recreation sub-tree of the Open Directory, and
University Web pages from WebKB. From
Chakrabarti, 2003, Chapter 5.
72Summary on Classifiers
- Simple models (can be effective on some problems)
- Logistic regression
- Naïve Bayes
- K nearest-neighbors
- Decision trees
- Good for high-dimensional problems with different
data types - State of the art
- Support vector machines
- Boosted trees (e.g., boosting with decision
stumps) - Many tradeoffs in interpretability, score
functions, etc
73Decision Tree Classifiers
Classification
Task
Decision boundaries hierarchy of axis-parallel
Representation
Cross-validated error
Score Function
Greedy search in tree space
Search/Optimization
Data Management
None specified
Models, Parameters
Tree
74Naïve Bayes Classifier
Classification
Task
Conditional independence probability model
Representation
Score Function
Likelihood
Closed form probability estimates
Search/Optimization
Data Management
None specified
Models, Parameters
Conditional probability tables
75Logistic Regression
Task
Classification
Log-odds(C) linear function of Xs
Representation
Score Function
Log-likelihood
Search/Optimization
Iterative (Newton) method
Data Management
None specified
Models, Parameters
Logistic weights
76Nearest Neighbor Classifier
Task
Classification
Representation
Memory-based
Cross-validated error (for selecting k)
Score Function
Search/Optimization
None
Data Management
None specified
Models, Parameters
None
77Support Vector Machines
Task
Classification
Representation
Hyperplanes
Score Function
Margin
Convex optimization (quadratic programming)
Search/Optimization
Data Management
None specified
Models, Parameters
None
78Neural Networks
Task
Regression
Representation
Y nonlin function of Xs
Score Function
Least-squares
Search/Optimization
Gradient descent
Data Management
None specified
Models, Parameters
Network weights
79Multivariate Linear Regression
Task
Regression
Y Weighted linear sum of Xs
Representation
Score Function
Least-squares
Search/Optimization
Linear algebra
Data Management
None specified
Models, Parameters
Regression coefficients
80Autoregressive Time Series Models
Task
Time Series Regression
X Weighted linear sum of earlier Xs
Representation
Score Function
Least-squares
Search/Optimization
Linear algebra
Data Management
None specified
Models, Parameters
Regression coefficients
81Software for Predictive Modeling
- Research software implementations
- Many very good implementations of algorithms
available on the Web from researchers - E.g., SVMLight by Thorsten Joachims
- Weka
- Free package, useful for classification and
regression - MATLAB
- Many free toolboxes on the Web for regression
and prediction - e.g., see http//lib.stat.cmu.edu/matlab/ and
in particular the CompStats toolbox - R
- General purpose statistical computing environment
(successor to S) - Free (!)
- Widely used by statisticians, has a huge library
of functions and visualization tools - Commercial tools
- SAS, other statistical packages
82Additional Reading
- Chapters 10 and 11 in the text
- Suggested background reading for further
information - Review paper by Greg Ridgeway on the class Web
site - Thorough and informative
- Elements of Statistical Learning,
- T. Hastie, R. Tibshirani, and J. Friedman,
Springer Verlag, 2001 - Learning from Kernels,
- B Schoelkopf and A. Smola, MIT Press, 2003.
- Classification Trees,
- Breiman, Friedman, Olshen, and Stone, Wadsworth
Press, 1984.
83Backup Slides (not used)
84Decision Tree Pseudocode
node tree-design(Data X,C) For i 1 to
d quality_variable(i) quality_score(Xi,
C) end node X_split, Threshold for
maxquality_variable Data_right, Data_left
split(Data, X_split, Threshold) if node
leaf? return(node) else node_right
tree-design(Data_right) node_left
tree-design(Data_left) end end
85Computational Complexity for a Binary Tree
- At the root node, for each of p variables
- Sort all values, compute quality for each split
- O(pN log N) time for real-valued or ordinal
variables - Subsequent internal node operations each take
O(N log N) - e.g., balanced tree of depth K requires
- .. Homework 2 problem
- This assumes data are in main memory
- If data are on disk then repeated access of
subsets at different nodes may be very slow
(impossible to pre-index) - Note time difference between retrieving data in
RAM and data on disk may be O(103) or more.
86Splitting on a nominal attribute
- Nominal attribute with m values
- e.g., the name of a state or a city in marketing
data - 2m-1 possible subsets gt exhaustive search is
O(2m-1) - For small m, a simple approach is to branch on
specific values - But for large m this may not work well
- Neat trick for the 2-class problem
- For each predictor value calculate the proportion
of class 1s - Order the m values according to these proportions
- Now treat as an ordinal variable and select the
best split (linear in m) - This gives the optimal split for the Gini index,
among all possible 2m-1 splits (Breiman et al,
1984).