Title: David Newman, UC Irvine Lecture 5: Classification 1
1CS 277 Data MiningLecture 5 Classification
(cont.)
- David Newman
- Department of Computer Science
- University of California, Irvine
TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAA
2Notices
- Project proposal due next Tuesday (Oct 16)
- Homework 2 (text classification) available Tues
3Homework 1 comments
- Overall, good
- Please, no handwritten work
- Graphs/plots/charts
- ALWAYS label x and y axis
- include a title
- choose appropriate type histogram or line plot?
- Use precise, formal language
- dont be chatty or informal
- less is more
- Complexity analysis
- define variables
- dont use constants
- state assumptions
- Solutions/comments on hw1 web directory
hw1.solutions.txt - lets review
4Today
- Lecture Classification
- Link CAIDA
- www.caida.org
- www.caida.org/tools/visualization/walrus/gallery1/
5Nearest Neighbor Classifiers
- kNN Select the k nearest neighbors to x from the
training data and select the majority class from
these neighbors - k is a parameter
- Small k noisier estimates, Large k smoother
estimates - Best value of k often chosen by cross-validation
- ? pseudo-code on whiteboard
6Train and test data
class label
W (words)
Dtrain Dtest
7? GOLDEN RULE FOR PREDICTION ?
- NEVER LET YOUR MODEL SEE YOUR TEST DATA
8Train, hold-out and test data
class label
W (words)
Dtrain Dholdout Dtest
9Decision Tree Classifiers
- Widely used in practice
- Can handle both real-valued and nominal inputs
(unusual) - Good with high-dimensional data
- Similar algorithms as used in constructing
regression trees - Historically, developed both in statistics and
computer science - Statistics
- Breiman, Friedman, Olshen and Stone, CART, 1984
- Computer science
- Quinlan, ID3, C4.5 (1980s-1990s)
- Try it out on Weka (implementation of C4.5 called
J48)
10Decision Tree Example
Debt
Income
11Decision Tree Example
Debt
Income gt t1
??
Income
t1
12Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
??
13Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
14Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
Note tree boundaries are piecewise linear and
axis-parallel
15Decision Tree Example (2)
16Decision tree example (cont.)
17Decision tree example (cont.)
Highest information gain. Creates a pure node.
18Decision tree example (cont.)
Lowest information gain. All nodes have
near-equal yes/no.
19Decision Tree Pseudocode
node tree-design (Data X,C) for i 1 to
d quality_variable(i) quality_score(Xi,
C) end node X_split, Threshold for
maxquality_variable Data_right, Data_left
split(Data, X_split, threshold) if node
leaf? return(node) else node_right
tree-design(Data_right) node_left
tree-design(Data_left) end end
20Decision Trees are not stable
Moving just one example slightly may lead to
quite different trees and space partition! Lack
of stability against small perturbation of data.
Figure from Duda, Hart Stork, Chap. 8
21How to Choose the Right-Sized Tree?
Predictive Error
Error on Test Data
Error on Training Data
Size of Decision Tree
Ideal Range for Tree Size
22Choosing a Good Tree for Prediction
- General idea
- grow a large tree
- prune it back to create a family of subtrees
- weakest link pruning
- score the subtrees and pick the best one
- Massive data sizes (e.g., n 100k data points)
- use training data set to fit a set of trees
- use a validation data set to score the subtrees
- Smaller data sizes (e.g., n 1k or less)
- use cross-validation
- use explicit penalty terms (e.g., Bayesian
methods)
23Example Spam Email Classification
- Data Set (from the UCI Machine Learning Archive)
- 4601 email messages from 1999
- Manually labeled as spam (60), non-spam (40)
- 54 features percentage of words matching a
specific word/character (NOT BAG OF WORDS) - Business, address, internet, free, george, !, ,
etc - Average/longest/sum lengths of uninterrupted
sequences of CAPS - Error Rates (Hastie, Tibshirani, Friedman, 2001)
- Training 3056 emails, Testing 1536 emails
- Decision tree 8.7
- Logistic regression error 7.6
- Naïve Bayes 10 (typically)
24(No Transcript)
25(No Transcript)
26Treating Missing Data in Trees
- Missing values are common in practice
- Approaches to handing missing values
- During training
- Ignore rows with missing values (inefficient)
- During testing
- Send the example being classified down both
branches and average predictions - Replace missing values with an imputed value
(can be suboptimal) - Other approaches
- Treat missing as a unique value (useful if
missing values are correlated with the class) - Surrogate splits method
- Search for and store surrogate variables/splits
during training
27Other Issues with Classification Trees
- Why use binary splits (for real-valued data)?
- Multiway splits can be used, but cause
fragmentation - Linear combination splits?
- can produces small improvements
- optimization is much more difficult (need weights
and split point) - Trees are much less interpretable
- Model instability
- A small change in the data can lead to a
completely different tree - Model averaging techniques (like bagging) can be
useful - Tree bias
- Poor at approximating non-axis-parallel
boundaries - Producing rule sets from tree models (e.g., c5.0)
28Why Trees are widely used in Practice
- Can handle high dimensional data
- builds a model using 1 dimension at time
- Can handle any type of input variables
- categorical, real-valued, etc
- most other methods require data of a single type
(e.g., only real-valued) - Trees are (somewhat) interpretable
- domain expert can read off the trees logic
- Tree algorithms are relatively easy to code and
test
29Limitations of Trees
- Representational Bias
- classification piecewise linear boundaries,
parallel to axes - regression piecewise constant surfaces
- High Variance
- trees can be unstable as a function of the
sample - e.g., small change in the data -gt completely
different tree - causes two problems
- 1. High variance contributes to prediction error
- 2. High variance reduces interpretability
- Trees are good candidates for model combining
- Often used with boosting and bagging
- Trees do not scale well to massive data sets
(e.g., N in millions) - repeated random access of subsets of the data
30Evaluating Classification Results
- Summary statistics
- Empirical estimate of score function on test
data, error rate, accuracy, etc. - More detailed breakdown
- Confusion matrix
- Can be quite useful in detecting systematic
errors - Detection v. false-alarm plots (2 classes)
- Binary classifier with real-valued output for
each example, where higher means more likely to
be class 1 - For each possible threshold, calculate
- Detection rate fraction of class 1 detected
- False alarm rate fraction of class 2 detected
- Plot y (detection rate) versus x (false alarm
rate) - Also known as ROC, precision-recall,
specificity/sensitivity
31Naïve Bayes Text Classification
- K classes c1,..cK
- Class-conditional probabilities p(
d ck ) probability of d given ck - Pi p( wi ck )
- Posterior class probabilities (by Bayes rule)
p( ck d ) p( d ck ) p(ck)
32Naïve Bayes Text Classification
- Multivariate Bernoulli / Binary model
- Multinomial model
33Naïve Bayes Text Classification