David Newman, UC Irvine Lecture 5: Classification 1 - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

David Newman, UC Irvine Lecture 5: Classification 1

Description:

Project proposal due next Tuesday (Oct 16) Homework 2 (text ... ALWAYS label x and y axis. include a title. choose appropriate type: ... Hart ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 34
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 5: Classification 1


1
CS 277 Data MiningLecture 5 Classification
(cont.)
  • David Newman
  • Department of Computer Science
  • University of California, Irvine

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAA
2
Notices
  • Project proposal due next Tuesday (Oct 16)
  • Homework 2 (text classification) available Tues

3
Homework 1 comments
  • Overall, good
  • Please, no handwritten work
  • Graphs/plots/charts
  • ALWAYS label x and y axis
  • include a title
  • choose appropriate type histogram or line plot?
  • Use precise, formal language
  • dont be chatty or informal
  • less is more
  • Complexity analysis
  • define variables
  • dont use constants
  • state assumptions
  • Solutions/comments on hw1 web directory
    hw1.solutions.txt
  • lets review

4
Today
  • Lecture Classification
  • Link CAIDA
  • www.caida.org
  • www.caida.org/tools/visualization/walrus/gallery1/

5
Nearest Neighbor Classifiers
  • kNN Select the k nearest neighbors to x from the
    training data and select the majority class from
    these neighbors
  • k is a parameter
  • Small k noisier estimates, Large k smoother
    estimates
  • Best value of k often chosen by cross-validation
  • ? pseudo-code on whiteboard

6
Train and test data
class label
W (words)
Dtrain Dtest
7
? GOLDEN RULE FOR PREDICTION ?
  • NEVER LET YOUR MODEL SEE YOUR TEST DATA

8
Train, hold-out and test data
class label
W (words)
Dtrain Dholdout Dtest
9
Decision Tree Classifiers
  • Widely used in practice
  • Can handle both real-valued and nominal inputs
    (unusual)
  • Good with high-dimensional data
  • Similar algorithms as used in constructing
    regression trees
  • Historically, developed both in statistics and
    computer science
  • Statistics
  • Breiman, Friedman, Olshen and Stone, CART, 1984
  • Computer science
  • Quinlan, ID3, C4.5 (1980s-1990s)
  • Try it out on Weka (implementation of C4.5 called
    J48)

10
Decision Tree Example
Debt
Income
11
Decision Tree Example
Debt
Income gt t1
??
Income
t1
12
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
??
13
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
14
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
Note tree boundaries are piecewise linear and
axis-parallel
15
Decision Tree Example (2)
16
Decision tree example (cont.)
17
Decision tree example (cont.)
Highest information gain. Creates a pure node.
18
Decision tree example (cont.)
Lowest information gain. All nodes have
near-equal yes/no.
19
Decision Tree Pseudocode
node tree-design (Data X,C) for i 1 to
d quality_variable(i) quality_score(Xi,
C) end node X_split, Threshold for
maxquality_variable Data_right, Data_left
split(Data, X_split, threshold) if node
leaf? return(node) else node_right
tree-design(Data_right) node_left
tree-design(Data_left) end end
20
Decision Trees are not stable
Moving just one example slightly may lead to
quite different trees and space partition! Lack
of stability against small perturbation of data.
Figure from Duda, Hart Stork, Chap. 8
21
How to Choose the Right-Sized Tree?
Predictive Error
Error on Test Data
Error on Training Data
Size of Decision Tree
Ideal Range for Tree Size
22
Choosing a Good Tree for Prediction
  • General idea
  • grow a large tree
  • prune it back to create a family of subtrees
  • weakest link pruning
  • score the subtrees and pick the best one
  • Massive data sizes (e.g., n 100k data points)
  • use training data set to fit a set of trees
  • use a validation data set to score the subtrees
  • Smaller data sizes (e.g., n 1k or less)
  • use cross-validation
  • use explicit penalty terms (e.g., Bayesian
    methods)

23
Example Spam Email Classification
  • Data Set (from the UCI Machine Learning Archive)
  • 4601 email messages from 1999
  • Manually labeled as spam (60), non-spam (40)
  • 54 features percentage of words matching a
    specific word/character (NOT BAG OF WORDS)
  • Business, address, internet, free, george, !, ,
    etc
  • Average/longest/sum lengths of uninterrupted
    sequences of CAPS
  • Error Rates (Hastie, Tibshirani, Friedman, 2001)
  • Training 3056 emails, Testing 1536 emails
  • Decision tree 8.7
  • Logistic regression error 7.6
  • Naïve Bayes 10 (typically)

24
(No Transcript)
25
(No Transcript)
26
Treating Missing Data in Trees
  • Missing values are common in practice
  • Approaches to handing missing values
  • During training
  • Ignore rows with missing values (inefficient)
  • During testing
  • Send the example being classified down both
    branches and average predictions
  • Replace missing values with an imputed value
    (can be suboptimal)
  • Other approaches
  • Treat missing as a unique value (useful if
    missing values are correlated with the class)
  • Surrogate splits method
  • Search for and store surrogate variables/splits
    during training

27
Other Issues with Classification Trees
  • Why use binary splits (for real-valued data)?
  • Multiway splits can be used, but cause
    fragmentation
  • Linear combination splits?
  • can produces small improvements
  • optimization is much more difficult (need weights
    and split point)
  • Trees are much less interpretable
  • Model instability
  • A small change in the data can lead to a
    completely different tree
  • Model averaging techniques (like bagging) can be
    useful
  • Tree bias
  • Poor at approximating non-axis-parallel
    boundaries
  • Producing rule sets from tree models (e.g., c5.0)

28
Why Trees are widely used in Practice
  • Can handle high dimensional data
  • builds a model using 1 dimension at time
  • Can handle any type of input variables
  • categorical, real-valued, etc
  • most other methods require data of a single type
    (e.g., only real-valued)
  • Trees are (somewhat) interpretable
  • domain expert can read off the trees logic
  • Tree algorithms are relatively easy to code and
    test

29
Limitations of Trees
  • Representational Bias
  • classification piecewise linear boundaries,
    parallel to axes
  • regression piecewise constant surfaces
  • High Variance
  • trees can be unstable as a function of the
    sample
  • e.g., small change in the data -gt completely
    different tree
  • causes two problems
  • 1. High variance contributes to prediction error
  • 2. High variance reduces interpretability
  • Trees are good candidates for model combining
  • Often used with boosting and bagging
  • Trees do not scale well to massive data sets
    (e.g., N in millions)
  • repeated random access of subsets of the data

30
Evaluating Classification Results
  • Summary statistics
  • Empirical estimate of score function on test
    data, error rate, accuracy, etc.
  • More detailed breakdown
  • Confusion matrix
  • Can be quite useful in detecting systematic
    errors
  • Detection v. false-alarm plots (2 classes)
  • Binary classifier with real-valued output for
    each example, where higher means more likely to
    be class 1
  • For each possible threshold, calculate
  • Detection rate fraction of class 1 detected
  • False alarm rate fraction of class 2 detected
  • Plot y (detection rate) versus x (false alarm
    rate)
  • Also known as ROC, precision-recall,
    specificity/sensitivity

31
Naïve Bayes Text Classification
  • K classes c1,..cK
  • Class-conditional probabilities p(
    d ck ) probability of d given ck
  • Pi p( wi ck )
  • Posterior class probabilities (by Bayes rule)
    p( ck d ) p( d ck ) p(ck)

32
Naïve Bayes Text Classification
  • Multivariate Bernoulli / Binary model
  • Multinomial model

33
Naïve Bayes Text Classification
  • Next Estimating f and q
Write a Comment
User Comments (0)
About PowerShow.com