Title: Data Mining Classification: Alternative Techniques
1Data Mining Classification Alternative
Techniques
- Lecture Notes for Chapter 5
- Introduction to Data Mining
- by
- Tan, Steinbach, Kumar
2Data Mining Classification Alternative
Techniques
3Rule-Based Classifier
- Classify records by using a collection of
ifthen rules - Rule (Condition) ? y
- where
- Condition is a conjunctions of attributes
- y is the class label
- LHS rule antecedent or condition
- RHS rule consequent
- Examples of classification rules
- (Blood TypeWarm) ? (Lay EggsYes) ? Birds
- (Taxable Income lt 50K) ? (RefundYes) ? EvadeNo
4Rule-based Classifier (Example)
5Application of Rule-Based Classifier
- A rule r covers an instance x if the attributes
of the instance satisfy the condition of the rule
The rule r1 covers a hawk gt Bird The rule r3
covers the grizzly bear gt Mammal
6Rule Coverage and Accuracy
- Coverage of a rule
- Fraction of records that satisfy the antecedent
of a rule - Accuracy of a rule
- Fraction of records that satisfy both the
antecedent and consequent of a rule
(StatusSingle) ? No Coverage 40,
Accuracy 50
7How does Rule-based Classifier Work?
A lemur triggers rule r3, so it is classified as
a mammal A turtle triggers both r4 and r5 A
dogfish shark triggers none of the rules
8Characteristics of Rule-Based Classifier
- Mutually exclusive rules
- Classifier contains mutually exclusive rules if
the rules are independent of each other - Every record is covered by at most one rule
- Exhaustive rules
- Classifier has exhaustive coverage if it accounts
for every possible combination of attribute
values - Each record is covered by at least one rule
9From Decision Trees To Rules
Rules are mutually exclusive and exhaustive Rule
set contains as much information as the tree
10Rules Can Be Simplified
Initial Rule (RefundNo) ?
(StatusMarried) ? No Simplified Rule
(StatusMarried) ? No
11Effect of Rule Simplification
- Rules are no longer mutually exclusive
- A record may trigger more than one rule
- Solution?
- Ordered rule set
- Unordered rule set use voting schemes
- Rules are no longer exhaustive
- A record may not trigger any rules
- Solution?
- Use a default class
12Ordered Rule Set
- Rules are rank ordered according to their
priority - An ordered rule set is known as a decision list
- When a test record is presented to the classifier
- It is assigned to the class label of the highest
ranked rule it has triggered - If none of the rules fired, it is assigned to the
default class
13Rule Ordering Schemes
- Rule-based ordering
- Individual rules are ranked based on their
quality/priority - Class-based ordering
- Rules that belong to the same class appear
together
14Building Classification Rules
- Direct Method
- Extract rules directly from data
- e.g. RIPPER, CN2, 1R, and AQ
- Indirect Method
- Extract rules from other classification models
(e.g. decision trees, neural networks, SVM,
etc). - e.g C4.5rules
15Direct Method Sequential Covering
16Example of Sequential Covering
17Example of Sequential Covering
18Aspects of Sequential Covering
- Rule Growing
- Rule evaluation
- Instance Elimination
- Stopping Criterion
- Rule Pruning
19Rule Growing
20Rule Evaluation
- Evaluation metric determines which conjunct
should be added during rule growing - Accuracy
- Laplace
- M-estimate
n Number of instances covered by rule nc
Number of instances of class c covered by rule k
Number of classes p Prior probability
21Rule Growing (Examples)
- CN2 Algorithm
- Start from an empty conjunct
- Add conjuncts that minimizes the entropy measure
A, A,B, - Determine the rule consequent by taking majority
class of instances covered by the rule - RIPPER Algorithm
- Start from an empty rule gt class
- Add conjuncts that maximizes FOILs information
gain measure - R0 gt class (initial rule)
- R1 A gt class (rule after adding conjunct)
- Gain(R0, R1) t log (p1/(p1n1)) log
(p0/(p0 n0)) - where t number of positive instances covered
by both R0 and R1 - p0 number of positive instances covered by R0
- n0 number of negative instances covered by R0
- p1 number of positive instances covered by R1
- n1 number of negative instances covered by R1
22Instance Elimination
- Why do we need to eliminate instances?
- Otherwise, the next rule is identical to previous
rule - Why do we remove positive instances?
- Ensure that the next rule is different
- Why do we remove negative instances?
- Prevent underestimating accuracy of rule
- Compare rules R2 and R3 in the diagram
23Stopping Criterion and Rule Pruning
- Examples of stopping criterion
- If rule does not improve significantly after
adding conjunct - If rule starts covering examples from another
class - Rule Pruning
- Similar to post-pruning of decision trees
- Example using validation set (reduced error
pruning) - Remove one of the conjuncts in the rule
- Compare error rate on validation set before and
after pruning - If error improves, prune the conjunct
24Summary of Direct Method
- Initial rule set is empty
- Repeat
- Grow a single rule
- Remove Instances covered by the rule
- Prune the rule (if necessary)
- Add rule to the current rule set
25Direct Method RIPPER
- For 2-class problem, choose one of the classes as
positive class, and the other as negative class - Learn the rules for positive class
- Use negative class as default
- For multi-class problem
- Order the classes according to increasing class
prevalence (fraction of instances that belong to
a particular class) - Learn the rule set for smallest class first,
treat the rest as negative class - Repeat with next smallest class as positive class
26Direct Method RIPPER
- Rule growing
- Start from an empty rule ?
- Add conjuncts as long as they improve FOILs
information gain - Stop when rule no longer covers negative examples
- Prune the rule immediately using incremental
reduced error pruning - Measure for pruning v (p-n)/(pn)
- p number of positive examples covered by the
rule in the validation set - n number of negative examples covered by the
rule in the validation set - Pruning method delete any final sequence of
conditions that maximizes v
27Direct Method RIPPER
- Building a Rule Set
- Use sequential covering algorithm
- Grow a rule to cover the current set of positive
examples - Eliminate both positive and negative examples
covered by the rule - Each time a rule is added to the rule set,
compute the new description length - stop adding new rules when the new description
length is d bits longer than the smallest
description length obtained so far
28Indirect Methods
29Indirect Method C4.5rules
- Extract rules for every path from root to leaf
nodes - For each rule, r A ? y,
- consider alternative rule r A ? y where A is
obtained by removing one of the conjuncts in A - Compare the pessimistic error rate for r against
all rs - Prune if one of the rs has lower pessimistic
error rate - Repeat until pessimistic error rate can no longer
be improved
30Indirect Method C4.5rules
- Use class-based ordering
- Rules that predict the same class are grouped
together into the same subset - Compute total description length for each class
- Classes are ordered in increasing order of their
total description length
31Example
C4.5rules (Give BirthNo, Can FlyYes) ?
Birds (Give BirthNo, Live in WaterYes) ?
Fishes (Give BirthYes) ? Mammals (Give BirthNo,
Can FlyNo, Live in WaterNo) ? Reptiles ( ) ?
Amphibians
32Characteristics of Rule-Based Classifiers
- As highly expressive as decision trees
- Easy to interpret
- Easy to generate
- Can classify new instances rapidly
- Performance comparable to decision trees
33Data Mining Classification Alternative
Techniques
- Instance-Based Classifiers
34Instance-Based Classifiers
- Store the training records
- Use training records to predict the class
label of unseen cases
35Instance Based Classifiers
- Examples
- Rote-learner
- Memorizes entire training data and performs
classification only if attributes of record match
one of the training examples exactly - Nearest neighbor
- Uses k closest points (nearest neighbors) for
performing classification
36Nearest Neighbor Classifiers
- Basic idea
- If it walks like a duck, quacks like a duck, then
its probably a duck
37Nearest-Neighbor Classifiers
- Requires three things
- The set of stored records
- Distance metric to compute distance between
records - The value of k, the number of nearest neighbors
to retrieve - To classify an unknown record
- Compute distance to other training records
- Identify k nearest neighbors
- Use class labels of nearest neighbors to
determine the class label of unknown record
(e.g., by taking majority vote)
38Definition of Nearest Neighbor
K-nearest neighbors of a record x are data
points that have the k smallest distance to x
391 nearest-neighbor
Voronoi Diagram
40Nearest Neighbor Classification
- Compute distance between two points
- Example Euclidean distance
- Determine the class from nearest neighbor list
- take the majority vote of class labels among the
k-nearest neighbors - Weigh the vote according to distance
- weight factor, w 1/d2
41Nearest Neighbor Classification
- Choosing the value of k
- If k is too small, sensitive to noise points
- If k is too large, neighborhood may include
points from other classes
42Nearest Neighbor Classification
- Scaling issues
- Attributes may have to be scaled to prevent
distance measures from being dominated by one of
the attributes - Example
- height of a person may vary from 1.5m to 1.8m
- weight of a person may vary from 90lb to 300lb
- income of a person may vary from 10K to 1M
43Nearest Neighbor Classification
- Problem with Euclidean measure
- High dimensional data
- curse of dimensionality
- Can produce counter-intuitive results
1 1 1 1 1 1 1 1 1 1 1 0
1 0 0 0 0 0 0 0 0 0 0 0
vs
0 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 1
d 1.4142
d 1.4142
-
- Solution Normalize the vectors to unit length
44Nearest neighbor Classification
- k-NN classifiers are lazy learners
- It does not build models explicitly
- Unlike eager learners such as decision tree
induction and rule-based systems - Classifying unknown records are relatively
expensive
45Example PEBLS
- PEBLS Parallel Examplar-Based Learning System
(Cost Salzberg) - Works with both continuous and nominal features
- For nominal features, distance between two
nominal values is computed using modified value
difference metric (MVDM) - Each record is assigned a weight factor
- Number of nearest neighbor, k 1
46Example PEBLS
Distance between nominal attribute
values d(Single,Married) 2/4 0/4
2/4 4/4 1 d(Single,Divorced) 2/4
1/2 2/4 1/2 0 d(Married,Divorced)
0/4 1/2 4/4 1/2
1 d(RefundYes,RefundNo) 0/3 3/7 3/3
4/7 6/7
47Example PEBLS
Distance between record X and record Y
where
wX ? 1 if X makes accurate prediction most of
the time wX gt 1 if X is not reliable for making
predictions
48Data Mining Classification Alternative
Techniques
49Thomas Bayes
- English minister and mathematician (1701 1761)
http//www-history.mcs.st-andrews.ac.uk/PictDispla
y/Bayes.html
50Bayes Classifier
- A probabilistic framework for solving
classification problems - Conditional Probability
- Bayes theorem
51Example of Bayes Theorem
- Given
- A doctor knows that meningitis causes stiff neck
50 of the time - Prior probability of any patient having
meningitis is 1/50,000 - Prior probability of any patient having stiff
neck is 1/20 - If a patient has stiff neck, whats the
probability he/she has meningitis?
52Using Bayes Theorem for Classification
- Consider each attribute and class label as random
variables - Given a record with attributes (X1, X2,, Xd)
- Goal is to predict class Y
- Specifically, we want to find the value of Y that
maximizes P(Y X1, X2,, Xd ) - Can we estimate P(Y X1, X2,, Xd ) directly from
data?
53Using Bayes Theorem for Classification
- Approach
- compute posterior probability P(Y X1, X2, ,
Xd) using the Bayes theorem - Maximum a-posteriori Choose Y that maximizes
P(Y X1, X2, , Xd) - Equivalent to choosing value of Y that maximizes
P(X1, X2, , XdY) P(Y) - How to estimate P(X1, X2, , Xd Y )?
54Naïve Bayes Classifier
- Assume independence among attributes Xi when
class is given - P(X1, X2, , Xd Yj) P(X1 Yj) P(X2 Yj) P(Xd
Yj) -
- Can estimate P(Xi Yj) for all Xi and Yj from
data - New point is classified to Yj if P(Yj) ? P(Xi
Yj) is maximal.
55Conditional Independence
- X and Y are conditionally independent given Z if
P(XYZ) P(XZ) - Example Arm length and reading skills
- Young child has shorter arm length and limited
reading skills, compared to adults - If age is fixed, no apparent relationship between
arm length and reading skills - Arm length and reading skills are conditionally
independent given age
56Estimate Probabilities from Data
- Class P(Y) Nc/N
- e.g., P(No) 7/10, P(Yes) 3/10
- For discrete attributes P(Xi Yk)
Xik/ Nc - where Xik is number of instances having
attribute value Xi and belonging to class Yk - Examples
- P(StatusMarriedNo) 4/7P(RefundYesYes)0
k
57Estimate Probabilities from Data
- For continuous attributes
- Discretization Partition the range into bins
- Replace continuous value with bin value
- Attribute changed from continuous to ordinal
- Probability density estimation
- Assume attribute follows a normal distribution
- Use data to estimate parameters of distribution
(e.g., mean and standard deviation) - Once probability distribution is known, use it to
estimate the conditional probability P(XiY)
k
58Estimate Probabilities from Data
- Normal distribution
- One for each (Xi,Yi) pair
- For (Income, ClassNo)
- If ClassNo
- sample mean 110
- sample variance 2975
59Example of Naïve Bayes Classifier
Given a Test Record
- P(XClassNo) P(RefundNoClassNo) ?
P(Married ClassNo) ? P(Income120K
ClassNo) 4/7 ? 4/7 ? 0.0072
0.0024 - P(XClassYes) P(RefundNo ClassYes)
? P(Married ClassYes)
? P(Income120K ClassYes)
1 ? 0 ? 1.2 ? 10-9 0 - Since P(XNo)P(No) gt P(XYes)P(Yes)
- Therefore P(NoX) gt P(YesX) gt Class No
60Naïve Bayes Classifier
- If one of the conditional probabilities is zero,
then the entire expression becomes zero - Probability estimation
c number of classes p prior probability m
parameter
61Example of Naïve Bayes Classifier
A attributes M mammals N non-mammals
P(AM)P(M) gt P(AN)P(N) gt Mammals
62Naïve Bayes (Summary)
- Robust to isolated noise points
- Handle missing values by ignoring the instance
during probability estimate calculations - Robust to irrelevant attributes
- Independence assumption may not hold for some
attributes - Use other techniques such as Bayesian Belief
Networks (BBN)
63Bayesian Belief Networks
- Provides graphical representation of
probabilistic relationships among a set of random
variables - Consists of
- A directed acyclic graph (dag)
- Node corresponds to a variable
- Arc corresponds to dependence relationship
between a pair of variables - A probability table associating each node to its
immediate parent
64Conditional Independence
D is parent of C A is child of C B is descendant
of D D is ancestor of A
- A node in a Bayesian network is conditionally
independent of all of its nondescendants, if its
parents are known
65Conditional Independence
66Probability Tables
- If X does not have any parents, table contains
prior probability P(X) - If X has only one parent (Y), table contains
conditional probability P(XY) - If X has multiple parents (Y1, Y2,, Yk), table
contains conditional probability P(XY1, Y2,, Yk)
67Example of Bayesian Belief Network
68Example of Inferencing using BBN
- Given X (ENo, DYes, CPYes, BPHigh)
- Compute P(HDE,D,CP,BP)?
- P(HDYes ENo,DYes) 0.55P(CPYes HDYes)
0.8P(BPHigh HDYes) 0.85 - P(HDYesENo,DYes,CPYes,BPHigh) ? 0.55 ? 0.8
? 0.85 0.374 - P(HDNo ENo,DYes) 0.45P(CPYes HDNo)
0.01P(BPHigh HDNo) 0.2 - P(HDNoENo,DYes,CPYes,BPHigh) ? 0.45 ? 0.01
? 0.2 0.0009
Classify X as Yes
69Data Mining Classification Alternative
Techniques
- Artificial Neural Networks
70Artificial Neural Networks (ANN)
Output Y is 1 if at least two of the three inputs
are equal to 1.
71Artificial Neural Networks (ANN)
72Artificial Neural Networks (ANN)
- Model is an assembly of inter-connected nodes and
weighted links - Output node sums up each of its input value
according to the weights of its links - Compare output node against some threshold t
Perceptron Model
73General Structure of ANN
Training ANN means learning the weights of the
neurons
74Artificial Neural Networks (ANN)
- Various types of neural network topology
- single-layered network (perceptron) versus
multi-layered network - Feed-forward versus recurrent network
- Various types of activation functions (g)
75Perceptron
- Single layer network
- Contains only input and output nodes
- Activation function g sign(w?x)
- Applying model is straightforward
- X1 1, X2 0, X3 1 gt y sign(0.2) 1
76Perceptron Learning Rule
- Initialize the weights (w0, w1, , wd)
- Repeat
- For each training example (xi, yi)
- Compute f(w, xi)
- Update the weights
- Until stopping condition is met
77Perceptron Learning Rule
- Weight update formula
- Intuition
- Update weight based on error
- If yf(x,w), e0 no update needed
- If ygtf(x,w), e2 weight must be increased so
that f(x,w) will increase - If yltf(x,w), e-2 weight must be decreased so
that f(x,w) will decrease
78Example of Perceptron Learning
79Perceptron Learning Rule
- Since f(w,x) is a linear combination of input
variables, decision boundary is linear - For nonlinearly separable problems, perceptron
learning algorithm will fail because no linear
hyperplane can separate the data perfectly
80Nonlinearly Separable Data
XOR Data
81Multilayer Neural Network
- Hidden layers
- intermediary layers between input output layers
- Hidden units (nodes)
- Nodes embedded in hidden layers
- More general activation functions (sigmoid,
linear, etc)
82Multi-layer Neural Network
- Multi-layer neural network can solve any type of
classification task involving nonlinear decision
surfaces
XOR Data
83Learning Multi-layer Neural Network
- Can we apply perceptron learning rule to each
node, including hidden nodes? - Perceptron learning rule computes error term e
y-f(w,x) and updates weights accordingly - Problem how to determine the true value of y
for hidden nodes? - Approximate error in hidden nodes by error in the
output nodes - Problem
- Not clear how adjustment in the hidden nodes
affect overall error - No guarantee of convergence to optimal solution
84Gradient Descent for Multilayer NN
- Weight update
- Error function
- Activation function f must be differentiable
- For sigmoid function
- Stochastic gradient descent (update the weight
immediately)
85Gradient Descent for MultiLayer NN
- For output neurons, weight update formula is the
same as before (gradient descent for perceptron) - For hidden neurons
86Design Issues in ANN
- Number of nodes in input layer
- One input node per binary/continuous attribute
- k or log2 k nodes for each categorical attribute
with k values - Number of nodes in output layer
- One output for binary class problem
- k or log2 k nodes for k-class problem
- Number of nodes in hidden layer
- Initial weights and biases
87Characteristics of ANN
- Multilayer ANN are universal approximators
- Can handle redundant attributes because weights
are automatically learnt - Gradient descent may converge to local minimum
- Model building can be very time consuming, but
testing can be very fast
88Data Mining Classification Alternative
Techniques
89Support Vector Machines
- Find a linear hyperplane (decision boundary) that
will separate the data
90Support Vector Machines
91Support Vector Machines
- Another possible solution
92Support Vector Machines
93Support Vector Machines
- Which one is better? B1 or B2?
- How do you define better?
94Support Vector Machines
- Find hyperplane maximizes the margin gt B1 is
better than B2
95SVM vs Perceptron
Perceptron
SVM
Minimizes least square error (gradient descent)
Maximizes margin
96Support Vector Machines
97Learning Linear SVM
- We want to maximize
- Which is equivalent to minimizing
- But subjected to the following constraints
- or
- This is a constrained optimization problem
- Solve it using Lagrange multiplier method
98Learning Linear SVM
- Lagrange multiplier
- Take derivative w.r.t ? and b
- Additional constraints
- Dual problem
99Learning Linear SVM
- Bigger picture
- Learning algorithm needs to find w and b
- To solve for w
- But
- ? is zero for points that do not reside on
- Data points where ?s are not zero are called
support vectors
100Example of Linear SVM
Support vectors
101Learning Linear SVM
- Bigger picture
- Decision boundary depends only on support vectors
- If you have data set with same support vectors,
decision boundary will not change - How to classify using SVM once w and b are found?
Given a test record, xi
102Support Vector Machines
- What if the problem is not linearly separable?
103Support Vector Machines
- What if the problem is not linearly separable?
- Introduce slack variables
- Need to minimize
- Subject to
- If k is 1 or 2, this leads to same objective
function as linear SVM but with different
constraints (see textbook)
104Nonlinear Support Vector Machines
- What if decision boundary is not linear?
105Nonlinear Support Vector Machines
- Trick Transform data into higher dimensional
space
Decision boundary
106Learning NonLinear SVM
- Optimization problem
- Which leads to the same set of equations (but
involve ?(x) instead of x)
107Learning NonLinear SVM
- Issues
- What type of mapping function ? should be used?
- How to do the computation in high dimensional
space? - Most computations involve dot product ?(xi)?
?(xj) - Curse of dimensionality?
108Learning Nonlinear SVM
- Kernel Trick
- ?(xi)? ?(xj) K(xi, xj)
- K(xi, xj) is a kernel function (expressed in
terms of the coordinates in the original space) - Examples
109Example of Nonlinear SVM
SVM with polynomial degree 2 kernel
110Learning Nonlinear SVM
- Advantages of using kernel
- Dont have to know the mapping function ?
- Computing dot product ?(xi)? ?(xj) in the
original space avoids curse of dimensionality - Not all functions can be kernels
- Must make sure there is a corresponding ? in some
high-dimensional space - Mercers theorem (see textbook)
111Data Mining Classification Alternative
Techniques
112Ensemble Methods
- Construct a set of classifiers from the training
data - Predict class label of test records by combining
the predictions made by multiple classifiers
113Why Ensemble Methods work?
- Suppose there are 25 base classifiers
- Each classifier has error rate, ? 0.35
- Assume errors made by classifiers are
uncorrelated - Probability that the ensemble classifier makes a
wrong prediction
114General Approach
115Types of Ensemble Methods
- Bayesian ensemble
- Example Mixture of Gaussian
- Manipulate data distribution
- Example Resampling method
- Manipulate input features
- Example Feature subset selection
- Manipulate class labels
- Example error-correcting output coding
- Introduce randomness into learning algorithm
- Example Random forests
116Bagging
- Sampling with replacement
- Build classifier on each bootstrap sample
- Each sample has probability (1 1/n)n of being
selected
117Bagging Algorithm
118Bagging Example
- Consider 1-dimensional data set
- Classifier is a decision stump
- Decision rule x ? k versus x gt k
- Split point k is chosen based on entropy
x ? k
True
False
yleft
yright
119Bagging Example
120Bagging Example
121Bagging Example
122Bagging Example
- Assume test set is the same as the original data
- Use majority vote to determine class of ensemble
classifier
Predicted Class
123Boosting
- An iterative procedure to adaptively change
distribution of training data by focusing more on
previously misclassified records - Initially, all N records are assigned equal
weights - Unlike bagging, weights may change at the end of
each boosting round
124Boosting
- Records that are wrongly classified will have
their weights increased - Records that are classified correctly will have
their weights decreased
- Example 4 is hard to classify
- Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds
125AdaBoost
- Base classifiers C1, C2, , CT
- Error rate
- Importance of a classifier
126AdaBoost Algorithm
- Weight update
- If any intermediate rounds produce error rate
higher than 50, the weights are reverted back to
1/n and the resampling procedure is repeated - Classification
127AdaBoost Algorithm
128AdaBoost Example
- Consider 1-dimensional data set
- Classifier is a decision stump
- Decision rule x ? k versus x gt k
- Split point k is chosen based on entropy
x ? k
True
False
yleft
yright
129AdaBoost Example
- Training sets for the first 3 boosting rounds
- Summary
130AdaBoost Example
Predicted Class
131Data Mining Classification Alternative
Techniques
132Class Imbalance Problem
- Lots of classification problems where the classes
are skewed (more records from one class than
another) - Credit card fraud
- Intrusion detection
- Defective products in manufacturing assembly line
133Challenges
- Evaluation measures such as accuracy is not
well-suited for imbalanced class - Detecting the rare class is like finding needle
in a haystack
134Confusion Matrix
a TP (true positive) b FN (false negative) c
FP (false positive) d TN (true negative)
135Accuracy
136Problem with Accuracy
- Consider a 2-class problem
- Number of Class 0 examples 9990
- Number of Class 1 examples 10
- If a model predicts everything to be class 0,
accuracy is 9990/10000 99.9 - This is misleading because the model does not
detect any class 1 example - Detecting the rare class is usually more
interesting (e.g., frauds, intrusions, defects,
etc)
137Alternative Measures
138ROC (Receiver Operating Characteristic)
- A graphical approach for displaying trade-off
between detection rate and false alarm rate - Developed in 1950s for signal detection theory to
analyze noisy signals - ROC curve plots TPR against FPR
- Performance of a model represented as a point in
an ROC curve - Changing the threshold parameter of classifier
changes the location of the point
139ROC Curve
- (TPR,FPR)
- (0,0) declare everything to be
negative class - (1,1) declare everything to be positive
class - (1,0) ideal
- Diagonal line
- Random guessing
- Below diagonal line
- prediction is opposite of the true class
140ROC (Receiver Operating Characteristic)
- To draw ROC curve, classifier must produce
continuous-valued output - Outputs are used to rank test records, from the
most likely positive class record to the least
likely positive class record - Many classifiers produce only discrete outputs
(i.e., predicted class) - How to get continuous-valued outputs?
- Decision trees, rule-based classifiers, neural
networks, Bayesian classifiers, k-nearest
neighbors, SVM
141Example Decision Trees
Decision Tree
Continuous-valued outputs
142ROC Curve Example
143ROC Curve Example
- 1-dimensional data set containing 2 classes
(positive and negative) - Any points located at x
gt t is classified as positive
144Using ROC for Model Comparison
- No model consistently outperform the other
- M1 is better for small FPR
- M2 is better for large FPR
- Area Under the ROC curve
- Ideal
- Area 1
- Random guess
- Area 0.5
145How to Construct an ROC curve
- Use classifier that produces continuous-valued
output for each test instance score(A) - Sort the instances according to score(A) in
decreasing order - Apply threshold at each unique value of
score(A) - Count the number of TP, FP, TN, FN at each
threshold - TPR TP/(TPFN)
- FPR FP/(FP TN)
146How to construct an ROC curve
Threshold gt
ROC Curve
147Handling Class Imbalanced Problem
- Class-based ordering (e.g. RIPPER)
- Rules for rare class have higher priority
- Cost-sensitive classification
- Misclassifying rare class as majority class is
more expensive than misclassifying majority as
rare class - Sampling-based approaches
148Cost Matrix
C(i,j) Cost of misclassifying class i example as
class j
149Computing Cost of Classification
Accuracy 80 Cost 3910
Accuracy 90 Cost 4255
150Cost Sensitive Classification
- Example Bayesian classifer
- Given a test record x
- Compute p(ix) for each class i
- Decision rule classify node as class k if
- For 2-class, classify x as if p(x) gt p(-x)
- This decision rule implicitly assumes that
C() C(--) 0 and C(-) C(-)
151Cost Sensitive Classification
- General decision rule
- Classify test record x as class k if
- 2-class
- Cost() p(x) C(,) p(-x) C(-,)
- Cost(-) p(x) C(,-) p(-x) C(-,-)
- Decision rule classify x as if Cost() lt
Cost(-) - if C(,) C(-,-) 0
152Sampling-based Approaches
- Modify the distribution of training data so that
rare class is well-represented in training set - Undersample the majority class
- Oversample the rare class
- Advantages and disadvantages