Data Mining Classification: Alternative Techniques

About This Presentation

Title:

Data Mining Classification: Alternative Techniques

Description:

Tan,Steinbach, Kumar Introduction to Data Mining 02/26/2006 1 ... A lemur triggers rule r3, so it is classified as a mammal. A turtle triggers both r4 and r5 ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 153

Provided by: Computa8

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining Classification: Alternative Techniques

1
Data Mining Classification Alternative
Techniques

Lecture Notes for Chapter 5
Introduction to Data Mining
by
Tan, Steinbach, Kumar

2
Data Mining Classification Alternative
Techniques

Rule-based Classifier

3
Rule-Based Classifier

Classify records by using a collection of
ifthen rules
Rule (Condition) ? y
where
Condition is a conjunctions of attributes
y is the class label
LHS rule antecedent or condition
RHS rule consequent
Examples of classification rules
(Blood TypeWarm) ? (Lay EggsYes) ? Birds
(Taxable Income lt 50K) ? (RefundYes) ? EvadeNo

4
Rule-based Classifier (Example)
5
Application of Rule-Based Classifier

A rule r covers an instance x if the attributes
of the instance satisfy the condition of the rule

The rule r1 covers a hawk gt Bird The rule r3
covers the grizzly bear gt Mammal
6
Rule Coverage and Accuracy

Coverage of a rule
Fraction of records that satisfy the antecedent
of a rule
Accuracy of a rule
Fraction of records that satisfy both the
antecedent and consequent of a rule

(StatusSingle) ? No Coverage 40,
Accuracy 50
7
How does Rule-based Classifier Work?
A lemur triggers rule r3, so it is classified as
a mammal A turtle triggers both r4 and r5 A
dogfish shark triggers none of the rules
8
Characteristics of Rule-Based Classifier

Mutually exclusive rules
Classifier contains mutually exclusive rules if
the rules are independent of each other
Every record is covered by at most one rule
Exhaustive rules
Classifier has exhaustive coverage if it accounts
for every possible combination of attribute
values
Each record is covered by at least one rule

9
From Decision Trees To Rules
Rules are mutually exclusive and exhaustive Rule
set contains as much information as the tree
10
Rules Can Be Simplified
Initial Rule (RefundNo) ?
(StatusMarried) ? No Simplified Rule
(StatusMarried) ? No
11
Effect of Rule Simplification

Rules are no longer mutually exclusive
A record may trigger more than one rule
Solution?
Ordered rule set
Unordered rule set use voting schemes
Rules are no longer exhaustive
A record may not trigger any rules
Solution?
Use a default class

12
Ordered Rule Set

Rules are rank ordered according to their
priority
An ordered rule set is known as a decision list
When a test record is presented to the classifier
It is assigned to the class label of the highest
ranked rule it has triggered
If none of the rules fired, it is assigned to the
default class

13
Rule Ordering Schemes

Rule-based ordering
Individual rules are ranked based on their
quality/priority
Class-based ordering
Rules that belong to the same class appear
together

14
Building Classification Rules

Direct Method
Extract rules directly from data
e.g. RIPPER, CN2, 1R, and AQ
Indirect Method
Extract rules from other classification models
(e.g. decision trees, neural networks, SVM,
etc).
e.g C4.5rules

15
Direct Method Sequential Covering
16
Example of Sequential Covering
17
Example of Sequential Covering
18
Aspects of Sequential Covering

Rule Growing
Rule evaluation
Instance Elimination
Stopping Criterion
Rule Pruning

19
Rule Growing

Two common strategies

20
Rule Evaluation

Evaluation metric determines which conjunct
should be added during rule growing
Accuracy
Laplace
M-estimate

n Number of instances covered by rule nc
Number of instances of class c covered by rule k
Number of classes p Prior probability
21
Rule Growing (Examples)

CN2 Algorithm
Start from an empty conjunct
Add conjuncts that minimizes the entropy measure
A, A,B,
Determine the rule consequent by taking majority
class of instances covered by the rule
RIPPER Algorithm
Start from an empty rule gt class
Add conjuncts that maximizes FOILs information
gain measure
R0 gt class (initial rule)
R1 A gt class (rule after adding conjunct)
Gain(R0, R1) t log (p1/(p1n1)) log
(p0/(p0 n0))
where t number of positive instances covered
by both R0 and R1
p0 number of positive instances covered by R0
n0 number of negative instances covered by R0
p1 number of positive instances covered by R1
n1 number of negative instances covered by R1

22
Instance Elimination

Why do we need to eliminate instances?
Otherwise, the next rule is identical to previous
rule
Why do we remove positive instances?
Ensure that the next rule is different
Why do we remove negative instances?
Prevent underestimating accuracy of rule
Compare rules R2 and R3 in the diagram

23
Stopping Criterion and Rule Pruning

Examples of stopping criterion
If rule does not improve significantly after
adding conjunct
If rule starts covering examples from another
class
Rule Pruning
Similar to post-pruning of decision trees
Example using validation set (reduced error
pruning)
Remove one of the conjuncts in the rule
Compare error rate on validation set before and
after pruning
If error improves, prune the conjunct

24
Summary of Direct Method

Initial rule set is empty
Repeat
Grow a single rule
Remove Instances covered by the rule
Prune the rule (if necessary)
Add rule to the current rule set

25
Direct Method RIPPER

For 2-class problem, choose one of the classes as
positive class, and the other as negative class
Learn the rules for positive class
Use negative class as default
For multi-class problem
Order the classes according to increasing class
prevalence (fraction of instances that belong to
a particular class)
Learn the rule set for smallest class first,
treat the rest as negative class
Repeat with next smallest class as positive class

26
Direct Method RIPPER

Rule growing
Start from an empty rule ?
Add conjuncts as long as they improve FOILs
information gain
Stop when rule no longer covers negative examples
Prune the rule immediately using incremental
reduced error pruning
Measure for pruning v (p-n)/(pn)
p number of positive examples covered by the
rule in the validation set
n number of negative examples covered by the
rule in the validation set
Pruning method delete any final sequence of
conditions that maximizes v

27
Direct Method RIPPER

Building a Rule Set
Use sequential covering algorithm
Grow a rule to cover the current set of positive
examples
Eliminate both positive and negative examples
covered by the rule
Each time a rule is added to the rule set,
compute the new description length
stop adding new rules when the new description
length is d bits longer than the smallest
description length obtained so far

28
Indirect Methods
29
Indirect Method C4.5rules

Extract rules for every path from root to leaf
nodes
For each rule, r A ? y,
consider alternative rule r A ? y where A is
obtained by removing one of the conjuncts in A
Compare the pessimistic error rate for r against
all rs
Prune if one of the rs has lower pessimistic
error rate
Repeat until pessimistic error rate can no longer
be improved

30
Indirect Method C4.5rules

Use class-based ordering
Rules that predict the same class are grouped
together into the same subset
Compute total description length for each class
Classes are ordered in increasing order of their
total description length

31
Example
C4.5rules (Give BirthNo, Can FlyYes) ?
Birds (Give BirthNo, Live in WaterYes) ?
Fishes (Give BirthYes) ? Mammals (Give BirthNo,
Can FlyNo, Live in WaterNo) ? Reptiles ( ) ?
Amphibians
32
Characteristics of Rule-Based Classifiers

As highly expressive as decision trees
Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to decision trees

33
Data Mining Classification Alternative
Techniques

Instance-Based Classifiers

34
Instance-Based Classifiers

Store the training records
Use training records to predict the class
label of unseen cases

35
Instance Based Classifiers

Examples
Rote-learner
Memorizes entire training data and performs
classification only if attributes of record match
one of the training examples exactly
Nearest neighbor
Uses k closest points (nearest neighbors) for
performing classification

36
Nearest Neighbor Classifiers

Basic idea
If it walks like a duck, quacks like a duck, then
its probably a duck

37
Nearest-Neighbor Classifiers

Requires three things
The set of stored records
Distance metric to compute distance between
records
The value of k, the number of nearest neighbors
to retrieve
To classify an unknown record
Compute distance to other training records
Identify k nearest neighbors
Use class labels of nearest neighbors to
determine the class label of unknown record
(e.g., by taking majority vote)

38
Definition of Nearest Neighbor
K-nearest neighbors of a record x are data
points that have the k smallest distance to x
39
1 nearest-neighbor
Voronoi Diagram
40
Nearest Neighbor Classification

Compute distance between two points
Example Euclidean distance
Determine the class from nearest neighbor list
take the majority vote of class labels among the
k-nearest neighbors
Weigh the vote according to distance
weight factor, w 1/d2

41
Nearest Neighbor Classification

Choosing the value of k
If k is too small, sensitive to noise points
If k is too large, neighborhood may include
points from other classes

42
Nearest Neighbor Classification

Scaling issues
Attributes may have to be scaled to prevent
distance measures from being dominated by one of
the attributes
Example
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from 10K to 1M

43
Nearest Neighbor Classification

Problem with Euclidean measure
High dimensional data
curse of dimensionality
Can produce counter-intuitive results

1 1 1 1 1 1 1 1 1 1 1 0
1 0 0 0 0 0 0 0 0 0 0 0
vs
0 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 1
d 1.4142
d 1.4142

Solution Normalize the vectors to unit length

44
Nearest neighbor Classification

k-NN classifiers are lazy learners
It does not build models explicitly
Unlike eager learners such as decision tree
induction and rule-based systems
Classifying unknown records are relatively
expensive

45
Example PEBLS

PEBLS Parallel Examplar-Based Learning System
(Cost Salzberg)
Works with both continuous and nominal features
For nominal features, distance between two
nominal values is computed using modified value
difference metric (MVDM)
Each record is assigned a weight factor
Number of nearest neighbor, k 1

46
Example PEBLS
Distance between nominal attribute
values d(Single,Married) 2/4 0/4
2/4 4/4 1 d(Single,Divorced) 2/4
1/2 2/4 1/2 0 d(Married,Divorced)
0/4 1/2 4/4 1/2
1 d(RefundYes,RefundNo) 0/3 3/7 3/3
4/7 6/7
47
Example PEBLS
Distance between record X and record Y
where
wX ? 1 if X makes accurate prediction most of
the time wX gt 1 if X is not reliable for making
predictions
48
Data Mining Classification Alternative
Techniques

Bayesian Classifiers

49
Thomas Bayes

English minister and mathematician (1701 1761)

http//www-history.mcs.st-andrews.ac.uk/PictDispla
y/Bayes.html
50
Bayes Classifier

A probabilistic framework for solving
classification problems
Conditional Probability
Bayes theorem

51
Example of Bayes Theorem

Given
A doctor knows that meningitis causes stiff neck
50 of the time
Prior probability of any patient having
meningitis is 1/50,000
Prior probability of any patient having stiff
neck is 1/20
If a patient has stiff neck, whats the
probability he/she has meningitis?

52
Using Bayes Theorem for Classification

Consider each attribute and class label as random
variables
Given a record with attributes (X1, X2,, Xd)
Goal is to predict class Y
Specifically, we want to find the value of Y that
maximizes P(Y X1, X2,, Xd )
Can we estimate P(Y X1, X2,, Xd ) directly from
data?

53
Using Bayes Theorem for Classification

Approach
compute posterior probability P(Y X1, X2, ,
Xd) using the Bayes theorem
Maximum a-posteriori Choose Y that maximizes
P(Y X1, X2, , Xd)
Equivalent to choosing value of Y that maximizes
P(X1, X2, , XdY) P(Y)
How to estimate P(X1, X2, , Xd Y )?

54
Naïve Bayes Classifier

Assume independence among attributes Xi when
class is given
P(X1, X2, , Xd Yj) P(X1 Yj) P(X2 Yj) P(Xd
Yj)
Can estimate P(Xi Yj) for all Xi and Yj from
data
New point is classified to Yj if P(Yj) ? P(Xi
Yj) is maximal.

55
Conditional Independence

X and Y are conditionally independent given Z if
P(XYZ) P(XZ)
Example Arm length and reading skills
Young child has shorter arm length and limited
reading skills, compared to adults
If age is fixed, no apparent relationship between
arm length and reading skills
Arm length and reading skills are conditionally
independent given age

56
Estimate Probabilities from Data

Class P(Y) Nc/N
e.g., P(No) 7/10, P(Yes) 3/10
For discrete attributes P(Xi Yk)
Xik/ Nc
where Xik is number of instances having
attribute value Xi and belonging to class Yk
Examples
P(StatusMarriedNo) 4/7P(RefundYesYes)0

k
57
Estimate Probabilities from Data

For continuous attributes
Discretization Partition the range into bins
Replace continuous value with bin value
Attribute changed from continuous to ordinal
Probability density estimation
Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, use it to
estimate the conditional probability P(XiY)

k
58
Estimate Probabilities from Data

Normal distribution
One for each (Xi,Yi) pair
For (Income, ClassNo)
If ClassNo
sample mean 110
sample variance 2975

59
Example of Naïve Bayes Classifier
Given a Test Record

P(XClassNo) P(RefundNoClassNo) ?
P(Married ClassNo) ? P(Income120K
ClassNo) 4/7 ? 4/7 ? 0.0072
0.0024
P(XClassYes) P(RefundNo ClassYes)
? P(Married ClassYes)
? P(Income120K ClassYes)
1 ? 0 ? 1.2 ? 10-9 0
Since P(XNo)P(No) gt P(XYes)P(Yes)
Therefore P(NoX) gt P(YesX) gt Class No

60
Naïve Bayes Classifier

If one of the conditional probabilities is zero,
then the entire expression becomes zero
Probability estimation

c number of classes p prior probability m
parameter
61
Example of Naïve Bayes Classifier
A attributes M mammals N non-mammals
P(AM)P(M) gt P(AN)P(N) gt Mammals
62
Naïve Bayes (Summary)

Robust to isolated noise points
Handle missing values by ignoring the instance
during probability estimate calculations
Robust to irrelevant attributes
Independence assumption may not hold for some
attributes
Use other techniques such as Bayesian Belief
Networks (BBN)

63
Bayesian Belief Networks

Provides graphical representation of
probabilistic relationships among a set of random
variables
Consists of
A directed acyclic graph (dag)
Node corresponds to a variable
Arc corresponds to dependence relationship
between a pair of variables
A probability table associating each node to its
immediate parent

64
Conditional Independence
D is parent of C A is child of C B is descendant
of D D is ancestor of A

A node in a Bayesian network is conditionally
independent of all of its nondescendants, if its
parents are known

65
Conditional Independence

Naïve Bayes assumption

66
Probability Tables

If X does not have any parents, table contains
prior probability P(X)
If X has only one parent (Y), table contains
conditional probability P(XY)
If X has multiple parents (Y1, Y2,, Yk), table
contains conditional probability P(XY1, Y2,, Yk)

67
Example of Bayesian Belief Network
68
Example of Inferencing using BBN

Given X (ENo, DYes, CPYes, BPHigh)
Compute P(HDE,D,CP,BP)?
P(HDYes ENo,DYes) 0.55P(CPYes HDYes)
0.8P(BPHigh HDYes) 0.85
P(HDYesENo,DYes,CPYes,BPHigh) ? 0.55 ? 0.8
? 0.85 0.374
P(HDNo ENo,DYes) 0.45P(CPYes HDNo)
0.01P(BPHigh HDNo) 0.2
P(HDNoENo,DYes,CPYes,BPHigh) ? 0.45 ? 0.01
? 0.2 0.0009

Classify X as Yes
69
Data Mining Classification Alternative
Techniques

Artificial Neural Networks

70
Artificial Neural Networks (ANN)
Output Y is 1 if at least two of the three inputs
are equal to 1.
71
Artificial Neural Networks (ANN)
72
Artificial Neural Networks (ANN)

Model is an assembly of inter-connected nodes and
weighted links
Output node sums up each of its input value
according to the weights of its links
Compare output node against some threshold t

Perceptron Model
73
General Structure of ANN
Training ANN means learning the weights of the
neurons
74
Artificial Neural Networks (ANN)

Various types of neural network topology
single-layered network (perceptron) versus
multi-layered network
Feed-forward versus recurrent network
Various types of activation functions (g)

75
Perceptron

Single layer network
Contains only input and output nodes
Activation function g sign(w?x)
Applying model is straightforward
X1 1, X2 0, X3 1 gt y sign(0.2) 1

76
Perceptron Learning Rule

Initialize the weights (w0, w1, , wd)
Repeat
For each training example (xi, yi)
Compute f(w, xi)
Update the weights
Until stopping condition is met

77
Perceptron Learning Rule

Weight update formula
Intuition
Update weight based on error
If yf(x,w), e0 no update needed
If ygtf(x,w), e2 weight must be increased so
that f(x,w) will increase
If yltf(x,w), e-2 weight must be decreased so
that f(x,w) will decrease

78
Example of Perceptron Learning
79
Perceptron Learning Rule

Since f(w,x) is a linear combination of input
variables, decision boundary is linear
For nonlinearly separable problems, perceptron
learning algorithm will fail because no linear
hyperplane can separate the data perfectly

80
Nonlinearly Separable Data
XOR Data
81
Multilayer Neural Network

Hidden layers
intermediary layers between input output layers
Hidden units (nodes)
Nodes embedded in hidden layers
More general activation functions (sigmoid,
linear, etc)

82
Multi-layer Neural Network

Multi-layer neural network can solve any type of
classification task involving nonlinear decision
surfaces

XOR Data
83
Learning Multi-layer Neural Network

Can we apply perceptron learning rule to each
node, including hidden nodes?
Perceptron learning rule computes error term e
y-f(w,x) and updates weights accordingly
Problem how to determine the true value of y
for hidden nodes?
Approximate error in hidden nodes by error in the
output nodes
Problem
Not clear how adjustment in the hidden nodes
affect overall error
No guarantee of convergence to optimal solution

84
Gradient Descent for Multilayer NN

Weight update
Error function
Activation function f must be differentiable
For sigmoid function
Stochastic gradient descent (update the weight
immediately)

85
Gradient Descent for MultiLayer NN

For output neurons, weight update formula is the
same as before (gradient descent for perceptron)
For hidden neurons

86
Design Issues in ANN

Number of nodes in input layer
One input node per binary/continuous attribute
k or log2 k nodes for each categorical attribute
with k values
Number of nodes in output layer
One output for binary class problem
k or log2 k nodes for k-class problem
Number of nodes in hidden layer
Initial weights and biases

87
Characteristics of ANN

Multilayer ANN are universal approximators
Can handle redundant attributes because weights
are automatically learnt
Gradient descent may converge to local minimum
Model building can be very time consuming, but
testing can be very fast

88
Data Mining Classification Alternative
Techniques

Support Vector Machines

89
Support Vector Machines

Find a linear hyperplane (decision boundary) that
will separate the data

90
Support Vector Machines

One Possible Solution

91
Support Vector Machines

Another possible solution

92
Support Vector Machines

Other possible solutions

93
Support Vector Machines

Which one is better? B1 or B2?
How do you define better?

94
Support Vector Machines

Find hyperplane maximizes the margin gt B1 is
better than B2

95
SVM vs Perceptron
Perceptron
SVM
Minimizes least square error (gradient descent)
Maximizes margin
96
Support Vector Machines
97
Learning Linear SVM

We want to maximize
Which is equivalent to minimizing
But subjected to the following constraints
or
This is a constrained optimization problem
Solve it using Lagrange multiplier method

98
Learning Linear SVM

Lagrange multiplier
Take derivative w.r.t ? and b
Additional constraints
Dual problem

99
Learning Linear SVM

Bigger picture
Learning algorithm needs to find w and b
To solve for w
But
? is zero for points that do not reside on
Data points where ?s are not zero are called
support vectors

100
Example of Linear SVM
Support vectors
101
Learning Linear SVM

Bigger picture
Decision boundary depends only on support vectors
If you have data set with same support vectors,
decision boundary will not change
How to classify using SVM once w and b are found?
Given a test record, xi

102
Support Vector Machines

What if the problem is not linearly separable?

103
Support Vector Machines

What if the problem is not linearly separable?
Introduce slack variables
Need to minimize
Subject to
If k is 1 or 2, this leads to same objective
function as linear SVM but with different
constraints (see textbook)

104
Nonlinear Support Vector Machines

What if decision boundary is not linear?

105
Nonlinear Support Vector Machines

Trick Transform data into higher dimensional
space

Decision boundary
106
Learning NonLinear SVM

Optimization problem
Which leads to the same set of equations (but
involve ?(x) instead of x)

107
Learning NonLinear SVM

Issues
What type of mapping function ? should be used?
How to do the computation in high dimensional
space?
Most computations involve dot product ?(xi)?
?(xj)
Curse of dimensionality?

108
Learning Nonlinear SVM

Kernel Trick
?(xi)? ?(xj) K(xi, xj)
K(xi, xj) is a kernel function (expressed in
terms of the coordinates in the original space)
Examples

109
Example of Nonlinear SVM
SVM with polynomial degree 2 kernel
110
Learning Nonlinear SVM

Advantages of using kernel
Dont have to know the mapping function ?
Computing dot product ?(xi)? ?(xj) in the
original space avoids curse of dimensionality
Not all functions can be kernels
Must make sure there is a corresponding ? in some
high-dimensional space
Mercers theorem (see textbook)

111
Data Mining Classification Alternative
Techniques

Ensemble Methods

112
Ensemble Methods

Construct a set of classifiers from the training
data
Predict class label of test records by combining
the predictions made by multiple classifiers

113
Why Ensemble Methods work?

Suppose there are 25 base classifiers
Each classifier has error rate, ? 0.35
Assume errors made by classifiers are
uncorrelated
Probability that the ensemble classifier makes a
wrong prediction

114
General Approach
115
Types of Ensemble Methods

Bayesian ensemble
Example Mixture of Gaussian
Manipulate data distribution
Example Resampling method
Manipulate input features
Example Feature subset selection
Manipulate class labels
Example error-correcting output coding
Introduce randomness into learning algorithm
Example Random forests

116
Bagging

Sampling with replacement
Build classifier on each bootstrap sample
Each sample has probability (1 1/n)n of being
selected

117
Bagging Algorithm
118
Bagging Example

Consider 1-dimensional data set
Classifier is a decision stump
Decision rule x ? k versus x gt k
Split point k is chosen based on entropy

x ? k
True
False
yleft
yright
119
Bagging Example
120
Bagging Example
121
Bagging Example

Summary of Training sets

122
Bagging Example

Assume test set is the same as the original data
Use majority vote to determine class of ensemble
classifier

Predicted Class
123
Boosting

An iterative procedure to adaptively change
distribution of training data by focusing more on
previously misclassified records
Initially, all N records are assigned equal
weights
Unlike bagging, weights may change at the end of
each boosting round

124
Boosting

Records that are wrongly classified will have
their weights increased
Records that are classified correctly will have
their weights decreased

Example 4 is hard to classify
Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds

125
AdaBoost

Base classifiers C1, C2, , CT
Error rate
Importance of a classifier

126
AdaBoost Algorithm

Weight update
If any intermediate rounds produce error rate
higher than 50, the weights are reverted back to
1/n and the resampling procedure is repeated
Classification

127
AdaBoost Algorithm
128
AdaBoost Example

Consider 1-dimensional data set
Classifier is a decision stump
Decision rule x ? k versus x gt k
Split point k is chosen based on entropy

x ? k
True
False
yleft
yright
129
AdaBoost Example

Training sets for the first 3 boosting rounds
Summary

130
AdaBoost Example

Weights
Classification

Predicted Class
131
Data Mining Classification Alternative
Techniques

Imbalanced Class Problem

132
Class Imbalance Problem

Lots of classification problems where the classes
are skewed (more records from one class than
another)
Credit card fraud
Intrusion detection
Defective products in manufacturing assembly line

133
Challenges

Evaluation measures such as accuracy is not
well-suited for imbalanced class
Detecting the rare class is like finding needle
in a haystack

134
Confusion Matrix

Confusion Matrix

a TP (true positive) b FN (false negative) c
FP (false positive) d TN (true negative)
135
Accuracy

Most widely-used metric

136
Problem with Accuracy

Consider a 2-class problem
Number of Class 0 examples 9990
Number of Class 1 examples 10
If a model predicts everything to be class 0,
accuracy is 9990/10000 99.9
This is misleading because the model does not
detect any class 1 example
Detecting the rare class is usually more
interesting (e.g., frauds, intrusions, defects,
etc)

137
Alternative Measures
138
ROC (Receiver Operating Characteristic)

A graphical approach for displaying trade-off
between detection rate and false alarm rate
Developed in 1950s for signal detection theory to
analyze noisy signals
ROC curve plots TPR against FPR
Performance of a model represented as a point in
an ROC curve
Changing the threshold parameter of classifier
changes the location of the point

139
ROC Curve

(TPR,FPR)
(0,0) declare everything to be
negative class
(1,1) declare everything to be positive
class
(1,0) ideal
Diagonal line
Random guessing
Below diagonal line
prediction is opposite of the true class

140
ROC (Receiver Operating Characteristic)

To draw ROC curve, classifier must produce
continuous-valued output
Outputs are used to rank test records, from the
most likely positive class record to the least
likely positive class record
Many classifiers produce only discrete outputs
(i.e., predicted class)
How to get continuous-valued outputs?
Decision trees, rule-based classifiers, neural
networks, Bayesian classifiers, k-nearest
neighbors, SVM

141
Example Decision Trees
Decision Tree
Continuous-valued outputs
142
ROC Curve Example
143
ROC Curve Example
- 1-dimensional data set containing 2 classes
(positive and negative) - Any points located at x
gt t is classified as positive
144
Using ROC for Model Comparison

No model consistently outperform the other
M1 is better for small FPR
M2 is better for large FPR
Area Under the ROC curve
Ideal
Area 1
Random guess
Area 0.5

145
How to Construct an ROC curve

Use classifier that produces continuous-valued
output for each test instance score(A)
Sort the instances according to score(A) in
decreasing order
Apply threshold at each unique value of
score(A)
Count the number of TP, FP, TN, FN at each
threshold
TPR TP/(TPFN)
FPR FP/(FP TN)

146
How to construct an ROC curve
Threshold gt
ROC Curve
147
Handling Class Imbalanced Problem

Class-based ordering (e.g. RIPPER)
Rules for rare class have higher priority
Cost-sensitive classification
Misclassifying rare class as majority class is
more expensive than misclassifying majority as
rare class
Sampling-based approaches

148
Cost Matrix
C(i,j) Cost of misclassifying class i example as
class j
149
Computing Cost of Classification
Accuracy 80 Cost 3910
Accuracy 90 Cost 4255
150
Cost Sensitive Classification

Example Bayesian classifer
Given a test record x
Compute p(ix) for each class i
Decision rule classify node as class k if
For 2-class, classify x as if p(x) gt p(-x)
This decision rule implicitly assumes that
C() C(--) 0 and C(-) C(-)

151
Cost Sensitive Classification

General decision rule
Classify test record x as class k if
2-class
Cost() p(x) C(,) p(-x) C(-,)
Cost(-) p(x) C(,-) p(-x) C(-,-)
Decision rule classify x as if Cost() lt
Cost(-)
if C(,) C(-,-) 0

152
Sampling-based Approaches

Modify the distribution of training data so that
rare class is well-represented in training set
Undersample the majority class
Oversample the rare class
Advantages and disadvantages

Write a Comment

User Comments (0)