Data Mining Classification: Alternative Techniques - PowerPoint PPT Presentation

1 / 152
About This Presentation
Title:

Data Mining Classification: Alternative Techniques

Description:

Tan,Steinbach, Kumar Introduction to Data Mining 02/26/2006 1 ... A lemur triggers rule r3, so it is classified as a mammal. A turtle triggers both r4 and r5 ... – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 153
Provided by: Computa8
Category:

less

Transcript and Presenter's Notes

Title: Data Mining Classification: Alternative Techniques


1
Data Mining Classification Alternative
Techniques
  • Lecture Notes for Chapter 5
  • Introduction to Data Mining
  • by
  • Tan, Steinbach, Kumar

2
Data Mining Classification Alternative
Techniques
  • Rule-based Classifier

3
Rule-Based Classifier
  • Classify records by using a collection of
    ifthen rules
  • Rule (Condition) ? y
  • where
  • Condition is a conjunctions of attributes
  • y is the class label
  • LHS rule antecedent or condition
  • RHS rule consequent
  • Examples of classification rules
  • (Blood TypeWarm) ? (Lay EggsYes) ? Birds
  • (Taxable Income lt 50K) ? (RefundYes) ? EvadeNo

4
Rule-based Classifier (Example)
5
Application of Rule-Based Classifier
  • A rule r covers an instance x if the attributes
    of the instance satisfy the condition of the rule

The rule r1 covers a hawk gt Bird The rule r3
covers the grizzly bear gt Mammal
6
Rule Coverage and Accuracy
  • Coverage of a rule
  • Fraction of records that satisfy the antecedent
    of a rule
  • Accuracy of a rule
  • Fraction of records that satisfy both the
    antecedent and consequent of a rule

(StatusSingle) ? No Coverage 40,
Accuracy 50
7
How does Rule-based Classifier Work?
A lemur triggers rule r3, so it is classified as
a mammal A turtle triggers both r4 and r5 A
dogfish shark triggers none of the rules
8
Characteristics of Rule-Based Classifier
  • Mutually exclusive rules
  • Classifier contains mutually exclusive rules if
    the rules are independent of each other
  • Every record is covered by at most one rule
  • Exhaustive rules
  • Classifier has exhaustive coverage if it accounts
    for every possible combination of attribute
    values
  • Each record is covered by at least one rule

9
From Decision Trees To Rules
Rules are mutually exclusive and exhaustive Rule
set contains as much information as the tree
10
Rules Can Be Simplified
Initial Rule (RefundNo) ?
(StatusMarried) ? No Simplified Rule
(StatusMarried) ? No
11
Effect of Rule Simplification
  • Rules are no longer mutually exclusive
  • A record may trigger more than one rule
  • Solution?
  • Ordered rule set
  • Unordered rule set use voting schemes
  • Rules are no longer exhaustive
  • A record may not trigger any rules
  • Solution?
  • Use a default class

12
Ordered Rule Set
  • Rules are rank ordered according to their
    priority
  • An ordered rule set is known as a decision list
  • When a test record is presented to the classifier
  • It is assigned to the class label of the highest
    ranked rule it has triggered
  • If none of the rules fired, it is assigned to the
    default class

13
Rule Ordering Schemes
  • Rule-based ordering
  • Individual rules are ranked based on their
    quality/priority
  • Class-based ordering
  • Rules that belong to the same class appear
    together

14
Building Classification Rules
  • Direct Method
  • Extract rules directly from data
  • e.g. RIPPER, CN2, 1R, and AQ
  • Indirect Method
  • Extract rules from other classification models
    (e.g. decision trees, neural networks, SVM,
    etc).
  • e.g C4.5rules

15
Direct Method Sequential Covering
16
Example of Sequential Covering
17
Example of Sequential Covering
18
Aspects of Sequential Covering
  • Rule Growing
  • Rule evaluation
  • Instance Elimination
  • Stopping Criterion
  • Rule Pruning

19
Rule Growing
  • Two common strategies

20
Rule Evaluation
  • Evaluation metric determines which conjunct
    should be added during rule growing
  • Accuracy
  • Laplace
  • M-estimate

n Number of instances covered by rule nc
Number of instances of class c covered by rule k
Number of classes p Prior probability
21
Rule Growing (Examples)
  • CN2 Algorithm
  • Start from an empty conjunct
  • Add conjuncts that minimizes the entropy measure
    A, A,B,
  • Determine the rule consequent by taking majority
    class of instances covered by the rule
  • RIPPER Algorithm
  • Start from an empty rule gt class
  • Add conjuncts that maximizes FOILs information
    gain measure
  • R0 gt class (initial rule)
  • R1 A gt class (rule after adding conjunct)
  • Gain(R0, R1) t log (p1/(p1n1)) log
    (p0/(p0 n0))
  • where t number of positive instances covered
    by both R0 and R1
  • p0 number of positive instances covered by R0
  • n0 number of negative instances covered by R0
  • p1 number of positive instances covered by R1
  • n1 number of negative instances covered by R1

22
Instance Elimination
  • Why do we need to eliminate instances?
  • Otherwise, the next rule is identical to previous
    rule
  • Why do we remove positive instances?
  • Ensure that the next rule is different
  • Why do we remove negative instances?
  • Prevent underestimating accuracy of rule
  • Compare rules R2 and R3 in the diagram

23
Stopping Criterion and Rule Pruning
  • Examples of stopping criterion
  • If rule does not improve significantly after
    adding conjunct
  • If rule starts covering examples from another
    class
  • Rule Pruning
  • Similar to post-pruning of decision trees
  • Example using validation set (reduced error
    pruning)
  • Remove one of the conjuncts in the rule
  • Compare error rate on validation set before and
    after pruning
  • If error improves, prune the conjunct

24
Summary of Direct Method
  • Initial rule set is empty
  • Repeat
  • Grow a single rule
  • Remove Instances covered by the rule
  • Prune the rule (if necessary)
  • Add rule to the current rule set

25
Direct Method RIPPER
  • For 2-class problem, choose one of the classes as
    positive class, and the other as negative class
  • Learn the rules for positive class
  • Use negative class as default
  • For multi-class problem
  • Order the classes according to increasing class
    prevalence (fraction of instances that belong to
    a particular class)
  • Learn the rule set for smallest class first,
    treat the rest as negative class
  • Repeat with next smallest class as positive class

26
Direct Method RIPPER
  • Rule growing
  • Start from an empty rule ?
  • Add conjuncts as long as they improve FOILs
    information gain
  • Stop when rule no longer covers negative examples
  • Prune the rule immediately using incremental
    reduced error pruning
  • Measure for pruning v (p-n)/(pn)
  • p number of positive examples covered by the
    rule in the validation set
  • n number of negative examples covered by the
    rule in the validation set
  • Pruning method delete any final sequence of
    conditions that maximizes v

27
Direct Method RIPPER
  • Building a Rule Set
  • Use sequential covering algorithm
  • Grow a rule to cover the current set of positive
    examples
  • Eliminate both positive and negative examples
    covered by the rule
  • Each time a rule is added to the rule set,
    compute the new description length
  • stop adding new rules when the new description
    length is d bits longer than the smallest
    description length obtained so far

28
Indirect Methods
29
Indirect Method C4.5rules
  • Extract rules for every path from root to leaf
    nodes
  • For each rule, r A ? y,
  • consider alternative rule r A ? y where A is
    obtained by removing one of the conjuncts in A
  • Compare the pessimistic error rate for r against
    all rs
  • Prune if one of the rs has lower pessimistic
    error rate
  • Repeat until pessimistic error rate can no longer
    be improved

30
Indirect Method C4.5rules
  • Use class-based ordering
  • Rules that predict the same class are grouped
    together into the same subset
  • Compute total description length for each class
  • Classes are ordered in increasing order of their
    total description length

31
Example
C4.5rules (Give BirthNo, Can FlyYes) ?
Birds (Give BirthNo, Live in WaterYes) ?
Fishes (Give BirthYes) ? Mammals (Give BirthNo,
Can FlyNo, Live in WaterNo) ? Reptiles ( ) ?
Amphibians
32
Characteristics of Rule-Based Classifiers
  • As highly expressive as decision trees
  • Easy to interpret
  • Easy to generate
  • Can classify new instances rapidly
  • Performance comparable to decision trees

33
Data Mining Classification Alternative
Techniques
  • Instance-Based Classifiers

34
Instance-Based Classifiers
  • Store the training records
  • Use training records to predict the class
    label of unseen cases

35
Instance Based Classifiers
  • Examples
  • Rote-learner
  • Memorizes entire training data and performs
    classification only if attributes of record match
    one of the training examples exactly
  • Nearest neighbor
  • Uses k closest points (nearest neighbors) for
    performing classification

36
Nearest Neighbor Classifiers
  • Basic idea
  • If it walks like a duck, quacks like a duck, then
    its probably a duck

37
Nearest-Neighbor Classifiers
  • Requires three things
  • The set of stored records
  • Distance metric to compute distance between
    records
  • The value of k, the number of nearest neighbors
    to retrieve
  • To classify an unknown record
  • Compute distance to other training records
  • Identify k nearest neighbors
  • Use class labels of nearest neighbors to
    determine the class label of unknown record
    (e.g., by taking majority vote)

38
Definition of Nearest Neighbor
K-nearest neighbors of a record x are data
points that have the k smallest distance to x
39
1 nearest-neighbor
Voronoi Diagram
40
Nearest Neighbor Classification
  • Compute distance between two points
  • Example Euclidean distance
  • Determine the class from nearest neighbor list
  • take the majority vote of class labels among the
    k-nearest neighbors
  • Weigh the vote according to distance
  • weight factor, w 1/d2

41
Nearest Neighbor Classification
  • Choosing the value of k
  • If k is too small, sensitive to noise points
  • If k is too large, neighborhood may include
    points from other classes

42
Nearest Neighbor Classification
  • Scaling issues
  • Attributes may have to be scaled to prevent
    distance measures from being dominated by one of
    the attributes
  • Example
  • height of a person may vary from 1.5m to 1.8m
  • weight of a person may vary from 90lb to 300lb
  • income of a person may vary from 10K to 1M

43
Nearest Neighbor Classification
  • Problem with Euclidean measure
  • High dimensional data
  • curse of dimensionality
  • Can produce counter-intuitive results

1 1 1 1 1 1 1 1 1 1 1 0
1 0 0 0 0 0 0 0 0 0 0 0
vs
0 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 1
d 1.4142
d 1.4142
  • Solution Normalize the vectors to unit length

44
Nearest neighbor Classification
  • k-NN classifiers are lazy learners
  • It does not build models explicitly
  • Unlike eager learners such as decision tree
    induction and rule-based systems
  • Classifying unknown records are relatively
    expensive

45
Example PEBLS
  • PEBLS Parallel Examplar-Based Learning System
    (Cost Salzberg)
  • Works with both continuous and nominal features
  • For nominal features, distance between two
    nominal values is computed using modified value
    difference metric (MVDM)
  • Each record is assigned a weight factor
  • Number of nearest neighbor, k 1

46
Example PEBLS
Distance between nominal attribute
values d(Single,Married) 2/4 0/4
2/4 4/4 1 d(Single,Divorced) 2/4
1/2 2/4 1/2 0 d(Married,Divorced)
0/4 1/2 4/4 1/2
1 d(RefundYes,RefundNo) 0/3 3/7 3/3
4/7 6/7
47
Example PEBLS
Distance between record X and record Y
where
wX ? 1 if X makes accurate prediction most of
the time wX gt 1 if X is not reliable for making
predictions
48
Data Mining Classification Alternative
Techniques
  • Bayesian Classifiers

49
Thomas Bayes
  • English minister and mathematician (1701 1761)

http//www-history.mcs.st-andrews.ac.uk/PictDispla
y/Bayes.html
50
Bayes Classifier
  • A probabilistic framework for solving
    classification problems
  • Conditional Probability
  • Bayes theorem

51
Example of Bayes Theorem
  • Given
  • A doctor knows that meningitis causes stiff neck
    50 of the time
  • Prior probability of any patient having
    meningitis is 1/50,000
  • Prior probability of any patient having stiff
    neck is 1/20
  • If a patient has stiff neck, whats the
    probability he/she has meningitis?

52
Using Bayes Theorem for Classification
  • Consider each attribute and class label as random
    variables
  • Given a record with attributes (X1, X2,, Xd)
  • Goal is to predict class Y
  • Specifically, we want to find the value of Y that
    maximizes P(Y X1, X2,, Xd )
  • Can we estimate P(Y X1, X2,, Xd ) directly from
    data?

53
Using Bayes Theorem for Classification
  • Approach
  • compute posterior probability P(Y X1, X2, ,
    Xd) using the Bayes theorem
  • Maximum a-posteriori Choose Y that maximizes
    P(Y X1, X2, , Xd)
  • Equivalent to choosing value of Y that maximizes
    P(X1, X2, , XdY) P(Y)
  • How to estimate P(X1, X2, , Xd Y )?

54
Naïve Bayes Classifier
  • Assume independence among attributes Xi when
    class is given
  • P(X1, X2, , Xd Yj) P(X1 Yj) P(X2 Yj) P(Xd
    Yj)
  • Can estimate P(Xi Yj) for all Xi and Yj from
    data
  • New point is classified to Yj if P(Yj) ? P(Xi
    Yj) is maximal.

55
Conditional Independence
  • X and Y are conditionally independent given Z if
    P(XYZ) P(XZ)
  • Example Arm length and reading skills
  • Young child has shorter arm length and limited
    reading skills, compared to adults
  • If age is fixed, no apparent relationship between
    arm length and reading skills
  • Arm length and reading skills are conditionally
    independent given age

56
Estimate Probabilities from Data
  • Class P(Y) Nc/N
  • e.g., P(No) 7/10, P(Yes) 3/10
  • For discrete attributes P(Xi Yk)
    Xik/ Nc
  • where Xik is number of instances having
    attribute value Xi and belonging to class Yk
  • Examples
  • P(StatusMarriedNo) 4/7P(RefundYesYes)0

k
57
Estimate Probabilities from Data
  • For continuous attributes
  • Discretization Partition the range into bins
  • Replace continuous value with bin value
  • Attribute changed from continuous to ordinal
  • Probability density estimation
  • Assume attribute follows a normal distribution
  • Use data to estimate parameters of distribution
    (e.g., mean and standard deviation)
  • Once probability distribution is known, use it to
    estimate the conditional probability P(XiY)

k
58
Estimate Probabilities from Data
  • Normal distribution
  • One for each (Xi,Yi) pair
  • For (Income, ClassNo)
  • If ClassNo
  • sample mean 110
  • sample variance 2975

59
Example of Naïve Bayes Classifier
Given a Test Record
  • P(XClassNo) P(RefundNoClassNo) ?
    P(Married ClassNo) ? P(Income120K
    ClassNo) 4/7 ? 4/7 ? 0.0072
    0.0024
  • P(XClassYes) P(RefundNo ClassYes)
    ? P(Married ClassYes)
    ? P(Income120K ClassYes)
    1 ? 0 ? 1.2 ? 10-9 0
  • Since P(XNo)P(No) gt P(XYes)P(Yes)
  • Therefore P(NoX) gt P(YesX) gt Class No

60
Naïve Bayes Classifier
  • If one of the conditional probabilities is zero,
    then the entire expression becomes zero
  • Probability estimation

c number of classes p prior probability m
parameter
61
Example of Naïve Bayes Classifier
A attributes M mammals N non-mammals
P(AM)P(M) gt P(AN)P(N) gt Mammals
62
Naïve Bayes (Summary)
  • Robust to isolated noise points
  • Handle missing values by ignoring the instance
    during probability estimate calculations
  • Robust to irrelevant attributes
  • Independence assumption may not hold for some
    attributes
  • Use other techniques such as Bayesian Belief
    Networks (BBN)

63
Bayesian Belief Networks
  • Provides graphical representation of
    probabilistic relationships among a set of random
    variables
  • Consists of
  • A directed acyclic graph (dag)
  • Node corresponds to a variable
  • Arc corresponds to dependence relationship
    between a pair of variables
  • A probability table associating each node to its
    immediate parent

64
Conditional Independence
D is parent of C A is child of C B is descendant
of D D is ancestor of A
  • A node in a Bayesian network is conditionally
    independent of all of its nondescendants, if its
    parents are known

65
Conditional Independence
  • Naïve Bayes assumption

66
Probability Tables
  • If X does not have any parents, table contains
    prior probability P(X)
  • If X has only one parent (Y), table contains
    conditional probability P(XY)
  • If X has multiple parents (Y1, Y2,, Yk), table
    contains conditional probability P(XY1, Y2,, Yk)

67
Example of Bayesian Belief Network
68
Example of Inferencing using BBN
  • Given X (ENo, DYes, CPYes, BPHigh)
  • Compute P(HDE,D,CP,BP)?
  • P(HDYes ENo,DYes) 0.55P(CPYes HDYes)
    0.8P(BPHigh HDYes) 0.85
  • P(HDYesENo,DYes,CPYes,BPHigh) ? 0.55 ? 0.8
    ? 0.85 0.374
  • P(HDNo ENo,DYes) 0.45P(CPYes HDNo)
    0.01P(BPHigh HDNo) 0.2
  • P(HDNoENo,DYes,CPYes,BPHigh) ? 0.45 ? 0.01
    ? 0.2 0.0009

Classify X as Yes
69
Data Mining Classification Alternative
Techniques
  • Artificial Neural Networks

70
Artificial Neural Networks (ANN)
Output Y is 1 if at least two of the three inputs
are equal to 1.
71
Artificial Neural Networks (ANN)
72
Artificial Neural Networks (ANN)
  • Model is an assembly of inter-connected nodes and
    weighted links
  • Output node sums up each of its input value
    according to the weights of its links
  • Compare output node against some threshold t

Perceptron Model
73
General Structure of ANN
Training ANN means learning the weights of the
neurons
74
Artificial Neural Networks (ANN)
  • Various types of neural network topology
  • single-layered network (perceptron) versus
    multi-layered network
  • Feed-forward versus recurrent network
  • Various types of activation functions (g)

75
Perceptron
  • Single layer network
  • Contains only input and output nodes
  • Activation function g sign(w?x)
  • Applying model is straightforward
  • X1 1, X2 0, X3 1 gt y sign(0.2) 1

76
Perceptron Learning Rule
  • Initialize the weights (w0, w1, , wd)
  • Repeat
  • For each training example (xi, yi)
  • Compute f(w, xi)
  • Update the weights
  • Until stopping condition is met

77
Perceptron Learning Rule
  • Weight update formula
  • Intuition
  • Update weight based on error
  • If yf(x,w), e0 no update needed
  • If ygtf(x,w), e2 weight must be increased so
    that f(x,w) will increase
  • If yltf(x,w), e-2 weight must be decreased so
    that f(x,w) will decrease

78
Example of Perceptron Learning
79
Perceptron Learning Rule
  • Since f(w,x) is a linear combination of input
    variables, decision boundary is linear
  • For nonlinearly separable problems, perceptron
    learning algorithm will fail because no linear
    hyperplane can separate the data perfectly

80
Nonlinearly Separable Data
XOR Data
81
Multilayer Neural Network
  • Hidden layers
  • intermediary layers between input output layers
  • Hidden units (nodes)
  • Nodes embedded in hidden layers
  • More general activation functions (sigmoid,
    linear, etc)

82
Multi-layer Neural Network
  • Multi-layer neural network can solve any type of
    classification task involving nonlinear decision
    surfaces

XOR Data
83
Learning Multi-layer Neural Network
  • Can we apply perceptron learning rule to each
    node, including hidden nodes?
  • Perceptron learning rule computes error term e
    y-f(w,x) and updates weights accordingly
  • Problem how to determine the true value of y
    for hidden nodes?
  • Approximate error in hidden nodes by error in the
    output nodes
  • Problem
  • Not clear how adjustment in the hidden nodes
    affect overall error
  • No guarantee of convergence to optimal solution

84
Gradient Descent for Multilayer NN
  • Weight update
  • Error function
  • Activation function f must be differentiable
  • For sigmoid function
  • Stochastic gradient descent (update the weight
    immediately)

85
Gradient Descent for MultiLayer NN
  • For output neurons, weight update formula is the
    same as before (gradient descent for perceptron)
  • For hidden neurons

86
Design Issues in ANN
  • Number of nodes in input layer
  • One input node per binary/continuous attribute
  • k or log2 k nodes for each categorical attribute
    with k values
  • Number of nodes in output layer
  • One output for binary class problem
  • k or log2 k nodes for k-class problem
  • Number of nodes in hidden layer
  • Initial weights and biases

87
Characteristics of ANN
  • Multilayer ANN are universal approximators
  • Can handle redundant attributes because weights
    are automatically learnt
  • Gradient descent may converge to local minimum
  • Model building can be very time consuming, but
    testing can be very fast

88
Data Mining Classification Alternative
Techniques
  • Support Vector Machines

89
Support Vector Machines
  • Find a linear hyperplane (decision boundary) that
    will separate the data

90
Support Vector Machines
  • One Possible Solution

91
Support Vector Machines
  • Another possible solution

92
Support Vector Machines
  • Other possible solutions

93
Support Vector Machines
  • Which one is better? B1 or B2?
  • How do you define better?

94
Support Vector Machines
  • Find hyperplane maximizes the margin gt B1 is
    better than B2

95
SVM vs Perceptron
Perceptron
SVM
Minimizes least square error (gradient descent)
Maximizes margin
96
Support Vector Machines
97
Learning Linear SVM
  • We want to maximize
  • Which is equivalent to minimizing
  • But subjected to the following constraints
  • or
  • This is a constrained optimization problem
  • Solve it using Lagrange multiplier method

98
Learning Linear SVM
  • Lagrange multiplier
  • Take derivative w.r.t ? and b
  • Additional constraints
  • Dual problem

99
Learning Linear SVM
  • Bigger picture
  • Learning algorithm needs to find w and b
  • To solve for w
  • But
  • ? is zero for points that do not reside on
  • Data points where ?s are not zero are called
    support vectors

100
Example of Linear SVM
Support vectors
101
Learning Linear SVM
  • Bigger picture
  • Decision boundary depends only on support vectors
  • If you have data set with same support vectors,
    decision boundary will not change
  • How to classify using SVM once w and b are found?
    Given a test record, xi

102
Support Vector Machines
  • What if the problem is not linearly separable?

103
Support Vector Machines
  • What if the problem is not linearly separable?
  • Introduce slack variables
  • Need to minimize
  • Subject to
  • If k is 1 or 2, this leads to same objective
    function as linear SVM but with different
    constraints (see textbook)

104
Nonlinear Support Vector Machines
  • What if decision boundary is not linear?

105
Nonlinear Support Vector Machines
  • Trick Transform data into higher dimensional
    space

Decision boundary
106
Learning NonLinear SVM
  • Optimization problem
  • Which leads to the same set of equations (but
    involve ?(x) instead of x)

107
Learning NonLinear SVM
  • Issues
  • What type of mapping function ? should be used?
  • How to do the computation in high dimensional
    space?
  • Most computations involve dot product ?(xi)?
    ?(xj)
  • Curse of dimensionality?

108
Learning Nonlinear SVM
  • Kernel Trick
  • ?(xi)? ?(xj) K(xi, xj)
  • K(xi, xj) is a kernel function (expressed in
    terms of the coordinates in the original space)
  • Examples

109
Example of Nonlinear SVM
SVM with polynomial degree 2 kernel
110
Learning Nonlinear SVM
  • Advantages of using kernel
  • Dont have to know the mapping function ?
  • Computing dot product ?(xi)? ?(xj) in the
    original space avoids curse of dimensionality
  • Not all functions can be kernels
  • Must make sure there is a corresponding ? in some
    high-dimensional space
  • Mercers theorem (see textbook)

111
Data Mining Classification Alternative
Techniques
  • Ensemble Methods

112
Ensemble Methods
  • Construct a set of classifiers from the training
    data
  • Predict class label of test records by combining
    the predictions made by multiple classifiers

113
Why Ensemble Methods work?
  • Suppose there are 25 base classifiers
  • Each classifier has error rate, ? 0.35
  • Assume errors made by classifiers are
    uncorrelated
  • Probability that the ensemble classifier makes a
    wrong prediction

114
General Approach
115
Types of Ensemble Methods
  • Bayesian ensemble
  • Example Mixture of Gaussian
  • Manipulate data distribution
  • Example Resampling method
  • Manipulate input features
  • Example Feature subset selection
  • Manipulate class labels
  • Example error-correcting output coding
  • Introduce randomness into learning algorithm
  • Example Random forests

116
Bagging
  • Sampling with replacement
  • Build classifier on each bootstrap sample
  • Each sample has probability (1 1/n)n of being
    selected

117
Bagging Algorithm
118
Bagging Example
  • Consider 1-dimensional data set
  • Classifier is a decision stump
  • Decision rule x ? k versus x gt k
  • Split point k is chosen based on entropy

x ? k
True
False
yleft
yright
119
Bagging Example
120
Bagging Example
121
Bagging Example
  • Summary of Training sets

122
Bagging Example
  • Assume test set is the same as the original data
  • Use majority vote to determine class of ensemble
    classifier

Predicted Class
123
Boosting
  • An iterative procedure to adaptively change
    distribution of training data by focusing more on
    previously misclassified records
  • Initially, all N records are assigned equal
    weights
  • Unlike bagging, weights may change at the end of
    each boosting round

124
Boosting
  • Records that are wrongly classified will have
    their weights increased
  • Records that are classified correctly will have
    their weights decreased
  • Example 4 is hard to classify
  • Its weight is increased, therefore it is more
    likely to be chosen again in subsequent rounds

125
AdaBoost
  • Base classifiers C1, C2, , CT
  • Error rate
  • Importance of a classifier

126
AdaBoost Algorithm
  • Weight update
  • If any intermediate rounds produce error rate
    higher than 50, the weights are reverted back to
    1/n and the resampling procedure is repeated
  • Classification

127
AdaBoost Algorithm
128
AdaBoost Example
  • Consider 1-dimensional data set
  • Classifier is a decision stump
  • Decision rule x ? k versus x gt k
  • Split point k is chosen based on entropy

x ? k
True
False
yleft
yright
129
AdaBoost Example
  • Training sets for the first 3 boosting rounds
  • Summary

130
AdaBoost Example
  • Weights
  • Classification

Predicted Class
131
Data Mining Classification Alternative
Techniques
  • Imbalanced Class Problem

132
Class Imbalance Problem
  • Lots of classification problems where the classes
    are skewed (more records from one class than
    another)
  • Credit card fraud
  • Intrusion detection
  • Defective products in manufacturing assembly line

133
Challenges
  • Evaluation measures such as accuracy is not
    well-suited for imbalanced class
  • Detecting the rare class is like finding needle
    in a haystack

134
Confusion Matrix
  • Confusion Matrix

a TP (true positive) b FN (false negative) c
FP (false positive) d TN (true negative)
135
Accuracy
  • Most widely-used metric

136
Problem with Accuracy
  • Consider a 2-class problem
  • Number of Class 0 examples 9990
  • Number of Class 1 examples 10
  • If a model predicts everything to be class 0,
    accuracy is 9990/10000 99.9
  • This is misleading because the model does not
    detect any class 1 example
  • Detecting the rare class is usually more
    interesting (e.g., frauds, intrusions, defects,
    etc)

137
Alternative Measures
138
ROC (Receiver Operating Characteristic)
  • A graphical approach for displaying trade-off
    between detection rate and false alarm rate
  • Developed in 1950s for signal detection theory to
    analyze noisy signals
  • ROC curve plots TPR against FPR
  • Performance of a model represented as a point in
    an ROC curve
  • Changing the threshold parameter of classifier
    changes the location of the point

139
ROC Curve
  • (TPR,FPR)
  • (0,0) declare everything to be
    negative class
  • (1,1) declare everything to be positive
    class
  • (1,0) ideal
  • Diagonal line
  • Random guessing
  • Below diagonal line
  • prediction is opposite of the true class

140
ROC (Receiver Operating Characteristic)
  • To draw ROC curve, classifier must produce
    continuous-valued output
  • Outputs are used to rank test records, from the
    most likely positive class record to the least
    likely positive class record
  • Many classifiers produce only discrete outputs
    (i.e., predicted class)
  • How to get continuous-valued outputs?
  • Decision trees, rule-based classifiers, neural
    networks, Bayesian classifiers, k-nearest
    neighbors, SVM

141
Example Decision Trees
Decision Tree
Continuous-valued outputs
142
ROC Curve Example
143
ROC Curve Example
- 1-dimensional data set containing 2 classes
(positive and negative) - Any points located at x
gt t is classified as positive
144
Using ROC for Model Comparison
  • No model consistently outperform the other
  • M1 is better for small FPR
  • M2 is better for large FPR
  • Area Under the ROC curve
  • Ideal
  • Area 1
  • Random guess
  • Area 0.5

145
How to Construct an ROC curve
  • Use classifier that produces continuous-valued
    output for each test instance score(A)
  • Sort the instances according to score(A) in
    decreasing order
  • Apply threshold at each unique value of
    score(A)
  • Count the number of TP, FP, TN, FN at each
    threshold
  • TPR TP/(TPFN)
  • FPR FP/(FP TN)

146
How to construct an ROC curve
Threshold gt
ROC Curve
147
Handling Class Imbalanced Problem
  • Class-based ordering (e.g. RIPPER)
  • Rules for rare class have higher priority
  • Cost-sensitive classification
  • Misclassifying rare class as majority class is
    more expensive than misclassifying majority as
    rare class
  • Sampling-based approaches

148
Cost Matrix
C(i,j) Cost of misclassifying class i example as
class j
149
Computing Cost of Classification
Accuracy 80 Cost 3910
Accuracy 90 Cost 4255
150
Cost Sensitive Classification
  • Example Bayesian classifer
  • Given a test record x
  • Compute p(ix) for each class i
  • Decision rule classify node as class k if
  • For 2-class, classify x as if p(x) gt p(-x)
  • This decision rule implicitly assumes that
    C() C(--) 0 and C(-) C(-)

151
Cost Sensitive Classification
  • General decision rule
  • Classify test record x as class k if
  • 2-class
  • Cost() p(x) C(,) p(-x) C(-,)
  • Cost(-) p(x) C(,-) p(-x) C(-,-)
  • Decision rule classify x as if Cost() lt
    Cost(-)
  • if C(,) C(-,-) 0

152
Sampling-based Approaches
  • Modify the distribution of training data so that
    rare class is well-represented in training set
  • Undersample the majority class
  • Oversample the rare class
  • Advantages and disadvantages
Write a Comment
User Comments (0)
About PowerShow.com