Business Systems Intelligence: 5. Classification 2 - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Business Systems Intelligence: 5. Classification 2

Description:

Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee) Business Systems Intelligence: 5. Classification 2 Acknowledgments These notes are based (heavily) on those provided ... – PowerPoint PPT presentation

Number of Views:127

Avg rating:3.0/5.0

Slides: 50

Provided by: jiaw205

Category:

more less

Transcript and Presenter's Notes

Title: Business Systems Intelligence: 5. Classification 2

1
Business Systems Intelligence5. Classification
2
Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)
2
Acknowledgments

These notes are based (heavily) on those
provided by the authors to accompany Data
Mining Concepts Techniques by Jiawei Han
and Micheline Kamber
Some slides are also based on trainers kits
provided by

More information about the book is available
atwww-sal.cs.uiuc.edu/hanj/bk2/ And
information on SAS is available atwww.sas.com
3
Classification Prediction

Today we will look at
What are classification prediction?
Issues regarding classification and prediction
Classification techniques
Case based reasoning (k-nearest neighbour
algorithm)
Decision tree induction
Bayesian classification
Neural networks
Support vector machines (SVM)
Classification based on from association rule
mining concepts
Other classification methods
Prediction
Classification accuracy

4
Classification

Classification
Predicts categorical class labels
Typical Applications
CreditHistory, Salary -gt CreditApproval
(Yes/No)
Temp, Humidity --gt Rain (Yes/No)

5
Linear Classification

Binary Classification problem
The data above the red line belongs to class x
The data below red line belongs to class o
Examples SVM, Perceptron, Probabilistic
Classifiers

6
Discriminative Classifiers

Advantages
Prediction accuracy is generally high
Robust, works when training examples contain
errors
Fast evaluation of the learned target function
Criticism
Long training time
Difficult to understand the learned function
(weights)
Not easy to incorporate domain knowledge

7
Artificial Neural Networks

A biologically inspired classification
technique
Formed from interconnected layers of simple
artificial neurons
ANN history
1943 McCulloch Pitts
1959 Rosenblatt (Perceptron)
1959 Widrow Hoff (ADALINE and MADALINE)
1969 Marvin Minsky and Seymour Papert's
1974 Werbos (Backprop)
1982 John Hopfield

8
An Artifical Neuron

The n-dimensional input vector x is mapped into
variable y by means of the scalar product and a
nonlinear function mapping

9
ANN Multi-Layer Perceptrons (MLPs)

Multi Layer Perceptrons (MLPs) are one of the
best known ANN types
Composed of layers of fully interconnected
artificial neurons
Training involves repeatedly presenting a series
of training cases to the network and adjusting
neurons weights and biases to minimise
classification error
Typically the backpropogation of error algorithm
is used for training

10
MLP Example

Remember our surfing example
An MLP can be built and trained to perform
classification for this problem

11
Network Training

The ultimate objective of training
Obtain a set of weights that makes almost all of
the tuples in the training data classified
correctly
Steps
Initialize weights with random values
Feed the input tuples into the network one by one
For each unit
Compute the net input to the unit as a linear
combination of all the inputs to the unit
Compute the output value using the activation
function
Compute the error
Update the weights and the bias

12
Summary of ANN Classification

Strengths
Fast classification
Very good generalization capacity
Weaknesses
No explanation capability black box
Training can be slow eager learning
Retraining is difficult
Lots of other network types, but MLP is probably
the most common

13
Support Vector Machines (SVM)

In classification problems we try to create
decision boundaries between classes
A choice must be made between possible boundaries

Class 2
Class 1
14
SVMs (cont)

The decision boundary should be as far away from
the data of both classes as possible

15
Margins
16
Linear Support Vector Machine

Given a set of points with label
The SVM finds a hyperplane defined by the pair
(w, b), where w is the normal to the plane and b
is the distance from the origin
Where
x - feature vector
b - bias, y- class label
w - margin

17
SVMs The Clever Bit!

What about when classes are not linearly
separable?
Kernel functions and the kernel trick are used to
transform data into a different linearly
separable feature space

18
SVMs The Clever Bit! (cont...)

What if the data is not linearly separable?
Project the data to high dimensional space where
it is linearly separable and then we can use
linear SVM (Using Kernels)

19
SVM Example
Example of Non-linear SVM
20
SVM Example (cont)
Results
21
Summary of SVM Classification

Strengths
Over-fitting is not common
Works well with high dimensional data
Fast classification
Good generalization capacity
Weaknesses
Retraining is difficult
No explanation capability
Slow training
At the cutting edge of machine learning

22
SVM vs. ANN

SVM
Relatively new concept
Nice generalization properties
Hard to learn learned in batch mode using
quadratic programming techniques
Using kernels can learn very complex functions

ANN
Quite old
Generalizes well but doesnt have strong
mathematical foundation
Can easily be learned in incremental fashion
To learn complex functions use multilayer
perceptron (not that trivial)

23
SVM Related Links

http//svm.dcs.rhbnc.ac.uk/
http//www.kernel-machines.org/
C. J. C. Burges. A Tutorial on Support Vector
Machines for Pattern Recognition. Knowledge
Discovery and Data Mining, 2(2), 1998.
SVMlight Software (in C) http//ais.gmd.de/thor
sten/svm_light
BOOK An Introduction to Support Vector Machines
N. Cristianini and J. Shawe-TaylorCambridge
University Press

24
Association-Based Classification

Several methods for association-based
classification
ARCS Quantitative association mining and
clustering of association rules (Lent et al97)
It beats C4.5 in (mainly) scalability and also
accuracy
Associative classification (Liu et al98)
It mines high support and high confidence rules
in the form of cond_set gt y, where y is a
class label
CAEP (Classification by aggregating emerging
patterns) (Dong et al99)
Emerging patterns (EPs) the itemsets whose
support increases significantly from one class to
another
Mine Eps based on minimum support and growth rate

25
What Is Prediction?

Prediction is similar to classification
First, construct a model
Second, use model to predict unknown value
Major method for prediction is regression
Linear and multiple regression
Non-linear regression
Prediction is different from classification
Classification refers to predict categorical
class label
Prediction models continuous-valued functions

26
Regress Analysis and Log-Linear Models in
Prediction

Linear regression Y ? ? X
Two parameters, ? and ?, specify the line and are
to be estimated by using the data at hand
Using the least squares criterion to the known
values of Y1, Y2,, X1, X2,.
Multiple regression Y b0 b1X1 b2X2
Many nonlinear functions can be transformed into
the above
Log-linear models
The multi-way table of joint probabilities is
approximated by a product of lower-order tables
Probability p(a, b, c, d) ?ab ?ac?ad ?bcd

27
Prediction Numerical Data
28
Prediction Categorical Data
29
Concerns Over Classification Techniques

When choosing a technique for a specific
classification problem we must consider the
following issues
Classification accuracy
Training speed
Classification speed
Danger of over-fitting
Generalisation capacity
Implications for retraining
Explanation capability

30
Evaluating Classification Accuracy

During development, and in testing before
deploying a classifier in the wild, we need to be
able to quantify the performance of the
classifier
How accurate is the classifier?
When the classifier is wrong, how is it wrong?
Useful to decide on which classifier (which
parameters) to use and to estimate what the
performance of the system will be

31
Evaluating Classifiers (cont)

How we do this depends on how much data is
available
If there is unlimited data available then there
is no problem
Usually we have less data than we would like so
we have to compromise
Use hold-out testing sets
Cross validation
K-fold cross validation
Leave-one-out validation
Parallel live test

32
Hold-Out Testing Sets

Split the available data into a training set and
a test set
Train the classifier in the training set and
evaluate based on the test set
A couple of drawbacks
We may not have enough data
We may happen upon an unfortunate split

Total number of available examples
33
K-Fold Cross Validation

Divide the entire data set into k folds
For each of k experiments, use kth fold for
testing and everything else for training

Total number of available examples
Test Set
K 0
Test Set
K 1
Test Set
K 2
Test Set
K 3
34
K-Fold Cross Validation (cont)

The accuracy of the system is calculated as the
average error across the k folds
The main advantages of k-fold cross validation
are that every example is used in testing at some
stage and the problem of an unfortunate split is
avoided
Any value can be used for k
10 is most common
Depends on the data set

35
Leave-One-Out Cross Validation

Extreme case of k-fold cross validation
With N data examples perform N experiments with
N-1 training cases and 1 test case

Total number of available examples
K 0
K 1
K 2
K N
36
Classifier Accuracy

The accuracy of a classifier on a given test set
is the percentage of test set tuples that are
correctly classified by the classifier
Often also referred to as recognition rate
Error rate (or misclassification rate) is the
opposite of accuracy

37
False Positives Vs False Negatives

While it is useful to generate the simple
accuracy of a classifier, sometimes we need more
When is the classifier wrong?
False positives vs false negatives
Related to type I and type II errors in
statistics
Often there is a different cost associated with
false positives and false negatives
Think about diagnosing diseases

38
Confusion Matrix

Device used to illustrate how a classifier is
performing in terms of false positives and false
negatives
Gives us more information than a single
accuracy figure
Allows us to think about the cost of mistakes
Can be extended to any number of classes

Classifier Result Classifier Result
Class A(yes) Class B(no)
? fn Class A(yes) Expected Result
fp ? Class B(no) Expected Result
39
Other Accuracy Measures

Sometimes a simple accuracy measure is not enough

40
ROC Curves

Receiver Operating Characteristic (ROC) curves
were originally used to make sense of noisy radio
signals
Can be used to help us talk about classifier
performance and determine the best operating
point for a classifier

41
ROC Curves (cont)

Consider how the relationship between true
positives and false positives can change
We need to choose the best operating point

For some great ROC curve examples have a look here
42
ROC Curves (cont)

ROC curves can be used to compare classifiers
The greater the area under the curve the more
accurate the classifier

43
Over-Fitting

When we train a classifier we are trying to a
learn a function approximated by the training
data we happen to use
What if the training data doesntcover the whole
problem space?
We can learn the training data too closely which
hampers the ability to generalise
This problem is known as overfitting
Depending on the type of classifier used there
are different approaches to avoiding this

44
Ensembles

In order to improve classification accuracy we
can aggregate the results of an ensemble of
classifiers

45
Bagging

Given a set S of s samples
Generate a bootstrap sample T from S
Cases in S may not appear in T or may appear more
than once
Repeat this sampling procedure, getting a
sequence of k independent training sets
A corresponding sequence of classifiers
C1,C2,,Ck is constructed for each of these
training sets, by using the same classification
algorithm

46
Bagging (cont)

To classify an unknown sample X,let each
classifier predict or vote
The Bagged Classifier C counts the votes and
assigns X to the class with the most votes

47
Boosting Technique Algorithm

Assign every example an equal weight 1/N
For t 1, 2, , T Do
Obtain a hypothesis (classifier) h(t) under w(t)
Calculate the error of h(t) and re-weight the
examples based on the error . Each classifier is
dependent on the previous ones. Samples that are
incorrectly predicted are weighted more heavily
Normalize w(t1) to sum to 1 (weights assigned to
different classifiers sum to 1)
Output a weighted sum of all the hypothesis, with
each hypothesis weighted according to its
accuracy on the training set

48
Summary

Classification is an extensively studied problem
Mainly in statistics and machine learning
Classification is probably one of the most widely
used data mining techniques
Scalability is still an important issue for
database applications
Research directions classification of
non-relational data, e.g., text, spatial,
multimedia, etc..

49
Questions?

Write a Comment

User Comments (0)