Title: Business Systems Intelligence: 5. Classification 2
1Business Systems Intelligence5. Classification
2
Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)
2Acknowledgments
- These notes are based (heavily) on those
provided by the authors to accompany Data
Mining Concepts Techniques by Jiawei Han
and Micheline Kamber - Some slides are also based on trainers kits
provided by
More information about the book is available
atwww-sal.cs.uiuc.edu/hanj/bk2/ And
information on SAS is available atwww.sas.com
3Classification Prediction
- Today we will look at
- What are classification prediction?
- Issues regarding classification and prediction
- Classification techniques
- Case based reasoning (k-nearest neighbour
algorithm) - Decision tree induction
- Bayesian classification
- Neural networks
- Support vector machines (SVM)
- Classification based on from association rule
mining concepts - Other classification methods
- Prediction
- Classification accuracy
4Classification
- Classification
- Predicts categorical class labels
- Typical Applications
- CreditHistory, Salary -gt CreditApproval
(Yes/No) - Temp, Humidity --gt Rain (Yes/No)
5Linear Classification
- Binary Classification problem
- The data above the red line belongs to class x
- The data below red line belongs to class o
- Examples SVM, Perceptron, Probabilistic
Classifiers
6Discriminative Classifiers
- Advantages
- Prediction accuracy is generally high
- Robust, works when training examples contain
errors - Fast evaluation of the learned target function
- Criticism
- Long training time
- Difficult to understand the learned function
(weights) - Not easy to incorporate domain knowledge
7Artificial Neural Networks
- A biologically inspired classification
technique - Formed from interconnected layers of simple
artificial neurons - ANN history
- 1943 McCulloch Pitts
- 1959 Rosenblatt (Perceptron)
- 1959 Widrow Hoff (ADALINE and MADALINE)
- 1969 Marvin Minsky and Seymour Papert's
- 1974 Werbos (Backprop)
- 1982 John Hopfield
8An Artifical Neuron
- The n-dimensional input vector x is mapped into
variable y by means of the scalar product and a
nonlinear function mapping
9ANN Multi-Layer Perceptrons (MLPs)
- Multi Layer Perceptrons (MLPs) are one of the
best known ANN types - Composed of layers of fully interconnected
artificial neurons - Training involves repeatedly presenting a series
of training cases to the network and adjusting
neurons weights and biases to minimise
classification error - Typically the backpropogation of error algorithm
is used for training
10MLP Example
- Remember our surfing example
- An MLP can be built and trained to perform
classification for this problem
11Network Training
- The ultimate objective of training
- Obtain a set of weights that makes almost all of
the tuples in the training data classified
correctly - Steps
- Initialize weights with random values
- Feed the input tuples into the network one by one
- For each unit
- Compute the net input to the unit as a linear
combination of all the inputs to the unit - Compute the output value using the activation
function - Compute the error
- Update the weights and the bias
12Summary of ANN Classification
- Strengths
- Fast classification
- Very good generalization capacity
- Weaknesses
- No explanation capability black box
- Training can be slow eager learning
- Retraining is difficult
- Lots of other network types, but MLP is probably
the most common
13Support Vector Machines (SVM)
- In classification problems we try to create
decision boundaries between classes - A choice must be made between possible boundaries
Class 2
Class 1
14SVMs (cont)
- The decision boundary should be as far away from
the data of both classes as possible
15Margins
16Linear Support Vector Machine
- Given a set of points with label
- The SVM finds a hyperplane defined by the pair
(w, b), where w is the normal to the plane and b
is the distance from the origin - Where
- x - feature vector
- b - bias, y- class label
- w - margin
17SVMs The Clever Bit!
- What about when classes are not linearly
separable? - Kernel functions and the kernel trick are used to
transform data into a different linearly
separable feature space
18SVMs The Clever Bit! (cont...)
- What if the data is not linearly separable?
- Project the data to high dimensional space where
it is linearly separable and then we can use
linear SVM (Using Kernels)
19SVM Example
Example of Non-linear SVM
20SVM Example (cont)
Results
21Summary of SVM Classification
- Strengths
- Over-fitting is not common
- Works well with high dimensional data
- Fast classification
- Good generalization capacity
- Weaknesses
- Retraining is difficult
- No explanation capability
- Slow training
- At the cutting edge of machine learning
22SVM vs. ANN
- SVM
- Relatively new concept
- Nice generalization properties
- Hard to learn learned in batch mode using
quadratic programming techniques - Using kernels can learn very complex functions
- ANN
- Quite old
- Generalizes well but doesnt have strong
mathematical foundation - Can easily be learned in incremental fashion
- To learn complex functions use multilayer
perceptron (not that trivial)
23SVM Related Links
- http//svm.dcs.rhbnc.ac.uk/
- http//www.kernel-machines.org/
- C. J. C. Burges. A Tutorial on Support Vector
Machines for Pattern Recognition. Knowledge
Discovery and Data Mining, 2(2), 1998. - SVMlight Software (in C) http//ais.gmd.de/thor
sten/svm_light - BOOK An Introduction to Support Vector Machines
N. Cristianini and J. Shawe-TaylorCambridge
University Press
24Association-Based Classification
- Several methods for association-based
classification - ARCS Quantitative association mining and
clustering of association rules (Lent et al97) - It beats C4.5 in (mainly) scalability and also
accuracy - Associative classification (Liu et al98)
- It mines high support and high confidence rules
in the form of cond_set gt y, where y is a
class label - CAEP (Classification by aggregating emerging
patterns) (Dong et al99) - Emerging patterns (EPs) the itemsets whose
support increases significantly from one class to
another - Mine Eps based on minimum support and growth rate
25What Is Prediction?
- Prediction is similar to classification
- First, construct a model
- Second, use model to predict unknown value
- Major method for prediction is regression
- Linear and multiple regression
- Non-linear regression
- Prediction is different from classification
- Classification refers to predict categorical
class label - Prediction models continuous-valued functions
26Regress Analysis and Log-Linear Models in
Prediction
- Linear regression Y ? ? X
- Two parameters, ? and ?, specify the line and are
to be estimated by using the data at hand - Using the least squares criterion to the known
values of Y1, Y2,, X1, X2,. - Multiple regression Y b0 b1X1 b2X2
- Many nonlinear functions can be transformed into
the above - Log-linear models
- The multi-way table of joint probabilities is
approximated by a product of lower-order tables - Probability p(a, b, c, d) ?ab ?ac?ad ?bcd
27Prediction Numerical Data
28Prediction Categorical Data
29Concerns Over Classification Techniques
- When choosing a technique for a specific
classification problem we must consider the
following issues - Classification accuracy
- Training speed
- Classification speed
- Danger of over-fitting
- Generalisation capacity
- Implications for retraining
- Explanation capability
30Evaluating Classification Accuracy
- During development, and in testing before
deploying a classifier in the wild, we need to be
able to quantify the performance of the
classifier - How accurate is the classifier?
- When the classifier is wrong, how is it wrong?
- Useful to decide on which classifier (which
parameters) to use and to estimate what the
performance of the system will be
31Evaluating Classifiers (cont)
- How we do this depends on how much data is
available - If there is unlimited data available then there
is no problem - Usually we have less data than we would like so
we have to compromise - Use hold-out testing sets
- Cross validation
- K-fold cross validation
- Leave-one-out validation
- Parallel live test
32Hold-Out Testing Sets
- Split the available data into a training set and
a test set - Train the classifier in the training set and
evaluate based on the test set - A couple of drawbacks
- We may not have enough data
- We may happen upon an unfortunate split
Total number of available examples
33K-Fold Cross Validation
- Divide the entire data set into k folds
- For each of k experiments, use kth fold for
testing and everything else for training
Total number of available examples
Test Set
K 0
Test Set
K 1
Test Set
K 2
Test Set
K 3
34K-Fold Cross Validation (cont)
- The accuracy of the system is calculated as the
average error across the k folds - The main advantages of k-fold cross validation
are that every example is used in testing at some
stage and the problem of an unfortunate split is
avoided - Any value can be used for k
- 10 is most common
- Depends on the data set
35Leave-One-Out Cross Validation
- Extreme case of k-fold cross validation
- With N data examples perform N experiments with
N-1 training cases and 1 test case
Total number of available examples
K 0
K 1
K 2
K N
36Classifier Accuracy
- The accuracy of a classifier on a given test set
is the percentage of test set tuples that are
correctly classified by the classifier - Often also referred to as recognition rate
- Error rate (or misclassification rate) is the
opposite of accuracy
37False Positives Vs False Negatives
- While it is useful to generate the simple
accuracy of a classifier, sometimes we need more - When is the classifier wrong?
- False positives vs false negatives
- Related to type I and type II errors in
statistics - Often there is a different cost associated with
false positives and false negatives - Think about diagnosing diseases
38Confusion Matrix
- Device used to illustrate how a classifier is
performing in terms of false positives and false
negatives - Gives us more information than a single
accuracy figure - Allows us to think about the cost of mistakes
- Can be extended to any number of classes
Classifier Result Classifier Result
Class A(yes) Class B(no)
? fn Class A(yes) Expected Result
fp ? Class B(no) Expected Result
39Other Accuracy Measures
- Sometimes a simple accuracy measure is not enough
40ROC Curves
- Receiver Operating Characteristic (ROC) curves
were originally used to make sense of noisy radio
signals - Can be used to help us talk about classifier
performance and determine the best operating
point for a classifier
41ROC Curves (cont)
- Consider how the relationship between true
positives and false positives can change - We need to choose the best operating point
For some great ROC curve examples have a look here
42ROC Curves (cont)
- ROC curves can be used to compare classifiers
- The greater the area under the curve the more
accurate the classifier
43Over-Fitting
- When we train a classifier we are trying to a
learn a function approximated by the training
data we happen to use - What if the training data doesntcover the whole
problem space? - We can learn the training data too closely which
hampers the ability to generalise - This problem is known as overfitting
- Depending on the type of classifier used there
are different approaches to avoiding this
44Ensembles
- In order to improve classification accuracy we
can aggregate the results of an ensemble of
classifiers
45Bagging
- Given a set S of s samples
- Generate a bootstrap sample T from S
- Cases in S may not appear in T or may appear more
than once - Repeat this sampling procedure, getting a
sequence of k independent training sets - A corresponding sequence of classifiers
C1,C2,,Ck is constructed for each of these
training sets, by using the same classification
algorithm
46Bagging (cont)
- To classify an unknown sample X,let each
classifier predict or vote - The Bagged Classifier C counts the votes and
assigns X to the class with the most votes
47Boosting Technique Algorithm
- Assign every example an equal weight 1/N
- For t 1, 2, , T Do
- Obtain a hypothesis (classifier) h(t) under w(t)
- Calculate the error of h(t) and re-weight the
examples based on the error . Each classifier is
dependent on the previous ones. Samples that are
incorrectly predicted are weighted more heavily - Normalize w(t1) to sum to 1 (weights assigned to
different classifiers sum to 1) - Output a weighted sum of all the hypothesis, with
each hypothesis weighted according to its
accuracy on the training set
48Summary
- Classification is an extensively studied problem
- Mainly in statistics and machine learning
- Classification is probably one of the most widely
used data mining techniques - Scalability is still an important issue for
database applications - Research directions classification of
non-relational data, e.g., text, spatial,
multimedia, etc..
49Questions?