Business Systems Intelligence: 5. Classification 2 - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Business Systems Intelligence: 5. Classification 2

Description:

Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee) Business Systems Intelligence: 5. Classification 2 Acknowledgments These notes are based (heavily) on those provided ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 50
Provided by: jiaw205
Category:

less

Transcript and Presenter's Notes

Title: Business Systems Intelligence: 5. Classification 2


1
Business Systems Intelligence5. Classification
2
Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)
2
Acknowledgments
  • These notes are based (heavily) on those
    provided by the authors to accompany Data
    Mining Concepts Techniques by Jiawei Han
    and Micheline Kamber
  • Some slides are also based on trainers kits
    provided by

More information about the book is available
atwww-sal.cs.uiuc.edu/hanj/bk2/ And
information on SAS is available atwww.sas.com
3
Classification Prediction
  • Today we will look at
  • What are classification prediction?
  • Issues regarding classification and prediction
  • Classification techniques
  • Case based reasoning (k-nearest neighbour
    algorithm)
  • Decision tree induction
  • Bayesian classification
  • Neural networks
  • Support vector machines (SVM)
  • Classification based on from association rule
    mining concepts
  • Other classification methods
  • Prediction
  • Classification accuracy

4
Classification
  • Classification
  • Predicts categorical class labels
  • Typical Applications
  • CreditHistory, Salary -gt CreditApproval
    (Yes/No)
  • Temp, Humidity --gt Rain (Yes/No)

5
Linear Classification
  • Binary Classification problem
  • The data above the red line belongs to class x
  • The data below red line belongs to class o
  • Examples SVM, Perceptron, Probabilistic
    Classifiers

6
Discriminative Classifiers
  • Advantages
  • Prediction accuracy is generally high
  • Robust, works when training examples contain
    errors
  • Fast evaluation of the learned target function
  • Criticism
  • Long training time
  • Difficult to understand the learned function
    (weights)
  • Not easy to incorporate domain knowledge

7
Artificial Neural Networks
  • A biologically inspired classification
    technique
  • Formed from interconnected layers of simple
    artificial neurons
  • ANN history
  • 1943 McCulloch Pitts
  • 1959 Rosenblatt (Perceptron)
  • 1959 Widrow Hoff (ADALINE and MADALINE)
  • 1969 Marvin Minsky and Seymour Papert's
  • 1974 Werbos (Backprop)
  • 1982 John Hopfield

8
An Artifical Neuron
  • The n-dimensional input vector x is mapped into
    variable y by means of the scalar product and a
    nonlinear function mapping

9
ANN Multi-Layer Perceptrons (MLPs)
  • Multi Layer Perceptrons (MLPs) are one of the
    best known ANN types
  • Composed of layers of fully interconnected
    artificial neurons
  • Training involves repeatedly presenting a series
    of training cases to the network and adjusting
    neurons weights and biases to minimise
    classification error
  • Typically the backpropogation of error algorithm
    is used for training

10
MLP Example
  • Remember our surfing example
  • An MLP can be built and trained to perform
    classification for this problem

11
Network Training
  • The ultimate objective of training
  • Obtain a set of weights that makes almost all of
    the tuples in the training data classified
    correctly
  • Steps
  • Initialize weights with random values
  • Feed the input tuples into the network one by one
  • For each unit
  • Compute the net input to the unit as a linear
    combination of all the inputs to the unit
  • Compute the output value using the activation
    function
  • Compute the error
  • Update the weights and the bias

12
Summary of ANN Classification
  • Strengths
  • Fast classification
  • Very good generalization capacity
  • Weaknesses
  • No explanation capability black box
  • Training can be slow eager learning
  • Retraining is difficult
  • Lots of other network types, but MLP is probably
    the most common

13
Support Vector Machines (SVM)
  • In classification problems we try to create
    decision boundaries between classes
  • A choice must be made between possible boundaries

Class 2
Class 1
14
SVMs (cont)
  • The decision boundary should be as far away from
    the data of both classes as possible

15
Margins
16
Linear Support Vector Machine
  • Given a set of points with label
  • The SVM finds a hyperplane defined by the pair
    (w, b), where w is the normal to the plane and b
    is the distance from the origin
  • Where
  • x - feature vector
  • b - bias, y- class label
  • w - margin

17
SVMs The Clever Bit!
  • What about when classes are not linearly
    separable?
  • Kernel functions and the kernel trick are used to
    transform data into a different linearly
    separable feature space

18
SVMs The Clever Bit! (cont...)
  • What if the data is not linearly separable?
  • Project the data to high dimensional space where
    it is linearly separable and then we can use
    linear SVM (Using Kernels)

19
SVM Example
Example of Non-linear SVM
20
SVM Example (cont)
Results
21
Summary of SVM Classification
  • Strengths
  • Over-fitting is not common
  • Works well with high dimensional data
  • Fast classification
  • Good generalization capacity
  • Weaknesses
  • Retraining is difficult
  • No explanation capability
  • Slow training
  • At the cutting edge of machine learning

22
SVM vs. ANN
  • SVM
  • Relatively new concept
  • Nice generalization properties
  • Hard to learn learned in batch mode using
    quadratic programming techniques
  • Using kernels can learn very complex functions
  • ANN
  • Quite old
  • Generalizes well but doesnt have strong
    mathematical foundation
  • Can easily be learned in incremental fashion
  • To learn complex functions use multilayer
    perceptron (not that trivial)

23
SVM Related Links
  • http//svm.dcs.rhbnc.ac.uk/
  • http//www.kernel-machines.org/
  • C. J. C. Burges. A Tutorial on Support Vector
    Machines for Pattern Recognition. Knowledge
    Discovery and Data Mining, 2(2), 1998.
  • SVMlight Software (in C) http//ais.gmd.de/thor
    sten/svm_light
  • BOOK An Introduction to Support Vector Machines
    N. Cristianini and J. Shawe-TaylorCambridge
    University Press

24
Association-Based Classification
  • Several methods for association-based
    classification
  • ARCS Quantitative association mining and
    clustering of association rules (Lent et al97)
  • It beats C4.5 in (mainly) scalability and also
    accuracy
  • Associative classification (Liu et al98)
  • It mines high support and high confidence rules
    in the form of cond_set gt y, where y is a
    class label
  • CAEP (Classification by aggregating emerging
    patterns) (Dong et al99)
  • Emerging patterns (EPs) the itemsets whose
    support increases significantly from one class to
    another
  • Mine Eps based on minimum support and growth rate

25
What Is Prediction?
  • Prediction is similar to classification
  • First, construct a model
  • Second, use model to predict unknown value
  • Major method for prediction is regression
  • Linear and multiple regression
  • Non-linear regression
  • Prediction is different from classification
  • Classification refers to predict categorical
    class label
  • Prediction models continuous-valued functions

26
Regress Analysis and Log-Linear Models in
Prediction
  • Linear regression Y ? ? X
  • Two parameters, ? and ?, specify the line and are
    to be estimated by using the data at hand
  • Using the least squares criterion to the known
    values of Y1, Y2,, X1, X2,.
  • Multiple regression Y b0 b1X1 b2X2
  • Many nonlinear functions can be transformed into
    the above
  • Log-linear models
  • The multi-way table of joint probabilities is
    approximated by a product of lower-order tables
  • Probability p(a, b, c, d) ?ab ?ac?ad ?bcd

27
Prediction Numerical Data
28
Prediction Categorical Data
29
Concerns Over Classification Techniques
  • When choosing a technique for a specific
    classification problem we must consider the
    following issues
  • Classification accuracy
  • Training speed
  • Classification speed
  • Danger of over-fitting
  • Generalisation capacity
  • Implications for retraining
  • Explanation capability

30
Evaluating Classification Accuracy
  • During development, and in testing before
    deploying a classifier in the wild, we need to be
    able to quantify the performance of the
    classifier
  • How accurate is the classifier?
  • When the classifier is wrong, how is it wrong?
  • Useful to decide on which classifier (which
    parameters) to use and to estimate what the
    performance of the system will be

31
Evaluating Classifiers (cont)
  • How we do this depends on how much data is
    available
  • If there is unlimited data available then there
    is no problem
  • Usually we have less data than we would like so
    we have to compromise
  • Use hold-out testing sets
  • Cross validation
  • K-fold cross validation
  • Leave-one-out validation
  • Parallel live test

32
Hold-Out Testing Sets
  • Split the available data into a training set and
    a test set
  • Train the classifier in the training set and
    evaluate based on the test set
  • A couple of drawbacks
  • We may not have enough data
  • We may happen upon an unfortunate split

Total number of available examples
33
K-Fold Cross Validation
  • Divide the entire data set into k folds
  • For each of k experiments, use kth fold for
    testing and everything else for training

Total number of available examples
Test Set
K 0
Test Set
K 1
Test Set
K 2
Test Set
K 3
34
K-Fold Cross Validation (cont)
  • The accuracy of the system is calculated as the
    average error across the k folds
  • The main advantages of k-fold cross validation
    are that every example is used in testing at some
    stage and the problem of an unfortunate split is
    avoided
  • Any value can be used for k
  • 10 is most common
  • Depends on the data set

35
Leave-One-Out Cross Validation
  • Extreme case of k-fold cross validation
  • With N data examples perform N experiments with
    N-1 training cases and 1 test case

Total number of available examples
K 0
K 1
K 2
K N
36
Classifier Accuracy
  • The accuracy of a classifier on a given test set
    is the percentage of test set tuples that are
    correctly classified by the classifier
  • Often also referred to as recognition rate
  • Error rate (or misclassification rate) is the
    opposite of accuracy

37
False Positives Vs False Negatives
  • While it is useful to generate the simple
    accuracy of a classifier, sometimes we need more
  • When is the classifier wrong?
  • False positives vs false negatives
  • Related to type I and type II errors in
    statistics
  • Often there is a different cost associated with
    false positives and false negatives
  • Think about diagnosing diseases

38
Confusion Matrix
  • Device used to illustrate how a classifier is
    performing in terms of false positives and false
    negatives
  • Gives us more information than a single
    accuracy figure
  • Allows us to think about the cost of mistakes
  • Can be extended to any number of classes

Classifier Result Classifier Result
Class A(yes) Class B(no)
? fn Class A(yes) Expected Result
fp ? Class B(no) Expected Result
39
Other Accuracy Measures
  • Sometimes a simple accuracy measure is not enough

40
ROC Curves
  • Receiver Operating Characteristic (ROC) curves
    were originally used to make sense of noisy radio
    signals
  • Can be used to help us talk about classifier
    performance and determine the best operating
    point for a classifier

41
ROC Curves (cont)
  • Consider how the relationship between true
    positives and false positives can change
  • We need to choose the best operating point

For some great ROC curve examples have a look here
42
ROC Curves (cont)
  • ROC curves can be used to compare classifiers
  • The greater the area under the curve the more
    accurate the classifier

43
Over-Fitting
  • When we train a classifier we are trying to a
    learn a function approximated by the training
    data we happen to use
  • What if the training data doesntcover the whole
    problem space?
  • We can learn the training data too closely which
    hampers the ability to generalise
  • This problem is known as overfitting
  • Depending on the type of classifier used there
    are different approaches to avoiding this

44
Ensembles
  • In order to improve classification accuracy we
    can aggregate the results of an ensemble of
    classifiers

45
Bagging
  • Given a set S of s samples
  • Generate a bootstrap sample T from S
  • Cases in S may not appear in T or may appear more
    than once
  • Repeat this sampling procedure, getting a
    sequence of k independent training sets
  • A corresponding sequence of classifiers
    C1,C2,,Ck is constructed for each of these
    training sets, by using the same classification
    algorithm

46
Bagging (cont)
  • To classify an unknown sample X,let each
    classifier predict or vote
  • The Bagged Classifier C counts the votes and
    assigns X to the class with the most votes

47
Boosting Technique Algorithm
  • Assign every example an equal weight 1/N
  • For t 1, 2, , T Do
  • Obtain a hypothesis (classifier) h(t) under w(t)
  • Calculate the error of h(t) and re-weight the
    examples based on the error . Each classifier is
    dependent on the previous ones. Samples that are
    incorrectly predicted are weighted more heavily
  • Normalize w(t1) to sum to 1 (weights assigned to
    different classifiers sum to 1)
  • Output a weighted sum of all the hypothesis, with
    each hypothesis weighted according to its
    accuracy on the training set

48
Summary
  • Classification is an extensively studied problem
  • Mainly in statistics and machine learning
  • Classification is probably one of the most widely
    used data mining techniques
  • Scalability is still an important issue for
    database applications
  • Research directions classification of
    non-relational data, e.g., text, spatial,
    multimedia, etc..

49
Questions?
  • ?
Write a Comment
User Comments (0)
About PowerShow.com