Classification - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Classification

Description:

The key to building a decision tree - which attribute to choose in order to branch. The heuristic is to choose the attribute with the maximum IG. ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 32

Provided by: Huan77

Learn more at: https://www.public.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Classification

1
Classification

A task of induction to find patterns

2
Outline

Data and its format
Problem of Classification
Learning a classifier
Different approaches
Key issues

3
Data and its format

Data
attribute-value pairs
with/without class
Data type
continuous/discrete
nominal
Data format
Flat
If not flat, what should we do?

4
Sample data
5
Induction from databases

Inferring knowledge from data
The task of deduction
infer information that is a logical consequence
of querying a database
Who conducted this class before?
Which courses are attended by Mary?
Deductive databases extending the RDBMS

6
Classification

It is one type of induction
data with class labels
Examples -
If weather is rainy then no golf
If
If

7
Different approaches

There exist many techniques
Decision trees
Neural networks
K-nearest neighbours
Naïve Bayesian classifiers
Support Vector Machines
Ensemble methods
Semi-supervised
and many more ...

8
A decision tree
9
Inducing a decision tree

There are many possible trees
lets try it on the golfing data
How to find the most compact one
that is consistent with the data (i.e.,
accurate)?
Why the most compact?
Occams razor principle
Issue of efficiency w.r.t. optimality
How to find an optimal tree?
Is there any need for a quick review for basic
probability theory?

10
Information gain
and

Entropy -
Information gain - the difference between the
node before and after splitting

11
Building a compact tree

The key to building a decision tree - which
attribute to choose in order to branch.
The heuristic is to choose the attribute with the
maximum IG.
Another explanation is to reduce uncertainty as
much as possible.

12
Learning a decision tree
Should Outlook be chosen first? If not, which one
should?
Outlook
sunny
overcast
rain
Humidity
Wind
No
high
normal
strong
weak
Yes
No
Yes
No
13
Issues of Decision Trees

Number of values of an attribute
Your solution?
When to stop
Data fragmentation problem
Any solution?
Mixed data types
Scalability

14
Rules and Tree stumps

Generating rules from decision trees
One path is a rule
We can do better. Why?
Tree stumps and 1R
For each attribute value, determine a default
class (of values of rules)
Calculate the of errors for each rule
Find the total of errors for that attributes
rule set
For n attributes, there are n rule sets
Choose the rule set that has the least of
errors
Lets go back to our example data and learn a 1R
rule

15
K-Nearest Neighbor

One of the most intuitive classification
algorithm
An unseen instances class is determined by its
nearest neighbor
The problem is it is sensitive to noise
Instead of using one neighbor, we can use k
neighbors

16
K-NN

New problems
How large should k be
lazy learning does it learn?
large storage
A toy example (noise, majority)
How good is k-NN?
How to compare
Speed
Accuracy

17
Naïve Bayes Classifier

This is a direct application of Bayes rule
P(CX) P(XC)P(C)/P(X)
X - a vector of x1,x2,,xn
Thats the best classifier we can build
But, there are problems
There are only a limited number of instances
How to estimate P(xC)
Your suggestions?

18
NBC (2)

Assume conditional independence between xis
We have
P(Cx) P(x1C) P(xiC) (xnC)P(C)
Whats missing? Is it really correct? Why?
An example (Golfing or not)
How good is it in reality?
Even when the assumption is not held true
How to update an NBC when new data stream in?
What if one of P(xiC) is 0?
Laplace estimator adding 1 to each count

19
No Free Lunch

If the goal is to obtain good generalization
performance, there are no context-independent or
usage-independent reasons to favor one learning
or classification method over another.
http//en.wikipedia.org/wiki/No-Free-Lunch_theorem
s
What does it indicate?
Or is it easy to choose a good classifier for
your application?
Again, there is no off-the-shelf solution for a
reasonably challenging application.

20
Ensemble Methods

Motivation
Achieve the stability of classification
Model generation
Bagging (Bootstrap Aggregating)
Boosting
Model combination
Majority voting
Meta learning
Stacking (using different types of classifiers)
Examples (classify-ensemble.ppt)

21
AdaBoost.M1 (from the Weka Book)
Model generation

Assign equal weight to each training instance
For t iterations
Apply learning algorithm to weighted dataset,
store resulting model
Compute models error e on weighted dataset
If e 0 or e gt 0.5
Terminate model generation
For each instance in dataset
If classified correctly by model
Multiply instances weight by e/(1-e)
Normalize weight of all instances

Classification
Assign weight 0 to all classes For each of the
t models (or fewer) For the class this model
predicts add log e/(1-e) to this classs
weight Return class with highest weight
22
Using many different classifiers

We have learned some basic and often-used
classifiers
There are many more out there.
Regression
Discriminant analysis
Neural networks
Support vector machines
Pick the most suitable one for an application
Where to find all these classifiers?
Dont reinvent the wheel that is not as round
We will likely come back to classification and
discuss support vector machines as requested

23
Assignment 3

Questions about classification and evaluation
(deadline 2/14, Wednesday)
Manually create a decision tree for the golfing
data (D)
Manually create a NBC for D
How to create 1-NN for D in your view? Discuss
your thoughts.
Run your decision tree algorithm (if you dont
like to implement your own algorithm, you can use
an available one) on D using 10-fold cross
validation (or leave-one-out for this particular
D) and 5 2-fold cross validation
Discuss the differences between the above two
evaluation methods

24
Some software for demo or for teaching

C4.5 at the Rulequest site http//www.rulequest.co
m/download.html
The free demo versions of Magnum Opus (for
association rule mining) can be downloaded from
the Rulequest site
Alphaminer (you probably will like it) at
http//www.eti.hku.hk/alphaminer/
WEKA http//www.cs.waikato.ac.nz/ml/weka/

25
Classification via Neural Networks
Squash
?
A perceptron
26
What can a perceptron do?

Neuron as a computing device
To separate a linearly separable points
Nice things about a perceptron
distributed representation
local learning
weight adjusting

27
Linear threshold unit

Basic concepts projection, thresholding

W vectors evoke 1
W .11 .6
L .7 .7
.5
28
E.g. 1 solution region for AND problem

Find a weight vector that satisfies all the
constraints

AND problem 0 0 0 0 1 0 1 0 0 1
1 1
29
E.g. 2 Solution region for XOR problem?
XOR problem 0 0 0 0 1 1 1 0 1 1
1 0
30
Learning by error reduction

Perceptron learning algorithm
If the activation level of the output unit is 1
when it should be 0, reduce the weight on the
link to the ith input unit by rLi, where Li is
the ith input value and r a learning rate
If the activation level of the output unit is 0
when it should be 1, increase the weight on the
link to the ith input unit by rLi
Otherwise, do nothing

31
Multi-layer perceptrons