Title: Classification
1Classification
- A task of induction to find patterns
2Outline
- Data and its format
- Problem of Classification
- Learning a classifier
- Different approaches
- Key issues
3Data and its format
- Data
- attribute-value pairs
- with/without class
- Data type
- continuous/discrete
- nominal
- Data format
- Flat
- If not flat, what should we do?
4Sample data
5Induction from databases
- Inferring knowledge from data
- The task of deduction
- infer information that is a logical consequence
of querying a database - Who conducted this class before?
- Which courses are attended by Mary?
- Deductive databases extending the RDBMS
6Classification
- It is one type of induction
- data with class labels
- Examples -
- If weather is rainy then no golf
- If
- If
7Different approaches
- There exist many techniques
- Decision trees
- Neural networks
- K-nearest neighbours
- Naïve Bayesian classifiers
- Support Vector Machines
- Ensemble methods
- Semi-supervised
- and many more ...
8A decision tree
9Inducing a decision tree
- There are many possible trees
- lets try it on the golfing data
- How to find the most compact one
- that is consistent with the data (i.e.,
accurate)? - Why the most compact?
- Occams razor principle
- Issue of efficiency w.r.t. optimality
- How to find an optimal tree?
- Is there any need for a quick review for basic
probability theory?
10Information gain
and
- Entropy -
- Information gain - the difference between the
node before and after splitting
11Building a compact tree
- The key to building a decision tree - which
attribute to choose in order to branch. - The heuristic is to choose the attribute with the
maximum IG. - Another explanation is to reduce uncertainty as
much as possible.
12Learning a decision tree
Should Outlook be chosen first? If not, which one
should?
Outlook
sunny
overcast
rain
Humidity
Wind
No
high
normal
strong
weak
Yes
No
Yes
No
13Issues of Decision Trees
- Number of values of an attribute
- Your solution?
- When to stop
- Data fragmentation problem
- Any solution?
- Mixed data types
- Scalability
14Rules and Tree stumps
- Generating rules from decision trees
- One path is a rule
- We can do better. Why?
- Tree stumps and 1R
- For each attribute value, determine a default
class (of values of rules) - Calculate the of errors for each rule
- Find the total of errors for that attributes
rule set - For n attributes, there are n rule sets
- Choose the rule set that has the least of
errors - Lets go back to our example data and learn a 1R
rule
15K-Nearest Neighbor
- One of the most intuitive classification
algorithm - An unseen instances class is determined by its
nearest neighbor - The problem is it is sensitive to noise
- Instead of using one neighbor, we can use k
neighbors
16K-NN
- New problems
- How large should k be
- lazy learning does it learn?
- large storage
- A toy example (noise, majority)
- How good is k-NN?
- How to compare
- Speed
- Accuracy
17Naïve Bayes Classifier
- This is a direct application of Bayes rule
- P(CX) P(XC)P(C)/P(X)
- X - a vector of x1,x2,,xn
- Thats the best classifier we can build
- But, there are problems
- There are only a limited number of instances
- How to estimate P(xC)
- Your suggestions?
18NBC (2)
- Assume conditional independence between xis
- We have
- P(Cx) P(x1C) P(xiC) (xnC)P(C)
- Whats missing? Is it really correct? Why?
- An example (Golfing or not)
- How good is it in reality?
- Even when the assumption is not held true
- How to update an NBC when new data stream in?
- What if one of P(xiC) is 0?
- Laplace estimator adding 1 to each count
19No Free Lunch
- If the goal is to obtain good generalization
performance, there are no context-independent or
usage-independent reasons to favor one learning
or classification method over another. - http//en.wikipedia.org/wiki/No-Free-Lunch_theorem
s - What does it indicate?
- Or is it easy to choose a good classifier for
your application? - Again, there is no off-the-shelf solution for a
reasonably challenging application.
20Ensemble Methods
- Motivation
- Achieve the stability of classification
- Model generation
- Bagging (Bootstrap Aggregating)
- Boosting
- Model combination
- Majority voting
- Meta learning
- Stacking (using different types of classifiers)
- Examples (classify-ensemble.ppt)
21AdaBoost.M1 (from the Weka Book)
Model generation
- Assign equal weight to each training instance
- For t iterations
- Apply learning algorithm to weighted dataset,
- store resulting model
- Compute models error e on weighted dataset
- If e 0 or e gt 0.5
- Terminate model generation
- For each instance in dataset
- If classified correctly by model
- Multiply instances weight by e/(1-e)
- Normalize weight of all instances
Classification
Assign weight 0 to all classes For each of the
t models (or fewer) For the class this model
predicts add log e/(1-e) to this classs
weight Return class with highest weight
22Using many different classifiers
- We have learned some basic and often-used
classifiers - There are many more out there.
- Regression
- Discriminant analysis
- Neural networks
- Support vector machines
- Pick the most suitable one for an application
- Where to find all these classifiers?
- Dont reinvent the wheel that is not as round
- We will likely come back to classification and
discuss support vector machines as requested
23Assignment 3
- Questions about classification and evaluation
(deadline 2/14, Wednesday) - Manually create a decision tree for the golfing
data (D) - Manually create a NBC for D
- How to create 1-NN for D in your view? Discuss
your thoughts. - Run your decision tree algorithm (if you dont
like to implement your own algorithm, you can use
an available one) on D using 10-fold cross
validation (or leave-one-out for this particular
D) and 5 2-fold cross validation - Discuss the differences between the above two
evaluation methods
24Some software for demo or for teaching
- C4.5 at the Rulequest site http//www.rulequest.co
m/download.html - The free demo versions of Magnum Opus (for
association rule mining) can be downloaded from
the Rulequest site - Alphaminer (you probably will like it) at
http//www.eti.hku.hk/alphaminer/ - WEKA http//www.cs.waikato.ac.nz/ml/weka/
25Classification via Neural Networks
Squash
?
A perceptron
26What can a perceptron do?
- Neuron as a computing device
- To separate a linearly separable points
- Nice things about a perceptron
- distributed representation
- local learning
- weight adjusting
27Linear threshold unit
- Basic concepts projection, thresholding
W vectors evoke 1
W .11 .6
L .7 .7
.5
28E.g. 1 solution region for AND problem
- Find a weight vector that satisfies all the
constraints
AND problem 0 0 0 0 1 0 1 0 0 1
1 1
29E.g. 2 Solution region for XOR problem?
XOR problem 0 0 0 0 1 1 1 0 1 1
1 0
30Learning by error reduction
- Perceptron learning algorithm
- If the activation level of the output unit is 1
when it should be 0, reduce the weight on the
link to the ith input unit by rLi, where Li is
the ith input value and r a learning rate - If the activation level of the output unit is 0
when it should be 1, increase the weight on the
link to the ith input unit by rLi - Otherwise, do nothing
31Multi-layer perceptrons
- Using the chain rule, we can back-propagate the
errors for a multi-layer perceptrons.
Output layer
Hidden layer
Input layer