Title: Classification
1Classification
- Today Basic Problem
- Decision Trees
2Classification Problem
- Given a database Dt1,t2,,tn and a set of
classes CC1,,Cm, the Classification Problem
is to define a mapping fDgC where each ti is
assigned to one class. - Actually divides D into equivalence classes.
- Prediction is similar, but may be viewed as
having infinite number of classes.
3Classification Ex Grading
- If x gt 90 then grade A.
- If 80ltxlt90 then grade B.
- If 70ltxlt80 then grade C.
- If 60ltxlt70 then grade D.
- If xlt50 then grade F.
x
A
gt80
lt80
x
B
x
C
D
F
4Classification Techniques
- Approach
- Create specific model by evaluating training data
(or using domain experts knowledge). - Apply model developed to new data.
- Classes must be predefined
- Most common techniques use DTs, or are based on
distances or statistical methods.
5Defining Classes
6Issues in Classification
- Missing Data
- Ignore
- Replace with assumed value
- Measuring Performance
- Classification accuracy on test data
- Confusion matrix
- OC Curve
7Height Example Data
8Classification Performance
True Positive
False Negative
True Negative
False Positive
9Confusion Matrix Example
- Using height data example with Output1 correct
and Output2 actual assignment
10Operating Characteristic Curve
11Classification Using Decision Trees
- Partitioning based Divide search space into
rectangular regions. - Tuple placed into class based on the region
within which it falls. - DT approaches differ in how the tree is built DT
Induction - Internal nodes associated with attribute and arcs
with values for that attribute. - Algorithms ID3, C4.5, CART
12Decision Tree
- Given
- D t1, , tn where tiltti1, , tihgt
- Database schema contains A1, A2, , Ah
- Classes CC1, ., Cm
- Decision or Classification Tree is a tree
associated with D such that - Each internal node is labeled with attribute, Ai
- Each arc is labeled with predicate which can be
applied to attribute at parent - Each leaf node is labeled with a class, Cj
13DT Induction
14DT Splits Area
M
Gender
F
Height
15Comparing DTs
Balanced
Deep
16DT Issues
- Choosing Splitting Attributes
- Ordering of Splitting Attributes
- Splits
- Tree Structure
- Stopping Criteria
- Training Data
- Pruning
17Information/Entropy
- Given probabilitites p1, p2, .., ps whose sum is
1, Entropy is defined as - Entropy measures the amount of randomness or
surprise or uncertainty. - Goal in classification
- no surprise
- entropy 0
18ID3
- Creates tree using information theory concepts
and tries to reduce expected number of
comparison.. - ID3 chooses split attribute with the highest
information gain
19ID3 Example (Output1)
- Starting state entropy
- 4/15 log(15/4) 8/15 log(15/8) 3/15 log(15/3)
0.4384 - Gain using gender
- Female 3/9 log(9/3)6/9 log(9/6)0.2764
- Male 1/6 (log 6/1) 2/6 log(6/2) 3/6 log(6/3)
0.4392 - Weighted sum (9/15)(0.2764) (6/15)(0.4392)
0.34152 - Gain 0.4384 0.34152 0.09688
- Gain using height
- 0.4384 (2/15)(0.301) 0.3983
- Choose height as first splitting attribute
20C4.5
- ID3 favors attributes with large number of
divisions - Improved version of ID3
- Missing Data
- Continuous Data
- Pruning
- Rules
- GainRatio
21CART
- Create Binary Tree
- Uses entropy
- Formula to choose split point, s, for node t
- PL,PR probability that a tuple in the training
set will be on the left or right side of the tree.
22CART Example
- At the start, there are six choices for split
point (right branch on equality) - P(Gender)2(6/15)(9/15)(2/15 4/15 3/15)0.224
- P(1.6) 0
- P(1.7) 2(2/15)(13/15)(0 8/15 3/15) 0.169
- P(1.8) 2(5/15)(10/15)(4/15 6/15 3/15)
0.385 - P(1.9) 2(9/15)(6/15)(4/15 2/15 3/15)
0.256 - P(2.0) 2(12/15)(3/15)(4/15 8/15 3/15)
0.32 - Split at 1.8
23Problem to Work OnTraining Dataset
This follows an example from Quinlans ID3
24Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
25Bayesian Classification Why?
- Probabilistic learning Calculate explicit
probabilities for hypothesis, among the most
practical approaches to certain types of learning
problems - Incremental Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge
can be combined with observed data. - Probabilistic prediction Predict multiple
hypotheses, weighted by their probabilities - Standard Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
26Bayesian Theorem Basics
- Let X be a data sample whose class label is
unknown - Let H be a hypothesis that X belongs to class C
- For classification problems, determine P(H/X)
the probability that the hypothesis holds given
the observed data sample X - P(H) prior probability of hypothesis H (i.e. the
initial probability before we observe any data,
reflects the background knowledge) - P(X) probability that sample data is observed
- P(XH) probability of observing the sample X,
given that the hypothesis holds
27Bayes Theorem (Recap)
- Given training data X, posteriori probability of
a hypothesis H, P(HX) follows the Bayes theorem -
- MAP (maximum posteriori) hypothesis
- Practical difficulty require initial knowledge
of many probabilities, significant computational
cost insufficient data
28Naïve Bayes Classifier
- A simplified assumption attributes are
conditionally independent - The product of occurrence of say 2 elements x1
and x2, given the current class is C, is the
product of the probabilities of each element
taken separately, given the same class
P(y1,y2,C) P(y1,C) P(y2,C) - No dependence relation between attributes
- Greatly reduces the computation cost, only count
the class distribution. - Once the probability P(XCi) is known, assign X
to the class with maximum P(XCi)P(Ci)
29Training dataset
Class C1buys_computer yes C2buys_computer
no Data sample X (agelt30, Incomemedium, Stud
entyes Credit_rating Fair)
30Naïve Bayesian Classifier Example
- Compute P(X/Ci) for each class
- P(agelt30 buys_computeryes)
2/90.222 - P(agelt30 buys_computerno) 3/5 0.6
- P(incomemedium buys_computeryes)
4/9 0.444 - P(incomemedium buys_computerno)
2/5 0.4 - P(studentyes buys_computeryes) 6/9
0.667 - P(studentyes buys_computerno)
1/50.2 - P(credit_ratingfair buys_computeryes)
6/90.667 - P(credit_ratingfair buys_computerno)
2/50.4 - X(agelt30 ,income medium, studentyes,credit_
ratingfair) - P(XCi) P(Xbuys_computeryes) 0.222 x
0.444 x 0.667 x 0.0.667 0.044 - P(Xbuys_computerno) 0.6 x
0.4 x 0.2 x 0.4 0.019 - Multiply by P(Ci)s and we can conclude that
- X belongs to class buys_computeryes
31Naïve Bayesian Classifier Comments
- Advantages
- Easy to implement
- Good results obtained in most of the cases
- Disadvantages
- Assumption class conditional independence ,
therefore loss of accuracy - Practically, dependencies exist among variables
- E.g., hospitals patients Profile age, family
history etc - Symptoms fever, cough etc., Disease lung
cancer, diabetes etc - Dependencies among these cannot be modeled by
Naïve Bayesian Classifier - How to deal with these dependencies?
- Bayesian Belief Networks
32Classification Using Distance
- Place items in class to which they are
closest. - Must determine distance between an item and a
class. - Classes represented by
- Centroid Central value.
- Medoid Representative point.
- Individual points
- Algorithm KNN
33K Nearest Neighbor (KNN)
- Training set includes classes.
- Examine K items near item to be classified.
- New item placed in class with the most number of
close items. - O(q) for each tuple to be classified. (Here q is
the size of the training set.)
34KNN
35KNN Algorithm