Title: Classification
1Classification
Based in part on Chapter 10 of Hand, Manilla,
Smyth and Chapter 7 of Han and Kamber David
Madigan
2Predictive Modeling
Goal learn a mapping y f(x?) Need 1. A
model structure 2. A score function 3. An
optimization strategy Categorical y ? c1,,cm
classification Real-valued y regression Note
usually assume c1,,cm are mutually exclusive
and exhaustive
3Probabilistic Classification
Let p(ck) prob. that a randomly chosen object
comes from ck Objects from ck have p(x ck ,
?k) (e.g., MVN) Then p(ck x ) ? p(x ck ,
?k) p(ck)
Bayes Error Rate
- Lower bound on the best possible error rate
4Bayes error rate about 6
5Classifier Types
Discrimination direct mapping from x to
c1,,cm - e.g. perceptron, SVM,
CART Regression model p(ck x ) - e.g.
logistic regression, CART Class-conditional
model p(x ck , ?k) - e.g. Bayesian
classifiers, LDA
6Simple Two-Class Perceptron
Define Classify as class 1 if h(x)gt0, class 2
otherwise Score function misclassification
errors on training data For training, replace
class 2 xjs by -xj now need h(x)gt0
Initialize weight vector Repeat one or more
times For each training data point xi If
point correctly classified, do nothing Else
Guaranteed to converge when there is perfect
separation
7Linear Discriminant Analysis
K classes, X n p data matrix.
p(ck x ) ? p(x ck , ?k) p(ck)
Could model each class density as multivariate
normal
LDA assumes for all k. Then
This is linear in x.
8Linear Discriminant Analysis (cont.)
It follows that the classifier should predict
linear discriminant function
If we dont assume the ?ks are identicial, get
Quadratic DA
9Linear Discriminant Analysis (cont.)
Can estimate the LDA parameters via maximum
likelihood
10(No Transcript)
11(No Transcript)
12LDA
QDA
13(No Transcript)
14(No Transcript)
15LDA (cont.)
- Fisher is optimal if the class are MVN with a
common covariance matrix - Computational complexity O(mp2n)
16Logistic Regression
Note that LDA is linear in x
Linear logistic regression looks the same
But the estimation procedure for the
co-efficicents is different. LDA maximizes joint
likelihood y,X logistic regression maximizes
conditional likelihood yX. Usually similar
predictions.
17Logistic Regression MLE
For the two-class case, the likelihood is
The maximize need to solve (non-linear) score
equations
18Logistic Regression Modeling
South African Heart Disease Example (yMI)
Coef. S.E. Z score
Intercept -4.130 0.964 -4.285
sbp 0.006 0.006 1.023
Tobacco 0.080 0.026 3.034
ldl 0.185 0.057 3.219
Famhist 0.939 0.225 4.178
Obesity -0.035 0.029 -1.187
Alcohol 0.001 0.004 0.136
Age 0.043 0.010 4.184
Wald
19Tree Models
- Easy to understand
- Can handle mixed data, missing values, etc.
- Sequential fitting method can be sub-optimal
- Usually grow a large tree and prune it back
rather than attempt to optimally stop the growing
process
20(No Transcript)
21Training Dataset
This follows an example from Quinlans ID3
22Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
23(No Transcript)
24(No Transcript)
25(No Transcript)
26Confusion matrix
27Algorithm for Decision Tree Induction
- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive
divide-and-conquer manner - At start, all the training examples are at the
root - Attributes are categorical (if continuous-valued,
they are discretized in advance) - Examples are partitioned recursively based on
selected attributes - Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain) - Conditions for stopping partitioning
- All samples for a given node belong to the same
class - There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf - There are no samples left
28Information Gain (ID3/C4.5)
- Select the attribute with the highest information
gain - Assume there are two classes, P and N
- Let the set of examples S contain p elements of
class P and n elements of class N - The amount of information, needed to decide if an
arbitrary example in S belongs to P or N is
defined as
e.g. I(0.5,0.5)1 I(0.9,0.1)0.47
I(0.99,0.01)0.08
29Information Gain in Decision Tree Induction
- Assume that using attribute A a set S will be
partitioned into sets S1, S2 , , Sv - If Si contains pi examples of P and ni examples
of N, the entropy, or the expected information
needed to classify objects in all subtrees Si is - The encoding information that would be gained by
branching on A
30Attribute Selection by Information Gain
Computation
- Class P buys_computer yes
- Class N buys_computer no
- I(p, n) I(9, 5) 0.940
- Compute the entropy for age
31Gini Index (IBM IntelligentMiner)
- If a data set T contains examples from n classes,
gini index, gini(T) is defined as - where pj is the relative frequency of class j
in T. - If a data set T is split into two subsets T1 and
T2 with sizes N1 and N2 respectively, the gini
index of the split data contains examples from n
classes, the gini index gini(T) is defined as - The attribute provides the smallest ginisplit(T)
is chosen to split the node
32Avoid Overfitting in Classification
- The generated tree may overfit the training data
- Too many branches, some may reflect anomalies due
to noise or outliers - Result is in poor accuracy for unseen samples
- Two approaches to avoid overfitting
- Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold - Difficult to choose an appropriate threshold
- Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees - Use a set of data different from the training
data to decide which is the best pruned tree
33Approaches to Determine the Final Tree Size
- Separate training (2/3) and testing (1/3) sets
- Use cross validation, e.g., 10-fold cross
validation - Use minimum description length (MDL) principle
- halting growth of the tree when the encoding is
minimized
34Nearest Neighbor Methods
- k-NN assigns an unknown object to the most common
class of its k nearest neighbors - Choice of k? (bias-variance tradeoff again)
- Choice of metric?
- Need all the training to be present to classify a
new point (lazy methods) - Surprisingly strong asymptotic results (e.g. no
decision rule is more than twice as accurate as
1-NN)
35Flexible Metric NN Classification
36Naïve Bayes Classification
Recall p(ck x) ? p(x ck)p(ck) Now
suppose Then Equivalently
C
x1
x2
xp
weights of evidence
37Evidence Balance Sheet
38Naïve Bayes (cont.)
- Despite the crude conditional independence
assumption, works well in practice (see Friedman,
1997 for a partial explanation) - Can be further enhanced with boosting, bagging,
model averaging, etc. - Can relax the conditional independence
assumptions in myriad ways (Bayesian networks)
39Dietterich (1999) Analysis of 33 UCI datasets