Classification - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Classification

Description:

Classification. Based in part on Chapter 10 of Hand, Manilla, ... e.g. perceptron, SVM, CART. Regression: model p(ck | x ) - e.g. logistic regression, CART ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 40

Provided by: madi67

Learn more at: https://www.stat.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Classification

1
Classification
Based in part on Chapter 10 of Hand, Manilla,
Smyth and Chapter 7 of Han and Kamber David
Madigan
2
Predictive Modeling
Goal learn a mapping y f(x?) Need 1. A
model structure 2. A score function 3. An
optimization strategy Categorical y ? c1,,cm
classification Real-valued y regression Note
usually assume c1,,cm are mutually exclusive
and exhaustive
3
Probabilistic Classification
Let p(ck) prob. that a randomly chosen object
comes from ck Objects from ck have p(x ck ,
?k) (e.g., MVN) Then p(ck x ) ? p(x ck ,
?k) p(ck)
Bayes Error Rate

Lower bound on the best possible error rate

4
Bayes error rate about 6
5
Classifier Types
Discrimination direct mapping from x to
c1,,cm - e.g. perceptron, SVM,
CART Regression model p(ck x ) - e.g.
logistic regression, CART Class-conditional
model p(x ck , ?k) - e.g. Bayesian
classifiers, LDA
6
Simple Two-Class Perceptron
Define Classify as class 1 if h(x)gt0, class 2
otherwise Score function misclassification
errors on training data For training, replace
class 2 xjs by -xj now need h(x)gt0
Initialize weight vector Repeat one or more
times For each training data point xi If
point correctly classified, do nothing Else
Guaranteed to converge when there is perfect
separation
7
Linear Discriminant Analysis
K classes, X n p data matrix.
p(ck x ) ? p(x ck , ?k) p(ck)
Could model each class density as multivariate
normal
LDA assumes for all k. Then
This is linear in x.
8
Linear Discriminant Analysis (cont.)
It follows that the classifier should predict
linear discriminant function
If we dont assume the ?ks are identicial, get
Quadratic DA
9
Linear Discriminant Analysis (cont.)
Can estimate the LDA parameters via maximum
likelihood
10
(No Transcript)
11
(No Transcript)
12
LDA
QDA
13
(No Transcript)
14
(No Transcript)
15
LDA (cont.)

Fisher is optimal if the class are MVN with a
common covariance matrix
Computational complexity O(mp2n)

16
Logistic Regression
Note that LDA is linear in x
Linear logistic regression looks the same
But the estimation procedure for the
co-efficicents is different. LDA maximizes joint
likelihood y,X logistic regression maximizes
conditional likelihood yX. Usually similar
predictions.
17
Logistic Regression MLE
For the two-class case, the likelihood is
The maximize need to solve (non-linear) score
equations
18
Logistic Regression Modeling
South African Heart Disease Example (yMI)
Coef. S.E. Z score
Intercept -4.130 0.964 -4.285
sbp 0.006 0.006 1.023
Tobacco 0.080 0.026 3.034
ldl 0.185 0.057 3.219
Famhist 0.939 0.225 4.178
Obesity -0.035 0.029 -1.187
Alcohol 0.001 0.004 0.136
Age 0.043 0.010 4.184
Wald
19
Tree Models

Easy to understand
Can handle mixed data, missing values, etc.
Sequential fitting method can be sub-optimal
Usually grow a large tree and prune it back
rather than attempt to optimally stop the growing
process

20
(No Transcript)
21
Training Dataset
This follows an example from Quinlans ID3
22
Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Confusion matrix
27
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all the training examples are at the
root
Attributes are categorical (if continuous-valued,
they are discretized in advance)
Examples are partitioned recursively based on
selected attributes
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same
class
There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
There are no samples left

28
Information Gain (ID3/C4.5)

Select the attribute with the highest information
gain
Assume there are two classes, P and N
Let the set of examples S contain p elements of
class P and n elements of class N
The amount of information, needed to decide if an
arbitrary example in S belongs to P or N is
defined as

e.g. I(0.5,0.5)1 I(0.9,0.1)0.47
I(0.99,0.01)0.08
29
Information Gain in Decision Tree Induction

Assume that using attribute A a set S will be
partitioned into sets S1, S2 , , Sv
If Si contains pi examples of P and ni examples
of N, the entropy, or the expected information
needed to classify objects in all subtrees Si is
The encoding information that would be gained by
branching on A

30
Attribute Selection by Information Gain
Computation

Hence
Similarly

Class P buys_computer yes
Class N buys_computer no
I(p, n) I(9, 5) 0.940
Compute the entropy for age

31
Gini Index (IBM IntelligentMiner)

If a data set T contains examples from n classes,
gini index, gini(T) is defined as
where pj is the relative frequency of class j
in T.
If a data set T is split into two subsets T1 and
T2 with sizes N1 and N2 respectively, the gini
index of the split data contains examples from n
classes, the gini index gini(T) is defined as
The attribute provides the smallest ginisplit(T)
is chosen to split the node

32
Avoid Overfitting in Classification

The generated tree may overfit the training data
Too many branches, some may reflect anomalies due
to noise or outliers
Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees
Use a set of data different from the training
data to decide which is the best pruned tree

33
Approaches to Determine the Final Tree Size

Separate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross
validation
Use minimum description length (MDL) principle
halting growth of the tree when the encoding is
minimized

34
Nearest Neighbor Methods

k-NN assigns an unknown object to the most common
class of its k nearest neighbors
Choice of k? (bias-variance tradeoff again)
Choice of metric?
Need all the training to be present to classify a
new point (lazy methods)
Surprisingly strong asymptotic results (e.g. no
decision rule is more than twice as accurate as
1-NN)

35
Flexible Metric NN Classification
36
Naïve Bayes Classification
Recall p(ck x) ? p(x ck)p(ck) Now
suppose Then Equivalently
C

x1
x2
xp
weights of evidence
37
Evidence Balance Sheet
38
Naïve Bayes (cont.)

Despite the crude conditional independence
assumption, works well in practice (see Friedman,
1997 for a partial explanation)
Can be further enhanced with boosting, bagging,
model averaging, etc.
Can relax the conditional independence
assumptions in myriad ways (Bayesian networks)

39
Dietterich (1999) Analysis of 33 UCI datasets

Write a Comment

User Comments (0)