Title: Statistical Classification
1Statistical Classification
2Classification Problems
- Given input Xx1, x2, , xm
- Predict the class label y ? Y
- Y -1,1, binary class classification problems
- Y 1, 2, 3, , c, multiple class
classification problems - Goal need to learn the function f X ? Y
3Examples of Classification Problem
- Text categorization
- Input features X
- Word frequency
- (campaigning, 1), (democrats, 2), (basketball,
0), - Class label y
- Y 1 politics
- Y -1 non-politics
Politics Non-politics
Doc Months of campaigning and weeks of
round-the-clock efforts in Iowa all came down to
a final push Sunday,
Topic
4Examples of Classification Problem
- Text categorization
- Input features X
- Word frequency
- (campaigning, 1), (democrats, 2), (basketball,
0), - Class label y
- Y 1 politics
- Y -1 not-politics
Politics Non-politics
Doc Months of campaigning and weeks of
round-the-clock efforts in Iowa all came down to
a final push Sunday,
Topic
5Examples of Classification Problem
- Image Classification
- Input features X
- Color histogram
- (red, 1004), (red, 23000),
- Class label y
- Y 1 bird image
- Y -1 non-bird image
Which images are birds, which are not?
6Examples of Classification Problem
- Image Classification
- Input features X
- Color histogram
- (red, 1004), (blue, 23000),
- Class label y
- Y 1 bird image
- Y -1 non-bird image
Which images are birds, which are not?
7Classification Problems
How to obtain f ?
Learn classification function f from examples
8Learning from Examples
- Training examples
-
- Identical Independent Distribution (i.i.d.)
- Each training example is drawn independently from
the identical source - Training examples are similar to testing examples
9Learning from Examples
- Training examples
-
- Identical Independent Distribution (i.i.d.)
- Each training example is drawn independently from
the identical source
10Learning from Examples
- Given training examples
- Goal learn a classification function f(x)X?Y
that is consistent with training examples - What is the easiest way to do it ?
11K Nearest Neighbor (kNN) Approach
How many neighbors should we count ?
12Cross Validation
- Divide training examples into two sets
- A training set (80) and a validation set (20)
- Predict the class labels of the examples in the
validation set by the examples in the training
set - Choose the number of neighbors k that maximizes
the classification accuracy
13Leave-One-Out Method
- For k 1, 2, , K
- Err(k) 0
- Randomly select a training data point and hide
its class label - Using the remaining data and given K to predict
the class label for the left data point - Err(k) Err(k) 1 if the predicted label is
different from the true label - Repeat the procedure until all training examples
are tested - Choose the k whose Err(k) is minimal
14Leave-One-Out Method
- For k 1, 2, , K
- Err(k) 0
- Randomly select a training data point and hide
its class label - Using the remaining data and given K to predict
the class label for the left data point - Err(k) Err(k) 1 if the predicted label is
different from the true label - Repeat the procedure until all training examples
are tested - Choose the k whose Err(k) is minimal
15Leave-One-Out Method
- For k 1, 2, , K
- Err(k) 0
- Randomly select a training data point and hide
its class label - Using the remaining data and given k to predict
the class label for the left data point - Err(k) Err(k) 1 if the predicted label is
different from the true label - Repeat the procedure until all training examples
are tested - Choose the k whose Err(k) is minimal
16Leave-One-Out Method
- For k 1, 2, , K
- Err(k) 0
- Randomly select a training data point and hide
its class label - Using the remaining data and given k to predict
the class label for the left data point - Err(k) Err(k) 1 if the predicted label is
different from the true label - Repeat the procedure until all training examples
are tested - Choose the k whose Err(k) is minimal
Err(1) 1
17Leave-One-Out Method
- For k 1, 2, , K
- Err(k) 0
- Randomly select a training data point and hide
its class label - Using the remaining data and given k to predict
the class label for the left data point - Err(k) Err(k) 1 if the predicted label is
different from the true label - Repeat the procedure until all training examples
are tested - Choose the k whose Err(k) is minimal
Err(1) 1
18Leave-One-Out Method
- For k 1, 2, , K
- Err(k) 0
- Randomly select a training data point and hide
its class label - Using the remaining data and given k to predict
the class label for the left data point - Err(k) Err(k) 1 if the predicted label is
different from the true label - Repeat the procedure until all training examples
are tested - Choose the k whose Err(k) is minimal
Err(1) 3 Err(2) 2 Err(3) 6
19Probabilistic interpretation of KNN
- Estimate the probability density function Pr(yx)
around the location of x - Count of data points in class y in the
neighborhood of x - Bias and variance tradeoff
- A small neighborhood ? large variance ?
unreliable estimation - A large neighborhood ? large bias ? inaccurate
estimation
20Weighted kNN
- Weight the contribution of each close neighbor
based on their distances - Weight function
- Prediction
21Estimate ?2 in the Weight Function
- Leave one cross validation
- Training dataset D is divided into two sets
- Validation set
- Training set
- Compute the
22Estimate ?2 in the Weight Function
Pr(yx1, D-1) is a function of ?2
23Estimate ?2 in the Weight Function
Pr(yx1, D-1) is a function of ?2
24Estimate ?2 in the Weight Function
- In general, we can have expression for
- Validation set
- Training set
- Estimate ?2 by maximizing the likelihood
25Estimate ?2 in the Weight Function
- In general, we can have expression for
- Validation set
- Training set
- Estimate ?2 by maximizing the likelihood
26Optimization
- It is a DC (difference of two convex functions)
function
27Challenges in Optimization
- Convex functions are easiest to be optimized
- Single-mode functions are the second easiest
- Multi-mode functions are difficult to be optimized
28Gradient Ascent
29Gradient Ascent (contd)
- Compute the derivative of l(?), i.e.,
- Update ?
How to decide the step size t?
30Gradient Ascent Line Search
Excerpt from the slides by Steven Boyd
31Gradient Ascent
- Stop criterion
- ? is predefined small value
- Start ?0, Define ?, ?, and ?
- Compute
- Choose step size t via backtracking line search
- Update
- Repeat till
32Gradient Ascent
- Stop criterion
- ? is predefined small value
- Start ?0, Define ?, ?, and ?
- Compute
- Choose step size t via backtracking line search
- Update
- Repeat till
33ML Statistics Optimization
- Modeling Pr(yx?)
- ? is the parameter(s) involved in the model
- Search for the best parameter ?
- Maximum likelihood estimation
- Construct a log-likelihood function l(?)
- Search for the optimal solution ?
34Instance-Based Learning (Ch. 8)
- Key idea just store all training examples
- k Nearest neighbor
- Given query example , take vote among its k
nearest neighbors (if discrete-valued target
function) - take mean of f values of k nearest neighbors if
real-valued target function
35When to Consider Nearest Neighbor ?
- Lots of training data
- Less than 20 attributes per example
- Advantages
- Training is very fast
- Learn complex target functions
- Dont lose information
- Disadvantages
- Slow at query time
- Easily fooled by irrelevant attributes
36KD Tree for NN Search
- Each node contains
- Children information
- The tightest box that bounds all the data points
within the node.
37NN Search by KD Tree
38NN Search by KD Tree
39NN Search by KD Tree
40NN Search by KD Tree
41NN Search by KD Tree
42NN Search by KD Tree
43NN Search by KD Tree
44Curse of Dimensionality
- Imagine instances described by 20 attributes, but
only 2 are relevant to target function - Curse of dimensionality nearest neighbor is
easily mislead when high dimensional X - Consider N data points uniformly distributed in a
p-dimensional unit ball centered at original.
Consider the nn estimate at the original. The
mean distance from the original to the closest
data point is
45Curse of Dimensionality
- Imagine instances described by 20 attributes, but
only 2 are relevant to target function - Curse of dimensionality nearest neighbor is
easily mislead when high dimensional X - Consider N data points uniformly distributed in a
p-dimensional unit ball centered at origin.
Consider the nn estimate at the original. The
mean distance from the origin to the closest data
point is