K nearest neighbor and Rocchio algorithm - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

K nearest neighbor and Rocchio algorithm

Description:

Testing time: for a new document, find the most similar prototype ... At the test time, instead of using all the training instances, use only prototype vectors. ... – PowerPoint PPT presentation

Number of Views:912

Avg rating:3.0/5.0

Slides: 29

Provided by: coursesWa

Category:

more less

Transcript and Presenter's Notes

Title: K nearest neighbor and Rocchio algorithm

1
K nearest neighborand Rocchio algorithm

LING 572
Fei Xia
1/11/2007

2
Announcement

Hw2 is online now. It is due on Jan 20.
Hw1 is due at 11pm on Jan 13 (Sat).
Lab session after the class.
Read DT Tutorial before next Tues class.

3
K-Nearest Neighbor (kNN)
4
Instance-based (IB) learning

No training store all training instances.
? Lazy learning
Examples
kNN
Locally weighted regression
Radial basis functions
Case-based reasoning
The most well-known IB method kNN

5
kNN
6
kNN

For a new document d,
find k training documents that are closest to d.
perform majority voting or weighted voting.
Properties
A lazy classifier. No training.
Feature selection and distance measure are
crucial.

7
The algorithm

Determine parameter K
Calculate the distance between query-instance and
all the training instances
Sort the distances and determine K nearest
neighbors
Gather the labels of the K nearest neighbors
Use simple majority voting or weighted voting.

8
Picking K

Use N-fold cross validation pick the one that
minimizes cross validation error.

9
Normalizing attribute values

Distance could be dominated by some attributes
with large numbers
Ex features age, income
Original data x1(35, 76K), x2(36, 80K),
x3(70, 79K)
Assume age 2 0,100, income 2 0, 200K
After normalization x1(0.35, 0.38),
x2(0.36, 0.40), x3 (0.70, 0.395).

10
The Choice of Features

Imagine there are 100 features, and only 2 of
them are relevant to the target label.
kNN is easily misled in high-dimensional space.
? Feature weighting or feature selection

11
Feature weighting

Stretch j-th axis by weight wj,
Use cross-validation to automatically
choose weights w1, , wn
Setting wj to zero eliminates this dimension
altogether.

12
Similarity measure

Euclidean distance
Weighted Euclidean distance
Similarity measure cosine

13
Voting

Majority voting
c arg maxc ?i ?(c, fi(x))
Weighted voting weighting is on each neighbor
c arg maxc ?i wi ?(c, fi(x))
wi 1/dist(x, xi)
? We can use all the training examples.

14
Summary of kNN

Strengths
Simplicity (conceptual)
Efficiency at training no training
Handling multi-class
Stability and robustness averaging k neighbors
Predication accuracy when the training data is
large
Weakness
Efficiency at testing time need to calc all
distances
Theoretical validity
It is not clear which types of distance measure
and features to use.

15
Rocchio Algorithm
16
Relevance Feedback for IR

The issue plane vs. aircraft
Take advantage of user feedback on relevance of
docs to improve IR results
User issues a short, simple query
The user marks returned documents as relevant or
non-relevant.
The system computes a better representation of
the information need based on feedback.
Relevance feedback can go through one or more
iterations.
Idea it may be difficult to formulate a good
query
when you dont know the collection well,
so iterate

17
Rocchio Algorithm

The Rocchio algorithm incorporates relevance
feedback information into the vector space model.
Want to maximize sim (Q, Cr) - sim (Q, Cnr)
The optimal query vector for separating relevant
and non-relevant documents (with cosine sim.)
Qopt optimal query Cr set of relevant doc
vectors N collection size

18
Rocchio 1971 Algorithm (SMART)
qm modified query vector q0 original query
vector ?, ?, ? weights Dr set of known
relevant doc vectors Dnr set of known
irrelevant doc vectors
19
Relevance feedback assumptions

Relevance prototypes are well-behaved.
Term distribution in relevant documents will be
similar
Term distribution in non-relevant documents will
be different from those in relevant documents
Either All relevant documents are tightly
clustered around a single prototype.
Or There are different prototypes, but they have
significant vocabulary overlap.
Similarities between relevant and irrelevant
documents are small.

20
Rocchio Algorithm for text classification

Training time construct a set of prototype
vectors, one vector per class.
Testing time for a new document, find the most
similar prototype vector.

21
Training time
Cj the set of positive examples for class
j. D the set of positive and negative
examples ?, ? weights
22
Why this formula?

Rocchio shows when ??1, each prototype vector
maximizes
How maximizing this formula connects to the
classification accuracy?

23
Testing time

Given a new document d

24
kNN vs. Rocchio

kNN
Lazy learning no training
Use all the training instances at testing time.
Rocchio algorithm
At the training time, calculate prototype
vectors.
At the test time, instead of using all the
training instances, use only prototype vectors.
Linear classifier not as expressive as kNN.

25
Summary of Rocchio

Strengths
Simplicity (conceptual)
Efficiency at training
Efficiency at testing time
Handling multi-class
Weakness
Theoretical validity
Stability and robustness
Predication accuracy it does not work well when
the categories are not linearly separable.

26
Additional slides
27
Three major design choices