Title: K nearest neighbor and Rocchio algorithm
1K nearest neighborand Rocchio algorithm
- LING 572
- Fei Xia
- 1/11/2007
2Announcement
- Hw2 is online now. It is due on Jan 20.
- Hw1 is due at 11pm on Jan 13 (Sat).
- Lab session after the class.
- Read DT Tutorial before next Tues class.
3K-Nearest Neighbor (kNN)
4Instance-based (IB) learning
- No training store all training instances.
- ? Lazy learning
- Examples
- kNN
- Locally weighted regression
- Radial basis functions
- Case-based reasoning
-
- The most well-known IB method kNN
5kNN
6kNN
- For a new document d,
- find k training documents that are closest to d.
- perform majority voting or weighted voting.
- Properties
- A lazy classifier. No training.
- Feature selection and distance measure are
crucial.
7The algorithm
- Determine parameter K
- Calculate the distance between query-instance and
all the training instances - Sort the distances and determine K nearest
neighbors - Gather the labels of the K nearest neighbors
- Use simple majority voting or weighted voting.
8Picking K
- Use N-fold cross validation pick the one that
minimizes cross validation error.
9Normalizing attribute values
- Distance could be dominated by some attributes
with large numbers - Ex features age, income
- Original data x1(35, 76K), x2(36, 80K),
x3(70, 79K) - Assume age 2 0,100, income 2 0, 200K
- After normalization x1(0.35, 0.38),
- x2(0.36, 0.40), x3 (0.70, 0.395).
10The Choice of Features
- Imagine there are 100 features, and only 2 of
them are relevant to the target label. - kNN is easily misled in high-dimensional space.
- ? Feature weighting or feature selection
11Feature weighting
- Stretch j-th axis by weight wj,
- Use cross-validation to automatically
- choose weights w1, , wn
- Setting wj to zero eliminates this dimension
altogether.
12Similarity measure
- Euclidean distance
-
- Weighted Euclidean distance
- Similarity measure cosine
-
13Voting
- Majority voting
- c arg maxc ?i ?(c, fi(x))
-
- Weighted voting weighting is on each neighbor
- c arg maxc ?i wi ?(c, fi(x))
- wi 1/dist(x, xi)
- ? We can use all the training examples.
-
14Summary of kNN
- Strengths
- Simplicity (conceptual)
- Efficiency at training no training
- Handling multi-class
- Stability and robustness averaging k neighbors
- Predication accuracy when the training data is
large - Weakness
- Efficiency at testing time need to calc all
distances - Theoretical validity
- It is not clear which types of distance measure
and features to use.
15Rocchio Algorithm
16Relevance Feedback for IR
- The issue plane vs. aircraft
- Take advantage of user feedback on relevance of
docs to improve IR results - User issues a short, simple query
- The user marks returned documents as relevant or
non-relevant. - The system computes a better representation of
the information need based on feedback. - Relevance feedback can go through one or more
iterations. - Idea it may be difficult to formulate a good
query - when you dont know the collection well,
so iterate
17Rocchio Algorithm
- The Rocchio algorithm incorporates relevance
feedback information into the vector space model. - Want to maximize sim (Q, Cr) - sim (Q, Cnr)
- The optimal query vector for separating relevant
and non-relevant documents (with cosine sim.) - Qopt optimal query Cr set of relevant doc
vectors N collection size
18Rocchio 1971 Algorithm (SMART)
qm modified query vector q0 original query
vector ?, ?, ? weights Dr set of known
relevant doc vectors Dnr set of known
irrelevant doc vectors
19Relevance feedback assumptions
- Relevance prototypes are well-behaved.
- Term distribution in relevant documents will be
similar - Term distribution in non-relevant documents will
be different from those in relevant documents - Either All relevant documents are tightly
clustered around a single prototype. - Or There are different prototypes, but they have
significant vocabulary overlap. - Similarities between relevant and irrelevant
documents are small.
20Rocchio Algorithm for text classification
- Training time construct a set of prototype
vectors, one vector per class. - Testing time for a new document, find the most
similar prototype vector.
21Training time
Cj the set of positive examples for class
j. D the set of positive and negative
examples ?, ? weights
22Why this formula?
- Rocchio shows when ??1, each prototype vector
maximizes - How maximizing this formula connects to the
classification accuracy?
23Testing time
24kNN vs. Rocchio
- kNN
- Lazy learning no training
- Use all the training instances at testing time.
- Rocchio algorithm
- At the training time, calculate prototype
vectors. - At the test time, instead of using all the
training instances, use only prototype vectors. - Linear classifier not as expressive as kNN.
25Summary of Rocchio
- Strengths
- Simplicity (conceptual)
- Efficiency at training
- Efficiency at testing time
- Handling multi-class
- Weakness
- Theoretical validity
- Stability and robustness
- Predication accuracy it does not work well when
the categories are not linearly separable.
26Additional slides
27Three major design choices
- The weight of feature fk in document di e.g.,
tf-idf - The document length normalization
- The similarity measure e.g., cosine
28Extending Rocchio?
- Generalized instance set (GIS) algorithm (Lam and
Ho, 1998). -