Title: Active Learning for Internet Information Retrieval
1 Active Learning for Internet Information
Retrieval
- Yu Hen Hu
- University of Wisconsin-Madison
- Dept. Electrical and Computer Engineering
- hu_at_engr.wisc.edu
Part of this work is collaborated with Partha
Niyogi, B. H. (Fred) Juang (Bell Labs)
2Outline
- Internet Information Retrieval
- Pattern Classification
- Active learning
- An introduction of active learning
- Minimax active learning strategy
3Internet Information Retrieval
Distributed sources of documents (web pages and
others)
Internet
User Interface
Search/Index Engines
4Internet Search State of Art
Too many results ?
Want info about repairing portable computers.
Instead, get portable machine tools!
5Issues in Internet Search
- Precision
- WYGIWYW What you get is what you want?
- What fraction of retrieved documents are
relevant? - Recall
- Are all the relevant documents found in this
search? - Efficiency
- How soon the desired search results can be
obtained.
6Internet Document Retrieval Process
Web-bot search through web to index web documents
Stemming
Break into words
Stoplist
words
Text, web documents
Feature normalization
Noise reduction
DATABASE (indexing, Boolean Search, etc.)
Relevance feedback
Query Pre-processing
Result presentation
User
Intelligent active learning user interface
7Information Retrieval Pattern Classification
- Given a set of feature vectors xi, each
represents a document. - A user provides a query, x, also in the form of
a feature vector. - Use x as a prototype, for a given error bound
?, label all documents as relevant if the
corresponding feature vectors satisfy - d(xi, x) ? ?
- Otherwise, the document is labeled as irrelevant.
8Feature Vector Document Term Vector
- Relative Term Frequency (within a document)
- tf (t,d) count of term t / of terms in
document d - Inverse document Frequency
- df(t) total count of document/ of doc
contain t - Weighted term frequency
- dt tf(t,d) log df(t)
- Document term vector D d1, d2,
9Term Vector Example
- Document 1 The weather is great these days.
- Document 2 These are great ideas
- Document 3 You look great
- Eliminate The, is, these, are, you
10Pattern Classification
- Let x be a feature vector, Ck 1 ? k ? K be K
class labels. We assume x has a mixture
probability distribution -
-
- A pattern classifier is a decision function d(x)
? Ck 1 ? k ? K that maps each x to a class
label. - Thus, d(x) partitions the feature space X into
disjoint regions Rk x x ?X, d(x) Ck.
11Bayes Rule Posterior Prob.
- pCk px ?Ck prior probability
- px Ck likelihood function, conditional
probability. - pCkx posterior probability
- To minimize Perror, one must maximize pCkx for
each x - MAP (Maximum posterior probability) classifier
- dMAP(x) Ck s. t.
- pCk x? pCk x, k?k
- Bayes rule
- Decision boundary
- Finding d(x) is equivalent to finding the
decision boundaries. - If K 2, ?(x) pC2x 1/2 at the decision
boundary.
12Probability of Mis-Classification
- The quality of a pattern classifier is often
determined by the probability of
mis-classification
pC1 px C1
pC2 px C2
x
x0
R1 xd(x) C1 R2 xd(x) C2
13Excessive Pr. of Mis-classification
Pece
x
x estimated decision boundary
Bayes Error P
Optimal decision boundary
14Upper bound of Pece
Assume px1
?(x) PC2x unknown
Pece
- For 1D, K2 problems, above upper bound can be
used to estimate Pece. - From an excessive Prob. of mis-classification
point of view, the error of estimating x,
is not the only concern. The slop of pC2x
near to x also matters.
1/2
x
x
x
?(x) 1/2
A posterior probability interpretation
15Active Learning Agent for Internet Search
- A mediator (agent) between human user and search
engine, performing tasks to - Interpret human queries,
- Organize search results,
- Solicit relevance feedback from users (Asking
questions)
16Relevance Feedback Active Learning
- Relevance feedback
- After search using initial query provided by the
user, the IR agent presents examples of retrieved
documents asking user to label each of them as
relevant or irrelevant. - Use of relevance feedback
- Based on users feedback, the query can be
refined to perform refined search, to prune out
more irrelevant documents, and to retrieve more
relevant documents like the user specified.
- Relation to active learning
- Too many questions will get the user bored
quickly and abandon search prematurely. - The agent must select a subset of most succinct
documents to ask user to provide feedback. - The process of selecting the right question to
ask can be formulated as an active learning
problem.
17What is Learning?
- A process to find relations between input and
output (mapping) - modeling, estimation, detection, classification,
etc. - Samples of output values at selected input points
are given Say, (x1, y1), (x2, y2), (x3, y3) - Goal of Learning find y f(x)
18What is Active Learning?
- The learning algorithm (learner) actively
requests next sample (ask question). - Sequential sampling (learning)
- The learner decides where to take the next sample
based on observation of past samples, rather than
random sampling. - Will use active sampling and active learning
inter-changeably
19Why Active Learning?
- FASTER
- Learning can be more efficient!
- Fewer samples may imply less training time.
- CHEAPER
- Samples (or labeling samples) are expensive!
20Active Learning A Pattern Classification
Formulation
- Let X be a feature space and x ?X be a feature
vector. Each x is associated with one of K labels
C Ck 1 ? k ? K. The prior pr. pCk is the
pr. x is associated with label Ck. - We want to devise a pattern classifier d(x) using
a learning algorithm (agent) to satisfy a
performance constraint that Perror is within a
pre-defined bound. Equivalently, this implies
that the pr. excessive classification error must
be bounded by a small positive number ? - Pece ? ?
- An oracle (the user) will provide training
samples (xi, yi) xi?X, yi?C at a cost. - The goal of active learning is to minimize this
sampling cost ( of labeled samples) subject to
the performance constraint Pece ? ? - Conventional pattern classification problem
sampling method the oracle provides both xi and
yi. xi are often randomly sampled within X. - Active learning problem sampling method the
agent specifies xi, and the oracle provides
corresponding yi
21An one-dimensional formulation
- Assume a unique decision boundary x?0 1.
- Define ?(x) pC2x. ?(x) is unknown, but
assumed to be non-decreasing in 0 1. - ?(0) lt 1/2 ?(x) lt ?(1)
- Each time the agent request one or a few training
samples at a sampling point x ?0 1, the
oracle will return a class label y(x). - We assume that the class label returned while
sampling at the same value of x repeatedly obeys
a binomial distribution with mean ?(x).
- For example, if ?(x) 0.7. Then, if we sample 10
times at x, we may observe, on average, 7 out of
10 times, the class label is C2, and the
remaining being C1. - In practice, repeated sampling at the same point
can be replaced by taking multiple samples at a
small neighborhood surrounding x. - Our goal is to find an estimate of x such that
Pece ? ? while using smallest amount of samples.
22More on the 1D task
- A stochastic search problem
- Find x?0 1 such that ?(x) 1/2. Only know
?(x) is non-decreasing, and ?(0) lt 1/2 lt ?(1). - Perform experiments at each x, with return y(x)
from the oracle. y(x) 0 if x ? C1 and y(x) 1
if x ? C2. - But px ? C2 ?(x), and px ? C1 1?(x) are
both unknown. Only 0 or 1 observed for each x.
Repeated sampling at the same x may yield
different labels!
- Robins-Monroes method can be used
- If x is to the left of x, y(x) is more likely to
be 0. x should be incremented to be closer to x. - If x is to the right of x, y(x) is more likely
to be 1. x should be decremented to be closer to
x. - Converges in probability. Often slow and too many
samples may be needed
23Overview of a New Approach
- By repeated sampling at the same sampling point,
we can estimate pr.?(x)?1/2 pr.decision
boundary x is to the right of current sampling
point x - n sample points partition the interval 0 1
into n1 sub-intervals. - Combine these probabilities, we can estimate the
probability - pr.x ? a sub-interval
- for each sub-interval. This is the p.d.f.of x
over 0 1.
- An estimate of x is the mean of this empirical
p.d.f. - Question How to actively select a new sampling
point? - Answer Select the next sampling point to
minimize the maximum of Pece. This is a minimax
heuristic. - Our contribution is to devise a new algorithm to
facilitate this minimax active sampling strategy.
24Estimating pr.??1/2 r,n
- Assume observing r 1s out of n trials.
- If we sample at x 0.1 5 times and observe
0,0,1,0,0. Then, what is the probability that
pC2x0.1?(0.1)gt0.5? - By observing this sequence, we derived a formula
for - p??1/2 r,n
A plot of P??1/2 r,n for n up to 36
n
r
25Details of the formula
- Assume p? 1.
- No close form solution is available.
- The formula is numerically ill-conditioned due to
the subtraction in series. - Example n 6, r 1,
- p??1/2 r,n 0.9375
- which says if in 6 trials, observe 1 (C2) only
once, then x is to the left of x with 93.75
probability.
26Partition of the Interval
- Assume ?(0)lt1/2, ?(1)gt1/2
- If P?(0.5)?1/20.3, then
- Px?0.5 1 0.3, and
- Px?0 0.5 0.7. Next,
- if P?(0.25)?1/20.2, then
- Px?0.25 0.5 x?0 0.5 0.2
- Px?0 0.25 x?0 0.5 0.8
- Hence,
- Px?0.25 0.5 0.20.7 0.14
- Px?0 0.25 0.80.7 0.56
-
- This leads to a tree representation.
A tree interpretation of the partition of
an interval 0 1 into sub-intervals by
sampling points, and calculate Px ?
sub-intervals
27Upper Bound over Multiple Intervals
- But since ?(x) is unknown, Pexcess can further be
simplified to - where
- and
-
28Minimax Active Sampling Strategy
- Given xi and ?i, choose x such that F(x) is
minimized. - F(x) is a piece-wise linear function. Solution
must occur at one of the center of the
sub-intervals (xixi1)/2 - Pece can be estimated using the previous upper
bounds to verify if Pece ? ? is satisfied. - Open issues
- How many samples to take repeatedly at each
sample point? - Is the total number of sample point minimized?
29Future Works
- Refine the results
- Relax assumptions about constant px, p?, etc.
- Theoretical bounds on what is the minimum number
of samples for a given task? Is this correct
question to ask? - What if the sampling points are determined, but
labels are to be queried? - Higher dimensional generalization
- Relations to other methods
- Importance sampling, design for experiments
- Query learning
- Applications to solve real world problems
- Internet information retrieval
- Digital library