Active Learning for Internet Information Retrieval - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Active Learning for Internet Information Retrieval

Description:

Part of this work is collaborated with Partha Niyogi, B. H. (Fred) ... Want info about repairing portable computers. Instead, get ... Robins-Monroe's ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 28

Provided by: YuHe8

Category:

more less

Transcript and Presenter's Notes

Title: Active Learning for Internet Information Retrieval

1
Active Learning for Internet Information
Retrieval

Yu Hen Hu
University of Wisconsin-Madison
Dept. Electrical and Computer Engineering
hu_at_engr.wisc.edu

Part of this work is collaborated with Partha
Niyogi, B. H. (Fred) Juang (Bell Labs)
2
Outline

Internet Information Retrieval
Pattern Classification
Active learning
An introduction of active learning
Minimax active learning strategy

3
Internet Information Retrieval
Distributed sources of documents (web pages and
others)
Internet
User Interface
Search/Index Engines
4
Internet Search State of Art
Too many results ?
Want info about repairing portable computers.
Instead, get portable machine tools!
5
Issues in Internet Search

Precision
WYGIWYW What you get is what you want?
What fraction of retrieved documents are
relevant?
Recall
Are all the relevant documents found in this
search?
Efficiency
How soon the desired search results can be
obtained.

6
Internet Document Retrieval Process
Web-bot search through web to index web documents
Stemming
Break into words
Stoplist
words
Text, web documents
Feature normalization
Noise reduction
DATABASE (indexing, Boolean Search, etc.)
Relevance feedback
Query Pre-processing
Result presentation
User
Intelligent active learning user interface
7
Information Retrieval Pattern Classification

Given a set of feature vectors xi, each
represents a document.
A user provides a query, x, also in the form of
a feature vector.
Use x as a prototype, for a given error bound
?, label all documents as relevant if the
corresponding feature vectors satisfy
d(xi, x) ? ?
Otherwise, the document is labeled as irrelevant.

8
Feature Vector Document Term Vector

Relative Term Frequency (within a document)
tf (t,d) count of term t / of terms in
document d
Inverse document Frequency
df(t) total count of document/ of doc
contain t
Weighted term frequency
dt tf(t,d) log df(t)
Document term vector D d1, d2,

9
Term Vector Example

Document 1 The weather is great these days.
Document 2 These are great ideas
Document 3 You look great
Eliminate The, is, these, are, you

10
Pattern Classification

Let x be a feature vector, Ck 1 ? k ? K be K
class labels. We assume x has a mixture
probability distribution
A pattern classifier is a decision function d(x)
? Ck 1 ? k ? K that maps each x to a class
label.
Thus, d(x) partitions the feature space X into
disjoint regions Rk x x ?X, d(x) Ck.

11
Bayes Rule Posterior Prob.

pCk px ?Ck prior probability
px Ck likelihood function, conditional
probability.
pCkx posterior probability
To minimize Perror, one must maximize pCkx for
each x
MAP (Maximum posterior probability) classifier
dMAP(x) Ck s. t.
pCk x? pCk x, k?k

Bayes rule
Decision boundary
Finding d(x) is equivalent to finding the
decision boundaries.
If K 2, ?(x) pC2x 1/2 at the decision
boundary.

12
Probability of Mis-Classification

The quality of a pattern classifier is often
determined by the probability of
mis-classification

pC1 px C1
pC2 px C2
x
x0
R1 xd(x) C1 R2 xd(x) C2
13
Excessive Pr. of Mis-classification
Pece
x
x estimated decision boundary
Bayes Error P
Optimal decision boundary
14
Upper bound of Pece
Assume px1
?(x) PC2x unknown
Pece

For 1D, K2 problems, above upper bound can be
used to estimate Pece.
From an excessive Prob. of mis-classification
point of view, the error of estimating x,
is not the only concern. The slop of pC2x
near to x also matters.

1/2
x
x
x
?(x) 1/2
A posterior probability interpretation
15
Active Learning Agent for Internet Search

A mediator (agent) between human user and search
engine, performing tasks to
Interpret human queries,
Organize search results,
Solicit relevance feedback from users (Asking
questions)

16
Relevance Feedback Active Learning

Relevance feedback
After search using initial query provided by the
user, the IR agent presents examples of retrieved
documents asking user to label each of them as
relevant or irrelevant.
Use of relevance feedback
Based on users feedback, the query can be
refined to perform refined search, to prune out
more irrelevant documents, and to retrieve more
relevant documents like the user specified.

Relation to active learning
Too many questions will get the user bored
quickly and abandon search prematurely.
The agent must select a subset of most succinct
documents to ask user to provide feedback.
The process of selecting the right question to
ask can be formulated as an active learning
problem.

17
What is Learning?

A process to find relations between input and
output (mapping)
modeling, estimation, detection, classification,
etc.
Samples of output values at selected input points
are given Say, (x1, y1), (x2, y2), (x3, y3)
Goal of Learning find y f(x)

18
What is Active Learning?

The learning algorithm (learner) actively
requests next sample (ask question).
Sequential sampling (learning)
The learner decides where to take the next sample
based on observation of past samples, rather than
random sampling.
Will use active sampling and active learning
inter-changeably

19
Why Active Learning?

FASTER
Learning can be more efficient!
Fewer samples may imply less training time.
CHEAPER
Samples (or labeling samples) are expensive!

20
Active Learning A Pattern Classification
Formulation

Let X be a feature space and x ?X be a feature
vector. Each x is associated with one of K labels
C Ck 1 ? k ? K. The prior pr. pCk is the
pr. x is associated with label Ck.
We want to devise a pattern classifier d(x) using
a learning algorithm (agent) to satisfy a
performance constraint that Perror is within a
pre-defined bound. Equivalently, this implies
that the pr. excessive classification error must
be bounded by a small positive number ?
Pece ? ?

An oracle (the user) will provide training
samples (xi, yi) xi?X, yi?C at a cost.
The goal of active learning is to minimize this
sampling cost ( of labeled samples) subject to
the performance constraint Pece ? ?
Conventional pattern classification problem
sampling method the oracle provides both xi and
yi. xi are often randomly sampled within X.
Active learning problem sampling method the
agent specifies xi, and the oracle provides
corresponding yi

21
An one-dimensional formulation

Assume a unique decision boundary x?0 1.
Define ?(x) pC2x. ?(x) is unknown, but
assumed to be non-decreasing in 0 1.
?(0) lt 1/2 ?(x) lt ?(1)
Each time the agent request one or a few training
samples at a sampling point x ?0 1, the
oracle will return a class label y(x).
We assume that the class label returned while
sampling at the same value of x repeatedly obeys
a binomial distribution with mean ?(x).

For example, if ?(x) 0.7. Then, if we sample 10
times at x, we may observe, on average, 7 out of
10 times, the class label is C2, and the
remaining being C1.
In practice, repeated sampling at the same point
can be replaced by taking multiple samples at a
small neighborhood surrounding x.
Our goal is to find an estimate of x such that
Pece ? ? while using smallest amount of samples.

22
More on the 1D task

A stochastic search problem
Find x?0 1 such that ?(x) 1/2. Only know
?(x) is non-decreasing, and ?(0) lt 1/2 lt ?(1).
Perform experiments at each x, with return y(x)
from the oracle. y(x) 0 if x ? C1 and y(x) 1
if x ? C2.
But px ? C2 ?(x), and px ? C1 1?(x) are
both unknown. Only 0 or 1 observed for each x.
Repeated sampling at the same x may yield
different labels!

Robins-Monroes method can be used
If x is to the left of x, y(x) is more likely to
be 0. x should be incremented to be closer to x.
If x is to the right of x, y(x) is more likely
to be 1. x should be decremented to be closer to
x.
Converges in probability. Often slow and too many
samples may be needed

23
Overview of a New Approach

By repeated sampling at the same sampling point,
we can estimate pr.?(x)?1/2 pr.decision
boundary x is to the right of current sampling
point x
n sample points partition the interval 0 1
into n1 sub-intervals.
Combine these probabilities, we can estimate the
probability
pr.x ? a sub-interval
for each sub-interval. This is the p.d.f.of x
over 0 1.

An estimate of x is the mean of this empirical
p.d.f.
Question How to actively select a new sampling
point?
Answer Select the next sampling point to
minimize the maximum of Pece. This is a minimax
heuristic.
Our contribution is to devise a new algorithm to
facilitate this minimax active sampling strategy.

24
Estimating pr.??1/2 r,n

Assume observing r 1s out of n trials.
If we sample at x 0.1 5 times and observe
0,0,1,0,0. Then, what is the probability that
pC2x0.1?(0.1)gt0.5?
By observing this sequence, we derived a formula
for
p??1/2 r,n

A plot of P??1/2 r,n for n up to 36
n
r
25
Details of the formula

Assume p? 1.
No close form solution is available.
The formula is numerically ill-conditioned due to
the subtraction in series.
Example n 6, r 1,
p??1/2 r,n 0.9375
which says if in 6 trials, observe 1 (C2) only
once, then x is to the left of x with 93.75
probability.

26
Partition of the Interval

Assume ?(0)lt1/2, ?(1)gt1/2
If P?(0.5)?1/20.3, then
Px?0.5 1 0.3, and
Px?0 0.5 0.7. Next,
if P?(0.25)?1/20.2, then
Px?0.25 0.5 x?0 0.5 0.2
Px?0 0.25 x?0 0.5 0.8
Hence,
Px?0.25 0.5 0.20.7 0.14
Px?0 0.25 0.80.7 0.56
This leads to a tree representation.

A tree interpretation of the partition of
an interval 0 1 into sub-intervals by
sampling points, and calculate Px ?
sub-intervals
27
Upper Bound over Multiple Intervals

But since ?(x) is unknown, Pexcess can further be
simplified to
where
and

28
Minimax Active Sampling Strategy

Given xi and ?i, choose x such that F(x) is
minimized.
F(x) is a piece-wise linear function. Solution
must occur at one of the center of the
sub-intervals (xixi1)/2
Pece can be estimated using the previous upper
bounds to verify if Pece ? ? is satisfied.
Open issues
How many samples to take repeatedly at each
sample point?
Is the total number of sample point minimized?

29
Future Works

Refine the results
Relax assumptions about constant px, p?, etc.
Theoretical bounds on what is the minimum number
of samples for a given task? Is this correct
question to ask?
What if the sampling points are determined, but
labels are to be queried?
Higher dimensional generalization
Relations to other methods
Importance sampling, design for experiments
Query learning
Applications to solve real world problems
Internet information retrieval
Digital library