Active Learning for Internet Information Retrieval - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Active Learning for Internet Information Retrieval

Description:

Part of this work is collaborated with Partha Niyogi, B. H. (Fred) ... Want info about repairing portable computers. Instead, get ... Robins-Monroe's ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 28
Provided by: YuHe8
Category:

less

Transcript and Presenter's Notes

Title: Active Learning for Internet Information Retrieval


1
Active Learning for Internet Information
Retrieval
  • Yu Hen Hu
  • University of Wisconsin-Madison
  • Dept. Electrical and Computer Engineering
  • hu_at_engr.wisc.edu

Part of this work is collaborated with Partha
Niyogi, B. H. (Fred) Juang (Bell Labs)
2
Outline
  • Internet Information Retrieval
  • Pattern Classification
  • Active learning
  • An introduction of active learning
  • Minimax active learning strategy

3
Internet Information Retrieval
Distributed sources of documents (web pages and
others)
Internet
User Interface
Search/Index Engines
4
Internet Search State of Art
Too many results ?
Want info about repairing portable computers.
Instead, get portable machine tools!
5
Issues in Internet Search
  • Precision
  • WYGIWYW What you get is what you want?
  • What fraction of retrieved documents are
    relevant?
  • Recall
  • Are all the relevant documents found in this
    search?
  • Efficiency
  • How soon the desired search results can be
    obtained.

6
Internet Document Retrieval Process
Web-bot search through web to index web documents
Stemming
Break into words
Stoplist
words
Text, web documents
Feature normalization
Noise reduction
DATABASE (indexing, Boolean Search, etc.)
Relevance feedback
Query Pre-processing
Result presentation
User
Intelligent active learning user interface
7
Information Retrieval Pattern Classification
  • Given a set of feature vectors xi, each
    represents a document.
  • A user provides a query, x, also in the form of
    a feature vector.
  • Use x as a prototype, for a given error bound
    ?, label all documents as relevant if the
    corresponding feature vectors satisfy
  • d(xi, x) ? ?
  • Otherwise, the document is labeled as irrelevant.

8
Feature Vector Document Term Vector
  • Relative Term Frequency (within a document)
  • tf (t,d) count of term t / of terms in
    document d
  • Inverse document Frequency
  • df(t) total count of document/ of doc
    contain t
  • Weighted term frequency
  • dt tf(t,d) log df(t)
  • Document term vector D d1, d2,

9
Term Vector Example
  • Document 1 The weather is great these days.
  • Document 2 These are great ideas
  • Document 3 You look great
  • Eliminate The, is, these, are, you

10
Pattern Classification
  • Let x be a feature vector, Ck 1 ? k ? K be K
    class labels. We assume x has a mixture
    probability distribution
  • A pattern classifier is a decision function d(x)
    ? Ck 1 ? k ? K that maps each x to a class
    label.
  • Thus, d(x) partitions the feature space X into
    disjoint regions Rk x x ?X, d(x) Ck.

11
Bayes Rule Posterior Prob.
  • pCk px ?Ck prior probability
  • px Ck likelihood function, conditional
    probability.
  • pCkx posterior probability
  • To minimize Perror, one must maximize pCkx for
    each x
  • MAP (Maximum posterior probability) classifier
  • dMAP(x) Ck s. t.
  • pCk x? pCk x, k?k
  • Bayes rule
  • Decision boundary
  • Finding d(x) is equivalent to finding the
    decision boundaries.
  • If K 2, ?(x) pC2x 1/2 at the decision
    boundary.

12
Probability of Mis-Classification
  • The quality of a pattern classifier is often
    determined by the probability of
    mis-classification

pC1 px C1
pC2 px C2
x
x0
R1 xd(x) C1 R2 xd(x) C2
13
Excessive Pr. of Mis-classification
Pece
x
x estimated decision boundary
Bayes Error P
Optimal decision boundary
14
Upper bound of Pece
Assume px1
?(x) PC2x unknown
Pece
  • For 1D, K2 problems, above upper bound can be
    used to estimate Pece.
  • From an excessive Prob. of mis-classification
    point of view, the error of estimating x,
    is not the only concern. The slop of pC2x
    near to x also matters.

1/2
x
x
x
?(x) 1/2
A posterior probability interpretation
15
Active Learning Agent for Internet Search
  • A mediator (agent) between human user and search
    engine, performing tasks to
  • Interpret human queries,
  • Organize search results,
  • Solicit relevance feedback from users (Asking
    questions)

16
Relevance Feedback Active Learning
  • Relevance feedback
  • After search using initial query provided by the
    user, the IR agent presents examples of retrieved
    documents asking user to label each of them as
    relevant or irrelevant.
  • Use of relevance feedback
  • Based on users feedback, the query can be
    refined to perform refined search, to prune out
    more irrelevant documents, and to retrieve more
    relevant documents like the user specified.
  • Relation to active learning
  • Too many questions will get the user bored
    quickly and abandon search prematurely.
  • The agent must select a subset of most succinct
    documents to ask user to provide feedback.
  • The process of selecting the right question to
    ask can be formulated as an active learning
    problem.

17
What is Learning?
  • A process to find relations between input and
    output (mapping)
  • modeling, estimation, detection, classification,
    etc.
  • Samples of output values at selected input points
    are given Say, (x1, y1), (x2, y2), (x3, y3)
  • Goal of Learning find y f(x)

18
What is Active Learning?
  • The learning algorithm (learner) actively
    requests next sample (ask question).
  • Sequential sampling (learning)
  • The learner decides where to take the next sample
    based on observation of past samples, rather than
    random sampling.
  • Will use active sampling and active learning
    inter-changeably

19
Why Active Learning?
  • FASTER
  • Learning can be more efficient!
  • Fewer samples may imply less training time.
  • CHEAPER
  • Samples (or labeling samples) are expensive!

20
Active Learning A Pattern Classification
Formulation
  • Let X be a feature space and x ?X be a feature
    vector. Each x is associated with one of K labels
    C Ck 1 ? k ? K. The prior pr. pCk is the
    pr. x is associated with label Ck.
  • We want to devise a pattern classifier d(x) using
    a learning algorithm (agent) to satisfy a
    performance constraint that Perror is within a
    pre-defined bound. Equivalently, this implies
    that the pr. excessive classification error must
    be bounded by a small positive number ?
  • Pece ? ?
  • An oracle (the user) will provide training
    samples (xi, yi) xi?X, yi?C at a cost.
  • The goal of active learning is to minimize this
    sampling cost ( of labeled samples) subject to
    the performance constraint Pece ? ?
  • Conventional pattern classification problem
    sampling method the oracle provides both xi and
    yi. xi are often randomly sampled within X.
  • Active learning problem sampling method the
    agent specifies xi, and the oracle provides
    corresponding yi

21
An one-dimensional formulation
  • Assume a unique decision boundary x?0 1.
  • Define ?(x) pC2x. ?(x) is unknown, but
    assumed to be non-decreasing in 0 1.
  • ?(0) lt 1/2 ?(x) lt ?(1)
  • Each time the agent request one or a few training
    samples at a sampling point x ?0 1, the
    oracle will return a class label y(x).
  • We assume that the class label returned while
    sampling at the same value of x repeatedly obeys
    a binomial distribution with mean ?(x).
  • For example, if ?(x) 0.7. Then, if we sample 10
    times at x, we may observe, on average, 7 out of
    10 times, the class label is C2, and the
    remaining being C1.
  • In practice, repeated sampling at the same point
    can be replaced by taking multiple samples at a
    small neighborhood surrounding x.
  • Our goal is to find an estimate of x such that
    Pece ? ? while using smallest amount of samples.

22
More on the 1D task
  • A stochastic search problem
  • Find x?0 1 such that ?(x) 1/2. Only know
    ?(x) is non-decreasing, and ?(0) lt 1/2 lt ?(1).
  • Perform experiments at each x, with return y(x)
    from the oracle. y(x) 0 if x ? C1 and y(x) 1
    if x ? C2.
  • But px ? C2 ?(x), and px ? C1 1?(x) are
    both unknown. Only 0 or 1 observed for each x.
    Repeated sampling at the same x may yield
    different labels!
  • Robins-Monroes method can be used
  • If x is to the left of x, y(x) is more likely to
    be 0. x should be incremented to be closer to x.
  • If x is to the right of x, y(x) is more likely
    to be 1. x should be decremented to be closer to
    x.
  • Converges in probability. Often slow and too many
    samples may be needed

23
Overview of a New Approach
  • By repeated sampling at the same sampling point,
    we can estimate pr.?(x)?1/2 pr.decision
    boundary x is to the right of current sampling
    point x
  • n sample points partition the interval 0 1
    into n1 sub-intervals.
  • Combine these probabilities, we can estimate the
    probability
  • pr.x ? a sub-interval
  • for each sub-interval. This is the p.d.f.of x
    over 0 1.
  • An estimate of x is the mean of this empirical
    p.d.f.
  • Question How to actively select a new sampling
    point?
  • Answer Select the next sampling point to
    minimize the maximum of Pece. This is a minimax
    heuristic.
  • Our contribution is to devise a new algorithm to
    facilitate this minimax active sampling strategy.

24
Estimating pr.??1/2 r,n
  • Assume observing r 1s out of n trials.
  • If we sample at x 0.1 5 times and observe
    0,0,1,0,0. Then, what is the probability that
    pC2x0.1?(0.1)gt0.5?
  • By observing this sequence, we derived a formula
    for
  • p??1/2 r,n

A plot of P??1/2 r,n for n up to 36
n
r
25
Details of the formula
  • Assume p? 1.
  • No close form solution is available.
  • The formula is numerically ill-conditioned due to
    the subtraction in series.
  • Example n 6, r 1,
  • p??1/2 r,n 0.9375
  • which says if in 6 trials, observe 1 (C2) only
    once, then x is to the left of x with 93.75
    probability.

26
Partition of the Interval
  • Assume ?(0)lt1/2, ?(1)gt1/2
  • If P?(0.5)?1/20.3, then
  • Px?0.5 1 0.3, and
  • Px?0 0.5 0.7. Next,
  • if P?(0.25)?1/20.2, then
  • Px?0.25 0.5 x?0 0.5 0.2
  • Px?0 0.25 x?0 0.5 0.8
  • Hence,
  • Px?0.25 0.5 0.20.7 0.14
  • Px?0 0.25 0.80.7 0.56
  • This leads to a tree representation.

A tree interpretation of the partition of
an interval 0 1 into sub-intervals by
sampling points, and calculate Px ?
sub-intervals
27
Upper Bound over Multiple Intervals
  • But since ?(x) is unknown, Pexcess can further be
    simplified to
  • where
  • and

28
Minimax Active Sampling Strategy
  • Given xi and ?i, choose x such that F(x) is
    minimized.
  • F(x) is a piece-wise linear function. Solution
    must occur at one of the center of the
    sub-intervals (xixi1)/2
  • Pece can be estimated using the previous upper
    bounds to verify if Pece ? ? is satisfied.
  • Open issues
  • How many samples to take repeatedly at each
    sample point?
  • Is the total number of sample point minimized?

29
Future Works
  • Refine the results
  • Relax assumptions about constant px, p?, etc.
  • Theoretical bounds on what is the minimum number
    of samples for a given task? Is this correct
    question to ask?
  • What if the sampling points are determined, but
    labels are to be queried?
  • Higher dimensional generalization
  • Relations to other methods
  • Importance sampling, design for experiments
  • Query learning
  • Applications to solve real world problems
  • Internet information retrieval
  • Digital library
Write a Comment
User Comments (0)
About PowerShow.com