Search Engine Technology

1 / 52
About This Presentation
Title:

Search Engine Technology

Description:

A software system that performs a specific search-engine ... NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR. CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 53
Provided by: rad2

less

Transcript and Presenter's Notes

Title: Search Engine Technology


1
Search Engine Technology5/6http//www.cs.columb
ia.edu/radev/SET07.html
  • October 4 and 11, 2007
  • Prof. Dragomir R. Radev
  • radev_at_umich.edu

2
Final projects
  • Two formats
  • A software system that performs a specific
    search-engine related task. We will create a web
    page with all such code and make it available to
    the IR community.
  • A research experiment documented in the form of a
    paper. Look at the proceedings of the SIGIR, WWW,
    or ACL conferences for a sample format. I will
    encourage the authors of the most successful
    papers to consider submitting them to one of the
    IR-related conferences.
  • Deliverables
  • System (code documentation examples) or Paper
    ( code, data)
  • Poster (to be presented in class)
  • Web page that describes the project.

3
SET Fall 2007
9. Text classification Naïve Bayesian
classifiers Decision trees
4
Introduction
  • Text classification assigning documents to
    predefined categories topics, languages, users
  • A given set of classes C
  • Given x, determine its class in C
  • Hierarchical vs. flat
  • Overlapping (soft) vs non-overlapping (hard)

5
Introduction
  • Ideas manual classification using rules (e.g.,
    Columbia AND University ? EducationColumbia AND
    South Carolina ? Geography
  • Popular techniques generative (knn, Naïve Bayes)
    vs. discriminative (SVM, regression)
  • Generative model joint prob. p(x,y) and use
    Bayesian prediction to compute p(yx)
  • Discriminative model p(yx) directly.

6
Bayes formula
Full probability
7
Example (performance enhancing drug)
  • Drug(D) with values y/n
  • Test(T) with values /-
  • P(Dy) 0.001
  • P(TDy)0.8
  • P(TDn)0.01
  • Given athlete tests positive
  • P(DyT)
  • P(TDy)P(Dy) / (P(TDy)P(Dy)P(TDn)P(D
    n)(0.8x0.001)/(0.8x0.0010.01x0.999)0.074

8
Naïve Bayesian classifiers
  • Naïve Bayesian classifier
  • Assuming statistical independence
  • Features words (or phrases) typically

9
Example
  • p(well)0.9, p(cold)0.05, p(allergy)0.05
  • p(sneezewell)0.1
  • p(sneezecold)0.9
  • p(sneezeallergy)0.9
  • p(coughwell)0.1
  • p(coughcold)0.8
  • p(coughallergy)0.7
  • p(feverwell)0.01
  • p(fevercold)0.7
  • p(feverallergy)0.4

Example from Ray Mooney
10
Example (contd)
  • Features sneeze, cough, no fever
  • P(welle)(.9) (.1)(.1)(.99) / p(e)0.0089/p(e)
  • P(colde)(.05) (.9)(.8)(.3) / p(e)0.01/p(e)
  • P(allergye)(.05) (.9)(.7)(.6) /
    p(e)0.019/p(e)
  • P(e) 0.00890.010.0190.379
  • P(welle).23
  • P(colde).26
  • P(allergye).50

Example from Ray Mooney
11
Issues with NB
  • Where do we get the values use
    maximum likelihood estimation (Ni/N)
  • Same for the conditionals these are based on a
    multinomial generator and the MLE estimator is
    (Tji/STji)
  • Smoothing is needed why?
  • Laplace smoothing ((Tji1)/S(Tji1))
  • Implementation how to avoid floating point
    underflow

12
Spam recognition
Return-Path ltig_esq_at_rediffmail.comgt X-Sieve CMU
Sieve 2.2 From "Ibrahim Galadima"
ltig_esq_at_rediffmail.comgt Reply-To
galadima_esq_at_netpiper.com To webmaster_at_aclweb.org
Date Tue, 14 Jan 2003 210626 -0800 Subject
Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS
LETTER MAY COME TO YOU AS A SURPRISE SINCE I
HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE
CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL
ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN
THE COURSE OF MY SEARCH FOR A RELIABLE PERSON
WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTIO
N INVOLVING THE ! TRANSFER OF FUND VALUED
AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED
STATES DOLLARS US20M TO A SAFE FOREIGN
ACCOUNT THE ABOVE FUND IN QUESTION IS NOT
CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT
IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN
1999 BY INEC TO A
13
SpamAssassin
  • http//spamassassin.apache.org/
  • http//spamassassin.apache.org/tests_3_1_x.html

14
Feature selection The ?2 test
  • For a term t
  • Cclass, it feature
  • Testing for independenceP(C0,It0) should be
    equal to P(C0) P(It0)
  • P(C0) (k00k01)/n
  • P(C1) 1-P(C0) (k10k11)/n
  • P(It0) (k00K10)/n
  • P(It1) 1-P(It0) (k01k11)/n

15
Feature selection The ?2 test
  • High values of ?2 indicate lower belief in
    independence.
  • In practice, compute ?2 for all words and pick
    the top k among them.

16
Feature selection mutual information
  • No document length scaling is needed
  • Documents are assumed to be generated according
    to the multinomial model
  • Measures amount of information if the
    distribution is the same as the background
    distribution, then MI0
  • X word Y class

17
Well-known datasets
  • 20 newsgroups
  • http//people.csail.mit.edu/u/j/jrennie/public_htm
    l/20Newsgroups/
  • Reuters-21578
  • http//www.daviddlewis.com/resources/testcollectio
    ns/reuters21578/
  • Cats grain, acquisitions, corn, crude, wheat,
    trade
  • WebKB
  • http//www-2.cs.cmu.edu/webkb/
  • course, student, faculty, staff, project, dept,
    other
  • NB performance (2000)
  • P26,43,18,6,13,2,94
  • R83,75,77,9,73,100,35

18
Evaluation of text classification
  • Microaveraging average over classes
  • Macroaveraging uses pooled table

19
Vector space classification
x2
topic2
topic1
x1
20
Decision surfaces
x2
topic2
topic1
x1
21
Decision trees
x2
topic2
topic1
x1
22
Classification usingdecision trees
  • Expected information need
  • I (s1, s2, , sm) - pi log (pi)
  • s data samples
  • m number of classes

S
23
(No Transcript)
24
Decision tree induction
  • I(s1,s2) I(9,5) - 9/14 log 9/14 5/14 log
    5/14 0.940

25
Entropy and information gain
S
S1j smj
  • E(A) I (s1j,,smj)

s
Entropy expected information based on the
partitioning into subsets by A
Gain (A) I (s1,s2,,sm) E(A)
26
Entropy
  • Age lt 30s11 2, s21 3, I(s11, s21) 0.971
  • Age in 31 .. 40s12 4, s22 0, I (s12,s22) 0
  • Age gt 40s13 3, s23 2, I (s13,s23) 0.971

27
Entropy (contd)
  • E (age) 5/14 I (s11,s21) 4/14 I (s12,s22)
    5/14 I (S13,s23) 0.694
  • Gain (age) I (s1,s2) E(age) 0.246
  • Gain (income) 0.029, Gain (student) 0.151,
    Gain (credit) 0.048

28
Final decision tree
age
gt 40
31 .. 40
student
credit
yes
excellent
fair
no
yes
no
yes
no
yes
29
Other techniques
  • Bayesian classifiers
  • X age lt30, income medium, student yes,
    credit fair
  • P(yes) 9/14 0.643
  • P(no) 5/14 0.357

30
Example
  • P (age lt 30 yes) 2/9 0.222P (age lt 30
    no) 3/5 0.600P (income medium yes) 4/9
    0.444P (income medium no) 2/5 0.400P
    (student yes yes) 6/9 0.667P (student
    yes no) 1/5 0.200P (credit fair yes)
    6/9 0.667P (credit fair no) 2/5 0.400

31
Example (contd)
  • P (X yes) 0.222 x 0.444 x 0.667 x 0.667
    0.044
  • P (X no) 0.600 x 0.400 x 0.200 x 0.400
    0.019
  • P (X yes) P (yes) 0.044 x 0.643 0.028
  • P (X no) P (no) 0.019 x 0.357 0.007
  • Answer yes/no?

32
SET Fall 2007
10. Linear classifiers Kernel methods
Support vector machines
33
Linear boundary
x2
topic2
topic1
x1
34
Vector space classifiers
  • Using centroids
  • Boundary line that is equidistant from two
    centroids

35
Generative models knn
  • Assign each element to the closest cluster
  • K-nearest neighbors
  • Very easy to program
  • Tessellation nonlinearity
  • Issues choosing k, b?
  • Demo
  • http//www-2.cs.cmu.edu/zhuxj/courseproject/knnde
    mo/KNN.html

36
Linear separators
  • Two-dimensional line
  • w1x1w2x2b is the linear separator
  • w1x1w2x2gtb for the positive class
  • In n-dimensional spaces

37
Example 1
x2
topic2
topic1
x1
38
Example 2
  • Classifier for interest in Reuters-21578
  • b0
  • If the document is rate discount dlr world, its
    score will be0.6710.461(-0.71)1(-0.35)1
    0.05gt0

Example from MSR
39
Example perceptron algorithm
Input
Algorithm
Output
40
Slide from Chris Bishop
41
Linear classifiers
  • What is the major shortcoming of a perceptron?
  • How to determine the dimensionality of the
    separator?
  • Bias-variance tradeoff (example)
  • How to deal with multiple classes?
  • Any-of build multiple classifiers for each class
  • One-of harder (as J hyperplanes do not divide RM
    into J regions), instead use class complements
    and scoring

42
Support vector machines
  • Introduced by Vapnik in the early 90s.

43
Issues with SVM
  • Soft margins (inseparability)
  • Kernels non-linearity

44
The kernel idea
before
after
45
Example
(mapping to a higher-dimensional space)
46
The kernel trick
Polynomial kernel
Sigmoid kernel
RBF kernel
Many other kernels are useful for IRe.g.,
string kernels, subsequence kernels, tree
kernels, etc.
47
SVM (Contd)
  • Evaluation
  • SVM gt knn gt decision tree gt NB
  • Implementation
  • Quadratic optimization
  • Use toolkit (e.g., Thorsten Joachimss svmlight)

48
Semi-supervised learning
  • EM
  • Co-training
  • Graph-based

49
Exploiting Hyperlinks Co-training
  • Each document instance has two sets of alternate
    view (Blum and Mitchell 1998)
  • terms in the document, x1
  • terms in the hyperlinks that point to the
    document, x2
  • Each view is sufficient to determine the class of
    the instance
  • Labeling function that classifies examples is
    the same applied to x1 or x2
  • x1 and x2 are conditionally independent, given
    the class

Slide from Pierre Baldi
50
Co-training Algorithm
  • Labeled data are used to infer two Naïve Bayes
    classifiers, one for each view
  • Each classifier will
  • examine unlabeled data
  • pick the most confidently predicted positive and
    negative examples
  • add these to the labeled examples
  • Classifiers are now retrained on the augmented
    set of labeled examples

Slide from Pierre Baldi
51
Conclusion
  • SVMs are widely considered to be the best method
    for text classification (look at papers by
    Sebastiani, Christianini, Joachims), e.g. 86
    accuracy on Reuters.
  • NB also good in many circumstances

52
Readings
  • For October 11 MRS18
  • For October 18 MRS17, MRS19
Write a Comment
User Comments (0)