Search Engine Technology

1 / 52

About This Presentation

Title:

Search Engine Technology

Description:

A software system that performs a specific search-engine ... NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR. CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 53

Provided by: rad2

more less

Transcript and Presenter's Notes

Title: Search Engine Technology

1
Search Engine Technology5/6http//www.cs.columb
ia.edu/radev/SET07.html

October 4 and 11, 2007
Prof. Dragomir R. Radev
radev_at_umich.edu

2
Final projects

Two formats
A software system that performs a specific
search-engine related task. We will create a web
page with all such code and make it available to
the IR community.
A research experiment documented in the form of a
paper. Look at the proceedings of the SIGIR, WWW,
or ACL conferences for a sample format. I will
encourage the authors of the most successful
papers to consider submitting them to one of the
IR-related conferences.
Deliverables
System (code documentation examples) or Paper
( code, data)
Poster (to be presented in class)
Web page that describes the project.

3
SET Fall 2007
9. Text classification Naïve Bayesian
classifiers Decision trees
4
Introduction

Text classification assigning documents to
predefined categories topics, languages, users
A given set of classes C
Given x, determine its class in C
Hierarchical vs. flat
Overlapping (soft) vs non-overlapping (hard)

5
Introduction

Ideas manual classification using rules (e.g.,
Columbia AND University ? EducationColumbia AND
South Carolina ? Geography
Popular techniques generative (knn, Naïve Bayes)
vs. discriminative (SVM, regression)
Generative model joint prob. p(x,y) and use
Bayesian prediction to compute p(yx)
Discriminative model p(yx) directly.

6
Bayes formula
Full probability
7
Example (performance enhancing drug)

Drug(D) with values y/n
Test(T) with values /-
P(Dy) 0.001
P(TDy)0.8
P(TDn)0.01
Given athlete tests positive
P(DyT)
P(TDy)P(Dy) / (P(TDy)P(Dy)P(TDn)P(D
n)(0.8x0.001)/(0.8x0.0010.01x0.999)0.074

8
Naïve Bayesian classifiers

Naïve Bayesian classifier
Assuming statistical independence
Features words (or phrases) typically

9
Example

p(well)0.9, p(cold)0.05, p(allergy)0.05
p(sneezewell)0.1
p(sneezecold)0.9
p(sneezeallergy)0.9
p(coughwell)0.1
p(coughcold)0.8
p(coughallergy)0.7
p(feverwell)0.01
p(fevercold)0.7
p(feverallergy)0.4

Example from Ray Mooney
10
Example (contd)

Features sneeze, cough, no fever
P(welle)(.9) (.1)(.1)(.99) / p(e)0.0089/p(e)
P(colde)(.05) (.9)(.8)(.3) / p(e)0.01/p(e)
P(allergye)(.05) (.9)(.7)(.6) /
p(e)0.019/p(e)
P(e) 0.00890.010.0190.379
P(welle).23
P(colde).26
P(allergye).50

Example from Ray Mooney
11
Issues with NB

Where do we get the values use
maximum likelihood estimation (Ni/N)
Same for the conditionals these are based on a
multinomial generator and the MLE estimator is
(Tji/STji)
Smoothing is needed why?
Laplace smoothing ((Tji1)/S(Tji1))
Implementation how to avoid floating point
underflow

12
Spam recognition
Return-Path ltig_esq_at_rediffmail.comgt X-Sieve CMU
Sieve 2.2 From "Ibrahim Galadima"
ltig_esq_at_rediffmail.comgt Reply-To
galadima_esq_at_netpiper.com To webmaster_at_aclweb.org
Date Tue, 14 Jan 2003 210626 -0800 Subject
Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS
LETTER MAY COME TO YOU AS A SURPRISE SINCE I
HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE
CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL
ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN
THE COURSE OF MY SEARCH FOR A RELIABLE PERSON
WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTIO
N INVOLVING THE ! TRANSFER OF FUND VALUED
AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED
STATES DOLLARS US20M TO A SAFE FOREIGN
ACCOUNT THE ABOVE FUND IN QUESTION IS NOT
CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT
IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN
1999 BY INEC TO A
13
SpamAssassin

http//spamassassin.apache.org/
http//spamassassin.apache.org/tests_3_1_x.html

14
Feature selection The ?2 test

For a term t
Cclass, it feature
Testing for independenceP(C0,It0) should be
equal to P(C0) P(It0)
P(C0) (k00k01)/n
P(C1) 1-P(C0) (k10k11)/n
P(It0) (k00K10)/n
P(It1) 1-P(It0) (k01k11)/n

15
Feature selection The ?2 test

High values of ?2 indicate lower belief in
independence.
In practice, compute ?2 for all words and pick
the top k among them.

16
Feature selection mutual information

No document length scaling is needed
Documents are assumed to be generated according
to the multinomial model
Measures amount of information if the
distribution is the same as the background
distribution, then MI0
X word Y class

17
Well-known datasets

20 newsgroups
http//people.csail.mit.edu/u/j/jrennie/public_htm
l/20Newsgroups/
Reuters-21578
http//www.daviddlewis.com/resources/testcollectio
ns/reuters21578/
Cats grain, acquisitions, corn, crude, wheat,
trade
WebKB
http//www-2.cs.cmu.edu/webkb/
course, student, faculty, staff, project, dept,
other
NB performance (2000)
P26,43,18,6,13,2,94
R83,75,77,9,73,100,35

18
Evaluation of text classification

Microaveraging average over classes
Macroaveraging uses pooled table

19
Vector space classification
x2
topic2
topic1
x1
20
Decision surfaces
x2
topic2
topic1
x1
21
Decision trees
x2
topic2
topic1
x1
22
Classification usingdecision trees

Expected information need
I (s1, s2, , sm) - pi log (pi)
s data samples
m number of classes

S
23
(No Transcript)
24
Decision tree induction

I(s1,s2) I(9,5) - 9/14 log 9/14 5/14 log
5/14 0.940

25
Entropy and information gain
S
S1j smj

E(A) I (s1j,,smj)

s
Entropy expected information based on the
partitioning into subsets by A
Gain (A) I (s1,s2,,sm) E(A)
26
Entropy

Age lt 30s11 2, s21 3, I(s11, s21) 0.971
Age in 31 .. 40s12 4, s22 0, I (s12,s22) 0
Age gt 40s13 3, s23 2, I (s13,s23) 0.971

27
Entropy (contd)

E (age) 5/14 I (s11,s21) 4/14 I (s12,s22)
5/14 I (S13,s23) 0.694
Gain (age) I (s1,s2) E(age) 0.246
Gain (income) 0.029, Gain (student) 0.151,
Gain (credit) 0.048

28
Final decision tree
age
gt 40
31 .. 40
student
credit
yes
excellent
fair
no
yes
no
yes
no
yes
29
Other techniques

Bayesian classifiers
X age lt30, income medium, student yes,
credit fair
P(yes) 9/14 0.643
P(no) 5/14 0.357

30
Example

P (age lt 30 yes) 2/9 0.222P (age lt 30
no) 3/5 0.600P (income medium yes) 4/9
0.444P (income medium no) 2/5 0.400P
(student yes yes) 6/9 0.667P (student
yes no) 1/5 0.200P (credit fair yes)
6/9 0.667P (credit fair no) 2/5 0.400

31
Example (contd)

P (X yes) 0.222 x 0.444 x 0.667 x 0.667
0.044
P (X no) 0.600 x 0.400 x 0.200 x 0.400
0.019
P (X yes) P (yes) 0.044 x 0.643 0.028
P (X no) P (no) 0.019 x 0.357 0.007
Answer yes/no?

32
SET Fall 2007
10. Linear classifiers Kernel methods
Support vector machines
33
Linear boundary
x2
topic2
topic1
x1
34
Vector space classifiers

Using centroids
Boundary line that is equidistant from two
centroids

35
Generative models knn

Assign each element to the closest cluster
K-nearest neighbors
Very easy to program
Tessellation nonlinearity
Issues choosing k, b?
Demo
http//www-2.cs.cmu.edu/zhuxj/courseproject/knnde
mo/KNN.html

36
Linear separators

Two-dimensional line
w1x1w2x2b is the linear separator
w1x1w2x2gtb for the positive class

In n-dimensional spaces

37
Example 1
x2
topic2
topic1
x1
38
Example 2

Classifier for interest in Reuters-21578
b0
If the document is rate discount dlr world, its
score will be0.6710.461(-0.71)1(-0.35)1
0.05gt0

Example from MSR
39
Example perceptron algorithm
Input
Algorithm
Output
40
Slide from Chris Bishop
41
Linear classifiers

What is the major shortcoming of a perceptron?
How to determine the dimensionality of the
separator?
Bias-variance tradeoff (example)
How to deal with multiple classes?
Any-of build multiple classifiers for each class
One-of harder (as J hyperplanes do not divide RM
into J regions), instead use class complements
and scoring

42
Support vector machines

Introduced by Vapnik in the early 90s.

43
Issues with SVM

Soft margins (inseparability)
Kernels non-linearity

44
The kernel idea
before
after
45
Example
(mapping to a higher-dimensional space)
46
The kernel trick
Polynomial kernel
Sigmoid kernel
RBF kernel
Many other kernels are useful for IRe.g.,
string kernels, subsequence kernels, tree
kernels, etc.
47
SVM (Contd)

Evaluation
SVM gt knn gt decision tree gt NB
Implementation
Quadratic optimization
Use toolkit (e.g., Thorsten Joachimss svmlight)

48
Semi-supervised learning

EM
Co-training
Graph-based

49
Exploiting Hyperlinks Co-training

Each document instance has two sets of alternate
view (Blum and Mitchell 1998)
terms in the document, x1
terms in the hyperlinks that point to the
document, x2
Each view is sufficient to determine the class of
the instance
Labeling function that classifies examples is
the same applied to x1 or x2
x1 and x2 are conditionally independent, given
the class

Slide from Pierre Baldi
50
Co-training Algorithm

Labeled data are used to infer two Naïve Bayes
classifiers, one for each view
Each classifier will
examine unlabeled data
pick the most confidently predicted positive and
negative examples
add these to the labeled examples
Classifiers are now retrained on the augmented
set of labeled examples

Slide from Pierre Baldi
51
Conclusion

SVMs are widely considered to be the best method
for text classification (look at papers by
Sebastiani, Christianini, Joachims), e.g. 86
accuracy on Reuters.
NB also good in many circumstances

52
Readings