Title: Search Engine Technology
1Search Engine Technology5/6http//www.cs.columb
ia.edu/radev/SET07.html
- October 4 and 11, 2007
- Prof. Dragomir R. Radev
- radev_at_umich.edu
2Final projects
- Two formats
- A software system that performs a specific
search-engine related task. We will create a web
page with all such code and make it available to
the IR community. - A research experiment documented in the form of a
paper. Look at the proceedings of the SIGIR, WWW,
or ACL conferences for a sample format. I will
encourage the authors of the most successful
papers to consider submitting them to one of the
IR-related conferences. - Deliverables
- System (code documentation examples) or Paper
( code, data) - Poster (to be presented in class)
- Web page that describes the project.
3SET Fall 2007
9. Text classification Naïve Bayesian
classifiers Decision trees
4Introduction
- Text classification assigning documents to
predefined categories topics, languages, users - A given set of classes C
- Given x, determine its class in C
- Hierarchical vs. flat
- Overlapping (soft) vs non-overlapping (hard)
5Introduction
- Ideas manual classification using rules (e.g.,
Columbia AND University ? EducationColumbia AND
South Carolina ? Geography - Popular techniques generative (knn, Naïve Bayes)
vs. discriminative (SVM, regression) - Generative model joint prob. p(x,y) and use
Bayesian prediction to compute p(yx) - Discriminative model p(yx) directly.
6Bayes formula
Full probability
7Example (performance enhancing drug)
- Drug(D) with values y/n
- Test(T) with values /-
- P(Dy) 0.001
- P(TDy)0.8
- P(TDn)0.01
- Given athlete tests positive
- P(DyT)
- P(TDy)P(Dy) / (P(TDy)P(Dy)P(TDn)P(D
n)(0.8x0.001)/(0.8x0.0010.01x0.999)0.074
8Naïve Bayesian classifiers
- Naïve Bayesian classifier
- Assuming statistical independence
- Features words (or phrases) typically
9Example
- p(well)0.9, p(cold)0.05, p(allergy)0.05
- p(sneezewell)0.1
- p(sneezecold)0.9
- p(sneezeallergy)0.9
- p(coughwell)0.1
- p(coughcold)0.8
- p(coughallergy)0.7
- p(feverwell)0.01
- p(fevercold)0.7
- p(feverallergy)0.4
Example from Ray Mooney
10Example (contd)
- Features sneeze, cough, no fever
- P(welle)(.9) (.1)(.1)(.99) / p(e)0.0089/p(e)
- P(colde)(.05) (.9)(.8)(.3) / p(e)0.01/p(e)
- P(allergye)(.05) (.9)(.7)(.6) /
p(e)0.019/p(e) - P(e) 0.00890.010.0190.379
- P(welle).23
- P(colde).26
- P(allergye).50
Example from Ray Mooney
11Issues with NB
- Where do we get the values use
maximum likelihood estimation (Ni/N) - Same for the conditionals these are based on a
multinomial generator and the MLE estimator is
(Tji/STji) - Smoothing is needed why?
- Laplace smoothing ((Tji1)/S(Tji1))
- Implementation how to avoid floating point
underflow
12Spam recognition
Return-Path ltig_esq_at_rediffmail.comgt X-Sieve CMU
Sieve 2.2 From "Ibrahim Galadima"
ltig_esq_at_rediffmail.comgt Reply-To
galadima_esq_at_netpiper.com To webmaster_at_aclweb.org
Date Tue, 14 Jan 2003 210626 -0800 Subject
Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS
LETTER MAY COME TO YOU AS A SURPRISE SINCE I
HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE
CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL
ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN
THE COURSE OF MY SEARCH FOR A RELIABLE PERSON
WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTIO
N INVOLVING THE ! TRANSFER OF FUND VALUED
AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED
STATES DOLLARS US20M TO A SAFE FOREIGN
ACCOUNT THE ABOVE FUND IN QUESTION IS NOT
CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT
IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN
1999 BY INEC TO A
13SpamAssassin
- http//spamassassin.apache.org/
- http//spamassassin.apache.org/tests_3_1_x.html
14Feature selection The ?2 test
- For a term t
- Cclass, it feature
- Testing for independenceP(C0,It0) should be
equal to P(C0) P(It0) - P(C0) (k00k01)/n
- P(C1) 1-P(C0) (k10k11)/n
- P(It0) (k00K10)/n
- P(It1) 1-P(It0) (k01k11)/n
15Feature selection The ?2 test
- High values of ?2 indicate lower belief in
independence. - In practice, compute ?2 for all words and pick
the top k among them.
16Feature selection mutual information
- No document length scaling is needed
- Documents are assumed to be generated according
to the multinomial model - Measures amount of information if the
distribution is the same as the background
distribution, then MI0 - X word Y class
17Well-known datasets
- 20 newsgroups
- http//people.csail.mit.edu/u/j/jrennie/public_htm
l/20Newsgroups/ - Reuters-21578
- http//www.daviddlewis.com/resources/testcollectio
ns/reuters21578/ - Cats grain, acquisitions, corn, crude, wheat,
trade - WebKB
- http//www-2.cs.cmu.edu/webkb/
- course, student, faculty, staff, project, dept,
other - NB performance (2000)
- P26,43,18,6,13,2,94
- R83,75,77,9,73,100,35
18Evaluation of text classification
- Microaveraging average over classes
- Macroaveraging uses pooled table
19Vector space classification
x2
topic2
topic1
x1
20Decision surfaces
x2
topic2
topic1
x1
21Decision trees
x2
topic2
topic1
x1
22Classification usingdecision trees
- Expected information need
- I (s1, s2, , sm) - pi log (pi)
- s data samples
- m number of classes
S
23(No Transcript)
24Decision tree induction
- I(s1,s2) I(9,5) - 9/14 log 9/14 5/14 log
5/14 0.940
25Entropy and information gain
S
S1j smj
s
Entropy expected information based on the
partitioning into subsets by A
Gain (A) I (s1,s2,,sm) E(A)
26Entropy
- Age lt 30s11 2, s21 3, I(s11, s21) 0.971
- Age in 31 .. 40s12 4, s22 0, I (s12,s22) 0
- Age gt 40s13 3, s23 2, I (s13,s23) 0.971
27Entropy (contd)
- E (age) 5/14 I (s11,s21) 4/14 I (s12,s22)
5/14 I (S13,s23) 0.694 - Gain (age) I (s1,s2) E(age) 0.246
- Gain (income) 0.029, Gain (student) 0.151,
Gain (credit) 0.048
28Final decision tree
age
gt 40
31 .. 40
student
credit
yes
excellent
fair
no
yes
no
yes
no
yes
29Other techniques
- Bayesian classifiers
- X age lt30, income medium, student yes,
credit fair - P(yes) 9/14 0.643
- P(no) 5/14 0.357
30Example
- P (age lt 30 yes) 2/9 0.222P (age lt 30
no) 3/5 0.600P (income medium yes) 4/9
0.444P (income medium no) 2/5 0.400P
(student yes yes) 6/9 0.667P (student
yes no) 1/5 0.200P (credit fair yes)
6/9 0.667P (credit fair no) 2/5 0.400
31Example (contd)
- P (X yes) 0.222 x 0.444 x 0.667 x 0.667
0.044 - P (X no) 0.600 x 0.400 x 0.200 x 0.400
0.019 - P (X yes) P (yes) 0.044 x 0.643 0.028
- P (X no) P (no) 0.019 x 0.357 0.007
- Answer yes/no?
32SET Fall 2007
10. Linear classifiers Kernel methods
Support vector machines
33Linear boundary
x2
topic2
topic1
x1
34Vector space classifiers
- Using centroids
- Boundary line that is equidistant from two
centroids
35Generative models knn
- Assign each element to the closest cluster
- K-nearest neighbors
- Very easy to program
- Tessellation nonlinearity
- Issues choosing k, b?
- Demo
- http//www-2.cs.cmu.edu/zhuxj/courseproject/knnde
mo/KNN.html
36Linear separators
- Two-dimensional line
- w1x1w2x2b is the linear separator
- w1x1w2x2gtb for the positive class
37Example 1
x2
topic2
topic1
x1
38Example 2
- Classifier for interest in Reuters-21578
- b0
- If the document is rate discount dlr world, its
score will be0.6710.461(-0.71)1(-0.35)1
0.05gt0
Example from MSR
39Example perceptron algorithm
Input
Algorithm
Output
40Slide from Chris Bishop
41Linear classifiers
- What is the major shortcoming of a perceptron?
- How to determine the dimensionality of the
separator? - Bias-variance tradeoff (example)
- How to deal with multiple classes?
- Any-of build multiple classifiers for each class
- One-of harder (as J hyperplanes do not divide RM
into J regions), instead use class complements
and scoring
42Support vector machines
- Introduced by Vapnik in the early 90s.
43Issues with SVM
- Soft margins (inseparability)
- Kernels non-linearity
44The kernel idea
before
after
45Example
(mapping to a higher-dimensional space)
46The kernel trick
Polynomial kernel
Sigmoid kernel
RBF kernel
Many other kernels are useful for IRe.g.,
string kernels, subsequence kernels, tree
kernels, etc.
47SVM (Contd)
- Evaluation
- SVM gt knn gt decision tree gt NB
- Implementation
- Quadratic optimization
- Use toolkit (e.g., Thorsten Joachimss svmlight)
48Semi-supervised learning
- EM
- Co-training
- Graph-based
49Exploiting Hyperlinks Co-training
- Each document instance has two sets of alternate
view (Blum and Mitchell 1998) - terms in the document, x1
- terms in the hyperlinks that point to the
document, x2 - Each view is sufficient to determine the class of
the instance - Labeling function that classifies examples is
the same applied to x1 or x2 - x1 and x2 are conditionally independent, given
the class
Slide from Pierre Baldi
50Co-training Algorithm
- Labeled data are used to infer two Naïve Bayes
classifiers, one for each view - Each classifier will
- examine unlabeled data
- pick the most confidently predicted positive and
negative examples - add these to the labeled examples
- Classifiers are now retrained on the augmented
set of labeled examples
Slide from Pierre Baldi
51Conclusion
- SVMs are widely considered to be the best method
for text classification (look at papers by
Sebastiani, Christianini, Joachims), e.g. 86
accuracy on Reuters. - NB also good in many circumstances
52Readings
- For October 11 MRS18
- For October 18 MRS17, MRS19