Title: Text Categorization
1Text Categorization
David Madigan Rutgers University
joint work with Alex Genkin and David Lewis
2Statistical Analysis of Text
- Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems - First (?) is Thomas C. Mendenhall in 1887
3Mendenhall
- Mendenhall was Professor of Physics at Ohio State
and at University of Tokyo, Superintendent of the
USA Coast and Geodetic Survey, and later,
President of Worcester Polytechnic Institute
Mendenhall Glacier, Juneau, Alaska
4X2 127.2, df12
5- Hamilton versus Madison
- Used Naïve Bayes with Poisson and Negative
Binomial model - Out-of-sample predictive performance
6Today
- Statistical methods routinely used for textual
analyses of all kinds - Machine translation, part-of-speech tagging,
information extraction, question-answering, text
categorization, disputed authorship (stylometry),
etc. - Not reported in the statistical literature
Mosteller, Wallace, Efron, Thisted
7Text Categorization
- Automatic assignment of documents to categories
- Applications include e-mail filtering,
pornography detection, medical coding, essay
grading, and news filtering - Modern TC research dates back to the 1960s
mostly knowledge engineering approaches through
the 1980s - The statistical approach now dominates learn a
classifier from a set of labeled documents
8Text Categorization Research
- Very active research area (e.g., Joachims 1998
SVM paper has been cited in over 250
publications) - Statisticians?
Sebastianis Bibliography on Automated Text
Categorization
9Terminology, etc.
- Document representation via bag of words
- wis might be 0/1, counts, or weights (e.g
tf/idf, LSI) - Phrases, syntactic information, synonyms, NLP,
etc. ? - Stopwords, stemming
10Test Collections
- Reuters-21578
- 9603 training, 3299 test, 90 categories,
multi-label - New Reuters 800,000 documents
- Medline 11,000,000 documents MeSH headings
- TREC conferences and collections
- Newsgroups, WebKB
11Reuters-21578 Evaluation
- binary classifiers
- recalld/(bd)
- precisiond/(cd)
- micro-averaged precision 2/3
sensitivity
predictive value positive
true
- multiple binary classifiers
predict
1
1
1
1
0
0
1
0
p1.0 p0.5 r 1.0 r 1.0
F1 Measure harmonic mean of precision and recall
12Reuters Results
13AdaBoost.MH
- Multiclass-Multilabel
- At each iteration learns a simple score-producing
classifier on weighed training data and the
updates the weights - Final decision averages over the classifiers
data
initial weights
score from simple classifier
revised weights
14AdaBoost.MH
Schapire and Singer, 2000
15AdaBoost.MHs weak learner is a stump
two words!
16AdaBoost.MH Comments
- Software implementation BoosTexter
- Some theoretical support in terms of bounds on
generalization error - 3 days of cpu time for Reuters with 10,000
boosting iterations
17Support Vector Machine
Two-class classifier with the form parameters
chosen to minimize Many of the fitted ws are
usually zero xs corresponding the the non-zero
ws are the support vectors.
tuning constant
complexity penalty
Gram matrix
18Hastie, Friedman Tibshirani
19SVM Comments
- Polynomial ( ) or radial
basis function kernels (
) often used - In fact, for text categorization not using a
kernel seems to do a little better than using a
kernel!! - Generalization bounds available but not useful in
practice? - Very similar to a form of ridge logistic
regression provides similar Reuters performance
(Zhang and Oles, 2002) - Software SVM Light, WEKA, etc.
20Zhang and Oles
- Ridge Logistic Regression
21ZO Results
- 10,000 binary features selected via information
gain criterion - Numerical optimization Gauss-Seidel with trust
region
22Tibshiranis LASSO
least absolute shrinkage and selection operator
w
23Bayesian Sparse Model
LASSO
Laplace
- Simultaneous feature selection and shrinkage
- Outperforms SVM and random forests on several
standard (small) problems using Gaussian kernel - Figueiredo (2001) Similar to Tippings Relevance
Vector Machine (JMLR, 2001)
24Laplace
Gaussian
25Bayes versus Sparse Bayes Old Reuters
- 20,320 log-tf features
- Sparse models have from 52 to 357 non-zero
posterior modes - Modified ZO algorithm for Laplace and Probit
26Bayes versus Sparse New Reuters
- 47,152 log-tf features 20,000 labeled documents
- no ad-hoc feature selection
- With 1,000 labeled documents F1 0.7
27Dense versus Sparse OHSUMED
- 122,076 log-tf features 20,000 labeled documents
- no ad-hoc feature selection
- median number of features used 34
28How Bayes?
EM ECM Gauss-Seidel
MCMC Variational Methods
Online EM, Quasi-Bayes (Titterington,
1984 Smith Makov, 1978)
Sequential MC (Ridgeway Madigan, 2003 Chopin,
2002)
29Full-Bayes versus MAP Bayes
versus
30Approximate Online Sparse Bayes
- Quasi-Bayes optimal Gaussian approximation to
the posterior as each new observation arrives - Alternative quadratic approximation to the
log-likelihood of each new observation at current
mode
Shooting algorithm (Fu, 1988)
31Shooting
32pima (UCI), batch n40, online n160
33Why Bayes?
- Can incorporate external information, e.g., topic
descriptions - Natural sequential learning paradigm
- Borrowing strength across topics
- Simultaneous feature selection and shrinkage via
sparse priors
34Conclusions
- Regularization/Shrinkage is critical for
predictive modeling with HDLSS (short fat data) - Sparse Bayesian classifier is highly competitive
and performs simultaneous feature selection and
shrinkage - Hierarchical partition model for multi-label
setting - Full Bayes versus MAP plug-in
35Part-of-Speech Tagging
- Assign grammatical tags to words
- Basic task in the analysis of natural language
data - Phrase identification, entity extraction, etc.
- Ambiguity tag could be a noun or a verb
- a tag is a part-of-speech label context
resolves the ambiguity
36The Penn Treebank POS Tag Set
37POS Tagging Process
Berlin Chen
38POS Tagging Algorithms
- Rule-based taggers large numbers of hand-crafted
rules - Probabilistic tagger used a tagged corpus to
train some sort of model, e.g. HMM.
tag3
tag2
tag1
word3
word2
word1
- clever tricks for reducing the number of
parameters
39some details
Charniak et al., 1993, achieved 95 accuracy on
the Brown Corpus with
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been seen
with tag i gets tag i
number of such occurrences in total
plus a modification that uses word suffixes
r1
s1
40Recent Developments
- Toutanova et al., 2003, use a dependency
network and richer feature set
- Log-linear model for ti t-i, w
- Model included, for example, a feature for
whether the word contains a number, uppercase
characters, hyphen, etc. (up to 300,000 features) - Regularization of the estimation process
critical (Gaussian priors) - 96.6 accuracy on the Penn corpus
41Named-Entity Classification
- Mrs. Frank is a person
- Steptoe and Johnson is a company
- Honduras is a location
Finally we come to Jordan But Jordan, a vice
president of Steptoe Johnson, But Mr.
Jordan, a vice president of Steptoe Johnson,
spelling clue
context clue
42NYMBLE (Bikel et al., 1998)
nc3
nc2
nc1
word3
word2
word1
- name classes Not-A-Name, Person, Location,
etc. - Smoothing for sparse training data word
features - Training 100,000 words from WSJ
- Accuracy 93
- 450,000 words ? same accuracy
43training-development-test
44Collins and Singer Co-training
- POS tagging to identify proper nouns context
45- Start with a set of seed rules and lots of
unlabeled data - Seed rules
- Label the data using spelling rules, then learn
context rule - Label the data using context rules, then learn
spelling rules - Algorithm achieved 91 accuracy
says Mr. Cooper, a vice-president of
context
spelling
46Standard rule induction algorithm
k3 (the number of classes)