Title: Statistical Methods for Text Classification
1Statistical Methods for Text Classification
David Madigan Rutgers University
DIMACS stat.rutgers.edu/madigan
David D. Lewis www.daviddlewis.com
joint work with Alex Genkin, Paul Kantor,
Vladimir Menkov, Aynur Dayanik, Dmitriy Fradkin
2Statistical Analysis of Text
- Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems - First (?) is Thomas C. Mendenhall in 1887
3(No Transcript)
4- Used Naïve Bayes with Poisson and Negative
Binomial model - Out-of-sample predictive performance
5Today
- Statistical methods routinely used for textual
analyses of all kinds - Machine translation, part-of-speech tagging,
information extraction, question-answering, text
categorization, etc. - Not reported in the statistical literature
6Text Categorization
- Automatic assignment of documents with respect to
manually defined set of categories - Applications automated indexing, spam filtering,
content filters, medical coding, CRM, essay
grading - Dominant technology is supervised machine
learning - Manually classify some documents, then learn a
classification rule from them (possibly with
manual intervention)
7Document Representation
- Documents usually represented as bag of words
- xis might be 0/1, counts, or weights (e.g.
tf/idf, LSI) - Many text processing choices stopwords,
stemming, phrases, synonyms, NLP, etc.
8Classifier Representation
- For instance, linear classifier
f(xi) S bj xij yi 1 if f(xi) gt 0 else yi -1
- xis derived from text of document
- yi indicates whether to put document in category
- bj are parameters chosen to give good
classification effectiveness
9Logistic Regression Model
- Linear model for log odds of category membership
p(y1xi)
log S bj xij bxi
p(y-1xi)
- Conditional probability model
10Maximum Likelihood Training
- Choose parameters (bj's) that maximize
probability (likelihood) of class labels (yi's)
given documents (xis)
- Maximizing (log-)likelihood can be viewed as
minimizing a loss function - Tends to overfit. Not defined if d gt n. Feature
selection.
11Shrinkage Methods
- Feature selection is a discrete process
individual variables are either in or out.
Combinatorial nightmare. - This method can have high variance a different
dataset from the same source can result in a
totally different model - Shrinkage methods allow a variable to be partly
included in the model. That is, the variable is
included but with a shrunken co-efficient - Elegant way to tackle over-fitting
12Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
works even when XTX is singular
13s
14Ridge Regression Bayesian MAP Regression
- Suppose we believe each bj is near 0
- Encode this belief as separate Gaussian prior
distributions over values of bj - Choosing maximum a posteriori value of the b
gives same result as ridge logistic regression
same as ridge with
15Least Absolute Shrinkage Selection Operator
(LASSO)
Tibshirani
subject to
Quadratic programming algorithm needed to solve
for the parameter estimates
16(No Transcript)
17Same as putting a double exponential or Laplace
prior on each bj
18Data Sets
- ModApte subset of Reuters-21578
- 90 categories 9603 training docs 18978 features
- Reuters RCV1-v2
- 103 cats 23149 training docs 47152 features
- OHSUMED heart disease categories
- 77 cats 83944 training docs 122076 features
- Cosine normalized TFxIDF weights
19Dense vs. Sparse Models (Macroaveraged F1)
20(No Transcript)
21(No Transcript)
22Bayesian Unsupervised Feature Selection and
Weighting
- Stopwords low content words that typically are
discarded - Give them a prior with mean 0 and low variance
- Inverse document frequency (IDF) weighting
- Rare words more likely to be content indicators
- Make variance of prior inversely proportional to
frequency in collection - Experiments in progress
23Bayesian Use of Domain Knowledge
- Often believe that certain words are positively
or negatively associated with category - Prior mean can encode strength of positive or
negative association - Prior variance encodes confidence
24First Experiments
- 27 RCV1-v2 Region categories
- CIA World Factbook entry for country
- Give content words higher mean and/or variance
- Only 10 training examples per category
- Shows off prior knowledge
- Limited data often the case in applications
25Results (Preliminary)
26Polytomous Logistic Regression
- Logistic regression trivially generalizes to
1-of-k problems - Cleaner than SVMs, error correcting codes, etc.
- Laplace prior particularly appealing here
- Suppose 99 classes and a word that predicts class
17 - Word gets used 100 times if build 100 models, or
if use polytomous with Gaussian prior - With Laplace prior and polytomous it's used only
once - Experiments in progress, particularly on author
id
271-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
281-of-K Sample Results brittany-l
4.6 million parameters
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
29How Bayes?
EM ECM Gauss-Seidel
MCMC
Online EM, Quasi-Bayes (Titterington,
1984 Smith Makov, 1978)
Sequential MC (Chopin, 2002 Ridgeway Madigan,
2003)
30Approximate Online Sparse Bayes
- Quasi-Bayes optimal Gaussian approximation to
the posterior as each new observation arrives - Alternative quadratic approximation to the
log-likelihood of each new observation at current
mode
Shooting algorithm (Fu, 1988)
31Shooting
32pima (UCI), batch n40, online n160
33Sequential MC
- time accumulating data
- Standard particle filtering ideas apply
- Need some way to deal with degeneracy
- Gilks and Berzuini (2001) resample-move effective
but not a one-pass algorithm - Balakrishnan Madigan (2004) uses Liu West
density estimation shrinkage idea to make a
one-pass version
34(No Transcript)
35Liu and West (2000)
36(No Transcript)
37(No Transcript)
38Text Categorization Summary
- Conditional probability models (logistic,
probit, etc.) - As powerful as other discriminative models (SVM,
boosting, etc.) - Bayesian framework provides much richer ability
to insert task knowledge - Code http//stat.rutgers.edu/madigan/BBR
- Polytomous, domain-specific priors now available
39The Last Slide
- Statistical methods for text mining work well on
certain types of problems - Many problems remain unsolved
- Which financial news stories are likely to impact
the market? - Where did soccer originate?
- Attribution
40Hastie, Friedman Tibshirani
41Outline
- Part-of-Speech Tagging, Entity Recognition
- Text categorization
- Logistic regression and friends
- The richness of Bayesian regularization
- Sparseness-inducing priors
- Word-specific priors stop words, IDF, domain
knowledge, etc. - Polytomous logistic regression
42Part-of-Speech Tagging
- Assign grammatical tags to sequences of words
- Basic task in the analysis of natural language
data - Phrase identification, entity extraction, etc.
- Ambiguity tag could be a noun or a verb
- a tag is a part-of-speech label context
resolves the ambiguity
43The Penn Treebank POS Tag Set
44POS Tagging Algorithms
- Rule-based taggers large numbers of hand-crafted
rules - Probabilistic tagger used a tagged corpus to
train some sort of model, e.g. HMM.
tag3
tag2
tag1
word3
word2
word1
- clever tricks for reducing the number of
parameters - (aka priors)
45some details
Charniak et al., 1993, achieved 95 accuracy on
the Brown Corpus with
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been seen
with tag i gets tag i
number of such occurrences in total
plus a modification that uses word suffixes
r1
s1
46Recent Developments
- Toutanova et al., 2003, use a dependency network
and richer feature set
- Log-linear model for ti t-i, w
- Model included, for example, a feature for
whether the word contains a number, uppercase
characters, hyphen, etc. - Regularization of the estimation process critical
- 96.6 accuracy on the Penn corpus
47Named-Entity Classification
- Mrs. Frank is a person
- Steptoe and Johnson is a company
- Honduras is a location
- etc.
- Bikel et al. (1998) from BBN Nymble statistical
approach using HMMs
48nc3
nc2
nc1
word3
word2
word1
- name classes Not-A-Name, Person, Location,
etc. - Smoothing for sparse training data word
features - Training 100,000 words from WSJ
- Accuracy 93
- 450,000 words ? same accuracy
49training-development-test