Title: Statistical Methods for Text Mining
1Statistical Methods for Text Mining
David Madigan Rutgers University
DIMACS stat.rutgers.edu/madigan
David D. Lewis www.daviddlewis.com
joint work with Alex Genkin, Paul Kantor,
Vladimir Menkov, Aynur Dayanik, Dmitriy Fradkin
2Statistical Analysis of Text
- Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems - First (?) is Thomas C. Mendenhall in 1887
3Mendenhall
- Mendenhall was Professor of Physics at Ohio State
and at University of Tokyo, Superintendent of the
USA Coast and Geodetic Survey, and later,
President of Worcester Polytechnic Institute
Mendenhall Glacier, Juneau, Alaska
4(No Transcript)
5- Used Naïve Bayes with Poisson and Negative
Binomial model - Out-of-sample predictive performance
6Today
- Statistical methods routinely used for textual
analyses of all kinds - Machine translation, part-of-speech tagging,
information extraction, question-answering, text
categorization, etc. - Not reported in the statistical literature
7Text Categorization
- Automatic assignment of documents with respect to
manually defined set of categories - Applications automated indexing, spam filtering,
content filters, medical coding, CRM, essay
grading - Dominant technology is supervised machine
learning - Manually classify some documents, then learn a
classification rule from them (possibly with
manual intervention)
8Document Representation
- Documents usually represented as bag of words
- xis might be 0/1, counts, or weights (e.g.
tf/idf, LSI) - Many text processing choices stopwords,
stemming, phrases, synonyms, NLP, etc.
9Classifier Representation
- For instance, linear classifier
f(xi) S bj xij yi 1 if f(xi) gt 0 else yi -1
- xis derived from text of document
- yi indicates whether to put document in category
- bj are parameters chosen to give good
classification effectiveness
10Logistic Regression Model
- Linear model for log odds of category membership
p(y1xi)
log S bj xij bxi
p(y-1xi)
- Conditional probability model
11Maximum Likelihood Training
- Choose parameters (bj's) that maximize
probability (likelihood) of class labels (yi's)
given documents (xis)
- Tends to overfit
- Not defined if d gt n
- Feature selection.
12Shrinkage Methods
- Feature selection is a discrete process
individual variables are either in or out.
Combinatorial nightmare. - This method can have high variance a different
dataset from the same source can result in a
totally different model - Shrinkage methods allow a variable to be partly
included in the model. That is, the variable is
included but with a shrunken co-efficient - Elegant way to tackle over-fitting
13Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
works even when XTX is singular
14s
15Ridge Regression Bayesian MAP Regression
- Suppose we believe each bj is near 0
- Encode this belief as separate Gaussian prior
distributions over values of bj - Choosing maximum a posteriori value of the b
gives same result as ridge logistic regression
same as ridge with
16Least Absolute Shrinkage Selection Operator
(LASSO)
Tibshirani
subject to
- Quadratic programming algorithm needed to solve
for the parameter estimates - Modifed Gauss-Seidel Highly tuned C
implementation
17(No Transcript)
18Same as putting a double exponential or Laplace
prior on each bj
19LARS
- New geometrical insights into Lasso and
Stagewise - Leads to a highly efficient Lasso algorithm for
linear regression
20LARS
- Start with all coefficients bj 0
- Find the predictor xj most correlated with y
- Increase bj in the direction of the sign of its
correlation with y. Take residuals ry-yhat along
the way. Stop when some other predictor xk has as
much correlation with r as xj has - Increase (bj,bk) in their joint least squares
direction until some other predictor xm has as
much correlation with the residual r. - Continue until all predictors are in the model
21(No Transcript)
22Data Sets
- ModApte subset of Reuters-21578
- 90 categories 9603 training docs 18978 features
- Reuters RCV1-v2
- 103 cats 23149 training docs 47152 features
- OHSUMED heart disease categories
- 77 cats 83944 training docs 122076 features
- Cosine normalized TFxIDF weights
23Dense vs. Sparse Models (Macroaveraged F1)
24(No Transcript)
25(No Transcript)
26Hastie, Friedman Tibshirani
27Bayesian Unsupervised Feature Selection and
Weighting
- Stopwords low content words that typically are
discarded - Give them a prior with mean 0 and low variance
- Inverse document frequency (IDF) weighting
- Rare words more likely to be content indicators
- Make variance of prior inversely proportional to
frequency in collection - Experiments in progress
28Bayesian Use of Domain Knowledge
- Often believe that certain words are positively
or negatively associated with category - Prior mean can encode strength of positive or
negative association - Prior variance encodes confidence
29First Experiments
- 27 RCV1-v2 Region categories
- CIA World Factbook entry for country
- Give content words higher mean and/or variance
- Only 10 training examples per category
- Shows off prior knowledge
- Limited data often the case in applications
30Results (Preliminary)
31Polytomous Logistic Regression
- Logistic regression trivially generalizes to
1-of-k problems - Cleaner than SVMs, error correcting codes, etc.
- Laplace prior particularly appealing here
- Suppose 99 classes and a word that predicts class
17 - Word gets used 100 times if build 100 models, or
if use polytomous with Gaussian prior - With Laplace prior and polytomous it's used only
once - Experiments in progress, particularly on author
id
321-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
331-of-K Sample Results brittany-l
4.6 million parameters
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
34The Federalist
- Mosteller and Wallace attributed all 12 disputed
papers to Madison - Historical evidence is more muddled
- Our results suggest attribution is highly
dependent on the document representation - Attrribution using part-of-speechs tags and word
suffixes gives better predictions on the
undisputed papers and assigns three disputed
papers to Hamilton
35Hyperparameter Selection
- CV hyperparameter selection is cumbersome and
risks overfitting - One standard error rule
36(No Transcript)
37Florentina Bunea,Florida State
38How Bayes?
EM ECM Gauss-Seidel
MCMC
Online EM, Quasi-Bayes (Titterington,
1984 Smith Makov, 1978)
Sequential MC (Chopin, 2002 Ridgeway Madigan,
2003)
39Approximate Online Sparse Bayes
- Quasi-Bayes optimal Gaussian approximation to
the posterior as each new observation arrives - Alternative quadratic approximation to the
log-likelihood of each new observation at current
mode
Shooting algorithm (Fu, 1988)
40Shooting
41pima (UCI), batch n40, online n160
42Sequential MC
- time accumulating data
- Standard particle filtering ideas apply
- Need some way to deal with degeneracy
- Gilks and Berzuini (2001) resample-move effective
but not a one-pass algorithm - Balakrishnan Madigan (2004) uses Liu West
density estimation shrinkage idea to make a
one-pass version
43(No Transcript)
44Liu and West (2000)
45(No Transcript)
46(No Transcript)
47Text Categorization Summary
- Conditional probability models (logistic,
probit, etc.) - As powerful as other discriminative models (SVM,
boosting, etc.) - Bayesian framework provides much richer ability
to insert task knowledge - Code http//stat.rutgers.edu/madigan/BBR
- Polytomous, domain-specific priors now available
48The Last Slide
- Statistical methods for text mining work well on
certain types of problems - Many problems remain unsolved
- Which financial news stories are likely to impact
the market? - Where did soccer originate?
- Attribution
49Outline
- Part-of-Speech Tagging, Entity Recognition
- Text categorization
- Logistic regression and friends
- The richness of Bayesian regularization
- Sparseness-inducing priors
- Word-specific priors stop words, IDF, domain
knowledge, etc. - Polytomous logistic regression
50Part-of-Speech Tagging
- Assign grammatical tags to sequences of words
- Basic task in the analysis of natural language
data - Phrase identification, entity extraction, etc.
- Ambiguity tag could be a noun or a verb
- a tag is a part-of-speech label context
resolves the ambiguity
51The Penn Treebank POS Tag Set
52POS Tagging Algorithms
- Rule-based taggers large numbers of hand-crafted
rules - Probabilistic tagger used a tagged corpus to
train some sort of model, e.g. HMM.
tag3
tag2
tag1
word3
word2
word1
- clever tricks for reducing the number of
parameters - (aka priors)
53some details
Charniak et al., 1993, achieved 95 accuracy on
the Brown Corpus with
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been seen
with tag i gets tag i
number of such occurrences in total
plus a modification that uses word suffixes
r1
s1
54Recent Developments
- Toutanova et al., 2003, use a dependency network
and richer feature set
- Log-linear model for ti t-i, w
- Model included, for example, a feature for
whether the word contains a number, uppercase
characters, hyphen, etc. - Regularization of the estimation process critical
- 96.6 accuracy on the Penn corpus
55Named-Entity Classification
- Mrs. Frank is a person
- Steptoe and Johnson is a company
- Honduras is a location
- etc.
- Bikel et al. (1998) from BBN Nymble statistical
approach using HMMs
56nc3
nc2
nc1
word3
word2
word1
- name classes Not-A-Name, Person, Location,
etc. - Smoothing for sparse training data word
features - Training 100,000 words from WSJ
- Accuracy 93
- 450,000 words ? same accuracy
57training-development-test