Statistical Methods for Text Mining - PowerPoint PPT Presentation

1 / 57

About This Presentation

Title:

Statistical Methods for Text Mining

Description:

Feature selection is a discrete process individual variables are either in or out. ... This method can have high variance a different dataset from the same source ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 58

Provided by: Madi1

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Methods for Text Mining

1
Statistical Methods for Text Mining
David Madigan Rutgers University
DIMACS stat.rutgers.edu/madigan
David D. Lewis www.daviddlewis.com
joint work with Alex Genkin, Paul Kantor,
Vladimir Menkov, Aynur Dayanik, Dmitriy Fradkin
2
Statistical Analysis of Text

Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems
First (?) is Thomas C. Mendenhall in 1887

3
Mendenhall

Mendenhall was Professor of Physics at Ohio State
and at University of Tokyo, Superintendent of the
USA Coast and Geodetic Survey, and later,
President of Worcester Polytechnic Institute

Mendenhall Glacier, Juneau, Alaska
4
(No Transcript)
5

Used Naïve Bayes with Poisson and Negative
Binomial model
Out-of-sample predictive performance

6
Today

Statistical methods routinely used for textual
analyses of all kinds
Machine translation, part-of-speech tagging,
information extraction, question-answering, text
categorization, etc.
Not reported in the statistical literature

7
Text Categorization

Automatic assignment of documents with respect to
manually defined set of categories
Applications automated indexing, spam filtering,
content filters, medical coding, CRM, essay
grading
Dominant technology is supervised machine
learning
Manually classify some documents, then learn a
classification rule from them (possibly with
manual intervention)

8
Document Representation

Documents usually represented as bag of words

xis might be 0/1, counts, or weights (e.g.
tf/idf, LSI)
Many text processing choices stopwords,
stemming, phrases, synonyms, NLP, etc.

9
Classifier Representation

For instance, linear classifier

f(xi) S bj xij yi 1 if f(xi) gt 0 else yi -1

xis derived from text of document
yi indicates whether to put document in category
bj are parameters chosen to give good
classification effectiveness

10
Logistic Regression Model

Linear model for log odds of category membership

p(y1xi)
log S bj xij bxi
p(y-1xi)

Conditional probability model

11
Maximum Likelihood Training

Choose parameters (bj's) that maximize
probability (likelihood) of class labels (yi's)
given documents (xis)

Tends to overfit
Not defined if d gt n
Feature selection.

12
Shrinkage Methods

Feature selection is a discrete process
individual variables are either in or out.
Combinatorial nightmare.
This method can have high variance a different
dataset from the same source can result in a
totally different model
Shrinkage methods allow a variable to be partly
included in the model. That is, the variable is
included but with a shrunken co-efficient
Elegant way to tackle over-fitting

13
Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
works even when XTX is singular
14
s
15
Ridge Regression Bayesian MAP Regression

Suppose we believe each bj is near 0
Encode this belief as separate Gaussian prior
distributions over values of bj
Choosing maximum a posteriori value of the b
gives same result as ridge logistic regression

same as ridge with
16
Least Absolute Shrinkage Selection Operator
(LASSO)
Tibshirani
subject to

Quadratic programming algorithm needed to solve
for the parameter estimates
Modifed Gauss-Seidel Highly tuned C
implementation

17
(No Transcript)
18
Same as putting a double exponential or Laplace
prior on each bj
19
LARS

New geometrical insights into Lasso and
Stagewise
Leads to a highly efficient Lasso algorithm for
linear regression

20
LARS

Start with all coefficients bj 0
Find the predictor xj most correlated with y
Increase bj in the direction of the sign of its
correlation with y. Take residuals ry-yhat along
the way. Stop when some other predictor xk has as
much correlation with r as xj has
Increase (bj,bk) in their joint least squares
direction until some other predictor xm has as
much correlation with the residual r.
Continue until all predictors are in the model

21
(No Transcript)
22
Data Sets

ModApte subset of Reuters-21578
90 categories 9603 training docs 18978 features
Reuters RCV1-v2
103 cats 23149 training docs 47152 features
OHSUMED heart disease categories
77 cats 83944 training docs 122076 features
Cosine normalized TFxIDF weights

23
Dense vs. Sparse Models (Macroaveraged F1)
24
(No Transcript)
25
(No Transcript)
26
Hastie, Friedman Tibshirani
27
Bayesian Unsupervised Feature Selection and
Weighting

Stopwords low content words that typically are
discarded
Give them a prior with mean 0 and low variance
Inverse document frequency (IDF) weighting
Rare words more likely to be content indicators
Make variance of prior inversely proportional to
frequency in collection
Experiments in progress

28
Bayesian Use of Domain Knowledge

Often believe that certain words are positively
or negatively associated with category
Prior mean can encode strength of positive or
negative association
Prior variance encodes confidence

29
First Experiments

27 RCV1-v2 Region categories
CIA World Factbook entry for country
Give content words higher mean and/or variance
Only 10 training examples per category
Shows off prior knowledge
Limited data often the case in applications

30
Results (Preliminary)
31
Polytomous Logistic Regression

Logistic regression trivially generalizes to
1-of-k problems
Cleaner than SVMs, error correcting codes, etc.
Laplace prior particularly appealing here
Suppose 99 classes and a word that predicts class
17
Word gets used 100 times if build 100 models, or
if use polytomous with Gaussian prior
With Laplace prior and polytomous it's used only
once
Experiments in progress, particularly on author
id

32
1-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
33
1-of-K Sample Results brittany-l
4.6 million parameters
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
34
The Federalist

Mosteller and Wallace attributed all 12 disputed
papers to Madison
Historical evidence is more muddled
Our results suggest attribution is highly
dependent on the document representation
Attrribution using part-of-speechs tags and word
suffixes gives better predictions on the
undisputed papers and assigns three disputed
papers to Hamilton

35
Hyperparameter Selection

CV hyperparameter selection is cumbersome and
risks overfitting
One standard error rule

36
(No Transcript)
37
Florentina Bunea,Florida State
38
How Bayes?
EM ECM Gauss-Seidel
MCMC
Online EM, Quasi-Bayes (Titterington,
1984 Smith Makov, 1978)
Sequential MC (Chopin, 2002 Ridgeway Madigan,
2003)
39
Approximate Online Sparse Bayes

Quasi-Bayes optimal Gaussian approximation to
the posterior as each new observation arrives
Alternative quadratic approximation to the
log-likelihood of each new observation at current
mode

Shooting algorithm (Fu, 1988)
40
Shooting
41
pima (UCI), batch n40, online n160
42
Sequential MC

time accumulating data
Standard particle filtering ideas apply
Need some way to deal with degeneracy
Gilks and Berzuini (2001) resample-move effective
but not a one-pass algorithm
Balakrishnan Madigan (2004) uses Liu West
density estimation shrinkage idea to make a
one-pass version

43
(No Transcript)
44
Liu and West (2000)
45
(No Transcript)
46
(No Transcript)
47
Text Categorization Summary

Conditional probability models (logistic,
probit, etc.)
As powerful as other discriminative models (SVM,
boosting, etc.)
Bayesian framework provides much richer ability
to insert task knowledge
Code http//stat.rutgers.edu/madigan/BBR
Polytomous, domain-specific priors now available

48
The Last Slide

Statistical methods for text mining work well on
certain types of problems
Many problems remain unsolved
Which financial news stories are likely to impact
the market?
Where did soccer originate?
Attribution

49
Outline

Part-of-Speech Tagging, Entity Recognition
Text categorization
Logistic regression and friends
The richness of Bayesian regularization
Sparseness-inducing priors
Word-specific priors stop words, IDF, domain
knowledge, etc.
Polytomous logistic regression

50
Part-of-Speech Tagging

Assign grammatical tags to sequences of words
Basic task in the analysis of natural language
data
Phrase identification, entity extraction, etc.
Ambiguity tag could be a noun or a verb
a tag is a part-of-speech label context
resolves the ambiguity

51
The Penn Treebank POS Tag Set
52
POS Tagging Algorithms

Rule-based taggers large numbers of hand-crafted
rules
Probabilistic tagger used a tagged corpus to
train some sort of model, e.g. HMM.

tag3
tag2
tag1
word3
word2
word1

clever tricks for reducing the number of
parameters
(aka priors)

53
some details
Charniak et al., 1993, achieved 95 accuracy on
the Brown Corpus with
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been seen
with tag i gets tag i
number of such occurrences in total
plus a modification that uses word suffixes
r1
s1
54
Recent Developments

Toutanova et al., 2003, use a dependency network
and richer feature set

Log-linear model for ti t-i, w
Model included, for example, a feature for
whether the word contains a number, uppercase
characters, hyphen, etc.
Regularization of the estimation process critical
96.6 accuracy on the Penn corpus

55
Named-Entity Classification