Statistical Methods for Text Classification - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Statistical Methods for Text Classification

Description:

... each bj is near 0. Encode this belief as separate Gaussian prior distributions over values of bj ... The Penn Treebank POS Tag Set. POS Tagging Algorithms ... – PowerPoint PPT presentation

Number of Views:181

Avg rating:3.0/5.0

Slides: 50

Provided by: Madi1

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Methods for Text Classification

1
Statistical Methods for Text Classification
David Madigan Rutgers University
DIMACS stat.rutgers.edu/madigan
David D. Lewis www.daviddlewis.com
joint work with Alex Genkin, Paul Kantor,
Vladimir Menkov, Aynur Dayanik, Dmitriy Fradkin
2
Statistical Analysis of Text

Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems
First (?) is Thomas C. Mendenhall in 1887

3
(No Transcript)
4

Used Naïve Bayes with Poisson and Negative
Binomial model
Out-of-sample predictive performance

5
Today

Statistical methods routinely used for textual
analyses of all kinds
Machine translation, part-of-speech tagging,
information extraction, question-answering, text
categorization, etc.
Not reported in the statistical literature

6
Text Categorization

Automatic assignment of documents with respect to
manually defined set of categories
Applications automated indexing, spam filtering,
content filters, medical coding, CRM, essay
grading
Dominant technology is supervised machine
learning
Manually classify some documents, then learn a
classification rule from them (possibly with
manual intervention)

7
Document Representation

Documents usually represented as bag of words

xis might be 0/1, counts, or weights (e.g.
tf/idf, LSI)
Many text processing choices stopwords,
stemming, phrases, synonyms, NLP, etc.

8
Classifier Representation

For instance, linear classifier

f(xi) S bj xij yi 1 if f(xi) gt 0 else yi -1

xis derived from text of document
yi indicates whether to put document in category
bj are parameters chosen to give good
classification effectiveness

9
Logistic Regression Model

Linear model for log odds of category membership

p(y1xi)
log S bj xij bxi
p(y-1xi)

Conditional probability model

10
Maximum Likelihood Training

Choose parameters (bj's) that maximize
probability (likelihood) of class labels (yi's)
given documents (xis)

Maximizing (log-)likelihood can be viewed as
minimizing a loss function
Tends to overfit. Not defined if d gt n. Feature
selection.

11
Shrinkage Methods

Feature selection is a discrete process
individual variables are either in or out.
Combinatorial nightmare.
This method can have high variance a different
dataset from the same source can result in a
totally different model
Shrinkage methods allow a variable to be partly
included in the model. That is, the variable is
included but with a shrunken co-efficient
Elegant way to tackle over-fitting

12
Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
works even when XTX is singular
13
s
14
Ridge Regression Bayesian MAP Regression

Suppose we believe each bj is near 0
Encode this belief as separate Gaussian prior
distributions over values of bj
Choosing maximum a posteriori value of the b
gives same result as ridge logistic regression

same as ridge with
15
Least Absolute Shrinkage Selection Operator
(LASSO)
Tibshirani
subject to
Quadratic programming algorithm needed to solve
for the parameter estimates
16
(No Transcript)
17
Same as putting a double exponential or Laplace
prior on each bj
18
Data Sets

ModApte subset of Reuters-21578
90 categories 9603 training docs 18978 features
Reuters RCV1-v2
103 cats 23149 training docs 47152 features
OHSUMED heart disease categories
77 cats 83944 training docs 122076 features
Cosine normalized TFxIDF weights

19
Dense vs. Sparse Models (Macroaveraged F1)
20
(No Transcript)
21
(No Transcript)
22
Bayesian Unsupervised Feature Selection and
Weighting

Stopwords low content words that typically are
discarded
Give them a prior with mean 0 and low variance
Inverse document frequency (IDF) weighting
Rare words more likely to be content indicators
Make variance of prior inversely proportional to
frequency in collection
Experiments in progress

23
Bayesian Use of Domain Knowledge

Often believe that certain words are positively
or negatively associated with category
Prior mean can encode strength of positive or
negative association
Prior variance encodes confidence

24
First Experiments

27 RCV1-v2 Region categories
CIA World Factbook entry for country
Give content words higher mean and/or variance
Only 10 training examples per category
Shows off prior knowledge
Limited data often the case in applications

25
Results (Preliminary)
26
Polytomous Logistic Regression

Logistic regression trivially generalizes to
1-of-k problems
Cleaner than SVMs, error correcting codes, etc.
Laplace prior particularly appealing here
Suppose 99 classes and a word that predicts class
17
Word gets used 100 times if build 100 models, or
if use polytomous with Gaussian prior
With Laplace prior and polytomous it's used only
once
Experiments in progress, particularly on author
id

27
1-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
28
1-of-K Sample Results brittany-l
4.6 million parameters
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
29
How Bayes?
EM ECM Gauss-Seidel
MCMC
Online EM, Quasi-Bayes (Titterington,
1984 Smith Makov, 1978)
Sequential MC (Chopin, 2002 Ridgeway Madigan,
2003)
30
Approximate Online Sparse Bayes

Quasi-Bayes optimal Gaussian approximation to
the posterior as each new observation arrives
Alternative quadratic approximation to the
log-likelihood of each new observation at current
mode

Shooting algorithm (Fu, 1988)
31
Shooting
32
pima (UCI), batch n40, online n160
33
Sequential MC

time accumulating data
Standard particle filtering ideas apply
Need some way to deal with degeneracy
Gilks and Berzuini (2001) resample-move effective
but not a one-pass algorithm
Balakrishnan Madigan (2004) uses Liu West
density estimation shrinkage idea to make a
one-pass version

34
(No Transcript)
35
Liu and West (2000)
36
(No Transcript)
37
(No Transcript)
38
Text Categorization Summary

Conditional probability models (logistic,
probit, etc.)
As powerful as other discriminative models (SVM,
boosting, etc.)
Bayesian framework provides much richer ability
to insert task knowledge
Code http//stat.rutgers.edu/madigan/BBR
Polytomous, domain-specific priors now available

39
The Last Slide

Statistical methods for text mining work well on
certain types of problems
Many problems remain unsolved
Which financial news stories are likely to impact
the market?
Where did soccer originate?
Attribution

40
Hastie, Friedman Tibshirani
41
Outline

Part-of-Speech Tagging, Entity Recognition
Text categorization
Logistic regression and friends
The richness of Bayesian regularization
Sparseness-inducing priors
Word-specific priors stop words, IDF, domain
knowledge, etc.
Polytomous logistic regression

42
Part-of-Speech Tagging

Assign grammatical tags to sequences of words
Basic task in the analysis of natural language
data
Phrase identification, entity extraction, etc.
Ambiguity tag could be a noun or a verb
a tag is a part-of-speech label context
resolves the ambiguity

43
The Penn Treebank POS Tag Set
44
POS Tagging Algorithms

Rule-based taggers large numbers of hand-crafted
rules
Probabilistic tagger used a tagged corpus to
train some sort of model, e.g. HMM.

tag3
tag2
tag1
word3
word2
word1

clever tricks for reducing the number of
parameters
(aka priors)

45
some details
Charniak et al., 1993, achieved 95 accuracy on
the Brown Corpus with
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been seen
with tag i gets tag i
number of such occurrences in total
plus a modification that uses word suffixes
r1
s1
46
Recent Developments

Toutanova et al., 2003, use a dependency network
and richer feature set

Log-linear model for ti t-i, w
Model included, for example, a feature for
whether the word contains a number, uppercase
characters, hyphen, etc.
Regularization of the estimation process critical
96.6 accuracy on the Penn corpus

47
Named-Entity Classification