Statistical Methods for Text Classification - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Statistical Methods for Text Classification

Description:

... each bj is near 0. Encode this belief as separate Gaussian prior distributions over values of bj ... The Penn Treebank POS Tag Set. POS Tagging Algorithms ... – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 50
Provided by: Madi1
Category:

less

Transcript and Presenter's Notes

Title: Statistical Methods for Text Classification


1
Statistical Methods for Text Classification
David Madigan Rutgers University
DIMACS stat.rutgers.edu/madigan
David D. Lewis www.daviddlewis.com
joint work with Alex Genkin, Paul Kantor,
Vladimir Menkov, Aynur Dayanik, Dmitriy Fradkin
2
Statistical Analysis of Text
  • Statistical text analysis has a long history in
    literary analysis and in solving disputed
    authorship problems
  • First (?) is Thomas C. Mendenhall in 1887

3
(No Transcript)
4
  • Used Naïve Bayes with Poisson and Negative
    Binomial model
  • Out-of-sample predictive performance

5
Today
  • Statistical methods routinely used for textual
    analyses of all kinds
  • Machine translation, part-of-speech tagging,
    information extraction, question-answering, text
    categorization, etc.
  • Not reported in the statistical literature

6
Text Categorization
  • Automatic assignment of documents with respect to
    manually defined set of categories
  • Applications automated indexing, spam filtering,
    content filters, medical coding, CRM, essay
    grading
  • Dominant technology is supervised machine
    learning
  • Manually classify some documents, then learn a
    classification rule from them (possibly with
    manual intervention)

7
Document Representation
  • Documents usually represented as bag of words
  • xis might be 0/1, counts, or weights (e.g.
    tf/idf, LSI)
  • Many text processing choices stopwords,
    stemming, phrases, synonyms, NLP, etc.

8
Classifier Representation
  • For instance, linear classifier

f(xi) S bj xij yi 1 if f(xi) gt 0 else yi -1
  • xis derived from text of document
  • yi indicates whether to put document in category
  • bj are parameters chosen to give good
    classification effectiveness

9
Logistic Regression Model
  • Linear model for log odds of category membership

p(y1xi)
log S bj xij bxi
p(y-1xi)
  • Conditional probability model

10
Maximum Likelihood Training
  • Choose parameters (bj's) that maximize
    probability (likelihood) of class labels (yi's)
    given documents (xis)
  • Maximizing (log-)likelihood can be viewed as
    minimizing a loss function
  • Tends to overfit. Not defined if d gt n. Feature
    selection.

11
Shrinkage Methods
  • Feature selection is a discrete process
    individual variables are either in or out.
    Combinatorial nightmare.
  • This method can have high variance a different
    dataset from the same source can result in a
    totally different model
  • Shrinkage methods allow a variable to be partly
    included in the model. That is, the variable is
    included but with a shrunken co-efficient
  • Elegant way to tackle over-fitting


12
Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
works even when XTX is singular
13
s
14
Ridge Regression Bayesian MAP Regression
  • Suppose we believe each bj is near 0
  • Encode this belief as separate Gaussian prior
    distributions over values of bj
  • Choosing maximum a posteriori value of the b
    gives same result as ridge logistic regression

same as ridge with
15
Least Absolute Shrinkage Selection Operator
(LASSO)
Tibshirani
subject to
Quadratic programming algorithm needed to solve
for the parameter estimates
16
(No Transcript)
17
Same as putting a double exponential or Laplace
prior on each bj
18
Data Sets
  • ModApte subset of Reuters-21578
  • 90 categories 9603 training docs 18978 features
  • Reuters RCV1-v2
  • 103 cats 23149 training docs 47152 features
  • OHSUMED heart disease categories
  • 77 cats 83944 training docs 122076 features
  • Cosine normalized TFxIDF weights

19
Dense vs. Sparse Models (Macroaveraged F1)
20
(No Transcript)
21
(No Transcript)
22
Bayesian Unsupervised Feature Selection and
Weighting
  • Stopwords low content words that typically are
    discarded
  • Give them a prior with mean 0 and low variance
  • Inverse document frequency (IDF) weighting
  • Rare words more likely to be content indicators
  • Make variance of prior inversely proportional to
    frequency in collection
  • Experiments in progress

23
Bayesian Use of Domain Knowledge
  • Often believe that certain words are positively
    or negatively associated with category
  • Prior mean can encode strength of positive or
    negative association
  • Prior variance encodes confidence

24
First Experiments
  • 27 RCV1-v2 Region categories
  • CIA World Factbook entry for country
  • Give content words higher mean and/or variance
  • Only 10 training examples per category
  • Shows off prior knowledge
  • Limited data often the case in applications

25
Results (Preliminary)
26
Polytomous Logistic Regression
  • Logistic regression trivially generalizes to
    1-of-k problems
  • Cleaner than SVMs, error correcting codes, etc.
  • Laplace prior particularly appealing here
  • Suppose 99 classes and a word that predicts class
    17
  • Word gets used 100 times if build 100 models, or
    if use polytomous with Gaussian prior
  • With Laplace prior and polytomous it's used only
    once
  • Experiments in progress, particularly on author
    id

27
1-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
28
1-of-K Sample Results brittany-l
4.6 million parameters
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
29
How Bayes?
EM ECM Gauss-Seidel
MCMC
Online EM, Quasi-Bayes (Titterington,
1984 Smith Makov, 1978)
Sequential MC (Chopin, 2002 Ridgeway Madigan,
2003)
30
Approximate Online Sparse Bayes
  • Quasi-Bayes optimal Gaussian approximation to
    the posterior as each new observation arrives
  • Alternative quadratic approximation to the
    log-likelihood of each new observation at current
    mode

Shooting algorithm (Fu, 1988)
31
Shooting
32
pima (UCI), batch n40, online n160
33
Sequential MC
  • time accumulating data
  • Standard particle filtering ideas apply
  • Need some way to deal with degeneracy
  • Gilks and Berzuini (2001) resample-move effective
    but not a one-pass algorithm
  • Balakrishnan Madigan (2004) uses Liu West
    density estimation shrinkage idea to make a
    one-pass version

34
(No Transcript)
35
Liu and West (2000)
36
(No Transcript)
37
(No Transcript)
38
Text Categorization Summary
  • Conditional probability models (logistic,
    probit, etc.)
  • As powerful as other discriminative models (SVM,
    boosting, etc.)
  • Bayesian framework provides much richer ability
    to insert task knowledge
  • Code http//stat.rutgers.edu/madigan/BBR
  • Polytomous, domain-specific priors now available

39
The Last Slide
  • Statistical methods for text mining work well on
    certain types of problems
  • Many problems remain unsolved
  • Which financial news stories are likely to impact
    the market?
  • Where did soccer originate?
  • Attribution

40
Hastie, Friedman Tibshirani
41
Outline
  • Part-of-Speech Tagging, Entity Recognition
  • Text categorization
  • Logistic regression and friends
  • The richness of Bayesian regularization
  • Sparseness-inducing priors
  • Word-specific priors stop words, IDF, domain
    knowledge, etc.
  • Polytomous logistic regression

42
Part-of-Speech Tagging
  • Assign grammatical tags to sequences of words
  • Basic task in the analysis of natural language
    data
  • Phrase identification, entity extraction, etc.
  • Ambiguity tag could be a noun or a verb
  • a tag is a part-of-speech label context
    resolves the ambiguity

43
The Penn Treebank POS Tag Set
44
POS Tagging Algorithms
  • Rule-based taggers large numbers of hand-crafted
    rules
  • Probabilistic tagger used a tagged corpus to
    train some sort of model, e.g. HMM.

tag3
tag2
tag1
word3
word2
word1
  • clever tricks for reducing the number of
    parameters
  • (aka priors)

45
some details
Charniak et al., 1993, achieved 95 accuracy on
the Brown Corpus with
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been seen
with tag i gets tag i
number of such occurrences in total
plus a modification that uses word suffixes
r1
s1
46
Recent Developments
  • Toutanova et al., 2003, use a dependency network
    and richer feature set
  • Log-linear model for ti t-i, w
  • Model included, for example, a feature for
    whether the word contains a number, uppercase
    characters, hyphen, etc.
  • Regularization of the estimation process critical
  • 96.6 accuracy on the Penn corpus

47
Named-Entity Classification
  • Mrs. Frank is a person
  • Steptoe and Johnson is a company
  • Honduras is a location
  • etc.
  • Bikel et al. (1998) from BBN Nymble statistical
    approach using HMMs

48
nc3
nc2
nc1
word3
word2
word1
  • name classes Not-A-Name, Person, Location,
    etc.
  • Smoothing for sparse training data word
    features
  • Training 100,000 words from WSJ
  • Accuracy 93
  • 450,000 words ? same accuracy

49
training-development-test
Write a Comment
User Comments (0)
About PowerShow.com