Statistical Methods for Text Mining - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Statistical Methods for Text Mining

Description:

Feature selection is a discrete process individual variables are either in or out. ... This method can have high variance a different dataset from the same source ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 58
Provided by: Madi1
Category:

less

Transcript and Presenter's Notes

Title: Statistical Methods for Text Mining


1
Statistical Methods for Text Mining
David Madigan Rutgers University
DIMACS stat.rutgers.edu/madigan
David D. Lewis www.daviddlewis.com
joint work with Alex Genkin, Paul Kantor,
Vladimir Menkov, Aynur Dayanik, Dmitriy Fradkin
2
Statistical Analysis of Text
  • Statistical text analysis has a long history in
    literary analysis and in solving disputed
    authorship problems
  • First (?) is Thomas C. Mendenhall in 1887

3
Mendenhall
  • Mendenhall was Professor of Physics at Ohio State
    and at University of Tokyo, Superintendent of the
    USA Coast and Geodetic Survey, and later,
    President of Worcester Polytechnic Institute

Mendenhall Glacier, Juneau, Alaska
4
(No Transcript)
5
  • Used Naïve Bayes with Poisson and Negative
    Binomial model
  • Out-of-sample predictive performance

6
Today
  • Statistical methods routinely used for textual
    analyses of all kinds
  • Machine translation, part-of-speech tagging,
    information extraction, question-answering, text
    categorization, etc.
  • Not reported in the statistical literature

7
Text Categorization
  • Automatic assignment of documents with respect to
    manually defined set of categories
  • Applications automated indexing, spam filtering,
    content filters, medical coding, CRM, essay
    grading
  • Dominant technology is supervised machine
    learning
  • Manually classify some documents, then learn a
    classification rule from them (possibly with
    manual intervention)

8
Document Representation
  • Documents usually represented as bag of words
  • xis might be 0/1, counts, or weights (e.g.
    tf/idf, LSI)
  • Many text processing choices stopwords,
    stemming, phrases, synonyms, NLP, etc.

9
Classifier Representation
  • For instance, linear classifier

f(xi) S bj xij yi 1 if f(xi) gt 0 else yi -1
  • xis derived from text of document
  • yi indicates whether to put document in category
  • bj are parameters chosen to give good
    classification effectiveness

10
Logistic Regression Model
  • Linear model for log odds of category membership

p(y1xi)
log S bj xij bxi
p(y-1xi)
  • Conditional probability model

11
Maximum Likelihood Training
  • Choose parameters (bj's) that maximize
    probability (likelihood) of class labels (yi's)
    given documents (xis)
  • Tends to overfit
  • Not defined if d gt n
  • Feature selection.

12
Shrinkage Methods
  • Feature selection is a discrete process
    individual variables are either in or out.
    Combinatorial nightmare.
  • This method can have high variance a different
    dataset from the same source can result in a
    totally different model
  • Shrinkage methods allow a variable to be partly
    included in the model. That is, the variable is
    included but with a shrunken co-efficient
  • Elegant way to tackle over-fitting


13
Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
works even when XTX is singular
14
s
15
Ridge Regression Bayesian MAP Regression
  • Suppose we believe each bj is near 0
  • Encode this belief as separate Gaussian prior
    distributions over values of bj
  • Choosing maximum a posteriori value of the b
    gives same result as ridge logistic regression

same as ridge with
16
Least Absolute Shrinkage Selection Operator
(LASSO)
Tibshirani
subject to
  • Quadratic programming algorithm needed to solve
    for the parameter estimates
  • Modifed Gauss-Seidel Highly tuned C
    implementation

17
(No Transcript)
18
Same as putting a double exponential or Laplace
prior on each bj
19
LARS
  • New geometrical insights into Lasso and
    Stagewise
  • Leads to a highly efficient Lasso algorithm for
    linear regression


20
LARS
  • Start with all coefficients bj 0
  • Find the predictor xj most correlated with y
  • Increase bj in the direction of the sign of its
    correlation with y. Take residuals ry-yhat along
    the way. Stop when some other predictor xk has as
    much correlation with r as xj has
  • Increase (bj,bk) in their joint least squares
    direction until some other predictor xm has as
    much correlation with the residual r.
  • Continue until all predictors are in the model


21
(No Transcript)
22
Data Sets
  • ModApte subset of Reuters-21578
  • 90 categories 9603 training docs 18978 features
  • Reuters RCV1-v2
  • 103 cats 23149 training docs 47152 features
  • OHSUMED heart disease categories
  • 77 cats 83944 training docs 122076 features
  • Cosine normalized TFxIDF weights

23
Dense vs. Sparse Models (Macroaveraged F1)
24
(No Transcript)
25
(No Transcript)
26
Hastie, Friedman Tibshirani
27
Bayesian Unsupervised Feature Selection and
Weighting
  • Stopwords low content words that typically are
    discarded
  • Give them a prior with mean 0 and low variance
  • Inverse document frequency (IDF) weighting
  • Rare words more likely to be content indicators
  • Make variance of prior inversely proportional to
    frequency in collection
  • Experiments in progress

28
Bayesian Use of Domain Knowledge
  • Often believe that certain words are positively
    or negatively associated with category
  • Prior mean can encode strength of positive or
    negative association
  • Prior variance encodes confidence

29
First Experiments
  • 27 RCV1-v2 Region categories
  • CIA World Factbook entry for country
  • Give content words higher mean and/or variance
  • Only 10 training examples per category
  • Shows off prior knowledge
  • Limited data often the case in applications

30
Results (Preliminary)
31
Polytomous Logistic Regression
  • Logistic regression trivially generalizes to
    1-of-k problems
  • Cleaner than SVMs, error correcting codes, etc.
  • Laplace prior particularly appealing here
  • Suppose 99 classes and a word that predicts class
    17
  • Word gets used 100 times if build 100 models, or
    if use polytomous with Gaussian prior
  • With Laplace prior and polytomous it's used only
    once
  • Experiments in progress, particularly on author
    id

32
1-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
33
1-of-K Sample Results brittany-l
4.6 million parameters
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
34
The Federalist
  • Mosteller and Wallace attributed all 12 disputed
    papers to Madison
  • Historical evidence is more muddled
  • Our results suggest attribution is highly
    dependent on the document representation
  • Attrribution using part-of-speechs tags and word
    suffixes gives better predictions on the
    undisputed papers and assigns three disputed
    papers to Hamilton

35
Hyperparameter Selection
  • CV hyperparameter selection is cumbersome and
    risks overfitting
  • One standard error rule

36
(No Transcript)
37
Florentina Bunea,Florida State
38
How Bayes?
EM ECM Gauss-Seidel
MCMC
Online EM, Quasi-Bayes (Titterington,
1984 Smith Makov, 1978)
Sequential MC (Chopin, 2002 Ridgeway Madigan,
2003)
39
Approximate Online Sparse Bayes
  • Quasi-Bayes optimal Gaussian approximation to
    the posterior as each new observation arrives
  • Alternative quadratic approximation to the
    log-likelihood of each new observation at current
    mode

Shooting algorithm (Fu, 1988)
40
Shooting
41
pima (UCI), batch n40, online n160
42
Sequential MC
  • time accumulating data
  • Standard particle filtering ideas apply
  • Need some way to deal with degeneracy
  • Gilks and Berzuini (2001) resample-move effective
    but not a one-pass algorithm
  • Balakrishnan Madigan (2004) uses Liu West
    density estimation shrinkage idea to make a
    one-pass version

43
(No Transcript)
44
Liu and West (2000)
45
(No Transcript)
46
(No Transcript)
47
Text Categorization Summary
  • Conditional probability models (logistic,
    probit, etc.)
  • As powerful as other discriminative models (SVM,
    boosting, etc.)
  • Bayesian framework provides much richer ability
    to insert task knowledge
  • Code http//stat.rutgers.edu/madigan/BBR
  • Polytomous, domain-specific priors now available

48
The Last Slide
  • Statistical methods for text mining work well on
    certain types of problems
  • Many problems remain unsolved
  • Which financial news stories are likely to impact
    the market?
  • Where did soccer originate?
  • Attribution

49
Outline
  • Part-of-Speech Tagging, Entity Recognition
  • Text categorization
  • Logistic regression and friends
  • The richness of Bayesian regularization
  • Sparseness-inducing priors
  • Word-specific priors stop words, IDF, domain
    knowledge, etc.
  • Polytomous logistic regression

50
Part-of-Speech Tagging
  • Assign grammatical tags to sequences of words
  • Basic task in the analysis of natural language
    data
  • Phrase identification, entity extraction, etc.
  • Ambiguity tag could be a noun or a verb
  • a tag is a part-of-speech label context
    resolves the ambiguity

51
The Penn Treebank POS Tag Set
52
POS Tagging Algorithms
  • Rule-based taggers large numbers of hand-crafted
    rules
  • Probabilistic tagger used a tagged corpus to
    train some sort of model, e.g. HMM.

tag3
tag2
tag1
word3
word2
word1
  • clever tricks for reducing the number of
    parameters
  • (aka priors)

53
some details
Charniak et al., 1993, achieved 95 accuracy on
the Brown Corpus with
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been seen
with tag i gets tag i
number of such occurrences in total
plus a modification that uses word suffixes
r1
s1
54
Recent Developments
  • Toutanova et al., 2003, use a dependency network
    and richer feature set
  • Log-linear model for ti t-i, w
  • Model included, for example, a feature for
    whether the word contains a number, uppercase
    characters, hyphen, etc.
  • Regularization of the estimation process critical
  • 96.6 accuracy on the Penn corpus

55
Named-Entity Classification
  • Mrs. Frank is a person
  • Steptoe and Johnson is a company
  • Honduras is a location
  • etc.
  • Bikel et al. (1998) from BBN Nymble statistical
    approach using HMMs

56
nc3
nc2
nc1
word3
word2
word1
  • name classes Not-A-Name, Person, Location,
    etc.
  • Smoothing for sparse training data word
    features
  • Training 100,000 words from WSJ
  • Accuracy 93
  • 450,000 words ? same accuracy

57
training-development-test
Write a Comment
User Comments (0)
About PowerShow.com