Title: Sparse Bayesian Classifiers
1Sparse Bayesian Classifiers
David Madigan Rutgers University
DIMACS stat.rutgers.edu/madigan
David D. Lewis www.daviddlewis.com
2Statistical Analysis of Text
- Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems - First (?) is Thomas C. Mendenhall in 1887
3X2 127.2, df12
4(No Transcript)
5- Used Naïve Bayes with Poisson and Negative
Binomial model - Out-of-sample predictive performance
6Today
- Statistical methods routinely used for textual
analyses of all kinds - Machine translation, part-of-speech tagging,
information extraction, question-answering, text
categorization, etc. - Not reported in the statistical literature
7Text Categorization
- Automatic assignment of documents with respect to
manually defined set of categories - Applications automated indexing, spam filtering,
content filters, medical coding, CRM, essay
grading - Dominant technology is supervised machine
learning - Manually classify some documents, then learn a
classification rule from them (possibly with
manual intervention)
8Document Representation
- Documents usually represented as bag of words
- xis might be 0/1, counts, or weights (e.g.
tf/idf, LSI) - Many text processing choices stopwords,
stemming, phrases, synonyms, NLP, etc.
9Classifier Representation
- For instance, linear classifier
f(xi) ? bj xij yi 1 if f(xi) gt 0 else yi -1
- xis derived from text of document
- yi indicates whether to put document in category
- bj are parameters chosen to give good
classification effectiveness
10Logistic Regression Model
- Linear model for log odds of category membership
p(y1xi)
log ? bj xij bxi
p(y-1xi)
- Conditional probability model
11Maximum Likelihood Training
- Choose parameters (bj's) that maximize
probability (likelihood) of class labels (yi's)
given documents (xis)
- Tends to overfit
- Not defined if d gt n
- Feature selection
12Shrinkage Methods
- Feature selection is a discrete process
individual variables are either in or out.
Combinatorial nightmare. - This method can have high variance a different
dataset from the same source can result in a
totally different model - Shrinkage methods allow a variable to be partly
included in the model. That is, the variable is
included but with a shrunken co-efficient - Elegant way to tackle over-fitting
13Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
works even when XTX is singular
14s
15Least Absolute Shrinkage Selection Operator
(LASSO)
Tibshirani
subject to
- Quadratic programming algorithm needed to solve
for the parameter estimates - Modifed Gauss-Seidel Highly tuned C
implementation - http//stat.rutgers.edu/madigan/BBR
16(No Transcript)
17Same as putting a double exponential or Laplace
prior on each bj
18(No Transcript)
19Data Sets
- ModApte subset of Reuters-21578
- 90 categories 9603 training docs 18978 features
- Reuters RCV1-v2
- 103 cats 23149 training docs 47152 features
- OHSUMED heart disease categories
- 77 cats 83944 training docs 122076 features
- Cosine normalized TFxIDF weights
20Dense vs. Sparse Models (Macroaveraged F1)
21(No Transcript)
22(No Transcript)
23Aleks Jakulin
24Hastie, Friedman Tibshirani
25Domain Knowledge in Text Classification
- Certain words are positively or negatively
associated with category - Domain Knowledge textual descriptions for
categories - Prior mean quantifies the strength of positive or
negative association - Prior variance quantifies our confidence in the
domain knowledge
Aynur Dayanik
26An Example Model (category grain)
27Using Domain Knowledge (DK)
- Give domain words higher mean or variance
- Two methods For each DK term t and category q,
and manually chosen C, - First method sets DK-based variance
-
- Second method sets DK-based mode
- Here ?2 is variance for all other words chosen by
5-fold CV - on training data
- Used TFxIDF weighting on the prior knoweldge
documents to compute significance(t, q)
28Experiments
- Data sets
- TREC 2004 Genomics data
- Categories 32 MeSH categories under Cells
hierarchy - Documents 3742 training and 4175 test
- Prior Knowledge MeSH category descriptions
- ModApte subset of Reuters-21578
- Categories 10 most frequent categories
- Documents 9603 training and 3299 test
- Prior Knowledge keywords selected by hand (Wu
Srihari, 2004) - Big (all training examples) and small size
training data - Limited, biased data often the case in
applications
29MeSH Prior Knowledge Example
- MeSH Heading Neurons
- Scope Note The basic cellular units of nervous
tissue. Each neuron consists of a body, an axon,
and dendrites. Their purpose is to receive,
conduct, and transmit impulses in the nervous
system. - Entry Term Nerve Cells
- See Also Neural Conduction
30MeSH Results (Big training data)
31MeSH Results (training 500 random examples)
32MeSH Results (training 5 positive and 5 random
examples for each category)
33Prior Knowledge for ModApte
34ModApte Results (training 100 random samples)
35ModApte Results (training 5 positive 5 random
samples for each category)
36Bayesian Priors (per D.M. Titterington)
37Polytomous Logistic Regression
- Sparse Bayesian (aka lasso) Logistic regression
trivially generalizes to 1-of-k problems - Laplace prior particularly appealing here
- Suppose 100 classes and a word that predicts
class 17 - Word gets used 100 times if build 100 binary
models, or if use polytomous with Gaussian prior - With Laplace prior and polytomous it's used only
once
381-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
39(No Transcript)
40(No Transcript)
41(No Transcript)
42Cross-Topic Mini-Experiment
43Cross-Topic Mini-Experiment
44The Federalist
Joint work with Li Ye
- Mosteller and Wallace attributed all 12 disputed
papers to Madison - Historical evidence is more muddled
- Our results suggest attribution is highly
dependent on the document representation - Attrribution using part-of-speechs tags and word
suffixes gives better predictions on the
undisputed papers and assigns four disputed
papers to Hamilton
45(No Transcript)
46four papers to Hamilton
47(No Transcript)
48Hyperparameter Selection
- CV hyperparameter selection is cumbersome and
risks overfitting - One standard error rule
49(No Transcript)
50Florentina Bunea and Andrew Nobel
51Hyperparameter Selection
- Hierarchical prior
- Optimization alternates between (? ?) and (?
?) - Improved predictive performance?
Mike West
52How Bayes?
EM ECM Gauss-Seidel
MCMC
Online EM, Quasi-Bayes (Titterington,
1984 Smith Makov, 1978)
Sequential MC (Chopin, 2002 Ridgeway Madigan,
2003)
53Approximate Online Sparse Bayes
- Quasi-Bayes optimal Gaussian approximation to
the posterior as each new observation arrives - Alternative quadratic approximation to the
log-likelihood of each new observation at current
mode
Shooting algorithm (Fu, 1988)
54Shooting
55pima (UCI), batch n40, online n160
56Sequential MC
- time accumulating data
- Standard particle filtering ideas apply
- Need some way to deal with degeneracy
- Gilks and Berzuini (2001) resample-move effective
but not a one-pass algorithm - Balakrishnan Madigan (2004) uses Liu West
density estimation shrinkage idea to make a
one-pass version
57Text Categorization Summary
- Conditional probability models (logistic,
probit, etc.) - As powerful as other discriminative models (SVM,
boosting, etc.) - Bayesian framework provides much richer ability
to insert task knowledge - Code http//stat.rutgers.edu/madigan/BBR
- Polytomous, domain-specific priors now available
58The Last Slide
- Statistical methods for text mining work well on
certain types of problems - Many problems remain unsolved
- Which financial news stories are likely to impact
the market? - Where did soccer originate?
- Attribution