Sparse Bayesian Classifiers - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

Sparse Bayesian Classifiers

Description:

Domain Knowledge: textual descriptions for categories ... Prior variance quantifies our confidence in the domain knowledge. Aynur Dayanik. An Example Model ... – PowerPoint PPT presentation

Number of Views:208

Avg rating:3.0/5.0

Slides: 59

Provided by: Madi1

Category:

more less

Transcript and Presenter's Notes

Title: Sparse Bayesian Classifiers

1
Sparse Bayesian Classifiers
David Madigan Rutgers University
DIMACS stat.rutgers.edu/madigan
David D. Lewis www.daviddlewis.com
2
Statistical Analysis of Text

Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems
First (?) is Thomas C. Mendenhall in 1887

3
X2 127.2, df12
4
(No Transcript)
5

Used Naïve Bayes with Poisson and Negative
Binomial model
Out-of-sample predictive performance

6
Today

Statistical methods routinely used for textual
analyses of all kinds
Machine translation, part-of-speech tagging,
information extraction, question-answering, text
categorization, etc.
Not reported in the statistical literature

7
Text Categorization

Automatic assignment of documents with respect to
manually defined set of categories
Applications automated indexing, spam filtering,
content filters, medical coding, CRM, essay
grading
Dominant technology is supervised machine
learning
Manually classify some documents, then learn a
classification rule from them (possibly with
manual intervention)

8
Document Representation

Documents usually represented as bag of words

xis might be 0/1, counts, or weights (e.g.
tf/idf, LSI)
Many text processing choices stopwords,
stemming, phrases, synonyms, NLP, etc.

9
Classifier Representation

For instance, linear classifier

f(xi) ? bj xij yi 1 if f(xi) gt 0 else yi -1

xis derived from text of document
yi indicates whether to put document in category
bj are parameters chosen to give good
classification effectiveness

10
Logistic Regression Model

Linear model for log odds of category membership

p(y1xi)
log ? bj xij bxi
p(y-1xi)

Conditional probability model

11
Maximum Likelihood Training

Choose parameters (bj's) that maximize
probability (likelihood) of class labels (yi's)
given documents (xis)

Tends to overfit
Not defined if d gt n
Feature selection

12
Shrinkage Methods

Feature selection is a discrete process
individual variables are either in or out.
Combinatorial nightmare.
This method can have high variance a different
dataset from the same source can result in a
totally different model
Shrinkage methods allow a variable to be partly
included in the model. That is, the variable is
included but with a shrunken co-efficient
Elegant way to tackle over-fitting

13
Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
works even when XTX is singular
14
s
15
Least Absolute Shrinkage Selection Operator
(LASSO)
Tibshirani
subject to

Quadratic programming algorithm needed to solve
for the parameter estimates
Modifed Gauss-Seidel Highly tuned C
implementation
http//stat.rutgers.edu/madigan/BBR

16
(No Transcript)
17
Same as putting a double exponential or Laplace
prior on each bj
18
(No Transcript)
19
Data Sets

ModApte subset of Reuters-21578
90 categories 9603 training docs 18978 features
Reuters RCV1-v2
103 cats 23149 training docs 47152 features
OHSUMED heart disease categories
77 cats 83944 training docs 122076 features
Cosine normalized TFxIDF weights

20
Dense vs. Sparse Models (Macroaveraged F1)
21
(No Transcript)
22
(No Transcript)
23
Aleks Jakulin
24
Hastie, Friedman Tibshirani
25
Domain Knowledge in Text Classification

Certain words are positively or negatively
associated with category
Domain Knowledge textual descriptions for
categories
Prior mean quantifies the strength of positive or
negative association
Prior variance quantifies our confidence in the
domain knowledge

Aynur Dayanik
26
An Example Model (category grain)
27
Using Domain Knowledge (DK)

Give domain words higher mean or variance
Two methods For each DK term t and category q,
and manually chosen C,
First method sets DK-based variance
Second method sets DK-based mode
Here ?2 is variance for all other words chosen by
5-fold CV
on training data
Used TFxIDF weighting on the prior knoweldge
documents to compute significance(t, q)

28
Experiments

Data sets
TREC 2004 Genomics data
Categories 32 MeSH categories under Cells
hierarchy
Documents 3742 training and 4175 test
Prior Knowledge MeSH category descriptions
ModApte subset of Reuters-21578
Categories 10 most frequent categories
Documents 9603 training and 3299 test
Prior Knowledge keywords selected by hand (Wu
Srihari, 2004)
Big (all training examples) and small size
training data
Limited, biased data often the case in
applications

29
MeSH Prior Knowledge Example

MeSH Heading Neurons
Scope Note The basic cellular units of nervous
tissue. Each neuron consists of a body, an axon,
and dendrites. Their purpose is to receive,
conduct, and transmit impulses in the nervous
system.
Entry Term Nerve Cells
See Also Neural Conduction

30
MeSH Results (Big training data)
31
MeSH Results (training 500 random examples)
32
MeSH Results (training 5 positive and 5 random
examples for each category)
33
Prior Knowledge for ModApte
34
ModApte Results (training 100 random samples)
35
ModApte Results (training 5 positive 5 random
samples for each category)
36
Bayesian Priors (per D.M. Titterington)
37
Polytomous Logistic Regression

Sparse Bayesian (aka lasso) Logistic regression
trivially generalizes to 1-of-k problems
Laplace prior particularly appealing here
Suppose 100 classes and a word that predicts
class 17
Word gets used 100 times if build 100 binary
models, or if use polytomous with Gaussian prior
With Laplace prior and polytomous it's used only
once

38
1-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
Cross-Topic Mini-Experiment
43
Cross-Topic Mini-Experiment
44
The Federalist
Joint work with Li Ye

Mosteller and Wallace attributed all 12 disputed
papers to Madison
Historical evidence is more muddled
Our results suggest attribution is highly
dependent on the document representation
Attrribution using part-of-speechs tags and word
suffixes gives better predictions on the
undisputed papers and assigns four disputed
papers to Hamilton

45
(No Transcript)
46
four papers to Hamilton
47
(No Transcript)
48
Hyperparameter Selection

CV hyperparameter selection is cumbersome and
risks overfitting
One standard error rule

49
(No Transcript)
50
Florentina Bunea and Andrew Nobel
51
Hyperparameter Selection

Hierarchical prior
Optimization alternates between (? ?) and (?
?)
Improved predictive performance?

Mike West
52
How Bayes?
EM ECM Gauss-Seidel
MCMC
Online EM, Quasi-Bayes (Titterington,
1984 Smith Makov, 1978)
Sequential MC (Chopin, 2002 Ridgeway Madigan,
2003)
53
Approximate Online Sparse Bayes

Quasi-Bayes optimal Gaussian approximation to
the posterior as each new observation arrives
Alternative quadratic approximation to the
log-likelihood of each new observation at current
mode

Shooting algorithm (Fu, 1988)
54
Shooting
55
pima (UCI), batch n40, online n160
56
Sequential MC

time accumulating data
Standard particle filtering ideas apply
Need some way to deal with degeneracy
Gilks and Berzuini (2001) resample-move effective
but not a one-pass algorithm
Balakrishnan Madigan (2004) uses Liu West
density estimation shrinkage idea to make a
one-pass version

57
Text Categorization Summary