Sparse Bayesian Classifiers - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Sparse Bayesian Classifiers

Description:

Domain Knowledge: textual descriptions for categories ... Prior variance quantifies our confidence in the domain knowledge. Aynur Dayanik. An Example Model ... – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 59
Provided by: Madi1
Category:

less

Transcript and Presenter's Notes

Title: Sparse Bayesian Classifiers


1
Sparse Bayesian Classifiers
David Madigan Rutgers University
DIMACS stat.rutgers.edu/madigan
David D. Lewis www.daviddlewis.com
2
Statistical Analysis of Text
  • Statistical text analysis has a long history in
    literary analysis and in solving disputed
    authorship problems
  • First (?) is Thomas C. Mendenhall in 1887

3
X2 127.2, df12
4
(No Transcript)
5
  • Used Naïve Bayes with Poisson and Negative
    Binomial model
  • Out-of-sample predictive performance

6
Today
  • Statistical methods routinely used for textual
    analyses of all kinds
  • Machine translation, part-of-speech tagging,
    information extraction, question-answering, text
    categorization, etc.
  • Not reported in the statistical literature

7
Text Categorization
  • Automatic assignment of documents with respect to
    manually defined set of categories
  • Applications automated indexing, spam filtering,
    content filters, medical coding, CRM, essay
    grading
  • Dominant technology is supervised machine
    learning
  • Manually classify some documents, then learn a
    classification rule from them (possibly with
    manual intervention)

8
Document Representation
  • Documents usually represented as bag of words
  • xis might be 0/1, counts, or weights (e.g.
    tf/idf, LSI)
  • Many text processing choices stopwords,
    stemming, phrases, synonyms, NLP, etc.

9
Classifier Representation
  • For instance, linear classifier

f(xi) ? bj xij yi 1 if f(xi) gt 0 else yi -1
  • xis derived from text of document
  • yi indicates whether to put document in category
  • bj are parameters chosen to give good
    classification effectiveness

10
Logistic Regression Model
  • Linear model for log odds of category membership

p(y1xi)
log ? bj xij bxi
p(y-1xi)
  • Conditional probability model

11
Maximum Likelihood Training
  • Choose parameters (bj's) that maximize
    probability (likelihood) of class labels (yi's)
    given documents (xis)
  • Tends to overfit
  • Not defined if d gt n
  • Feature selection

12
Shrinkage Methods
  • Feature selection is a discrete process
    individual variables are either in or out.
    Combinatorial nightmare.
  • This method can have high variance a different
    dataset from the same source can result in a
    totally different model
  • Shrinkage methods allow a variable to be partly
    included in the model. That is, the variable is
    included but with a shrunken co-efficient
  • Elegant way to tackle over-fitting


13
Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
works even when XTX is singular
14
s
15
Least Absolute Shrinkage Selection Operator
(LASSO)
Tibshirani
subject to
  • Quadratic programming algorithm needed to solve
    for the parameter estimates
  • Modifed Gauss-Seidel Highly tuned C
    implementation
  • http//stat.rutgers.edu/madigan/BBR

16
(No Transcript)
17
Same as putting a double exponential or Laplace
prior on each bj
18
(No Transcript)
19
Data Sets
  • ModApte subset of Reuters-21578
  • 90 categories 9603 training docs 18978 features
  • Reuters RCV1-v2
  • 103 cats 23149 training docs 47152 features
  • OHSUMED heart disease categories
  • 77 cats 83944 training docs 122076 features
  • Cosine normalized TFxIDF weights

20
Dense vs. Sparse Models (Macroaveraged F1)
21
(No Transcript)
22
(No Transcript)
23
Aleks Jakulin
24
Hastie, Friedman Tibshirani
25
Domain Knowledge in Text Classification
  • Certain words are positively or negatively
    associated with category
  • Domain Knowledge textual descriptions for
    categories
  • Prior mean quantifies the strength of positive or
    negative association
  • Prior variance quantifies our confidence in the
    domain knowledge

Aynur Dayanik
26
An Example Model (category grain)
27
Using Domain Knowledge (DK)
  • Give domain words higher mean or variance
  • Two methods For each DK term t and category q,
    and manually chosen C,
  • First method sets DK-based variance
  • Second method sets DK-based mode
  • Here ?2 is variance for all other words chosen by
    5-fold CV
  • on training data
  • Used TFxIDF weighting on the prior knoweldge
    documents to compute significance(t, q)

28
Experiments
  • Data sets
  • TREC 2004 Genomics data
  • Categories 32 MeSH categories under Cells
    hierarchy
  • Documents 3742 training and 4175 test
  • Prior Knowledge MeSH category descriptions
  • ModApte subset of Reuters-21578
  • Categories 10 most frequent categories
  • Documents 9603 training and 3299 test
  • Prior Knowledge keywords selected by hand (Wu
    Srihari, 2004)
  • Big (all training examples) and small size
    training data
  • Limited, biased data often the case in
    applications

29
MeSH Prior Knowledge Example
  • MeSH Heading Neurons
  • Scope Note The basic cellular units of nervous
    tissue. Each neuron consists of a body, an axon,
    and dendrites. Their purpose is to receive,
    conduct, and transmit impulses in the nervous
    system.
  • Entry Term Nerve Cells
  • See Also Neural Conduction

30
MeSH Results (Big training data)
31
MeSH Results (training 500 random examples)
32
MeSH Results (training 5 positive and 5 random
examples for each category)
33
Prior Knowledge for ModApte
34
ModApte Results (training 100 random samples)
35
ModApte Results (training 5 positive 5 random
samples for each category)
36
Bayesian Priors (per D.M. Titterington)
37
Polytomous Logistic Regression
  • Sparse Bayesian (aka lasso) Logistic regression
    trivially generalizes to 1-of-k problems
  • Laplace prior particularly appealing here
  • Suppose 100 classes and a word that predicts
    class 17
  • Word gets used 100 times if build 100 binary
    models, or if use polytomous with Gaussian prior
  • With Laplace prior and polytomous it's used only
    once

38
1-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
Cross-Topic Mini-Experiment
43
Cross-Topic Mini-Experiment
44
The Federalist
Joint work with Li Ye
  • Mosteller and Wallace attributed all 12 disputed
    papers to Madison
  • Historical evidence is more muddled
  • Our results suggest attribution is highly
    dependent on the document representation
  • Attrribution using part-of-speechs tags and word
    suffixes gives better predictions on the
    undisputed papers and assigns four disputed
    papers to Hamilton

45
(No Transcript)
46
four papers to Hamilton
47
(No Transcript)
48
Hyperparameter Selection
  • CV hyperparameter selection is cumbersome and
    risks overfitting
  • One standard error rule

49
(No Transcript)
50
Florentina Bunea and Andrew Nobel
51
Hyperparameter Selection
  • Hierarchical prior
  • Optimization alternates between (? ?) and (?
    ?)
  • Improved predictive performance?

Mike West
52
How Bayes?
EM ECM Gauss-Seidel
MCMC
Online EM, Quasi-Bayes (Titterington,
1984 Smith Makov, 1978)
Sequential MC (Chopin, 2002 Ridgeway Madigan,
2003)
53
Approximate Online Sparse Bayes
  • Quasi-Bayes optimal Gaussian approximation to
    the posterior as each new observation arrives
  • Alternative quadratic approximation to the
    log-likelihood of each new observation at current
    mode

Shooting algorithm (Fu, 1988)
54
Shooting
55
pima (UCI), batch n40, online n160
56
Sequential MC
  • time accumulating data
  • Standard particle filtering ideas apply
  • Need some way to deal with degeneracy
  • Gilks and Berzuini (2001) resample-move effective
    but not a one-pass algorithm
  • Balakrishnan Madigan (2004) uses Liu West
    density estimation shrinkage idea to make a
    one-pass version

57
Text Categorization Summary
  • Conditional probability models (logistic,
    probit, etc.)
  • As powerful as other discriminative models (SVM,
    boosting, etc.)
  • Bayesian framework provides much richer ability
    to insert task knowledge
  • Code http//stat.rutgers.edu/madigan/BBR
  • Polytomous, domain-specific priors now available

58
The Last Slide
  • Statistical methods for text mining work well on
    certain types of problems
  • Many problems remain unsolved
  • Which financial news stories are likely to impact
    the market?
  • Where did soccer originate?
  • Attribution
Write a Comment
User Comments (0)
About PowerShow.com