Text Categorization - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Text Categorization

Description:

Machine translation, part-of-speech tagging, information extraction, question ... Figueiredo (2001); Similar to Tipping's Relevance Vector Machine (JMLR, 2001); LASSO ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 47
Provided by: Madi1
Category:

less

Transcript and Presenter's Notes

Title: Text Categorization


1
Text Categorization
David Madigan Rutgers University
joint work with Alex Genkin and David Lewis
2
Statistical Analysis of Text
  • Statistical text analysis has a long history in
    literary analysis and in solving disputed
    authorship problems
  • First (?) is Thomas C. Mendenhall in 1887

3
Mendenhall
  • Mendenhall was Professor of Physics at Ohio State
    and at University of Tokyo, Superintendent of the
    USA Coast and Geodetic Survey, and later,
    President of Worcester Polytechnic Institute

Mendenhall Glacier, Juneau, Alaska
4
X2 127.2, df12
5
  • Hamilton versus Madison
  • Used Naïve Bayes with Poisson and Negative
    Binomial model
  • Out-of-sample predictive performance

6
Today
  • Statistical methods routinely used for textual
    analyses of all kinds
  • Machine translation, part-of-speech tagging,
    information extraction, question-answering, text
    categorization, disputed authorship (stylometry),
    etc.
  • Not reported in the statistical literature

Mosteller, Wallace, Efron, Thisted
7
Text Categorization
  • Automatic assignment of documents to categories
  • Applications include e-mail filtering,
    pornography detection, medical coding, essay
    grading, and news filtering
  • Modern TC research dates back to the 1960s
    mostly knowledge engineering approaches through
    the 1980s
  • The statistical approach now dominates learn a
    classifier from a set of labeled documents

8
Text Categorization Research
  • Very active research area (e.g., Joachims 1998
    SVM paper has been cited in over 250
    publications)
  • Statisticians?

Sebastianis Bibliography on Automated Text
Categorization
9
Terminology, etc.
  • Document representation via bag of words
  • wis might be 0/1, counts, or weights (e.g
    tf/idf, LSI)
  • Phrases, syntactic information, synonyms, NLP,
    etc. ?
  • Stopwords, stemming

10
Test Collections
  • Reuters-21578
  • 9603 training, 3299 test, 90 categories,
    multi-label
  • New Reuters 800,000 documents
  • Medline 11,000,000 documents MeSH headings
  • TREC conferences and collections
  • Newsgroups, WebKB

11
Reuters-21578 Evaluation
  • binary classifiers
  • recalld/(bd)
  • precisiond/(cd)
  • micro-averaged precision 2/3

sensitivity
predictive value positive
true
  • multiple binary classifiers

predict
1
1
1
1
0
0
1
0
p1.0 p0.5 r 1.0 r 1.0
F1 Measure harmonic mean of precision and recall
12
Reuters Results
13
AdaBoost.MH
  • Multiclass-Multilabel
  • At each iteration learns a simple score-producing
    classifier on weighed training data and the
    updates the weights
  • Final decision averages over the classifiers

data
initial weights
score from simple classifier
revised weights
14
AdaBoost.MH
Schapire and Singer, 2000
15
AdaBoost.MHs weak learner is a stump
two words!
16
AdaBoost.MH Comments
  • Software implementation BoosTexter
  • Some theoretical support in terms of bounds on
    generalization error
  • 3 days of cpu time for Reuters with 10,000
    boosting iterations

17
Support Vector Machine
Two-class classifier with the form parameters
chosen to minimize Many of the fitted ws are
usually zero xs corresponding the the non-zero
ws are the support vectors.
tuning constant
complexity penalty
Gram matrix
18
Hastie, Friedman Tibshirani
19
SVM Comments
  • Polynomial ( ) or radial
    basis function kernels (
    ) often used
  • In fact, for text categorization not using a
    kernel seems to do a little better than using a
    kernel!!
  • Generalization bounds available but not useful in
    practice?
  • Very similar to a form of ridge logistic
    regression provides similar Reuters performance
    (Zhang and Oles, 2002)
  • Software SVM Light, WEKA, etc.

20
Zhang and Oles
  • Ridge Logistic Regression

21
ZO Results
  • 10,000 binary features selected via information
    gain criterion
  • Numerical optimization Gauss-Seidel with trust
    region

22
Tibshiranis LASSO
least absolute shrinkage and selection operator
w
23
Bayesian Sparse Model
LASSO
Laplace
  • Simultaneous feature selection and shrinkage
  • Outperforms SVM and random forests on several
    standard (small) problems using Gaussian kernel
  • Figueiredo (2001) Similar to Tippings Relevance
    Vector Machine (JMLR, 2001)

24
Laplace
Gaussian
25
Bayes versus Sparse Bayes Old Reuters
  • 20,320 log-tf features
  • Sparse models have from 52 to 357 non-zero
    posterior modes
  • Modified ZO algorithm for Laplace and Probit

26
Bayes versus Sparse New Reuters
  • 47,152 log-tf features 20,000 labeled documents
  • no ad-hoc feature selection
  • With 1,000 labeled documents F1 0.7

27
Dense versus Sparse OHSUMED
  • 122,076 log-tf features 20,000 labeled documents
  • no ad-hoc feature selection
  • median number of features used 34

28
How Bayes?
EM ECM Gauss-Seidel
MCMC Variational Methods
Online EM, Quasi-Bayes (Titterington,
1984 Smith Makov, 1978)
Sequential MC (Ridgeway Madigan, 2003 Chopin,
2002)
29
Full-Bayes versus MAP Bayes
versus
30
Approximate Online Sparse Bayes
  • Quasi-Bayes optimal Gaussian approximation to
    the posterior as each new observation arrives
  • Alternative quadratic approximation to the
    log-likelihood of each new observation at current
    mode

Shooting algorithm (Fu, 1988)
31
Shooting
32
pima (UCI), batch n40, online n160
33
Why Bayes?
  • Can incorporate external information, e.g., topic
    descriptions
  • Natural sequential learning paradigm
  • Borrowing strength across topics
  • Simultaneous feature selection and shrinkage via
    sparse priors

34
Conclusions
  • Regularization/Shrinkage is critical for
    predictive modeling with HDLSS (short fat data)
  • Sparse Bayesian classifier is highly competitive
    and performs simultaneous feature selection and
    shrinkage
  • Hierarchical partition model for multi-label
    setting
  • Full Bayes versus MAP plug-in

35
Part-of-Speech Tagging
  • Assign grammatical tags to words
  • Basic task in the analysis of natural language
    data
  • Phrase identification, entity extraction, etc.
  • Ambiguity tag could be a noun or a verb
  • a tag is a part-of-speech label context
    resolves the ambiguity

36
The Penn Treebank POS Tag Set
37
POS Tagging Process
Berlin Chen
38
POS Tagging Algorithms
  • Rule-based taggers large numbers of hand-crafted
    rules
  • Probabilistic tagger used a tagged corpus to
    train some sort of model, e.g. HMM.

tag3
tag2
tag1
word3
word2
word1
  • clever tricks for reducing the number of
    parameters

39
some details
Charniak et al., 1993, achieved 95 accuracy on
the Brown Corpus with
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been seen
with tag i gets tag i
number of such occurrences in total
plus a modification that uses word suffixes
r1
s1
40
Recent Developments
  • Toutanova et al., 2003, use a dependency
    network and richer feature set
  • Log-linear model for ti t-i, w
  • Model included, for example, a feature for
    whether the word contains a number, uppercase
    characters, hyphen, etc. (up to 300,000 features)
  • Regularization of the estimation process
    critical (Gaussian priors)
  • 96.6 accuracy on the Penn corpus

41
Named-Entity Classification
  • Mrs. Frank is a person
  • Steptoe and Johnson is a company
  • Honduras is a location

Finally we come to Jordan But Jordan, a vice
president of Steptoe Johnson, But Mr.
Jordan, a vice president of Steptoe Johnson,
spelling clue
context clue
42
NYMBLE (Bikel et al., 1998)
nc3
nc2
nc1
word3
word2
word1
  • name classes Not-A-Name, Person, Location,
    etc.
  • Smoothing for sparse training data word
    features
  • Training 100,000 words from WSJ
  • Accuracy 93
  • 450,000 words ? same accuracy

43
training-development-test
44
Collins and Singer Co-training
  • POS tagging to identify proper nouns context

45
  • Start with a set of seed rules and lots of
    unlabeled data
  • Seed rules
  • Label the data using spelling rules, then learn
    context rule
  • Label the data using context rules, then learn
    spelling rules
  • Algorithm achieved 91 accuracy

says Mr. Cooper, a vice-president of
context
spelling
46
Standard rule induction algorithm
k3 (the number of classes)
Write a Comment
User Comments (0)
About PowerShow.com