Text Categorization - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Text Categorization

Description:

Machine translation, part-of-speech tagging, information extraction, question ... Figueiredo (2001); Similar to Tipping's Relevance Vector Machine (JMLR, 2001); LASSO ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 47

Provided by: Madi1

Category:

more less

Transcript and Presenter's Notes

Title: Text Categorization

1
Text Categorization
David Madigan Rutgers University
joint work with Alex Genkin and David Lewis
2
Statistical Analysis of Text

Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems
First (?) is Thomas C. Mendenhall in 1887

3
Mendenhall

Mendenhall was Professor of Physics at Ohio State
and at University of Tokyo, Superintendent of the
USA Coast and Geodetic Survey, and later,
President of Worcester Polytechnic Institute

Mendenhall Glacier, Juneau, Alaska
4
X2 127.2, df12
5

Hamilton versus Madison
Used Naïve Bayes with Poisson and Negative
Binomial model
Out-of-sample predictive performance

6
Today

Statistical methods routinely used for textual
analyses of all kinds
Machine translation, part-of-speech tagging,
information extraction, question-answering, text
categorization, disputed authorship (stylometry),
etc.
Not reported in the statistical literature

Mosteller, Wallace, Efron, Thisted
7
Text Categorization

Automatic assignment of documents to categories
Applications include e-mail filtering,
pornography detection, medical coding, essay
grading, and news filtering
Modern TC research dates back to the 1960s
mostly knowledge engineering approaches through
the 1980s
The statistical approach now dominates learn a
classifier from a set of labeled documents

8
Text Categorization Research

Very active research area (e.g., Joachims 1998
SVM paper has been cited in over 250
publications)
Statisticians?

Sebastianis Bibliography on Automated Text
Categorization
9
Terminology, etc.

Document representation via bag of words
wis might be 0/1, counts, or weights (e.g
tf/idf, LSI)
Phrases, syntactic information, synonyms, NLP,
etc. ?
Stopwords, stemming

10
Test Collections

Reuters-21578
9603 training, 3299 test, 90 categories,
multi-label
New Reuters 800,000 documents
Medline 11,000,000 documents MeSH headings
TREC conferences and collections
Newsgroups, WebKB

11
Reuters-21578 Evaluation

binary classifiers
recalld/(bd)
precisiond/(cd)
micro-averaged precision 2/3

sensitivity
predictive value positive
true

multiple binary classifiers

predict
1
1
1
1
0
0
1
0
p1.0 p0.5 r 1.0 r 1.0
F1 Measure harmonic mean of precision and recall
12
Reuters Results
13
AdaBoost.MH

Multiclass-Multilabel
At each iteration learns a simple score-producing
classifier on weighed training data and the
updates the weights
Final decision averages over the classifiers

data
initial weights
score from simple classifier
revised weights
14
AdaBoost.MH
Schapire and Singer, 2000
15
AdaBoost.MHs weak learner is a stump
two words!
16
AdaBoost.MH Comments

Software implementation BoosTexter
Some theoretical support in terms of bounds on
generalization error
3 days of cpu time for Reuters with 10,000
boosting iterations

17
Support Vector Machine
Two-class classifier with the form parameters
chosen to minimize Many of the fitted ws are
usually zero xs corresponding the the non-zero
ws are the support vectors.
tuning constant
complexity penalty
Gram matrix
18
Hastie, Friedman Tibshirani
19
SVM Comments

Polynomial ( ) or radial
basis function kernels (
) often used
In fact, for text categorization not using a
kernel seems to do a little better than using a
kernel!!
Generalization bounds available but not useful in
practice?
Very similar to a form of ridge logistic
regression provides similar Reuters performance
(Zhang and Oles, 2002)
Software SVM Light, WEKA, etc.

20
Zhang and Oles

Ridge Logistic Regression

21
ZO Results

10,000 binary features selected via information
gain criterion
Numerical optimization Gauss-Seidel with trust
region

22
Tibshiranis LASSO
least absolute shrinkage and selection operator
w
23
Bayesian Sparse Model
LASSO
Laplace

Simultaneous feature selection and shrinkage
Outperforms SVM and random forests on several
standard (small) problems using Gaussian kernel
Figueiredo (2001) Similar to Tippings Relevance
Vector Machine (JMLR, 2001)

24
Laplace
Gaussian
25
Bayes versus Sparse Bayes Old Reuters

20,320 log-tf features
Sparse models have from 52 to 357 non-zero
posterior modes
Modified ZO algorithm for Laplace and Probit

26
Bayes versus Sparse New Reuters

47,152 log-tf features 20,000 labeled documents
no ad-hoc feature selection
With 1,000 labeled documents F1 0.7

27
Dense versus Sparse OHSUMED

122,076 log-tf features 20,000 labeled documents
no ad-hoc feature selection
median number of features used 34

28
How Bayes?
EM ECM Gauss-Seidel
MCMC Variational Methods
Online EM, Quasi-Bayes (Titterington,
1984 Smith Makov, 1978)
Sequential MC (Ridgeway Madigan, 2003 Chopin,
2002)
29
Full-Bayes versus MAP Bayes
versus
30
Approximate Online Sparse Bayes

Quasi-Bayes optimal Gaussian approximation to
the posterior as each new observation arrives
Alternative quadratic approximation to the
log-likelihood of each new observation at current
mode

Shooting algorithm (Fu, 1988)
31
Shooting
32
pima (UCI), batch n40, online n160
33
Why Bayes?

Can incorporate external information, e.g., topic
descriptions
Natural sequential learning paradigm
Borrowing strength across topics
Simultaneous feature selection and shrinkage via
sparse priors

34
Conclusions

Regularization/Shrinkage is critical for
predictive modeling with HDLSS (short fat data)
Sparse Bayesian classifier is highly competitive
and performs simultaneous feature selection and
shrinkage
Hierarchical partition model for multi-label
setting
Full Bayes versus MAP plug-in

35
Part-of-Speech Tagging

Assign grammatical tags to words
Basic task in the analysis of natural language
data
Phrase identification, entity extraction, etc.
Ambiguity tag could be a noun or a verb
a tag is a part-of-speech label context
resolves the ambiguity

36
The Penn Treebank POS Tag Set
37
POS Tagging Process
Berlin Chen
38
POS Tagging Algorithms

Rule-based taggers large numbers of hand-crafted
rules
Probabilistic tagger used a tagged corpus to
train some sort of model, e.g. HMM.

tag3
tag2
tag1
word3
word2
word1

clever tricks for reducing the number of
parameters

39
some details
Charniak et al., 1993, achieved 95 accuracy on
the Brown Corpus with
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been seen
with tag i gets tag i
number of such occurrences in total
plus a modification that uses word suffixes
r1
s1
40
Recent Developments

Toutanova et al., 2003, use a dependency
network and richer feature set

Log-linear model for ti t-i, w
Model included, for example, a feature for
whether the word contains a number, uppercase
characters, hyphen, etc. (up to 300,000 features)
Regularization of the estimation process
critical (Gaussian priors)
96.6 accuracy on the Penn corpus

41
Named-Entity Classification

Mrs. Frank is a person
Steptoe and Johnson is a company
Honduras is a location

Finally we come to Jordan But Jordan, a vice
president of Steptoe Johnson, But Mr.
Jordan, a vice president of Steptoe Johnson,
spelling clue
context clue
42
NYMBLE (Bikel et al., 1998)
nc3
nc2
nc1
word3
word2
word1