Supervised Learning for Text Classification

About This Presentation

Title:

Supervised Learning for Text Classification

Description:

Penalized Likelihood. Independent Laplace priors give this not so intuitive ... Higher prior variance = less penalization. We used: C is tuning constant ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 143

Provided by: Madi1

Category:

more less

Transcript and Presenter's Notes

Title: Supervised Learning for Text Classification

1
Supervised Learning for Text Classification
2
Predictive Modeling
Goal learn a mapping y f(x?) Need 1. A
model structure 2. A score function 3. An
optimization strategy Categorical y ? c1,,cm
classification Real-valued y regression Note
usually assume c1,,cm are mutually exclusive
and exhaustive
3
Probabilistic Classification
Let p(ck) prob. that a randomly chosen object
comes from ck Objects from ck have p(x ck ,
?k) (e.g., MVN) Then p(ck x ) ? p(x ck ,
?k) p(ck)
Bayes Error Rate

Lower bound on the best possible error rate

4
Bayes error rate about 6
5
Classifier Types
Discriminative model p(ck x ) - e.g. linear
regression, logistic regression, CART Generative
model p(x ck , ?k) - e.g. Bayesian
classifiers, LDA
6
Regression for Binary Classification

Can fit a linear regression model to a 0/1
response
Predicted values are not necessarily between zero
and one

With pgt1, the decision boundary is linear
e.g. 0.5 b0 b1 x1 b2 x2

zeroOneR.txt
7
(No Transcript)
8
Naïve Bayes via a Toy Spam Filter Example

Naïve Bayes is a generative model that makes
drastic simplifying assumptions
Consider a small training data set for spam along
with a bag of words representation

9
(No Transcript)
10
(No Transcript)
11
Naïve Bayes Machinery

We need a way to estimate

Via Bayes theorem we have

or, on the log-odds scale
12
Naïve Bayes Machinery

Naïve Bayes assumes

and
leading to
13
Maximum Likelihood Estimation
weights of evidence
14
Naïve Bayes Prediction

Usually add a small constant (e.g. 0.5) to avoid
divide by zero problems and to reduce bias

New message the quick rabbit rests

New message the quick rabbit rests

Predicted log odds
0.51 0.51 0.51 0.51 1.10 0 3.04

Corresponds to a spam probability of 0.95

16
Linear Discriminant Analysis
K classes, X n p data matrix.
p(ck x ) ? p(x ck , ?k) p(ck)
Could model each class density as multivariate
normal
LDA assumes for all k. Then
This is linear in x.
17
Linear Discriminant Analysis (cont.)
It follows that the classifier should predict
linear discriminant function
If we dont assume the ?ks are identicial, get
Quadratic DA
18
Linear Discriminant Analysis (cont.)
Can estimate the LDA parameters via maximum
likelihood
19
LDA
QDA
T. Hastie, R. Tibshirani, and J. Friedman (2001)
The Elements of Statistical Learning data
mining, inference and prediction. Springer
Verlag.
20
Logistic Regression
Note that LDA is linear in x
Linear logistic regression looks the same
But the estimation procedure for the
co-efficients is different. LDA maximizes joint
likelihood y,X logistic regression maximizes
conditional likelihood yX. Usually similar
predictions.
21
Logistic Regression MLE
For the two-class case, the likelihood is
The maximize need to solve (non-linear) score
equations
22
Logistic Regression Modeling
South African Heart Disease Example (yMI)
Wald
23
Simple Two-Class Perceptron
Define Classify as class 1 if h(x)gt0, class 2
otherwise Score function misclassification
errors on training data For training, replace
class 2 xjs by -xj now need h(x)gt0
Initialize weight vector Repeat one or more
times For each training data point xi If
point correctly classified, do nothing Else
Guaranteed to converge to a separating hyperplane
(if exists)
24
Orange Least Squares
T. Hastie, R. Tibshirani, and J. Friedman (2001)
The Elements of Statistical Learning data
mining, inference and prediction. Springer
Verlag.
25
Optimal Hyperplane
The optimal hyperplane separates the two
classes and maximizes the distance to the closest
point from either class. Finding this hyperplane
is a convex optimization problem. This notion
plays an important role in support vector machines
26
Blue Optimal Red Logistic Regression
T. Hastie, R. Tibshirani, and J. Friedman (2001)
The Elements of Statistical Learning data
mining, inference and prediction. Springer
Verlag.
27
Measuring the Performance of a Binary Classifier
28
Suppose we use a cutoff of 0.5
actual outcome
1
0
1
predicted outcome
0
Test Data
29
More generally
b c
misclassification rate
actual outcome
abcd
1
0
a
sensitivity
ac
1
(aka recall)
predicted outcome
d
specificity
0
bd
a
predicitive value positive
ab
(aka precision)
30
Suppose we use a cutoff of 0.5
actual outcome
1
0
7
sensitivity 100
70
1
predicted outcome
10
specificity 77
0
103
31
Suppose we use a cutoff of 0.8
actual outcome
1
0
5
sensitivity 71
52
1
predicted outcome
11
specificity 85
0
112
32

Note there are 20 possible thresholds
ROC computes sensitivity and specificity for all
possible thresholds and plots them
Note if threshold minimum
cd0 so sens1 spec0
If threshold maximum
ab0 so sens0 spec1

actual outcome
1
0
1
0
33
(No Transcript)
34
(No Transcript)
35

Area under the curve is a common measure of
predictive performance
So is squared error S(yi-yhat)2
also known as the Brier Score

36
A Close Look at Logistic Regression for Text
Classification
37
Logistic Regression in One Slide
Example Predict the gender (yM/F) of a person
given their height (xa number).
38
Logistic Regression Model

Linear model for log odds of class membership

We will call any model with these semantics a
logistic regression model

39
0-

Its arbitrary whether we write

Notationally convenient in different cases

40
Equivalent Forms of LR Model (1)

Exponential model for odds ratio

Ill give you 3 to 1 against this document being
about Sports.

41
Equivalent Forms of LR Model (2)

Logistic model for probability of class membership

I think theres the probability is 0.25 that
this document is about Sports

42
One Beta or Two?

We could also write the logistic form as

with ?-1 0
Suggests natural generalization...

43
Polytomous Logistic Regression (PLR)

Elegant approach to multiclass problems
Also known as polychotomous LR, multinomial LR,
and, ambiguously, multiple LR and multivariate LR

44
Why LR is Interesting (1)

Usual advantages of linear models
Computationally efficient
Take advantage of model and data sparsity
Numeric or discrete inputs
Can use kernels
Natural loss function easy to optimize (more
later)

45
Why LR is Interesting (2)

Probabilistic predictions!
Optimize expected value of effectiveness measure
w/o changing model
Including utilities, rankings, batch measures
Highlight uncertain test cases
Estimate number of class members
Truth is rarely deterministic

46
Why LR is Interesting (3)

Parameters have a meaning
How log odds increases w/ feature values
Lets you
Look at model and see if sensible
Use domain knowledge to guide parameter fitting
(more later)
Build some parts of model by hand
Cavaet realistically, a lot can (does)
complicate this interpretation

47
Conditional Maximum Likelihood Fitting

Find parameters (ßj's)
that predict the largest possibility likelihood
(probability)
of the set of class labels (yi's)
given the corresponding vectors (xis)

48
Loglikelihood

For optimization, we equivalently maximize the
logarithm of the conditional likelihood, i.e. find

49
(Negated) Loglikelihood as Loss Function

Negative of loglikelihood measures loss (degree
of error) in predictions
Sum over training examples
Continuous, differentiable, convex
No local minima
But global minimum may not be unique
Amenable to off-the-shelf optimization approaches

50
(Negated) Loglikelihood as Loss Function
Hastie, Friedman Tibshirani
51
Problems with Likelihood

Linear separation of classes leads to infinite
parameter values

Likelihood maximized by assigning probability
exactly equal to 1.0....
...and these 0.0. Requires infinite parameter
values.
52
P(yx)
53
Problems with Likelihood

Usually too many parameters, too little data
Result overfitting (predictions worse on test
data than on training)
Algorithmic kludge stop before convergence
More principled kludge feature selection
But lets take the high road instead...

54
Bayesian Statistics

Parameters viewed as drawn from a prior
distribution, p(?)
p(?) summarizes what we know about ? before
seeing data
After seeing data, D, we have a posterior
distribution, p(?D)
Summarizes what we now know about ?

55
Bayes Rule

The two are connected by Bayes Rule

Gives convenient way to favor some parameter
values over others
True believers say only legitimate way

56
Bayesian MAP Training

Find parameters (ßj's) that maximize log
posterior probability of class labels (yi's)
given documents (xis) and prior p(ß) on ßj's

MAP selection theoretically inferior to making
use of entire posterior distribution
But only if model is exactly correct, etc.
In practice MAP can be just as good

57
Priors

Any multivariate probability distribution p(?)
can be used
Most lead to intractable, multimodal posteriors
Lets look at a simple case
Independent univariate prior for each parameter
Joint prior is just product of these

58
Earths Favorite Distribution

Suppose our prior beliefs about ??js are
independent Gaussians

59
Diagonal Gaussian Prior

If gaussians are independent, joint distribution
is multivariate gaussian with diagonal covariance
matrix

60
Penalized Likelihood

Diagonal gaussian prior gives this intuitive
function to maximize

Cant overfit if parameter values are small
enough
Variance, ?2, trades off fit and penalty

61
Gaussiangivesdensemodel
(variance)
62
Laplace Distribution

Of course, we could pick any univariate
distribution for our prior
How about Laplace

63
Multivariate Laplace

Again we can define a multivariate distribution
as the product of independent Laplace
distributions

64
Penalized Likelihood

Independent Laplace priors give this not so
intuitive function to maximize

Again, favors small parameter values
So whats the difference from gaussian prior?

65
Gaussiangivesdensemodel
(variance)
66
Laplacegivessparsemodel
w
67
Text Classification Example

ModApte subset of Reuters-21578
90 categories 9603 training docs 18978 features
Reuters RCV1-v2
103 cats 23149 training docs 47152 features
OHSUMED heart disease categories
77 cats 83944 training docs 122076 features
Cosine normalized TFxIDF weights

68
Dense vs. Sparse Models (Macroaveraged F1)
69
(No Transcript)
70
An Example Model (category grain)
71
Bayesian Use of Domain Knowledge

Suppose we know (or have resources that suggest)
Certain words are positively or negatively
associated with category
Certain words are mostly unrelated to content
Prior mean can encode positive or negative
association
Prior variance can encode how confident we are

72
DK-Based Prior Variance

Words we believe to be good content indicators
should be allowed to get larger parameter values
Higher prior variance less penalization
We used

C is tuning constant
significance based on TFxIDF in training data, or
in prior knowledge texts (more later)
?2 is baseline variance for words with no prior
knowledge (chosen by cross-validation or
heuristics)

73
DK-based Prior Mode

Idea Prior mode (most likely parameter value)
should be gt 0 for words we believe to be
positively associated with category

?
? is standard deviation found by cross-validation
Analogous to combining of queries and documents
in relevance feedback

74
Experiments

Data sets
TREC 2004 Genomics data
Categories 32 MeSH categories under Cells
hierarchy
Documents 3742 training and 4175 test
Prior Knowledge MeSH category descriptions
ModApte subset of Reuters-21578
Categories 10 most frequent categories
Documents 9603 training and 3299 test
Prior Knowledge keywords selected by hand (Wu
Srihari, 2004)
Study different training set sizes

75
Sources of Prior Knowledge

Text that is strongly associated with a category
But which doesnt have same statistical
properties as training examples
e.g. category descriptions
Human intuition

76
Text as Prior Knowledge MeSH Category
Description

MeSH Heading Neurons
Scope Note The basic cellular units of nervous
tissue. Each neuron consists of a body, an axon,
and dendrites. Their purpose is to receive,
conduct, and transmit impulses in the nervous
system.
Entry Term Nerve Cells
See Also Neural Conduction

IDF on domain texts gives low significance
IDF gives high significance
77
Priors on ?s (Laplace, mode 0, domain IDF-based
variance)
78
Priors on ?s (Laplace, domain IDF-based mode,
fixed variance)
79
MeSH Results (training 3742 random examples)
80
MeSH Results (training 500 random examples)
81
MeSH Results (training 5 positive 5 random
examples/category)
82
Prior Knowledge from Human Intuition (Wu
Srihari)
83
ModApte Results (training 100 random examples)
84
ModApte Results (training 5 positive 5 random
examples/category)
85
Advertisement

Joint work with Aynur Dayanik, Alex Genkin,
Michael Hollander, and Vladimir Menkov
Bayesian logistic regression software available
(binary and polytomous)
http//www.stat.rutgers.edu/madigan/BBR/
http//www.stat.rutgers.edu/madigan/BMR/
Optimizer not the best but fast enough to use
(especially BBR)
Long tradition of early slow optimizers -)

86
Off the MAP

Bayesian MAP training only uses the single most
probable ? from the posterior distribution
Weighted combination of ?s better?
Maybe, but not guaranteed (despite optimality)
Computationally expensive
High dimensional integrals, Monte Carlo
algorithms, etc.

87
Application Authorship Attribution
88
Some Background

Identification technologies important for
homeland security and in the legal system
Authorship attribution for textual artifacts
using topic independent stylometric features
has a long history
Historical focus on small numbers of authors and
low-dimensional representations via function words

89
Some Background

Identification technologies important for
homeland security and in the legal system
Authorship attribution for textual artifacts
using topic independent stylometric features
has a long history
Historical focus on small numbers of authors and
low-dimensional representations via function words

90
Some Background

Identification technologies important for
homeland security and in the legal system
Authorship attribution for textual artifacts
using topic independent stylometric features
has a long history
Historical focus on small numbers of authors and
low-dimensional representations via function words

Used Naïve Bayes with Poisson and Negative
Binomial model
Out-of-sample predictive performance

92
Different Attribution Problems
1 of K
training
test
?
?
?
?
93
odd man out
training
test
?
?
?
which one?
?
new author
aka novelty detection
94
document pairs
training
test
classify this pair of documents

into one of these configurations

?
?

?
?
95
anti-aliasing

in this example, the red author and the grey
author are the same real person
96
Other Related Problems

Author gender
Author nationality
Sentiment (positive/negative feeling)
Rhetorical style
Multi-authored documents

97
1-of-K Authorship Attribution

Represent documents in a topic-free fashion
Function words and, of, the, etc.
upon?
Sentence lengths, word lengths, deep
linguistics, stylometric features
Parts-of-speech? Word endings? Word prefixes?
Combinations of the above
High-dimensional document representations

98
Polytomous Logistic Regression

Sparse Bayesian (aka lasso) Logistic regression
trivially generalizes to 1-of-k problems
Laplace prior particularly appealing here
Suppose 100 classes and a word that predicts
class 17
Word gets used 100 times if build 100 binary
models, or if use polytomous with Gaussian prior
With Laplace prior and polytomous it's used only
once

99
1-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
100
(No Transcript)
101
(No Transcript)
102
(No Transcript)
103
Cross-Topic Mini-Experiment
104
Cross-Topic Mini-Experiment
105
odd-man-out Sample Results RCV-1

KLM-approach K primary authors, L decoy
authors, M test authors (plus 50 of the K
authors)

114 RCV-1 journalists with 200articles. Argamon
function words. Average of 10 replications. BBR
threshold tuning for F1.
106
odd-man-out Sample Results RCV-1

KLM-approach K primary authors, L decoy
authors, M test authors (plus 50 of the K
authors)

114 RCV-1 journalists with 200articles. Argamon
function words. Average of 10 replications. BBR
threshold tuning for F1.
107
KDD Challenge

150,000 scientific abstracts
Task 1 cluster documents written by T. Suzuki
into real people
Task 2 find documents that have a single author
deleted and/or replaced
Features words, co-authors, institutions, MeSH
headings

108
Gamon (2004)

Discriminate between Anne, Charlotte, and Emily
Brontë

109
Koppel et al (2004)

Unstable words can be replaced without
changing meaning
Use machine translation algorithms to generate
multiple document versions with the same meaning
Function words are unstable

110
Software

BMR Software. Sparse non-sparse Bayesian
multinomial regression software for large numbers
of classes and features
Featex Software. Tool for creating
high-dimensional document representations for
authorship attribution

111
Conclusions

Lots of interesting open problems
How real are the non-literary applications?

112
Some More Case Studies
113
Example 1 Classifying Customer Comments

Comments logged by customer service at large
telecom co.
Free form text, formatted account info
Goal was classification, to support
Informal analysis by account managers
Possible automated response
(Joint work with Bill Gale)

114
Example (Simulated) Records

11-Oct-1999, 17, 7735555555, CST PRIMARY LANG OF
CHINESE AND TOLHIM WE WLD CALL BACK BY CHINEES
SPEAKER
12-Oct-1999, 75, 9085555555, MRS RICHARDS WANTS
TV OFR FREE MILES. SNT IT.

115
Needs and Techniques

Non-technical managers wanted to define classes
of customers
Supervised learning from examples
Active learning reduce amount of data to label
Data accessed via menu-based interface with oddly
limited boolean querying
Learned rule-based classifiers obeying syntactic
restrictions using Cohens Grendel system

116
Example Classifier

HEAR or CUT or STATIC or NOISE
or (LINE and NOISY)
or (TALKING and BAD)
or (LINES and DIRECT)
Accuracy 88 (90 if negation allowed)

117
Example 2 Counting Types of Device Failure

Trouble tickets for service calls on PBXes
Repair person enters failure type, attributes
Plus textual notes (sometimes)
As part of process improvement
Reorganized taxonomy of failure types
Want to know number of failures in each class
(Joint work with Mark Jones)

118
Trouble Ticket (Simulated)

Customer Giant Foods, Inc. Key West, FL
Model RX1837
Date 5-Oct-1999
Last Service 18-Nov-1997
Problem Code OverHeat
Resolution Code ReplacePart
Notes rpl fan, vac dust rec maint pln

119
Goals and Techniques

Leverage similarities between old and new
taxonomies
Classifiers use old class labels as predictors
(along with words and attributes)
Old class labels used to guide selection of data
to label
First attempt Naive Bayes classifier tuned to
minimize error rate

120
(No Transcript)
121
What's the Problem?

Built classifier to minimize of errors
But goal here is counting class members
Better approach
Predict probability of class membership
Add up probabilities to estimate count
Used logistic regression to rescale Naive Bayes
outputs to be probabilities

122
(No Transcript)
123
(No Transcript)
124
Why Are Results So Different?

Probability estimates fairly well calibrated...
About 20 of docs with p near 0.2 are class
members
...but almost all less than 0.5
If minimizing error rate, classifiers almost
always say document is not a class member
Knowledge of mining goal is critical

125
Example 3 Categorizing Nonprofit Activities

Class labels here are meant to be attributes for
mining and analysis
Joint work with Thomas H. Pollak and Sheryl Romeo
of The Urban Institute
In progress

126
Background

Tax-exempt groups report finances and programs
(activities) to IRS
Urban Institute and Guidestar digitize reports
for access and analysis
Groups categorize themselves using NTEE taxonomy
Groups don't categorize their programs

127
Program Record (Portions)

Name WEST CHESTER FIRE TRAINING CTR.
Group Purpose FIREFIGHTER EDUCATION TRAINING
City W CHESTER, PA
NTEE Code M24
Program Achievements COMPLETED CONSTRUCTION AND
DEDICATED A NEW FIRE AND SMOKE TRAINING BUILDING.
Exp1 209403, Exp2 218554, Grants 0

128
Program-Level Categorization

NCCS wants a category label associated with each
program
Support manual and automated data mining
e.g., Is there a correlation between lack of food
bank and health problems in city?
New taxonomy (NPC)
Finer-grained, different emphases, than NTEE

129
NPC Taxonomy (Portion)