Title: Supervised Learning for Text Classification
1Supervised Learning for Text Classification
2Predictive Modeling
Goal learn a mapping y f(x?) Need 1. A
model structure 2. A score function 3. An
optimization strategy Categorical y ? c1,,cm
classification Real-valued y regression Note
usually assume c1,,cm are mutually exclusive
and exhaustive
3Probabilistic Classification
Let p(ck) prob. that a randomly chosen object
comes from ck Objects from ck have p(x ck ,
?k) (e.g., MVN) Then p(ck x ) ? p(x ck ,
?k) p(ck)
Bayes Error Rate
- Lower bound on the best possible error rate
4Bayes error rate about 6
5Classifier Types
Discriminative model p(ck x ) - e.g. linear
regression, logistic regression, CART Generative
model p(x ck , ?k) - e.g. Bayesian
classifiers, LDA
6Regression for Binary Classification
- Can fit a linear regression model to a 0/1
response - Predicted values are not necessarily between zero
and one
- With pgt1, the decision boundary is linear
- e.g. 0.5 b0 b1 x1 b2 x2
zeroOneR.txt
7(No Transcript)
8Naïve Bayes via a Toy Spam Filter Example
- Naïve Bayes is a generative model that makes
drastic simplifying assumptions - Consider a small training data set for spam along
with a bag of words representation
9(No Transcript)
10(No Transcript)
11Naïve Bayes Machinery
- We need a way to estimate
- Via Bayes theorem we have
or, on the log-odds scale
12Naïve Bayes Machinery
and
leading to
13Maximum Likelihood Estimation
weights of evidence
14Naïve Bayes Prediction
- Usually add a small constant (e.g. 0.5) to avoid
divide by zero problems and to reduce bias
- New message the quick rabbit rests
15- New message the quick rabbit rests
- Predicted log odds
- 0.51 0.51 0.51 0.51 1.10 0 3.04
- Corresponds to a spam probability of 0.95
16Linear Discriminant Analysis
K classes, X n p data matrix.
p(ck x ) ? p(x ck , ?k) p(ck)
Could model each class density as multivariate
normal
LDA assumes for all k. Then
This is linear in x.
17Linear Discriminant Analysis (cont.)
It follows that the classifier should predict
linear discriminant function
If we dont assume the ?ks are identicial, get
Quadratic DA
18Linear Discriminant Analysis (cont.)
Can estimate the LDA parameters via maximum
likelihood
19LDA
QDA
T. Hastie, R. Tibshirani, and J. Friedman (2001)
The Elements of Statistical Learning data
mining, inference and prediction. Springer
Verlag.
20Logistic Regression
Note that LDA is linear in x
Linear logistic regression looks the same
But the estimation procedure for the
co-efficients is different. LDA maximizes joint
likelihood y,X logistic regression maximizes
conditional likelihood yX. Usually similar
predictions.
21Logistic Regression MLE
For the two-class case, the likelihood is
The maximize need to solve (non-linear) score
equations
22Logistic Regression Modeling
South African Heart Disease Example (yMI)
Wald
23Simple Two-Class Perceptron
Define Classify as class 1 if h(x)gt0, class 2
otherwise Score function misclassification
errors on training data For training, replace
class 2 xjs by -xj now need h(x)gt0
Initialize weight vector Repeat one or more
times For each training data point xi If
point correctly classified, do nothing Else
Guaranteed to converge to a separating hyperplane
(if exists)
24Orange Least Squares
T. Hastie, R. Tibshirani, and J. Friedman (2001)
The Elements of Statistical Learning data
mining, inference and prediction. Springer
Verlag.
25Optimal Hyperplane
The optimal hyperplane separates the two
classes and maximizes the distance to the closest
point from either class. Finding this hyperplane
is a convex optimization problem. This notion
plays an important role in support vector machines
26Blue Optimal Red Logistic Regression
T. Hastie, R. Tibshirani, and J. Friedman (2001)
The Elements of Statistical Learning data
mining, inference and prediction. Springer
Verlag.
27Measuring the Performance of a Binary Classifier
28Suppose we use a cutoff of 0.5
actual outcome
1
0
1
predicted outcome
0
Test Data
29More generally
b c
misclassification rate
actual outcome
abcd
1
0
a
sensitivity
ac
1
(aka recall)
predicted outcome
d
specificity
0
bd
a
predicitive value positive
ab
(aka precision)
30Suppose we use a cutoff of 0.5
actual outcome
1
0
7
sensitivity 100
70
1
predicted outcome
10
specificity 77
0
103
31Suppose we use a cutoff of 0.8
actual outcome
1
0
5
sensitivity 71
52
1
predicted outcome
11
specificity 85
0
112
32- Note there are 20 possible thresholds
- ROC computes sensitivity and specificity for all
possible thresholds and plots them - Note if threshold minimum
- cd0 so sens1 spec0
- If threshold maximum
- ab0 so sens0 spec1
actual outcome
1
0
1
0
33(No Transcript)
34(No Transcript)
35- Area under the curve is a common measure of
predictive performance - So is squared error S(yi-yhat)2
- also known as the Brier Score
36A Close Look at Logistic Regression for Text
Classification
37Logistic Regression in One Slide
Example Predict the gender (yM/F) of a person
given their height (xa number).
38Logistic Regression Model
- Linear model for log odds of class membership
- We will call any model with these semantics a
logistic regression model
390-
- Its arbitrary whether we write
- Notationally convenient in different cases
40Equivalent Forms of LR Model (1)
- Exponential model for odds ratio
- Ill give you 3 to 1 against this document being
about Sports.
41Equivalent Forms of LR Model (2)
- Logistic model for probability of class membership
- I think theres the probability is 0.25 that
this document is about Sports
42 One Beta or Two?
- We could also write the logistic form as
- with ?-1 0
- Suggests natural generalization...
43Polytomous Logistic Regression (PLR)
- Elegant approach to multiclass problems
- Also known as polychotomous LR, multinomial LR,
and, ambiguously, multiple LR and multivariate LR
44Why LR is Interesting (1)
- Usual advantages of linear models
- Computationally efficient
- Take advantage of model and data sparsity
- Numeric or discrete inputs
- Can use kernels
- Natural loss function easy to optimize (more
later)
45Why LR is Interesting (2)
- Probabilistic predictions!
- Optimize expected value of effectiveness measure
w/o changing model - Including utilities, rankings, batch measures
- Highlight uncertain test cases
- Estimate number of class members
- Truth is rarely deterministic
46Why LR is Interesting (3)
- Parameters have a meaning
- How log odds increases w/ feature values
- Lets you
- Look at model and see if sensible
- Use domain knowledge to guide parameter fitting
(more later) - Build some parts of model by hand
- Cavaet realistically, a lot can (does)
complicate this interpretation
47Conditional Maximum Likelihood Fitting
- Find parameters (ßj's)
- that predict the largest possibility likelihood
(probability) - of the set of class labels (yi's)
- given the corresponding vectors (xis)
48Loglikelihood
- For optimization, we equivalently maximize the
logarithm of the conditional likelihood, i.e. find
49(Negated) Loglikelihood as Loss Function
- Negative of loglikelihood measures loss (degree
of error) in predictions - Sum over training examples
- Continuous, differentiable, convex
- No local minima
- But global minimum may not be unique
- Amenable to off-the-shelf optimization approaches
50(Negated) Loglikelihood as Loss Function
Hastie, Friedman Tibshirani
51Problems with Likelihood
- Linear separation of classes leads to infinite
parameter values
Likelihood maximized by assigning probability
exactly equal to 1.0....
...and these 0.0. Requires infinite parameter
values.
52P(yx)
53Problems with Likelihood
- Usually too many parameters, too little data
- Result overfitting (predictions worse on test
data than on training) - Algorithmic kludge stop before convergence
- More principled kludge feature selection
- But lets take the high road instead...
54Bayesian Statistics
- Parameters viewed as drawn from a prior
distribution, p(?) - p(?) summarizes what we know about ? before
seeing data - After seeing data, D, we have a posterior
distribution, p(?D) - Summarizes what we now know about ?
55Bayes Rule
- The two are connected by Bayes Rule
- Gives convenient way to favor some parameter
values over others - True believers say only legitimate way
56Bayesian MAP Training
- Find parameters (ßj's) that maximize log
posterior probability of class labels (yi's)
given documents (xis) and prior p(ß) on ßj's
- MAP selection theoretically inferior to making
use of entire posterior distribution - But only if model is exactly correct, etc.
- In practice MAP can be just as good
57Priors
- Any multivariate probability distribution p(?)
can be used - Most lead to intractable, multimodal posteriors
- Lets look at a simple case
- Independent univariate prior for each parameter
- Joint prior is just product of these
58Earths Favorite Distribution
- Suppose our prior beliefs about ??js are
independent Gaussians
59Diagonal Gaussian Prior
- If gaussians are independent, joint distribution
is multivariate gaussian with diagonal covariance
matrix
60Penalized Likelihood
- Diagonal gaussian prior gives this intuitive
function to maximize
- Cant overfit if parameter values are small
enough - Variance, ?2, trades off fit and penalty
61Gaussiangivesdensemodel
(variance)
62Laplace Distribution
- Of course, we could pick any univariate
distribution for our prior - How about Laplace
63Multivariate Laplace
- Again we can define a multivariate distribution
as the product of independent Laplace
distributions
64Penalized Likelihood
- Independent Laplace priors give this not so
intuitive function to maximize
- Again, favors small parameter values
- So whats the difference from gaussian prior?
65Gaussiangivesdensemodel
(variance)
66Laplacegivessparsemodel
w
67Text Classification Example
- ModApte subset of Reuters-21578
- 90 categories 9603 training docs 18978 features
- Reuters RCV1-v2
- 103 cats 23149 training docs 47152 features
- OHSUMED heart disease categories
- 77 cats 83944 training docs 122076 features
- Cosine normalized TFxIDF weights
68Dense vs. Sparse Models (Macroaveraged F1)
69(No Transcript)
70An Example Model (category grain)
71Bayesian Use of Domain Knowledge
- Suppose we know (or have resources that suggest)
- Certain words are positively or negatively
associated with category - Certain words are mostly unrelated to content
- Prior mean can encode positive or negative
association - Prior variance can encode how confident we are
72DK-Based Prior Variance
- Words we believe to be good content indicators
should be allowed to get larger parameter values - Higher prior variance less penalization
- We used
- C is tuning constant
- significance based on TFxIDF in training data, or
in prior knowledge texts (more later) - ?2 is baseline variance for words with no prior
knowledge (chosen by cross-validation or
heuristics)
73DK-based Prior Mode
- Idea Prior mode (most likely parameter value)
should be gt 0 for words we believe to be
positively associated with category
- ?
- ? is standard deviation found by cross-validation
- Analogous to combining of queries and documents
in relevance feedback
74Experiments
- Data sets
- TREC 2004 Genomics data
- Categories 32 MeSH categories under Cells
hierarchy - Documents 3742 training and 4175 test
- Prior Knowledge MeSH category descriptions
- ModApte subset of Reuters-21578
- Categories 10 most frequent categories
- Documents 9603 training and 3299 test
- Prior Knowledge keywords selected by hand (Wu
Srihari, 2004) - Study different training set sizes
75Sources of Prior Knowledge
- Text that is strongly associated with a category
- But which doesnt have same statistical
properties as training examples - e.g. category descriptions
- Human intuition
76 Text as Prior Knowledge MeSH Category
Description
- MeSH Heading Neurons
- Scope Note The basic cellular units of nervous
tissue. Each neuron consists of a body, an axon,
and dendrites. Their purpose is to receive,
conduct, and transmit impulses in the nervous
system. - Entry Term Nerve Cells
- See Also Neural Conduction
IDF on domain texts gives low significance
IDF gives high significance
77Priors on ?s (Laplace, mode 0, domain IDF-based
variance)
78Priors on ?s (Laplace, domain IDF-based mode,
fixed variance)
79MeSH Results (training 3742 random examples)
80MeSH Results (training 500 random examples)
81MeSH Results (training 5 positive 5 random
examples/category)
82Prior Knowledge from Human Intuition (Wu
Srihari)
83ModApte Results (training 100 random examples)
84ModApte Results (training 5 positive 5 random
examples/category)
85Advertisement
- Joint work with Aynur Dayanik, Alex Genkin,
Michael Hollander, and Vladimir Menkov - Bayesian logistic regression software available
(binary and polytomous) - http//www.stat.rutgers.edu/madigan/BBR/
- http//www.stat.rutgers.edu/madigan/BMR/
- Optimizer not the best but fast enough to use
(especially BBR) - Long tradition of early slow optimizers -)
86Off the MAP
- Bayesian MAP training only uses the single most
probable ? from the posterior distribution - Weighted combination of ?s better?
- Maybe, but not guaranteed (despite optimality)
- Computationally expensive
- High dimensional integrals, Monte Carlo
algorithms, etc.
87Application Authorship Attribution
88Some Background
- Identification technologies important for
homeland security and in the legal system - Authorship attribution for textual artifacts
using topic independent stylometric features
has a long history - Historical focus on small numbers of authors and
low-dimensional representations via function words
89Some Background
- Identification technologies important for
homeland security and in the legal system - Authorship attribution for textual artifacts
using topic independent stylometric features
has a long history - Historical focus on small numbers of authors and
low-dimensional representations via function words
90Some Background
- Identification technologies important for
homeland security and in the legal system - Authorship attribution for textual artifacts
using topic independent stylometric features
has a long history - Historical focus on small numbers of authors and
low-dimensional representations via function words
91- Used Naïve Bayes with Poisson and Negative
Binomial model - Out-of-sample predictive performance
92Different Attribution Problems
1 of K
training
test
?
?
?
?
93odd man out
training
test
?
?
?
which one?
?
new author
aka novelty detection
94document pairs
training
test
classify this pair of documents
into one of these configurations
?
?
?
?
95anti-aliasing
in this example, the red author and the grey
author are the same real person
96Other Related Problems
- Author gender
- Author nationality
- Sentiment (positive/negative feeling)
- Rhetorical style
- Multi-authored documents
971-of-K Authorship Attribution
- Represent documents in a topic-free fashion
- Function words and, of, the, etc.
- upon?
- Sentence lengths, word lengths, deep
linguistics, stylometric features - Parts-of-speech? Word endings? Word prefixes?
- Combinations of the above
- High-dimensional document representations
98Polytomous Logistic Regression
- Sparse Bayesian (aka lasso) Logistic regression
trivially generalizes to 1-of-k problems - Laplace prior particularly appealing here
- Suppose 100 classes and a word that predicts
class 17 - Word gets used 100 times if build 100 binary
models, or if use polytomous with Gaussian prior - With Laplace prior and polytomous it's used only
once
991-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
100(No Transcript)
101(No Transcript)
102(No Transcript)
103Cross-Topic Mini-Experiment
104Cross-Topic Mini-Experiment
105odd-man-out Sample Results RCV-1
- KLM-approach K primary authors, L decoy
authors, M test authors (plus 50 of the K
authors)
114 RCV-1 journalists with 200articles. Argamon
function words. Average of 10 replications. BBR
threshold tuning for F1.
106odd-man-out Sample Results RCV-1
- KLM-approach K primary authors, L decoy
authors, M test authors (plus 50 of the K
authors)
114 RCV-1 journalists with 200articles. Argamon
function words. Average of 10 replications. BBR
threshold tuning for F1.
107KDD Challenge
- 150,000 scientific abstracts
- Task 1 cluster documents written by T. Suzuki
into real people - Task 2 find documents that have a single author
deleted and/or replaced - Features words, co-authors, institutions, MeSH
headings
108Gamon (2004)
- Discriminate between Anne, Charlotte, and Emily
Brontë
109Koppel et al (2004)
- Unstable words can be replaced without
changing meaning - Use machine translation algorithms to generate
multiple document versions with the same meaning - Function words are unstable
110Software
- BMR Software. Sparse non-sparse Bayesian
multinomial regression software for large numbers
of classes and features - Featex Software. Tool for creating
high-dimensional document representations for
authorship attribution
111Conclusions
- Lots of interesting open problems
- How real are the non-literary applications?
112Some More Case Studies
113Example 1 Classifying Customer Comments
- Comments logged by customer service at large
telecom co. - Free form text, formatted account info
- Goal was classification, to support
- Informal analysis by account managers
- Possible automated response
- (Joint work with Bill Gale)
114Example (Simulated) Records
- 11-Oct-1999, 17, 7735555555, CST PRIMARY LANG OF
CHINESE AND TOLHIM WE WLD CALL BACK BY CHINEES
SPEAKER - 12-Oct-1999, 75, 9085555555, MRS RICHARDS WANTS
TV OFR FREE MILES. SNT IT.
115Needs and Techniques
- Non-technical managers wanted to define classes
of customers - Supervised learning from examples
- Active learning reduce amount of data to label
- Data accessed via menu-based interface with oddly
limited boolean querying - Learned rule-based classifiers obeying syntactic
restrictions using Cohens Grendel system
116Example Classifier
- HEAR or CUT or STATIC or NOISE
- or (LINE and NOISY)
- or (TALKING and BAD)
- or (LINES and DIRECT)
- Accuracy 88 (90 if negation allowed)
117Example 2 Counting Types of Device Failure
- Trouble tickets for service calls on PBXes
- Repair person enters failure type, attributes
- Plus textual notes (sometimes)
- As part of process improvement
- Reorganized taxonomy of failure types
- Want to know number of failures in each class
- (Joint work with Mark Jones)
118Trouble Ticket (Simulated)
- Customer Giant Foods, Inc. Key West, FL
- Model RX1837
- Date 5-Oct-1999
- Last Service 18-Nov-1997
- Problem Code OverHeat
- Resolution Code ReplacePart
- Notes rpl fan, vac dust rec maint pln
119Goals and Techniques
- Leverage similarities between old and new
taxonomies - Classifiers use old class labels as predictors
- (along with words and attributes)
- Old class labels used to guide selection of data
to label - First attempt Naive Bayes classifier tuned to
minimize error rate
120(No Transcript)
121What's the Problem?
- Built classifier to minimize of errors
- But goal here is counting class members
- Better approach
- Predict probability of class membership
- Add up probabilities to estimate count
- Used logistic regression to rescale Naive Bayes
outputs to be probabilities
122(No Transcript)
123(No Transcript)
124Why Are Results So Different?
- Probability estimates fairly well calibrated...
- About 20 of docs with p near 0.2 are class
members - ...but almost all less than 0.5
- If minimizing error rate, classifiers almost
always say document is not a class member - Knowledge of mining goal is critical
125Example 3 Categorizing Nonprofit Activities
- Class labels here are meant to be attributes for
mining and analysis - Joint work with Thomas H. Pollak and Sheryl Romeo
of The Urban Institute - In progress
126Background
- Tax-exempt groups report finances and programs
(activities) to IRS - Urban Institute and Guidestar digitize reports
for access and analysis - Groups categorize themselves using NTEE taxonomy
- Groups don't categorize their programs
127Program Record (Portions)
- Name WEST CHESTER FIRE TRAINING CTR.
- Group Purpose FIREFIGHTER EDUCATION TRAINING
- City W CHESTER, PA
- NTEE Code M24
- Program Achievements COMPLETED CONSTRUCTION AND
DEDICATED A NEW FIRE AND SMOKE TRAINING BUILDING. - Exp1 209403, Exp2 218554, Grants 0
128Program-Level Categorization
- NCCS wants a category label associated with each
program - Support manual and automated data mining
- e.g., Is there a correlation between lack of food
bank and health problems in city? - New taxonomy (NPC)
- Finer-grained, different emphases, than NTEE
129NPC Taxonomy (Portion)
- B Education
- B01 Education, General/Other
- B02 Education Policy Programs
-
- B04 Educational Programs
- B04.01 Educational Programs, General/Other
- B04.02 Adult Education Programs
- B04.02.01 Adult Education Programs, General/Other
- B04.02.02 Adult Basic Education Programs
130Manual Categorization w/ NPC
Elite coder blind agreement on random sample of
200 program records
131Scale
- 300,000 program records/year
- Short texts
- Batches arrive monthly
- Classification allowed to take several days
- Even w/ 797 classes, speed not big issue
- Potential savings 3,000 person-hours/yr
132Data
- Labeled examples (from previous social science
studies) - 12,531 labeled by summer interns
- 11,879 labeled by NCCS personnel
- 390,000 unlabeled examples
- Textual and nontextual attributes
- NPC NTEE taxonomies
- Existing manually engineered classifier
133Data Difficulties
- Multiple coders
- Intern-labeled data less consistent, invalid
categories, variable format - Engineered classifier, intern data use old
version of NPC - Missing values
- Labeled data geographic subset, not random
- Variable of programs per organization
134Explored Many Techniques
- Which text fields, and whether to merge
- Phrase formation
- Nontextual attributes
- Choice/tuning of learning algorithm
- How/how much to use intern-labeled data
- How/how much to use prior knowledge
- Balancing of data by source
135What Mattered
- Program and organization-level text fields
- Avoiding overfitting to organizations
- Using NTEE class as predictor
- Discriminative learning (vs. naïve Bayes)
- Efficient software
- Granularity of desired classification
136What Didnt Matter (Much)
- Multiword phrasal attributes
- Financial attributes
- Ordering of data
- Which of several discriminative algorithms
- Used SNoW w/ Winnow and perceptron
- Also BoosTexter, but too slow
- Intern-labeled data
137Accuracy Engineered v. Learn
Train 8910, Validation 593, Test 2376
Elite-coded, balanced by coder
138Adjusting Results for Plausibility
- However, some errors worse than others
- Had NCCS judge (blindly) category assignments for
plausibility - Human, engineered, learned, hybrid, random
- Sample of 705 programs, stratified on human level
1 category - Cant specify all plausible in advance
139Plausibility-Adjusted (Tent.)
Mistakes made by manually engineered classifier
more likely to be plausible
140Best of Both Worlds?
- Engineered classifier had similar form as SNoW
classifiers - Linear model gives score for each category
- Choose highest scoring category
- SNoW algorithms are incremental
- Used engineered classifier as starting point
- After a couple days translation work
141Plausibility-Adjusted (Tent.)
Helps, though still below goal
142Next Steps Toward Goal
- More labeled data becoming available
- Explicitly capture notion of plausibility
- Does not correspond to closeness in taxonomy
- Better combination of engineered learned
- Kicking out difficult cases to human