Title: Learning%20with%20Positive%20and%20Unlabeled%20Examples%20using%20Weighted%20Logistic%20Regression
1Learning with Positive and Unlabeled Examples
using Weighted Logistic Regression
- Wee Sun Lee
- National University of Singapore
- Bing Liu
- University of Illinois, Chicago
2Personalized Web Browser
- Learn web pages that are of interest to you!
- Information that is available to browser when it
is installed - Your bookmark (or cached documents) Positive
examples - All documents in the web Unlabeled examples!!
3Direct Marketing
- Company has database with details of its customer
positive examples - Want to find people who are similar to their own
customer - Buy a database consisting of details of people,
some of whom may be potential customers
unlabeled examples.
4Assumptions
- All examples are drawn independently from a fixed
underlying distribution - Negative examples are never labeled
- With fixed probability ?, positive example is
independently left unlabeled.
5Are Unlabeled Examples Helpful?
- Function known to be either x1 lt 0 or x2 gt 0
- Which one is it?
x1 lt 0
Not learnable with only positiveexamples.
However, addition ofunlabeled examples makes it
learnable.
x2 gt 0
6Related Works
- Denis (1998) showed that function classes
learnable in the statistical query model is
learnable from positive and unlabeled examples. - Muggleton (2001) showed that learning from
positive examples is possible if the distribution
of inputs is known. - Liu et.al. (2002) give sample complexity bounds
and an algorithm based on EM - Yu et.al. (2002) gives algorithm based on SVM
-
7Approach
- Label all unlabeled examples as negative (Denis
1998) - Negative examples are always labeled negative
- Positive examples are labeled negative with
probability ? - Training with one-sided noise
- Problem ? is not known
- Also, what if there is some noise on the negative
examples? Negative examples occasionally labeled
positive with small probability.
8Selecting Threshold and Robustness to Noise
- Approach Reweigh examples and learn conditional
probability P(Y1X) - If you weight the examples by
- Multiplying the negative examples with weight
equal to the number of positive examples and - Multiplying the positive examples with weight
equal to the number of negative examples
9Selecting Threshold and Robustness to Noise
- Then P(Y1X) gt 0.5 when X is a positive example
and P(Y1X) lt 0.5 when X is a negative example,
as long as - ?? lt 1 where
- ? is probability that positive example is labeled
negative - ? is probability that negative example is labeled
positive - Okay, even if some of the positive examples are
not actually positive (noise).
10Weighted Logistic Regression
- Practical algorithm Reweigh the examples and
then do logistic regression with linear function
to learn P(Y1X). - Compose linear function with sigmoid then do
maximum likelihood estimation - Convex optimization problem
- Will learn the correct conditional probability if
it can be represented - Minimize upper bound to weighted classification
error if cannot be represented still makes
sense.
11Selecting Regularization Parameter
- Regularization important when learning with noise
- Add c times sum of squared values of weights to
cost function as regularization - How to choose the value of c?
- When both positive and negative examples
available, can use validation set to choose c. - Can use weighted examples in a validation set to
choose c, but not sure if this makes sense?
12Selecting Regularization Parameter
- Performance criteria pr/P(Y1) can be estimated
directly from validation set as r2/P(f(X) 1) - Recall r P(f(X) 1 Y 1)
- Precision p P(Y 1 f(X) 1)
- Can use for
- tuning regularization parameter c
- also to compare different algorithms when only
positive and unlabeled examples (no negative)
available - Behavior similar to commonly used F-score F
2pr/(pr) - Reasonable when use of F-score reasonable
13Experimental Setup
- 20 Newsgroup dataset
- 1 group positive, 19 others negative
- Term frequency as features, normalized to length
1 - Randomly split
- 50 train
- 20 validation
- 30 test
- Validation set used to select regularization
parameter from small discrete set then retrain on
trainingvalidation set
14Results
F-score averaged over 20 groups
? Opt pr/P(Y1) Weighted Error S-EM 1-Cls SVM
0.3 0.757 0.754 0.646 0.661 0.15
0.7 0.675 0.659 0.619 0.59 0.153
15Conclusions
- Learning from positive and unlabeled examples by
learning P(Y1X) after setting all unlabeled
examples negative. - Reweighing examples allows threshold at 0.5 and
makes it tolerant to negative examples that are
misclassified as positive - Performance measure pr/P(Y1) can be estimated
from data - Useful when F-score is reasonable
- Can be used to select regularization parameter
- Logistic regression using linear regression and
these methods works well on text classification