Title: Linear Models (II)
1Linear Models (II)
2Recap
- Classification problems
- Inputs x ? output y
- y is from a discrete set
- Example height 1.8m ? male/female?
- Statistical learning approaches for
classification problems
(1.8m, m) (1.87, m) (1.65, f) (1.66, m) (1.58, f)
(1.63, f)
p(hmale), p(male) p(fmale), p(female)
p(male1.8) p(female1.8)
3Recap
- Generative Model
- p(yx) determine the class y for object x
- p(y) how frequent class y appears
- p(xy) the input pattern for class y
- Example
- 1.8m ? male? female?
- p(male1.8m) p(male)p(1.8mmale)/p(1.8m)
- p(female1.8m) p(female)p(1.8mfemale)/p(1.8m)
- p(1.8m) p(1.8mmale)p(male)p(1.8mfemale)p(fema
le)
4Recap
- Learning p(xy) and p(y)
- p(y) example(y)/examples
- Maximum likelihood estimation for p(xy)
- Example
- Training examples
- (1.8m, m) (1.87, m) (1.65, f) (1.66, m) (1.58, f)
(1.63, f) - p(male) Nmale/N p(female) Nfemale/N
- Assume that the height distributions for male and
female are Gaussian - (?male,?male), (?female,?female)
- MLE estimation
5Recap
6Recap
7Recap
- Naïve Bayes
- Input x is a vector xx1, x2,,xm
- Assume each feature is independent from each
other given the class y - p(xy)p(x1y)p(x2y)p(xmy)
- each p(xiy) is estimated using MLE approach
8Text Classification (I)
- Learning to classify text
- Input x document
- Represented by a vector of words
- Output y interesting or not
- 1 for interesting document, -1 for uninteresting
- Generative model for text classification (TC)
- p(), p(-)
- p(doc), p(doc-)
- Naïve Bayes approach
9Text Classification (II)
- Learning parameters for TC
- p() n()/N, p(-) n(-)/N
- n(?) number of positive (or negative) documents
- N total number of documents
- Apply MLE for estimating p(w), p(w-)
10Text Classification (IV)
Twenty Newsgroups
An Example
11Text Classification (IV)
- Any problems with the naïve Bayes text classifier?
12Text Classifier (V)
- Problems
- Irrelevant words
- Unseen words
- Solution
- Select relevant words using mutual information
I(x, y) - x whether or not word x appearing in a document
- y the document is of interests or not
- Unseen words
- Word class approach
- Introduce word class T t1, t2, , tm
- Compute p(ti), p(ti-)
- When w is unseen before, replace p(w?) with
p(ti?) - Word correlation approach
- finding out the correlations between words
p(ww) - Using web information
- p(w?) ?w p(ww)p(w?)
13Logistic Regression Model
- Gaussian generative model find a linear
decision boundary. - Why not learn a linear decision boundary
directly?
14Logistic Regression Model
- The log-ratio of positive class to negative class
- Results
15Logistic Regression Model
- Assume the inputs and outputs are related in the
log linear function - Estimate weights MLE approach
16Example 1 Heart Disease
1 25-29 2 30-34 3 35-39 4 40-44 5 45-49 6
50-54 7 55-59 8 60-64
- Input feature x age group id
- output y having heart disease or not
- 1 having heart disease
- -1 no heart disease
17Example 1 Heart Disease
- Logistic regression model
- Learning w and c MLE approach
- Numerical optimization w 0.58, c -3.34
18Example 1 Heart Disease
- W 0.58
- An old person is more likely to have heart
disease - C -3.34
- i?wc lt 0 ? p(i) lt p(-i)
- i?wc gt 0 ? p(i) gt p(-i)
- i?wc 0 ? decision boundary
- i 5.78 ? 53 year old
19Naïve Bayes Solution
- Inaccurate fitting
- Non Gaussian distribution
- i 5.59
- Close to the estimation by logistic regression
- Even though naïve Bayes does not fit input
patterns well, it still works fine for the
decision boundary
20Problems with Using Histogram Data?
21Uneven Sampling for Different Ages
22Solution
w 0.63, c -3.56 ? i 5.65
23Example Text Classification
- Input x a binary vector
- Each word is a different dimension
- xi 0 if the ith word does not appear in the
document - xi 1 if it appears in the document
- Output y interesting document or not
- 1 interesting
- -1 uninteresting
24Example Text Classification
Doc 1 The purpose of the Lady Bird Johnson
Wildflower Center is to educate people around the
world,
Doc 2 Rain Bird is one of the leading irrigation
manufacturers in the world, providingcomplete
irrigation solutions for people
term the world people company center
Doc 1 1 1 1 0 1
Doc 2 1 1 1 1 0
25Example 2 Text Classification
- Logistic regression model
- Every term ti is assigned with a weight wi
- Learning parameters MLE approach
- Need numerical solutions
26Example 2 Text Classification
- Weight wi
- wi gt 0 term ti is a positive evidence
- wi lt 0 term ti is a negative evidence
- wi 0 term ti is irrelevant to whether the
document is intesting - The larger the wi , the more important ti term
is determining whether the document is
interesting. - Threshold c
27Example 2 Text Classification
- Dataset Reuter-21578
- Classification accuracy
- Naïve Bayes 77
- Logistic regression 88
28Why Logistic Regression Works better for Text
Classification?
- Common words
- Small weights in logistic regression
- Large weights in naïve Bayes
- Weight p(w) p(w-)
- Independence assumption
- Naive Bayes assumes that each word is generated
independently - Logistic regression is able to take into account
of the correlation of words
29Comparison
- Generative Model
- Model P(xy)
- Model the input patterns
- Usually fast converge
- Cheap computation
- Robust to noise data
- But
- Usually performs worse
- Discriminative Model
-
- Model P(yx) directly
- Model the decision boundary
- Usually good performance
- But
- Slow convergence
- Expensive computation
- Sensitive to noise data