Title: I256: Applied Natural Language Processing
1I256 Applied Natural Language Processing
Marti Hearst October 18, 2006 (Many slides
originally by Barbara Rosario, modified
here)
2Today
- Algorithms for Classification
- Binary classification
- Perceptron
- Winnow
- Support Vector Machines (SVM)
- Kernel Methods
- Multi-Class classification
- Decision Trees
- Naïve Bayes
- K nearest neighbor
3Binary Classification examples
- Spam filtering (spam, not spam)
- Customer service message classification (urgent
vs. not urgent) - Information retrieval (relevant, not relevant)
- Sentiment classification (positive, negative)
- Sometime it can be convenient to treat a
multi-way problem like a binary one one class
versus all the others, for all classes
4Binary Classification
- Given some data items that belong to a positive
(1 ) or a negative (-1 ) class - Task Train the classifier and predict the class
for a new data item - Geometrically find a separator
5Linear versus Non Linear algorithms
- Linearly separable data if all the data points
can be correctly classified by a linear
(hyperplanar) decision boundary
6Linearly separable data
7Non linearly separable data
8Non linearly separable data
Non Linear Classifier
9Linear versus Non Linear algorithms
- Linear or Non linear separable data?
- We can find out only empirically
- Linear algorithms (algorithms that find a linear
decision boundary) - When we think the data is linearly separable
- Advantages
- Simpler, less parameters
- Disadvantages
- High dimensional data (like for NLT) is usually
not linearly separable - Examples Perceptron, Winnow, SVM
- Note we can use linear algorithms also for non
linear problems (see Kernel methods)
10Linear versus Non Linear algorithms
- Non Linear
- When the data is non linearly separable
- Advantages
- More accurate
- Disadvantages
- More complicated, more parameters
- Example Kernel methods
- Note the distinction between linear and non
linear applies also for multi-class
classification (well see this later)
11Simple linear algorithms
- Perceptron and Winnow algorithm
- Linear
- Binary classification
- Online (process data sequentially, one data point
at the time) - Mistake driven
- Simple single layer Neural Networks
12Linear binary classification
- Data (xi,yi)i1...n
- x in Rd (x is a vector in d-dimensional
space) - ? feature vector
- y in -1,1
- ? label (class, category)
- Question
- Design a linear decision boundary wx b
(equation of hyperplane) such that the
classification rule associated with it has
minimal probability of error - classification rule
- y sign(w x b) which means
- if wx b gt 0 then y 1
- if wx b lt 0 then y -1
13Linear binary classification
- Find a good hyperplane
- (w,b) in Rd1
- that correctly classifies data points as much
as possible - In online fashion one data point at the time,
update weights as necessary
wx b 0
Classification Rule y sign(wx b)
14Perceptron algorithm
- Initialize w1 0
- Updating rule For each data point x
- If class(x) ! decision(x,w)
- then
- wk1 ? wk yixi
- k ? k 1
- else
- wk1 ? wk
- Function decision(x, w)
- If wx b gt 0 return 1
- Else return -1
wk
1
0
-1
wk x b 0
15Perceptron algorithm
- Online can adjust to changing target, over time
- Advantages
- Simple and computationally efficient
- Guaranteed to learn a linearly separable problem
(convergence, global optimum) - Limitations
- Only linear separations
- Only converges for linearly separable data
- Not really efficient with many features
16Winnow algorithm
- Another online algorithm for learning perceptron
weights - f(x) sign(wx b)
- Linear, binary classification
- Update-rule again error-driven, but
multiplicative (instead of additive)
17Winnow algorithm
- Initialize w1 0
- Updating rule For each data point x
- If class(x) ! decision(x,w)
- then
- wk1 ? wk yixi ? Perceptron
- wk1 ? wk exp(yixi) ? Winnow
- k ? k 1
- else
- wk1 ? wk
- Function decision(x, w)
- If wx b gt 0 return 1
- Else return -1
wk
1
0
-1
wk x b 0
18Perceptron vs. Winnow
- Assume
- N available features
- only K relevant items, with KltltN
- Perceptron number of mistakes O( K N)
- Winnow number of mistakes O(K log N)
-
- Winnow is more robust to high-dimensional feature
spaces
19Perceptron vs. Winnow
- Perceptron
- Online can adjust to changing target, over time
- Advantages
- Simple and computationally efficient
- Guaranteed to learn a linearly separable problem
- Limitations
- only linear separations
- only converges for linearly separable data
- not really efficient with many features
- Winnow
- Online can adjust to changing target, over time
- Advantages
- Simple and computationally efficient
- Guaranteed to learn a linearly separable problem
- Suitable for problems with many irrelevant
attributes - Limitations
- only linear separations
- only converges for linearly separable data
- not really efficient with many features
- Used in NLP
20Large margin classifier
- Another family of linear algorithms
- Intuition (Vapnik, 1965)
- If the classes are linearly separable
- Separate the data
- Place hyper-plane far from the data large
margin - Statistical results guarantee good generalization
BAD
21Large margin classifier
- Intuition (Vapnik, 1965) if linearly separable
- Separate the data
- Place hyperplane far from the data large
margin - Statistical results guarantee good generalization
GOOD
? Maximal Margin Classifier
22Large margin classifier
- If not linearly separable
- Allow some errors
- Still, try to place hyperplane far from each
class
23Large Margin Classifiers
- Advantages
- Theoretically better (better error bounds)
- Limitations
- Computationally more expensive, large quadratic
programming
24Support Vector Machine (SVM)
- Large Margin Classifier
- Linearly separable case
- Goal find the hyperplane that maximizes the
margin
25Support Vector Machine (SVM)
- Text classification
- Hand-writing recognition
- Computational biology (e.g., micro-array data)
- Face detection
- Face expression recognition
- Time series prediction
26Non Linear problem
27Non Linear problem
28Non Linear problem
- Kernel methods
- A family of non-linear algorithms
- Transform the non linear problem in a linear one
(in a different feature space) - Use linear algorithms to solve the linear problem
in the new space
29Main intuition of Kernel methods
- (Copy here from black board)
30Basic principle kernel methods
Xx z
31Basic principle kernel methods
- Linear separability more likely in high
dimensions - Mapping ? maps input into high-dimensional
feature space - Classifier construct linear classifier in
high-dimensional feature space - Motivation appropriate choice of ? leads to
linear separability - We can do this efficiently!
32Basic principle kernel methods
- We can use the linear algorithms seen before
(Perceptron, SVM) for classification in the
higher dimensional space
33Multi-class classification
- Given some data items that belong to one of M
possible classes - Task Train the classifier and predict the class
for a new data item - Geometrically harder problem, no more simple
geometry
34Multi-class classification
35Multi-class classification Examples
- Author identification
- Language identification
- Text categorization (topics)
36(Some) Algorithms for Multi-class classification
- Linear
- Parallel class separators Decision Trees
- Non parallel class separators Naïve Bayes
- Non Linear
- K-nearest neighbors
37Linear, parallel class separators (ex Decision
Trees)
38Linear, NON parallel class separators (ex Naïve
Bayes)
39Non Linear (ex k Nearest Neighbor)
40Decision Trees
- Decision tree is a classifier in the form of a
tree structure, where each node is either - Leaf node - indicates the value of the target
attribute (class) of examples, or - Decision node - specifies some test to be carried
out on a single attribute-value, with one branch
and sub-tree for each possible outcome of the
test. - A decision tree can be used to classify an
example by starting at the root of the tree and
moving through it until a leaf node, which
provides the classification of the instance.
41Training Examples
Goal learn when we can play Tennis and when we
cannot
42Decision Tree for PlayTennis
Outlook
Sunny
Overcast
Rain
Humidity
Wind
Yes
High
Normal
Strong
Weak
No
Yes
Yes
No
43Decision Tree for PlayTennis
Outlook
Sunny
Overcast
Rain
Humidity
High
Normal
No
Yes
44Decision Tree for PlayTennis
Outlook Temperature Humidity Wind PlayTennis
Sunny Hot High
Weak ?
45Decision Tree for Reuter classification
46Decision Tree for Reuter classification
47Building Decision Trees
- Given training data, how do we construct them?
- The central focus of the decision tree growing
algorithm is selecting which attribute to test at
each node in the tree. The goal is to select the
attribute that is most useful for classifying
examples. - Top-down, greedy search through the space of
possible decision trees. - That is, it picks the best attribute and never
looks back to reconsider earlier choices.
48Building Decision Trees
- Splitting criterion
- Finding the features and the values to split on
- for example, why test first cts and not vs?
- Why test on cts lt 2 and not cts lt 5 ?
- Split that gives us the maximum information gain
(or the maximum reduction of uncertainty) - Stopping criterion
- When all the elements at one node have the same
class, no need to split further - In practice, one first builds a large tree and
then one prunes it back (to avoid overfitting) - See Foundations of Statistical Natural Language
Processing, Manning and Schuetze for a good
introduction
49Decision Trees Strengths
- Decision trees are able to generate
understandable rules. - Decision trees perform classification without
requiring much computation. - Decision trees are able to handle both continuous
and categorical variables. - Decision trees provide a clear indication of
which features are most important for prediction
or classification.
50Decision Trees weaknesses
- Decision trees are prone to errors in
classification problems with many classes and
relatively small number of training examples. - Decision tree can be computationally expensive to
train. - Need to compare all possible splits
- Pruning is also expensive
- Most decision-tree algorithms only examine a
single field at a time. This leads to rectangular
classification boxes that may not correspond well
with the actual distribution of records in the
decision space.
51Decision Trees
52Naïve Bayes
More powerful that Decision Trees
53Naïve Bayes Models
- Graphical Models graph theory plus probability
theory - Nodes are variables
- Edges are conditional probabilities
A
P(A) P(BA) P(CA)
54Naïve Bayes Models
- Graphical Models graph theory plus probability
theory - Nodes are variables
- Edges are conditional probabilities
- Absence of an edge between nodes implies
independence between the variables of the nodes
A
P(A) P(BA) P(CA)
55Naïve Bayes for text classification
56Naïve Bayes for text classification
earn
Shr
per
57Naïve Bayes for text classification
Topic
w1
w3
wn-1
- The words depend on the topic P(wi Topic)
- P(ctsearn) gt P(tennis earn)
- Naïve Bayes assumption all words are independent
given the topic - From training set we learn the probabilities
P(wi Topic) for each word and for each topic in
the training set
58Naïve Bayes for text classification
Topic
w1
w3
wn-1
- To Classify new example
- Calculate P(Topic w1, w2, wn) for each topic
- Bayes decision rule
- Choose the topic T for which
- P(T w1, w2, wn) gt P(T w1, w2, wn) for
each T? T
59Naïve Bayes Math
- Naïve Bayes define a joint probability
distribution - P(Topic , w1, w2, wn) P(Topic)? P(wi Topic)
- We learn P(Topic) and P(wi Topic) in training
- Test we need P(Topic w1, w2, wn)
- P(Topic w1, w2, wn) P(Topic , w1, w2,
wn) / P(w1, w2, wn)
60Naïve Bayes Strengths
- Very simple model
- Easy to understand
- Very easy to implement
- Very efficient, fast training and classification
- Modest space storage
- Widely used because it works really well for text
categorization - Linear, but non parallel decision boundaries
61Naïve Bayes weaknesses
- Naïve Bayes independence assumption has two
consequences - The linear ordering of words is ignored (bag of
words model) - The words are independent of each other given the
class False - President is more likely to occur in a context
that contains election than in a context that
contains poet - Naïve Bayes assumption is inappropriate if there
are strong conditional dependencies between the
variables - (But even if the model is not right, Naïve
Bayes models do well in a surprisingly large
number of cases because often we are interested
in classification accuracy and not in accurate
probability estimations)
62k Nearest Neighbor Classification
- Nearest Neighbor classification rule to classify
a new object, find the object in the training set
that is most similar. Then assign the category of
this nearest neighbor - K Nearest Neighbor (KNN) consult k nearest
neighbors. Decision based on the majority
category of these neighbors. More robust than k
1 - Example of similarity measure often used in NLP
is cosine similarity
631-Nearest Neighbor
641-Nearest Neighbor
653-Nearest Neighbor
663-Nearest Neighbor
Assign the category of the majority of the
neighbors
67k Nearest Neighbor Classification
- Strengths
- Robust
- Conceptually simple
- Often works well
- Powerful (arbitrary decision boundaries)
- Weaknesses
- Performance is very dependent on the similarity
measure used (and to a lesser extent on the
number of neighbors k used) - Finding a good similarity measure can be
difficult - Computationally expensive
68Summary
- Algorithms for Classification
- Linear versus non linear classification
- Binary classification
- Perceptron
- Winnow
- Support Vector Machines (SVM)
- Kernel Methods
- Multi-Class classification
- Decision Trees
- Naïve Bayes
- K nearest neighbor
- On Wednesday Weka