I256: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation

Title:

I256: Applied Natural Language Processing

Description:

I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most s originally by Barbara Rosario, modified here) Today Algorithms for Classification ... – PowerPoint PPT presentation

Number of Views:249

Avg rating:3.0/5.0

Slides: 66

Provided by: coursesIs1

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: I256: Applied Natural Language Processing

1
I256 Applied Natural Language Processing
Marti Hearst Nov 1, 2006 (Most slides originally
by Barbara Rosario, modified here)
2
Today

Algorithms for Classification
Binary classification
Perceptron
Winnow
Support Vector Machines (SVM)
Kernel Methods
Multi-Class classification
Decision Trees
Naïve Bayes
K nearest neighbor

3
Binary Classification examples

Spam filtering (spam, not spam)
Customer service message classification (urgent
vs. not urgent)
Sentiment classification (positive, negative)
Sometime it can be convenient to treat a
multi-way problem like a binary one one class
versus all the others, for all classes

4
Binary Classification

Given some data items that belong to a positive
(1 ) or a negative (-1 ) class
Task Train the classifier and predict the class
for a new data item
Geometrically find a separator

5
Linear versus Non Linear algorithms

Linearly separable data if all the data points
can be correctly classified by a linear
(hyperplanar) decision boundary

6
Linearly separable data
7
Non linearly separable data
8
Non linearly separable data
Non Linear Classifier
9
Linear versus Non Linear algorithms

Linear or Non linear separable data?
We can find out only empirically
Linear algorithms (algorithms that find a linear
decision boundary)
When we think the data is linearly separable
Advantages
Simpler, less parameters
Disadvantages
High dimensional data (like for NLT) is usually
not linearly separable
Examples Perceptron, Winnow, SVM
Note we can use linear algorithms also for non
linear problems (see Kernel methods)

10
Linear versus Non Linear algorithms

Non Linear
When the data is non linearly separable
Advantages
More accurate
Disadvantages
More complicated, more parameters
Example Kernel methods
Note the distinction between linear and non
linear applies also for multi-class
classification (well see this later)

11
Simple linear algorithms

Perceptron and Winnow algorithm
Binary classification
Online (process data sequentially, one data point
at the time)
Mistake-driven

12
Linear binary classification

Data (xi,yi)i1...n
x in Rd (x is a vector in d-dimensional
space)
? feature vector
y in -1,1
? label (class, category)
Question
Design a linear decision boundary wx b
(equation of hyperplane) such that the
classification rule associated with it has
minimal probability of error
classification rule
y sign(w x b) which means
if wx b gt 0 then y 1 (positive example)
if wx b lt 0 then y -1 (negative example)

13
Linear binary classification

Find a good hyperplane
(w,b) in Rd1
that correctly classifies data points as much
as possible
In online fashion try one data point at the
time, update weights as necessary

wx b 0
Classification Rule y sign(wx b)
14
Perceptron algorithm

Initialize w1 0
Updating rule For each data point x
If class(x) ! decision(x,w)
then
wk1 ? wk yixi
k ? k 1
else
wk1 ? wk
Function decision(x, w)
If wx b gt 0 return 1
Else return -1

wk
1
0
-1
wk x b 0
15
Perceptron algorithm

Online can adjust to changing target, over time
Advantages
Simple and computationally efficient
Guaranteed to learn a linearly separable problem
(convergence, global optimum)
Limitations
Only linear separations
Only converges for linearly separable data
Not really efficient with many features

16
Winnow algorithm

Another online algorithm for learning perceptron
weights
f(x) sign(wx b)
Linear, binary classification
Update-rule again error-driven, but
multiplicative (instead of additive)

17
Winnow algorithm

Initialize w1 0
Updating rule For each data point x
If class(x) ! decision(x,w)
then
wk1 ? wk yixi ? Perceptron
wk1 ? wk exp(yixi) ? Winnow
k ? k 1
else
wk1 ? wk
Function decision(x, w)
If wx b gt 0 return 1
Else return -1

wk
1
0
-1
wk x b 0
18
Perceptron vs. Winnow

Assume
N available features
only K relevant items, with KltltN
Perceptron number of mistakes O( K N)
Winnow number of mistakes O(K log N)
Winnow is more robust to high-dimensional feature
spaces

19
Perceptron vs. Winnow

Perceptron
Online can adjust to changing target, over time
Advantages
Simple and computationally efficient
Guaranteed to learn a linearly separable problem
Limitations
only linear separations
only converges for linearly separable data
not really efficient with many features

Winnow
Online can adjust to changing target, over time
Advantages
Simple and computationally efficient
Guaranteed to learn a linearly separable problem
Suitable for problems with many irrelevant
attributes
Limitations
only linear separations
only converges for linearly separable data
not really efficient with many features
Used in NLP

20
Large margin classifier

Another family of linear algorithms
Intuition (Vapnik, 1965)
If the classes are linearly separable
Separate the data
Place hyper-plane far from the data large
margin
Statistical results guarantee good generalization

BAD
21
Large margin classifier

Intuition (Vapnik, 1965) if linearly separable
Separate the data
Place hyperplane far from the data large
margin
Statistical results guarantee good generalization

GOOD
? Maximal Margin Classifier
22
Large margin classifier

If not linearly separable
Allow some errors
Still, try to place hyperplane far from each
class

23
Large Margin Classifiers

Advantages
Theoretically better (better error bounds)
Limitations
Computationally more expensive, large quadratic
programming

24
Support Vector Machine (SVM)

Large Margin Classifier
Linearly separable case
Goal find the hyperplane that maximizes the
margin

25
Support Vector Machine (SVM) Applications

Text classification
Hand-writing recognition
Computational biology (e.g., micro-array data)
Face detection
Face expression recognition
Time series prediction

26
Non Linear problem
27
Non Linear problem
28
Non Linear problem

Kernel methods
A family of non-linear algorithms
Transform the non linear problem in a linear one
(in a different feature space)
Use linear algorithms to solve the linear problem
in the new space

29
Basic principle kernel methods

? Rd ? RD (D gtgt d)

Xx z
30
Basic principle kernel methods

Linear separability more likely in high
dimensions
Mapping ? maps input into high-dimensional
feature space
Classifier construct linear classifier in
high-dimensional feature space
Motivation appropriate choice of ? leads to
linear separability
We can do this efficiently!

31
Basic principle kernel methods

We can use the linear algorithms seen before
(Perceptron, SVM) for classification in the
higher dimensional space
HOWEVER According to Dan Klein, kernel methods
are too hard to understand and no one uses them
right !

32
MultiLayer Neural Networks

Also known as a multi-layer perceptron
Also known as artificial neural networks, to
distinguish from the biological ones
Many learning algorithms, but most popular is
backpropagation
The output values are compared with the correct
answer to compute the value of some predefined
error-function.
Propagate the errors back through the network
Adjust the weights to reduce the errors
Continue iterating some number of times.
Can be linear or nonlinear
Tends to work very well, but
Is very slow to run
Isnt great with huge feature sets (slow and
memory-intensive)

33
Multilayer Neural Network Applied to Sentence
Boundary Detection
Features in Descriptor Array
34
Multilayer Neural Networks

Backpropagation algorithm
Present a training sample to the neural network.
Compare the network's output to the desired
output from that sample. Calculate the error in
each output neuron.
For each neuron, calculate what the output should
have been, and a scaling factor, how much lower
or higher the output must be adjusted to match
the desired output. This is the local error.
Adjust the weights of each neuron to lower the
local error.
Assign "blame" for the local error to neurons at
the previous level, giving greater responsibility
to neurons connected by stronger weights.
Repeat the steps above on the neurons at the
previous level, using each one's "blame" as its
error.
For a detailed example, see
http//galaxy.agh.edu.pl/vlsi/AI/backp_t_en/backp
rop.html

35
Multi-class classification
36
Multi-class classification

Given some data items that belong to one of M
possible classes
Task Train the classifier and predict the class
for a new data item
Geometrically harder problem, no more simple
geometry

37
Multi-class classification Examples

Author identification
Language identification
Text categorization (topics)

38
(Some) Algorithms for Multi-class classification

Linear
Decision trees, Naïve Bayes
Non Linear
K-nearest neighbors
Neural Networks

39
Linear class separators (ex Naïve Bayes)
40
Non Linear (ex k Nearest Neighbor)
41
Decision Trees

Decision tree is a classifier in the form of a
tree structure, where each node is either
Leaf node - indicates the value of the target
attribute (class) of examples, or
Decision node - specifies some test to be carried
out on a single attribute-value, with one branch
and sub-tree for each possible outcome of the
test.
A decision tree can be used to classify an
example by starting at the root of the tree and
moving through it until a leaf node, which
provides the classification of the instance.

42
Decision Tree Example
Goal learn when we can play Tennis and when we
cannot
43
Decision Tree for PlayTennis
Outlook
Sunny
Overcast
Rain
Humidity
Wind
Yes
High
Normal
Strong
Weak
No
Yes
Yes
No
44
Decision Tree for PlayTennis
Outlook
Sunny
Overcast
Rain
Humidity
High
Normal
No
Yes
45
Decision Tree for PlayTennis
Outlook Temperature Humidity Wind PlayTennis
Sunny Hot High
Weak ?
46
Decision Tree for Reuter classification
47
Decision Tree for Reuter classification
48
Building Decision Trees

Given training data, how do we construct them?
The central focus of the decision tree growing
algorithm is selecting which attribute to test at
each node in the tree. The goal is to select the
attribute that is most useful for classifying
examples.
Top-down, greedy search through the space of
possible decision trees.
That is, it picks the best attribute and never
looks back to reconsider earlier choices.

49
Building Decision Trees

Splitting criterion
Finding the features and the values to split on
for example, why test first cts and not vs?
Why test on cts lt 2 and not cts lt 5 ?
Split that gives us the maximum information gain
(or the maximum reduction of uncertainty)
Stopping criterion
When all the elements at one node have the same
class, no need to split further
In practice, one first builds a large tree and
then one prunes it back (to avoid overfitting)
See Foundations of Statistical Natural Language
Processing, Manning and Schuetze for a good
introduction

50
Decision Trees Strengths

Decision trees are able to generate
understandable rules.
Decision trees perform classification without
requiring much computation.
Decision trees are able to handle both continuous
and categorical variables.
Decision trees provide a clear indication of
which features are most important for prediction
or classification.

51
Decision Trees Weaknesses

Decision trees are prone to errors in
classification problems with many classes and
relatively small number of training examples.
Decision tree can be computationally expensive to
train.
Need to compare all possible splits
Pruning is also expensive
Most decision-tree algorithms only examine a
single field at a time. This leads to rectangular
classification boxes that may not correspond well
with the actual distribution of records in the
decision space.

52
Naïve Bayes Models

Graphical Models graph theory plus probability
theory
Nodes are variables
Edges are conditional probabilities

A
P(A) P(BA) P(CA)
53
Naïve Bayes Models

Graphical Models graph theory plus probability
theory
Nodes are variables
Edges are conditional probabilities
Absence of an edge between nodes implies
independence between the variables of the nodes

A
P(A) P(BA) P(CA)
54
Naïve Bayes for text classification
55
Naïve Bayes for text classification
earn
Shr
per
56
Naïve Bayes for text classification
Topic
w1
w3
wn-1

The words depend on the topic P(wi Topic)
P(ctsearn) gt P(tennis earn)
Naïve Bayes assumption all words are independent
given the topic
From training set we learn the probabilities
P(wi Topic) for each word and for each topic in
the training set

57
Naïve Bayes for text classification
Topic
w1
w3
wn-1

To Classify new example
Calculate P(Topic w1, w2, wn) for each topic
Bayes decision rule
Choose the topic T for which
P(T w1, w2, wn) gt P(T w1, w2, wn) for
each T? T

58
Naïve Bayes Math

Naïve Bayes define a joint probability
distribution
P(Topic , w1, w2, wn) P(Topic)? P(wi Topic)
We learn P(Topic) and P(wi Topic) in training
Test we need P(Topic w1, w2, wn)
P(Topic w1, w2, wn) P(Topic , w1, w2,
wn) / P(w1, w2, wn)

59
Naïve Bayes Strengths

Very simple model
Easy to understand
Very easy to implement
Very efficient, fast training and classification
Modest space storage
Widely used because it works somewhat well for
text categorization
Linear, but non parallel, decision boundaries

60
Naïve Bayes Weaknesses

Naïve Bayes independence assumption
Ignores the sequential ordering of words (uses
bag of words model)
Naïve Bayes assumption is inappropriate if there
are strong conditional dependencies between the
variables
But even if the model is not right, Naïve Bayes
models do well in a surprisingly large number of
cases because often we are interested in
classification accuracy and not in accurate
probability estimations

61
Multinomial Naïve Bayes

(Based on a paper by McCallum Nigram 98)
Features include the number of times words occur
in the document, not binary (present/absent)
indicators
Uses a statistical formula known as the
multinomial distribution.
Authors compared, on several text classification
tasks
Multinomial naïve bayes
Binary-featured multi-variate Bernoulli-distribute
d
Results
Multinomial much better when using large
vocabularies.
However, they note that Bernoulli can handle
other features (e.g., from-title) as numbers,
whereas this will confuse the multinomial
version.

Andrew McCallum and Kamal Nigam. A Comparison of
Event Models for Naive Bayes Text Classification
In AAAI/ICML-98 Workshop on Learning for Text
Categorization.
62
k Nearest Neighbor Classification

Nearest Neighbor classification rule to classify
a new object, find the object in the training set
that is most similar. Then assign the category of
this nearest neighbor
K Nearest Neighbor (KNN) consult k nearest
neighbors. Decision based on the majority
category of these neighbors. More robust than k
1
Example of similarity measure often used in NLP
is cosine similarity

63
1-Nearest Neighbor
64
1-Nearest Neighbor
65
3-Nearest Neighbor
66
3-Nearest Neighbor
Assign the category of the majority of the
neighbors
67
k Nearest Neighbor Classification

Strengths
Robust
Conceptually simple
Often works well
Powerful (arbitrary decision boundaries)
Weaknesses
Performance is very dependent on the similarity
measure used (and to a lesser extent on the
number of neighbors k used)
Finding a good similarity measure can be
difficult
Computationally expensive

68
Summary