I256: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

I256: Applied Natural Language Processing

Description:

I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most s originally by Barbara Rosario, modified here) Today Algorithms for Classification ... – PowerPoint PPT presentation

Number of Views:242
Avg rating:3.0/5.0
Slides: 66
Provided by: coursesIs1
Category:

less

Transcript and Presenter's Notes

Title: I256: Applied Natural Language Processing


1
I256 Applied Natural Language Processing
Marti Hearst Nov 1, 2006 (Most slides originally
by Barbara Rosario, modified here)    
2
Today
  • Algorithms for Classification
  • Binary classification
  • Perceptron
  • Winnow
  • Support Vector Machines (SVM)
  • Kernel Methods
  • Multi-Class classification
  • Decision Trees
  • Naïve Bayes
  • K nearest neighbor

3
Binary Classification examples
  • Spam filtering (spam, not spam)
  • Customer service message classification (urgent
    vs. not urgent)
  • Sentiment classification (positive, negative)
  • Sometime it can be convenient to treat a
    multi-way problem like a binary one one class
    versus all the others, for all classes

4
Binary Classification
  • Given some data items that belong to a positive
    (1 ) or a negative (-1 ) class
  • Task Train the classifier and predict the class
    for a new data item
  • Geometrically find a separator

5
Linear versus Non Linear algorithms
  • Linearly separable data if all the data points
    can be correctly classified by a linear
    (hyperplanar) decision boundary

6
Linearly separable data
7
Non linearly separable data
8
Non linearly separable data
Non Linear Classifier
9
Linear versus Non Linear algorithms
  • Linear or Non linear separable data?
  • We can find out only empirically
  • Linear algorithms (algorithms that find a linear
    decision boundary)
  • When we think the data is linearly separable
  • Advantages
  • Simpler, less parameters
  • Disadvantages
  • High dimensional data (like for NLT) is usually
    not linearly separable
  • Examples Perceptron, Winnow, SVM
  • Note we can use linear algorithms also for non
    linear problems (see Kernel methods)

10
Linear versus Non Linear algorithms
  • Non Linear
  • When the data is non linearly separable
  • Advantages
  • More accurate
  • Disadvantages
  • More complicated, more parameters
  • Example Kernel methods
  • Note the distinction between linear and non
    linear applies also for multi-class
    classification (well see this later)

11
Simple linear algorithms
  • Perceptron and Winnow algorithm
  • Binary classification
  • Online (process data sequentially, one data point
    at the time)
  • Mistake-driven

12
Linear binary classification
  • Data (xi,yi)i1...n
  • x in Rd (x is a vector in d-dimensional
    space)
  • ? feature vector
  • y in -1,1
  • ? label (class, category)
  • Question
  • Design a linear decision boundary wx b
    (equation of hyperplane) such that the
    classification rule associated with it has
    minimal probability of error
  • classification rule
  • y sign(w x b) which means
  • if wx b gt 0 then y 1 (positive example)
  • if wx b lt 0 then y -1 (negative example)

13
Linear binary classification
  • Find a good hyperplane
  • (w,b) in Rd1
  • that correctly classifies data points as much
    as possible
  • In online fashion try one data point at the
    time, update weights as necessary

wx b 0
Classification Rule y sign(wx b)
14
Perceptron algorithm
  • Initialize w1 0
  • Updating rule For each data point x
  • If class(x) ! decision(x,w)
  • then
  • wk1 ? wk yixi
  • k ? k 1
  • else
  • wk1 ? wk
  • Function decision(x, w)
  • If wx b gt 0 return 1
  • Else return -1

wk
1
0
-1
wk x b 0
15
Perceptron algorithm
  • Online can adjust to changing target, over time
  • Advantages
  • Simple and computationally efficient
  • Guaranteed to learn a linearly separable problem
    (convergence, global optimum)
  • Limitations
  • Only linear separations
  • Only converges for linearly separable data
  • Not really efficient with many features

16
Winnow algorithm
  • Another online algorithm for learning perceptron
    weights
  • f(x) sign(wx b)
  • Linear, binary classification
  • Update-rule again error-driven, but
    multiplicative (instead of additive)

17
Winnow algorithm
  • Initialize w1 0
  • Updating rule For each data point x
  • If class(x) ! decision(x,w)
  • then
  • wk1 ? wk yixi ? Perceptron
  • wk1 ? wk exp(yixi) ? Winnow
  • k ? k 1
  • else
  • wk1 ? wk
  • Function decision(x, w)
  • If wx b gt 0 return 1
  • Else return -1

wk
1
0
-1
wk x b 0
18
Perceptron vs. Winnow
  • Assume
  • N available features
  • only K relevant items, with KltltN
  • Perceptron number of mistakes O( K N)
  • Winnow number of mistakes O(K log N)
  • Winnow is more robust to high-dimensional feature
    spaces

19
Perceptron vs. Winnow
  • Perceptron
  • Online can adjust to changing target, over time
  • Advantages
  • Simple and computationally efficient
  • Guaranteed to learn a linearly separable problem
  • Limitations
  • only linear separations
  • only converges for linearly separable data
  • not really efficient with many features
  • Winnow
  • Online can adjust to changing target, over time
  • Advantages
  • Simple and computationally efficient
  • Guaranteed to learn a linearly separable problem
  • Suitable for problems with many irrelevant
    attributes
  • Limitations
  • only linear separations
  • only converges for linearly separable data
  • not really efficient with many features
  • Used in NLP

20
Large margin classifier
  • Another family of linear algorithms
  • Intuition (Vapnik, 1965)
  • If the classes are linearly separable
  • Separate the data
  • Place hyper-plane far from the data large
    margin
  • Statistical results guarantee good generalization

BAD
21
Large margin classifier
  • Intuition (Vapnik, 1965) if linearly separable
  • Separate the data
  • Place hyperplane far from the data large
    margin
  • Statistical results guarantee good generalization

GOOD
? Maximal Margin Classifier
22
Large margin classifier
  • If not linearly separable
  • Allow some errors
  • Still, try to place hyperplane far from each
    class

23
Large Margin Classifiers
  • Advantages
  • Theoretically better (better error bounds)
  • Limitations
  • Computationally more expensive, large quadratic
    programming

24
Support Vector Machine (SVM)
  • Large Margin Classifier
  • Linearly separable case
  • Goal find the hyperplane that maximizes the
    margin

25
Support Vector Machine (SVM) Applications
  • Text classification
  • Hand-writing recognition
  • Computational biology (e.g., micro-array data)
  • Face detection
  • Face expression recognition
  • Time series prediction

26
Non Linear problem
27
Non Linear problem
28
Non Linear problem
  • Kernel methods
  • A family of non-linear algorithms
  • Transform the non linear problem in a linear one
    (in a different feature space)
  • Use linear algorithms to solve the linear problem
    in the new space

29
Basic principle kernel methods
  • ? Rd ? RD (D gtgt d)

Xx z
30
Basic principle kernel methods
  • Linear separability more likely in high
    dimensions
  • Mapping ? maps input into high-dimensional
    feature space
  • Classifier construct linear classifier in
    high-dimensional feature space
  • Motivation appropriate choice of ? leads to
    linear separability
  • We can do this efficiently!

31
Basic principle kernel methods
  • We can use the linear algorithms seen before
    (Perceptron, SVM) for classification in the
    higher dimensional space
  • HOWEVER According to Dan Klein, kernel methods
    are too hard to understand and no one uses them
    right !

32
MultiLayer Neural Networks
  • Also known as a multi-layer perceptron
  • Also known as artificial neural networks, to
    distinguish from the biological ones
  • Many learning algorithms, but most popular is
    backpropagation
  • The output values are compared with the correct
    answer to compute the value of some predefined
    error-function.
  • Propagate the errors back through the network
  • Adjust the weights to reduce the errors
  • Continue iterating some number of times.
  • Can be linear or nonlinear
  • Tends to work very well, but
  • Is very slow to run
  • Isnt great with huge feature sets (slow and
    memory-intensive)

33
Multilayer Neural Network Applied to Sentence
Boundary Detection
Features in Descriptor Array
34
Multilayer Neural Networks
  • Backpropagation algorithm
  • Present a training sample to the neural network.
  • Compare the network's output to the desired
    output from that sample. Calculate the error in
    each output neuron.
  • For each neuron, calculate what the output should
    have been, and a scaling factor, how much lower
    or higher the output must be adjusted to match
    the desired output. This is the local error.
  • Adjust the weights of each neuron to lower the
    local error.
  • Assign "blame" for the local error to neurons at
    the previous level, giving greater responsibility
    to neurons connected by stronger weights.
  • Repeat the steps above on the neurons at the
    previous level, using each one's "blame" as its
    error.
  • For a detailed example, see
  • http//galaxy.agh.edu.pl/vlsi/AI/backp_t_en/backp
    rop.html

35
Multi-class classification
36
Multi-class classification
  • Given some data items that belong to one of M
    possible classes
  • Task Train the classifier and predict the class
    for a new data item
  • Geometrically harder problem, no more simple
    geometry

37
Multi-class classification Examples
  • Author identification
  • Language identification
  • Text categorization (topics)

38
(Some) Algorithms for Multi-class classification
  • Linear
  • Decision trees, Naïve Bayes
  • Non Linear
  • K-nearest neighbors
  • Neural Networks

39
Linear class separators (ex Naïve Bayes)
40
Non Linear (ex k Nearest Neighbor)
41
Decision Trees
  • Decision tree is a classifier in the form of a
    tree structure, where each node is either
  • Leaf node - indicates the value of the target
    attribute (class) of examples, or
  • Decision node - specifies some test to be carried
    out on a single attribute-value, with one branch
    and sub-tree for each possible outcome of the
    test.
  • A decision tree can be used to classify an
    example by starting at the root of the tree and
    moving through it until a leaf node, which
    provides the classification of the instance.

42
Decision Tree Example
Goal learn when we can play Tennis and when we
cannot
43
Decision Tree for PlayTennis
Outlook
Sunny
Overcast
Rain
Humidity
Wind
Yes
High
Normal
Strong
Weak
No
Yes
Yes
No
44
Decision Tree for PlayTennis
Outlook
Sunny
Overcast
Rain
Humidity
High
Normal
No
Yes
45
Decision Tree for PlayTennis
Outlook Temperature Humidity Wind PlayTennis
Sunny Hot High
Weak ?
46
Decision Tree for Reuter classification
47
Decision Tree for Reuter classification
48
Building Decision Trees
  • Given training data, how do we construct them?
  • The central focus of the decision tree growing
    algorithm is selecting which attribute to test at
    each node in the tree. The goal is to select the
    attribute that is most useful for classifying
    examples.
  • Top-down, greedy search through the space of
    possible decision trees.
  • That is, it picks the best attribute and never
    looks back to reconsider earlier choices.

49
Building Decision Trees
  • Splitting criterion
  • Finding the features and the values to split on
  • for example, why test first cts and not vs?
  • Why test on cts lt 2 and not cts lt 5 ?
  • Split that gives us the maximum information gain
    (or the maximum reduction of uncertainty)
  • Stopping criterion
  • When all the elements at one node have the same
    class, no need to split further
  • In practice, one first builds a large tree and
    then one prunes it back (to avoid overfitting)
  • See Foundations of Statistical Natural Language
    Processing, Manning and Schuetze for a good
    introduction

50
Decision Trees Strengths
  • Decision trees are able to generate
    understandable rules.
  • Decision trees perform classification without
    requiring much computation.
  • Decision trees are able to handle both continuous
    and categorical variables.
  • Decision trees provide a clear indication of
    which features are most important for prediction
    or classification.

51
Decision Trees Weaknesses
  • Decision trees are prone to errors in
    classification problems with many classes and
    relatively small number of training examples.
  • Decision tree can be computationally expensive to
    train.
  • Need to compare all possible splits
  • Pruning is also expensive
  • Most decision-tree algorithms only examine a
    single field at a time. This leads to rectangular
    classification boxes that may not correspond well
    with the actual distribution of records in the
    decision space.

52
Naïve Bayes Models
  • Graphical Models graph theory plus probability
    theory
  • Nodes are variables
  • Edges are conditional probabilities

A
P(A) P(BA) P(CA)
53
Naïve Bayes Models
  • Graphical Models graph theory plus probability
    theory
  • Nodes are variables
  • Edges are conditional probabilities
  • Absence of an edge between nodes implies
    independence between the variables of the nodes

A
P(A) P(BA) P(CA)
54
Naïve Bayes for text classification
55
Naïve Bayes for text classification
earn
Shr
per
56
Naïve Bayes for text classification
Topic
w1
w3
wn-1
  • The words depend on the topic P(wi Topic)
  • P(ctsearn) gt P(tennis earn)
  • Naïve Bayes assumption all words are independent
    given the topic
  • From training set we learn the probabilities
    P(wi Topic) for each word and for each topic in
    the training set

57
Naïve Bayes for text classification
Topic
w1
w3
wn-1
  • To Classify new example
  • Calculate P(Topic w1, w2, wn) for each topic
  • Bayes decision rule
  • Choose the topic T for which
  • P(T w1, w2, wn) gt P(T w1, w2, wn) for
    each T? T

58
Naïve Bayes Math
  • Naïve Bayes define a joint probability
    distribution
  • P(Topic , w1, w2, wn) P(Topic)? P(wi Topic)
  • We learn P(Topic) and P(wi Topic) in training
  • Test we need P(Topic w1, w2, wn)
  • P(Topic w1, w2, wn) P(Topic , w1, w2,
    wn) / P(w1, w2, wn)

59
Naïve Bayes Strengths
  • Very simple model
  • Easy to understand
  • Very easy to implement
  • Very efficient, fast training and classification
  • Modest space storage
  • Widely used because it works somewhat well for
    text categorization
  • Linear, but non parallel, decision boundaries

60
Naïve Bayes Weaknesses
  • Naïve Bayes independence assumption
  • Ignores the sequential ordering of words (uses
    bag of words model)
  • Naïve Bayes assumption is inappropriate if there
    are strong conditional dependencies between the
    variables
  • But even if the model is not right, Naïve Bayes
    models do well in a surprisingly large number of
    cases because often we are interested in
    classification accuracy and not in accurate
    probability estimations

61
Multinomial Naïve Bayes
  • (Based on a paper by McCallum Nigram 98)
  • Features include the number of times words occur
    in the document, not binary (present/absent)
    indicators
  • Uses a statistical formula known as the
    multinomial distribution.
  • Authors compared, on several text classification
    tasks
  • Multinomial naïve bayes
  • Binary-featured multi-variate Bernoulli-distribute
    d
  • Results
  • Multinomial much better when using large
    vocabularies.
  • However, they note that Bernoulli can handle
    other features (e.g., from-title) as numbers,
    whereas this will confuse the multinomial
    version.

Andrew McCallum and Kamal Nigam. A Comparison of
Event Models for Naive Bayes Text Classification
In AAAI/ICML-98 Workshop on Learning for Text
Categorization.
62
k Nearest Neighbor Classification
  • Nearest Neighbor classification rule to classify
    a new object, find the object in the training set
    that is most similar. Then assign the category of
    this nearest neighbor
  • K Nearest Neighbor (KNN) consult k nearest
    neighbors. Decision based on the majority
    category of these neighbors. More robust than k
    1
  • Example of similarity measure often used in NLP
    is cosine similarity

63
1-Nearest Neighbor
64
1-Nearest Neighbor
65
3-Nearest Neighbor
66
3-Nearest Neighbor
Assign the category of the majority of the
neighbors
67
k Nearest Neighbor Classification
  • Strengths
  • Robust
  • Conceptually simple
  • Often works well
  • Powerful (arbitrary decision boundaries)
  • Weaknesses
  • Performance is very dependent on the similarity
    measure used (and to a lesser extent on the
    number of neighbors k used)
  • Finding a good similarity measure can be
    difficult
  • Computationally expensive

68
Summary
  • Algorithms for Classification
  • Linear versus non linear classification
  • Binary classification
  • Perceptron
  • Winnow
  • Support Vector Machines (SVM)
  • Kernel Methods
  • Multilayer Neural Networks
  • Multi-Class classification
  • Decision Trees
  • Naïve Bayes
  • K nearest neighbor

69
Next Time
  • More learning algorithms
  • Clustering
Write a Comment
User Comments (0)
About PowerShow.com