CSA3180: Natural Language Processing - PowerPoint PPT Presentation

About This Presentation

Title:

CSA3180: Natural Language Processing

Description:

Slides based on Lectures by Mike Rosner (2003) and material by ... 13. Word Probability. Given the context: Look at the cute ... is the more likely than rabbit? ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 61

Provided by: michael307

Category:

more less

Transcript and Presenter's Notes

Title: CSA3180: Natural Language Processing

1
CSA3180 Natural Language Processing

Statistics 2 Probability and Classification II
Experiments/Outcomes/Events
Independence/Dependence
Bayes Rule
Conditional Probabilities/Chain Rule
Classification II

2
Introduction

Slides based on Lectures by Mike Rosner (2003)
and material by Mary Dalrymple, Kings College,
London

3
Experiments, Basic Outcome, Sample Space

Probability theory is founded upon the notion of
an experiment.
An experiment is a situation which can have one
or more different basic outcomes.
Example if we throw a die, there are six
possible basic outcomes.
A Sample Space O is a set of all possible basic
outcomes. For example,
If we toss a coin, O H,T
If we toss a coin twice, O HT,TH,TT,HH
if we throw a die, O 1,2,3,4,5,6

4
Events

An Event A ? O is a set of basic outcomes e.g.
tossing two heads HH
throwing a 6, 6
getting either a 2 or a 4, 2,4.
O itself is the certain event, whilst is the
impossible event.
Event Space ? Sample Space

5
Probability Distribution

A probability distribution of an experiment is a
function that assigns a number (or probability)
between 0 and 1 to each basic outcome such that
the sum of all the probabilities 1.
Probability distribution functions (PDFs)
The probability p(E) of an event E is the sum of
the probabilities of all the basic outcomes in E.
Uniform distribution is when each basic outcome
is equally likely.

6
Probability of an Event

Sample space for a die throw set of basic
outcomes 1,2,3,4,5,6
If the die is not loaded, distribution is
uniform.
Thus for each basic outcome, e.g. 6 (throwing a
six) is assigned the same probability 1/6.
So p(3,6) p(3) p(6) 2/6 1/3

7
Probability Estimates

Repeat experiment T times and count frequency of
E.
Estimated p(E) count(E)/count(T)
This can be done over m runs, yielding estimates
p1(E),...pm(E).
Best estimate is (possibly weighted) average of
individual pi(E)

8
3 Times Coin Toss

O HHH,HHT,HTH,HTT,THH,THT,TTH,TTT
Cases with exactly 2 tails HTT, THT,TTH
Experimenti 1000 cases (3000 tosses).
c1(E) 386, p1(E) .386
c2(E) 375, p2(E) .375
pmean(E) (.386.375)/2 .381
Uniform distribution is when each basic outcome
is equally likely.
Assuming uniform distribution, p(E) 3/8 .375

9
Word Probability

General ProblemWhat is the probability of the
next word/character/phoneme in a sequence, given
the first N words/characters/phonemes.
To approach this problem we study an experiment
whose sample space is the set of possible words.
Same approach could be used to study the the
probability of the next character or phoneme.

10
Word Probability

I would like to make a phone _____.
Look it up in the phone ________, quick!
The phone ________ you requested is
Context can have decisive effect on word
probability

11
Word Probability

Approximation 1 all words are equally probable
Then probability of each word 1/N where N is
the number of word types.
But all words are not equally probable
Approximation 2 probability of each word is the
same as its frequency of occurrence in a corpus.

12
Word Probability

Estimate p(w) - the probability of word w
Given corpus Cp(w) ? count(w)/size(C)
Example
Brown corpus 1,000,000 tokens
the 69,971 tokens
Probability of the 69,971/1,000,000 ? .07
rabbit 11 tokens
Probability of rabbit 11/1,000,000 ? .00001
conclusion next word is most likely to be the
Is this correct?

13
Word Probability

Given the context Look at the cute ...
is the more likely than rabbit?
Context matters in determining what word comes
next.
What is the probability of the next word in a
sequence, given the first N words?

14
Independent Events
A eggs
B monday
sample space
15
Sample Space

(eggs,mon) (cereal,mon) (nothing,mon)
(eggs,tue) (cereal,tue) (nothing,tue)
(eggs,wed) (cereal,wed) (nothing,wed)
(eggs,thu) (cereal,thu) (nothing,thu)
(eggs,fri) (cereal,fri) (nothing,fri)
(eggs,sat) (cereal,sat) (nothing,sat)
(eggs,sun) (cereal,sun) (nothing,sun)

16
Independent Events

Two events, A and B, are independent if the fact
that A occurs does not affect the probability of
B occurring.
When two events, A and B, are independent, the
probability of both occurring p(A,B) is the
product of the prior probabilities of each, i.e.
p(A,B) p(A) p(B)

17
Dependent Events

Two events, A and B, are dependent if the
occurrence of one affects the probability of the
occurrence of the other.

18
Dependent Events
A
A ? B
B
sample space
19
Conditional Probability

The conditional probability of an event A given
that event B has already occurred is written
p(AB)
In general p(AB) ? p(BA)

20
Dependent Events p(AB)? p(BA)
sample space
A
A ? B
B
21
Example Dependencies

Consider fair die example with
A outcome divisible by 2
B outcome divisible by 3
C outcome divisible by 4
p(AB) p(A ? B)/p(B) (1/6)/(1/3) ½
p(AC) p(A ? C)/p(C) (1/6)/(1/6) 1

22
Conditional Probability

Intuitively, after B has occurred, event A is
replaced by A ? B, the sample space O is replaced
by B, and probabilities are renormalised
accordingly
The conditional probability of an event A given
that B has occurred (p(B)gt0) is thus given by
p(AB) p(A ? B)/p(B).
If A and B are independent,p(A ? B)
p(A) p(B) sop(AB) p(A) p(B) /p(B) p(A)

23
Bayesian Inversion

For A and B to occur, either B must occur first,
then B, or vice versa. We get the following
possibilites
p(AB) p(A ? B)/p(B)p(BA) p(A ? B)/p(A)
Hence p(AB) p(B) p(BA) p(A)
We can thus express p(AB) in terms of p(BA)
p(AB) p(BA) p(A)/p(B)
This equivalence, known as Bayes Theorem, is
useful when one or other quantity is difficult to
determine

24
Bayes Theorem

p(BA) p(B?A)/p(A) p(AB) p(B)/p(A)
The denominator p(A) can be ignored if we are
only interested in which event out of some set is
most likely.
Typically we are interested in the value of B
that maximises an observation A, i.e.
arg maxB p(AB) p(B)/p(A) arg maxB p(AB) p(B)

25
Chain Rule

We can use the definition of conditional
probability to more than two events
p(A1 ? ... ? An) p(A1) p(A2A1) p(A3A1 ?
A2)..., p(AnA1 ? ... ? An-1)
The chain rule allows us to talk about the
probability of sequences of events p(A1,...,An)

26
Classification II

Linear algorithms in Classification I
Non-linear algorithms
Kernel methods
Multi-class classification
Decision trees
Naïve Bayes

27
Non-Linear Problems
28
Non-Linear Problems
29
Non-Linear Problems

Kernel methods
A family of non-linear algorithms
Transform the non linear problem in a linear one
(in a different feature space)
Use linear algorithms to solve the linear problem
in the new space

30
Kernel Methods

Linear separability more likely in high
dimensions
Mapping ? maps input into high-dimensional
feature space
Classifier construct linear classifier in
high-dimensional feature space
Motivation appropriate choice of ? leads to
linear separability
We can do this efficiently!

31
Kernel Methods

? Rd ? RD (D gtgt d)

Xx z
32
Kernel Methods

We can use the linear algorithms seen before
(Perceptron, SVM) for classification in the
higher dimensional space
Kernel methods basically transform any algorithm
that solely depend on dot product between two
vectors by replacing dot with kernel function
Non-linear kernel algorithm is the linear
algorithm operating in the range space of ?
The ? is never explicitly computed (kernels are
used instead)

33
Multi-class Classification

Given some data items that belong to one of M
possible classes
Task Train the classifier and predict the class
for a new data item
Geometrically harder problem, no more simple
geometry

34
Multi-class Classification
35
Multi-class Classification

Author identification
Language identification
Text categorization (topics)

36
Multi-class Classification

Linear
Parallel class separators Decision Trees
Non parallel class separators Naïve Bayes
Non Linear
K-nearest neighbors

37
Linear, parallel class separators (e.g. Decision
Trees)
38
Linear, non-parallel class separators (e.g. Naïve
Bayes)
39
Non-Linear separators (e.g. k Nearest Neighbors)
40
Decision Trees

Decision tree is a classifier in the form of a
tree structure, where each node is either
Leaf node - indicates the value of the target
attribute (class) of examples, or
Decision node - specifies some test to be carried
out on a single attribute-value, with one branch
and sub-tree for each possible outcome of the
test.
A decision tree can be used to classify an
example by starting at the root of the tree and
moving through it until a leaf node, which
provides the classification of the instance.

41
Goal learn when we can play Tennis and when we
cannot
42
Decision Trees
Outlook
Sunny
Overcast
Rain
Humidity
Wind
Yes
High
Normal
Strong
Weak
No
Yes
Yes
No
43
Decision Trees
Outlook
Sunny
Overcast
Rain
Humidity
High
Normal
No
Yes
44
Outlook Temperature Humidity Wind PlayTennis
Sunny Hot High
Weak ?
45
Decision Tree for Reuters
46
Decision Trees for Reuters
47
Building Decision Trees

Given training data, how do we construct them?
The central focus of the decision tree growing
algorithm is selecting which attribute to test at
each node in the tree. The goal is to select the
attribute that is most useful for classifying
examples.
Top-down, greedy search through the space of
possible decision trees.
That is, it picks the best attribute and never
looks back to reconsider earlier choices.

48
Building Decision Trees

Splitting criterion
Finding the features and the values to split on
for example, why test first cts and not vs?
Why test on cts lt 2 and not cts lt 5 ?
Split that gives us the maximum information gain
(or the maximum reduction of uncertainty)
Stopping criterion
When all the elements at one node have the same
class, no need to split further
In practice, one first builds a large tree and
then one prunes it back (to avoid overfitting)
See Foundations of Statistical Natural Language
Processing, Manning and Schuetze for a good
introduction

49
Decision Trees Strengths

Decision trees are able to generate
understandable rules.
Decision trees perform classification without
requiring much computation.
Decision trees are able to handle both continuous
and categorical variables.
Decision trees provide a clear indication of
which features are most important for prediction
or classification.

50
Decision Trees Weaknesses

Decision trees are prone to errors in
classification problems with many classes and
relatively small number of training examples.
Decision tree can be computationally expensive to
train.
Need to compare all possible splits
Pruning is also expensive
Most decision-tree algorithms only examine a
single field at a time. This leads to rectangular
classification boxes that may not correspond well
with the actual distribution of records in the
decision space.

51
Naïve Bayes
More powerful that Decision Trees
52
Naïve Bayes

Graphical Models graph theory plus probability
theory
Nodes are variables
Edges are conditional probabilities

A
P(A) P(BA) P(CA)
53
Naïve Bayes

Graphical Models graph theory plus probability
theory
Nodes are variables
Edges are conditional probabilities
Absence of an edge between nodes implies
independence between the variables of the nodes

A
P(A) P(BA) P(CA)
54
Naïve Bayes
55
Naïve Bayes
earn
Shr
per
56
Naïve Bayes
Topic
w1
w3
wn-1

The words depend on the topic P(wi Topic)
P(ctsearn) gt P(tennis earn)
Naïve Bayes assumption all words are independent
given the topic
From training set we learn the probabilities
P(wi Topic) for each word and for each topic in
the training set

57
Naïve Bayes
Topic
w1
w3
wn-1

To Classify new example
Calculate P(Topic w1, w2, wn) for each topic
Bayes decision rule
Choose the topic T for which
P(T w1, w2, wn) gt P(T w1, w2, wn) for
each T? T

58
Naïve Bayes Math

Naïve Bayes define a joint probability
distribution
P(Topic , w1, w2, wn) P(Topic)? P(wi Topic)
We learn P(Topic) and P(wi Topic) in training
Test we need P(Topic w1, w2, wn)
P(Topic w1, w2, wn) P(Topic , w1, w2,
wn) / P(w1, w2, wn)

59
Naïve Bayes Strengths

Very simple model
Easy to understand
Very easy to implement
Very efficient, fast training and classification
Modest space storage
Widely used because it works really well for text
categorization
Linear, but non parallel decision boundaries

60
Naïve Bayes Weaknesses

Naïve Bayes independence assumption has two
consequences
The linear ordering of words is ignored (bag of
words model)
The words are independent of each other given the
class False
President is more likely to occur in a context
that contains election than in a context that
contains poet
Naïve Bayes assumption is inappropriate if there
are strong conditional dependencies between the
variables
(But even if the model is not right, Naïve
Bayes models do well in a surprisingly large
number of cases because often we are interested
in classification accuracy and not in accurate
probability estimations)