Title: Natural Language Processing 2 Basic Probability
1Natural Language Processing(2) Basic
Probability
- Dr. Xuan Wang(? ?)
- Intelligence Computing Research Center
- Harbin Institute of Technology Shenzhen Graduate
School - Slides from Dr. Mary P. Harper ECE, Purdue
University
2Motivation
- Statistical NLP aims to do statistical inference.
- Statistical inference consists of taking some
data (generated according to some unknown
probability distribution) and then making some
inferences about this distribution. - An example of statistical inference is the task
of language modeling, namely predicting the next
word given a window of previous words. To do
this, we need a model of the language. - Probability theory helps us to find such a model.
3Probability Terminology
- Probability theory deals with predicting how
likely it is that something will happen. - The process by which an observation is made is
called an experiment or a trial (e.g., tossing a
coin twice). - The collection of basic outcomes (or sample
points) for our experiment is called the sample
space.
4Probability Terminology
- An event is a subset of the sample space.
- Probabilities are numbers between 0 and 1, where
0 indicates impossibility and 1, certainty. - A probability function/distribution distributes a
probability mass of 1 throughout the sample
space.
5Experiments Sample Spaces
- Set of possible basic outcomes of an experiment
sample space W - coin toss (W head,tail)
- tossing coin 2 times (W HH, HT, TH, TT)
- dice roll (W 1, 2, 3, 4, 5, 6)
- missing word ( W _at_ vocabulary size)
- Discrete (countable) versus continuous
(uncountable) - Every observation/trial is a basic outcome or
sample point. - Event A is a set of basic outcomes with A Ì W , Æ
is the impossible event
6Events and Probability
- The probability of event A is denoted p(A) (also
called the prior probability, i.e., the
probability before we consider any additional
knowledge). - Example Experiment toss coin three times
- W HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
- cases with two or more tails
- A HTT, THT, TTH, TTT
- P(A) A/ W 1/2 (assuming uniform
distribution) - all heads
- A HHH
- P(A) A/ W 1/8
7Probability Properties
- Basic properties
- P 0,1
- P(W) 1
- For disjoint events P(ÈAi) åi P(Ai)
- NB axiomatic definition of probability take
the above three conditions as axioms - Immediate consequences
- P(Æ) 0, P(A ) 1 - P(A), A Í B Þ P(A) P(B)
-
8Joint Probability
- Joint Probability of A and B P(A,B) P(A Ç B)
- 2-dimensional table (AxB) with a value in
every cell - giving the probability of that specific pair
occurring.
9Conditional Probability
- Sometimes we have partial knowledge about the
- outcome of an experiment, then the conditional
(or - posterior) probability can be helpful. If we know
- that event B is true, then we can determine the
- probability that A is true given this knowledge
- P(AB) P(A,B) / P(B)
10Conditional and Joint Probabilities
- P(AB) P(A,B)/P(B) P(A,B) P(AB) P(B)
- P(BA) P(A,B)/P(A) P(A,B) P(BA) P(B)
- Chain rule
11Bayes Rules
- Since P(A,B) P(B,A), (P(A Ç B) P(B Ç A)), and
P(A,B) P(AB) P(B) P(BA) P(A) - P(AB) P(A,B)/P(B) P(BA) P(A) / P(B)
- P(BA) P(A,B)/P(A) P(AB) P(B)/P(A)
12Example
- S have stiff neck, M have meningitis(???)
- P(SM) 0.5, P(M) 1/50,000, P(S)1/20
- I have stiff neck, should I worry?
13Independence
Two events A and B are independent of each other
if P(A) P(AB) Example two coin tosses,
weather today and weather on March 4th, 1789 If A
and B are independent, then we compute P(A,B)
from P(A) and P(B) as P(A,B) P(AB) P(B)
P(A) P(B) Two events A and B are conditionally
independent of each other given C if P(AC)
P(AB,C)
14A Golden Rule(of Statistical NLP)
- If we are interested in which event is most
likely given A, we can use Bayes rule, max over
all B - P(A) is a normalizing constant
- However the denominator is easy to obtain.
15Random Variables (RV)
- Random variables (RV) allow us to talk about the
probabilities of numerical values that are
related to the event space (with a specific
numeric range) - An RV is a function X W Q
- in general Q Rn, typically R
- easier to handle real numbers than real-world
events - An RV is discrete if Q is a countable subset of
R an indicator RV (or Bernoulli trial) if Q0,
1 - Can define a probability mass function (pmf) for
RV X that gives the probability it has at
different values -
- where Ax w Î W X(w) x
- often just p(x) if it is clear from context what
x is
16Example
- Suppose we define a discrete RV X that is the sum
of the faces of two die, then Q2, , 11, 12
with the pmf as follows - P(X2)1/36,
- P(X3)2/36,
- P(X4)3/36,
- P(X5)4/36,
- P(X6)5/36,
- P(X7)6/36,
- P(X8)5/36,
- P(X9)4/36,
- P(X10)3/36,
- P(X11)2/36,
- P(X12)1/36
17Expectation and Variance
- The expectation is the mean or average of a RV
- defined as
- The variance of a RV is a measure of whether the
- values of the RV tend to vary over trials
- The standard deviation (s) is the square root of
the - variance.
18Examples
- What is the expectation of the sum of the
numbers on two dice? - Or more simply
- 2 P(X2) 2 1/36 1/18 E(SUM) E(D1D2)
- 3 P(X3) 3 2/36 3/18 E(D1) E(D2)
- 4 P(X4) 4 3/36 6/18 E(D1) E(D2)
- 5 P(X5) 5 4/36 10/18 1 1/6 2
1/6 6 1/6 - 6 P(X6) 6 5/36 15/18 1/6 (1 2
3 4 5 6) 21/6 - 7 P(X7) 7 6/36 21/18 Hence,
E(SUM) 21/6 21/6 7 - 8 P(X8) 8 5/36 20/18
- 9 P(X9) 9 4/36 18/18
- 10 P(X10) 10 3/36 15/18
- 11 P(X11) 11 2/36 11/18
- 12 p(X12) 12 1/36 6/18
- E(SUM) 126/18 7
19Examples
20Joint and Conditional Distributionsfor RVs
21Estimating Probability Functions
22Parametric Methods
- Assume that the language phenomenon is
- acceptably modeled by one of the well-known
- standard distributions (e.g., binomial, normal).
- By assuming an explicit probabilistic model of
the - process by which the data was generated, then
- determining a particular probability distribution
- within the family requires only the specification
of - a few parameters, which requires less training
data - (i.e., only a small number of parameters need to
- be estimated).
23Non-parametric Methods
- No assumption is made about the underlying
- distribution of the data.
- For example, simply estimate P empirically by
- counting a large number of random events is a
- distribution-free method.
- Given less prior information, more training
data is - needed.
24Estimating Probability
- Example Toss coin three times
- W HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
- count cases with exactly two tails A HTT,
THT, TTH - Run an experiment 1000 times (i.e., 3000
tosses) - Counted 386 cases with two tails (HTT, THT, or
TTH) - Estimate of p(A) 386 / 1000 .386
- Run again 373, 399, 382, 355, 372, 406, 359
- p(A) .379 (weighted average) or simply 3032 /
8000 - Uniform distribution assumption p(A) 3/8
.375
25Standard Distributions
- In practice, one commonly finds the same basic
form of a - probability mass function, but with different
constants - employed.
- Families of pmfs are called distributions and
the constants - that define the different possible pmfs in one
family are - called parameters.
- Discrete Distributions the binomial
distribution, the - multinomial distribution, the Poisson
distribution. - Continuous Distributions the normal
distribution, the - standard normal distribution.
26Standard Distributions Binomial
27Binomial Distribution
- Works well for tossing a coin. However, for
linguistic corpora one never has complete
independence from one sentence to the next - approximation.
- Use it when counting whether something has a
certain property or not (assuming independence). - Actually quite common in SNLP e.g., look
through a corpus to find out the estimate of the
percentage of sentences that have a certain word
in - them or how often a verb is used as transitive or
intransitive. - Expectation is n p, variance is n p (1-p)
28Standard Distributions Normal
29Frequentist Statistics
30Baysian Statistics I Bayesian Updating
- Updating
- Assume that the data are coming in sequentially
- and are independent.
- Given an a-priori probability distribution, we
can - update our beliefs when a new datum comes in by
- calculating the Maximum A Posteriori (MAP)
- distribution.
- The MAP probability becomes the new prior and
- the process repeats on each new datum.
31Bayesian Statistics MAP
32Bayesian Statistics II Bayesian
- Decision Theory
- Bayesian Statistics can be used to evaluate
which - model or family of models better explains some
- data.
- We define two different models of the event and
- calculate the likelihood ratio between these two
- models.
33Bayesian Decision Theory
34Essential Information Theory
- Developed by Shannon in the 1940s.
- Goal is to maximize the amount of information
that can be - transmitted over an imperfect communication
channel. - Wished to determine the theoretical maxima for
data - compression (entropy H) and transmission rate
(channel - capacity C).
- If a message is transmitted at a rate slower than
C, then the - probability of transmission errors can be made as
small as - desired.
35Entropy
36Entropy(cont)
37Using the Formula Examples
38The Limits
39Coding Interpretation of Entropy
- The least (average) number of bits needed to
- encode a message (string, sequence, series,...)
- (each element being a result of a random process
- with some distribution p) gives H(p).
- Compression algorithms
- do well on data with repeating (easily
predictable low - entropy) patterns
- their results have high entropy Þcompressing
- compressed data does nothing
40Coding Example
41Properties of Entropy I
42Joint Entropy
43Conditional Entropy
44Properties of Entropy II
45Chain Rule for Entropy
46Entropy Rate
- Because the amount of information contained in
a - message depends on its length, we may want to
- compare using entropy rate (the entropy per
unit). - The entropy rate of a language is the limit of
the - entropy rate of a sample of language as the
sample - gets longer and longer.
47Mutual Information
48Relationship between I and H
49Mutual Information (cont)
50Mutual Information (cont)
51Mutual Information and Entropy
52The Noisy Channel Model
- Want to optimize a communication across a
- channel in terms of throughput and accuracy the
- communication of messages in the presence of
- noise in the channel.
- There is a duality between compression
(achieved - by removing all redundancy) and transmission
- accuracy (achieved by adding controlled
- redundancy so that the input can be recovered in
- the presence of noise).
53The Noisy Channel Model
- Goal encode the message in such a way that it
occupies minimal space while still containing
enough redundancy to be able to detect and
correct errors.
54Language and the Noisy ChannelModel
- In language we cant control the encoding phase
we can only decode the output to give the most
likely input. - Determine the most likely input given the output!
55The Noisy Channel Model
56Relative Entropy Kullback-LeiblerDivergence
57Entropy and Language
- Entropy is measure of uncertainty. The more we
- know about something the lower the entropy.
- If a language model captures more of the
structure - of the language than another model, then its
- entropy should be lower.
- Entropy can be thought of as a matter of how
- surprised we will be to see the next word given
- previous words we have already seen.
58Entropy and Language
- ??????????,??????????3.98??,??????????4.00
??,??????????4.01??,??????????4.03??,??4.10??,????
??4.12??,???4.35?? ? ????????9.71???
??????????,???????????10.0??,????????11.46 ???????
??????????? - ?????,?????????80,???67,????73???????????70?
??????????73,???55,???? 63,????????????
59A example Mengs Profile
- http//219.223.235.139/weblog/profile.php?umengxj
- Aoccdrnig to rscheearch at an Elingsh
uinervtisy, it deosn't mttaer in waht oredr the
ltteers in a wrod are, the olny iprmoetnt tihng
is taht the frist and lsat ltteer are in the
rghit pclae. The rset can be a toatl mses and you
can sitll raed it wouthit a porbelm. Tihs is
bcuseae we do not raed ervey lteter by it slef
but the wrod as a wlohe and the biran fguiers it
out aynawy. so please excuse me for every typo in
the blog, btw fixes and patches are welcome.
60Perplexity(???)
- A measure related to the notion of cross
entropy - and used in the speech recognition community is
- called the perplexity.
- A perplexity of k means that you are as
surprised - on average as you would have been if you had had
- to guess between k equi-probable choices at each
- step.
61???? World Super Star
- A.M. Turing Award
- ACM's most prestigious technical award is
accompanied by a prize of 100,000. It is given
to an individual selected for contributions of a
technical nature made to the computing community.
The contributions should be of lasting and major
technical importance to the computer field.
Financial support of the Turing Award is provided
by the Intel Corporation.
62Alan Turing 19121954
632004 Winners Vinton G. CerfRobert E. Kahn
- Citation
- For pioneering work on internetworking, including
the design and implementation of the Internet's
basic communications protocols, TCP/IP, and for
inspired leadership in networking
64Vinton Cerf
- ?????,?????????(ICANN)???
- ICANN(Internet Corporation for Assigned Names and
Numbers)???????????,???1998?10?,??????????????????
??????????????????????????????ICANN???????????????
??????,????????(IP Address Space)???????????(Proto
col parameters)????????(Domain Name
System)?????????(Root Server System)??????"?????"
,Vinton G. Cerf????????-----TCIP/IP???????????????
1997?12?,??????Cerf??????Robert E.
Kahn???????????,????????????????????Cerf??????????
???????????????????????????????,??????????????????
??????????
65Vinton G. Cerf
- Vinton Gray Cerf grew up in San Francisco,
- the son of a Navy officer who fought in World
- War IIiv. He would receive a B.S. in
mathematics from Stanford, and graduate degrees
from UCLA. However, it was his research work at
Stanford that would begin his lifelonginvolvement
in the Internet.
66Robert E. Kahn
- ???-??,??????????????????,TCP/IP???????,?????Ar
panet???????,????????????????????(National
Academy of Engineering)??,??????????IEEE??(IEEE)fe
llow,????????(American Association for Artificial
Intelligence)fellow,???????(ACM)
fellow,?????????? ???-???????????????(CNRI,Cor
poration for National Research Initiatives)????CNR
I????.???1986????????,????????????????????????????
?,?????IETF??????? ???-??1938???????????,?????
????????,???????????,????????????????1969?,???????
???????(IMP)??,???????????IMP?????????????????
??1970?,??????????????(NCP),??????????80????,??
?????????????(NII)???, NII???????????????
1997?,???-????????????????,??????????????????????
??
67