Title: Reasoning Under Uncertainty
1Reasoning Under Uncertainty
- Artificial Intelligence
- Chapter 9
2Part 2
Reasoning
3Notation
- Random variable (RV) a variable (uppercase)that
takes on values (lowercase) from a domainof
mutually exclusive and exhaustive values - Aa a proposition, world state, event, effect,
etc. - abbreviate P(Atrue) to P(a)
- abbreviate P(Afalse) to P(Øa)
- abbreviate P(Avalue) to P(value)
- abbreviate P(A¹value) to P(Øvalue)
- Atomic event a complete specification of the
stateof the world about which the agent is
uncertain
4Notation
- P(a) a prior probability of RV Aawhich is the
degree of belief proposition ain absence of any
other relevant information - P(ae) conditional probability of RV Aa given
Eewhich is the degree of belief in proposition
awhen all that is known is evidence e - P(A) probability distribution, i.e. set of P(ai)
for all i - Joint probabilities are for conjunctions of
propositions
5Reasoning under Uncertainty
- Rather than reasoning about the truth or
falsityof a proposition, instead reason about
the belief that a proposition is true. - Use knowledge base of known probabilities to
determine probabilities for query propositions.
6Reasoning under Uncertaintyusing Full Joint
Distributions
- Assume a simplified Clue game havingtwo
characters, two weapons and two rooms
- each row is an atomic event- one of these must
be true - - list must be mutually exclusive
- - list must be exhaustive
Who What Where
plum rope hall
plum rope kitchen
plum pipe hall
plum pipe kitchen
green rope hall
green rope kitchen
green pipe hall
green pipe kitchen
Probability
1/8
1/8
1/8
1/8
1/8
1/8
1/8
1/8
- prior probability for each is 1/8
- - each equally likely
- - e.g. P(plum,rope,hall) 1/8
? P(atomic_eventi) 1 since each RV's domain
isexhaustive mutually exclusive
7Determining Marginal Probabilitiesusing Full
Joint Distributions
- The probability of any proposition is equal
tothe sum of the probabilities of the atomic
eventsin which it holds, which is called the set
e(a). - P(a) ? P(ei)where ei is an element of e(a)
- its the disjunction of atomic events in set e(a)
- recall this property of atomic eventsany
proposition is logically equivalent to the
disjunctionof all atomic events that entail the
truth of that proposition
8Determining Marginal Probabilitiesusing Full
Joint Distributions
- Assume a simplified Clue game havingtwo
characters, two weapons and two rooms
P(a) ? P(ei)where ei is an element of e(a)
Who What Where
plum rope hall
plum rope kitchen
plum pipe hall
plum pipe kitchen
green rope hall
green rope kitchen
green pipe hall
green pipe kitchen
Probability
1/8
1/8
1/8
1/8
1/8
1/8
1/8
1/8
P(plum) ?
1/81/81/81/8 1/2
when obtained in this manner it is called a
marginal probability can be just a prior
probability (shown) or more complex (next) this
process is called marginalization or summing out
9Reasoning under Uncertaintyusing Full Joint
Distributions
- Assume a simplified Clue game havingtwo
characters, two weapons and two rooms
Who What Where
plum rope hall
plum rope kitchen
plum pipe hall
plum pipe kitchen
green rope hall
green rope kitchen
green pipe hall
green pipe kitchen
Probability
1/8
1/8
1/8
1/8
1/8
1/8
1/8
1/8
P(green,pipe)
P(rope, Øhall)
P(rope Ú hall)
1/8
10Independence
- Using the game cluefor an example is
uninteresting! Why? - Because the random variablesWho, What, Where are
independent. - Does picking the murder from the deck of cards
affect which weapon is chosen? Location? - No! Each is randomly selected.
11Independence
- Unconditional (absolute) IndependenceRVs have
no affect on each other's probabilities - 1. P(XY) P(X)
- 2. P(YX) P(Y)
- 3. P(X,Y) P(X) P(Y)
- Example (full clue)
- P(green hall) P(green, hall) / P(hall)
6/324 / 1/9 P(green) 1/6 - P(hall green) P(hall) 1/9
- P(green, hall) P(green) P(hall) 1/54
- We need a more interesting example!
12Independence
- Conditional IndependenceRVs (X, Y) are
dependent on another RV (Z)but are independent
of each other - 1. P(XY,Z) P(XZ)
- 2. P(YX,Z) P(YZ)
- 3. P(X,YZ) P(XZ) P(YZ)
- Ideasneezing (x) and itchy eyes (y)are both
directly caused by hayfever (z) - but neither sneezing nor itchy eyeshas a direct
effect on each other
13Reasoning under Uncertaintyusing Full Joint
Distributions
- Assume three boolean RVs Hayfever HF, Sneeze SN,
ItchyEyes IE
and fictional probabilities
P(a) ? P(ei)where ei is an element of e(a)
HF SN IE
false false false
false false true
false true false
false true true
true false false
true false true
true true false
true true true
Probability
0.5
0.09
0.1
0.1
0.01
0.06
0.04
0.1
P(sn) 0.1 0.1 0.04 0.10.34
P(hf) 0.01 0.06 0.04 0.10.21
P(sn,ie) 0.1 0.10.20
P(hf,sn) 0.04 0.10.14
14Reasoning under Uncertaintyusing Full Joint
Distributions
- Assume three boolean RVs Hayfever HF, Sneeze SN,
ItchyEyes IE - and fictional probabilities
HF SN IE
false false false
false false true
false true false
false true true
true false false
true false true
true true false
true true true
Probability
0.5
0.09
0.1
0.1
0.01
0.06
0.04
0.1
P(ae) P(a, e) / P(e)
P(hf sn) P(hf,sn) / P(sn) 0.14 /
0.34 0.41
P(hf ie) P(hf,ie) / P(ie) 0.16 /
0.35 0.46
15Reasoning under Uncertaintyusing Full Joint
Distributions
- Assume three boolean RVs Hayfever HF, Sneeze SN,
ItchyEyes IE - and fictional probabilities
HF SN IE
false false false
false false true
false true false
false true true
true false false
true false true
true true false
true true true
Probability
0.5
0.09
0.1
0.1
0.01
0.06
0.04
0.1
P(ae) P(a, e) / P(e)
Instead of computing P(e),could use normalization
P(hf sn) 0.14 / P(sn)
also computeP(Øhf sn) 0.20 / P(sn) since
P(hf sn) P(Øhf sn) 1 substituting and
solving gives P(sn) 0.34 !
16Combining Multiple Evidence
- As evidence describing the state of the worldis
accumulated, we'd like to be able to easily
update the degree of belief in a conclusion. - Using the Full Joint Prob. Dist. Table
- P(v1,...,vkvk1,...,vn) ?P(V1v1,...,Vnvn) /
?P(Vk1vk1,...,Vnvn) - sum of all entries in the table, where V1v1,
..., Vnvn - divided by the sum of all entries in the
tablecorresponding to the evidence, where
Vk1vk1, ..., Vnvn
17Combining Multiple Evidenceusing Full Joint
Distributions
- Assume three boolean RVs and fictional
probabilities Hayfever HF, Sneeze SN, ItchyEyes
IE
HF SN IE
false false false
false false true
false true false
false true true
true false false
true false true
true true false
true true true
Probability
0.5
0.09
0.1
0.1
0.01
0.06
0.04
0.1
P(ab, c) P(a,b,c) / ? P(b,c) as described in
prior slide
P(hf sn, ie) P(hf,sn,ie) / ? P(sn,ie)
0.10 / (0.10.1) 0.5
18Combining Multiple Evidence (cont.)
- FJDT techniques are intractable in general
because the table size grows exponentially. - Independence assertions can help reducethe size
of the domain and the complexityof the inference
problem. - Independence assertions are usually basedon the
knowledge of the domain enablingFJD table to be
factored in to separate JD tables. - it's a good thing that problem domains are
independent - but typically subsets of dependent RVs are quite
large
19Probability Rulesfor Multi-valued Variables
- Summing Out P(Y) ? P(Y, z) sum over all
values z of RV Z - Conditioning P(Y) ? P(Yz) P(z) sum over
all values z of RV Z - Product Rule P(X, Y) P(XY) P(Y) P(YX) P(X)
- Chain Rule P(X, Y, Z) P(XY, Z) P(YZ) P(Z)
- this is a generalization of product rule with Y
Y,Z - order of RVs doesn't matter, i.e. gives same
result - Conditionalized Chain Rule(let YAB) P(X,
AB) P(XA, B) P(AB)(order doesn't matter)
P(AX, B) P(XB)
20Bayes' Rule
- Bayes' RuleP(ba) (P(ab) P(b)) / P(a)
- derived from P(a Ù b) P(ba) P(a) P(ab)
P(b)just divide both sides of equation by P(a) - basis of AI systems using probabilistic reasoning
- For Example
- ahappy, bsun a sneeze, b fall
- P(sunhappy) ? P(fallsneeze) ?
- P(happysun) 0.95 P(sneezefall)
0.85P(sun) 0.5 P(fall)
0.25P(happy)
0.75 P(sneeze) 0.3 - (0.95 0.5)/0.75 0.63 (0.85 0.25)/0.3
0.71
21Bayes' Rule
- P(ba) (P(ab) P(b)) / P(a)What's the benefit
of being able to calculateP(ba) from the three
probabilities on the right? - Usefulness of Bayes' Rule
- many problems have good estimates of
probabilities on right - P(ba) needed to identify cause, classification,
diagnosis, etc - typical use is to calculate diagnostic
knowledgefrom causal knowledge
22Bayes' Rule
- Causal knowledge from causes to effects
- e.g. P(sneezecold) probability of effect sneeze
given cause common cold - this probability the doctors obtains from
experiencetreating patients and understanding
the disease process - Diagnostic knowledge from effects to causes
- e.g. P(coldsneeze)probability of cause common
cold given effect sneeze - knowing this probability helps a doctor make
adisease diagnosis based on a patient's symptoms - diagnostic knowledge is more fragile that causal
knowledgesince it can change significantly over
time given variationsin rate of occurrence of
its causes (due to epidemics, etc.)
23Bayes' Rule
- Using Bayes' Rule with causal knowledge
- want to determine diagnostic knowledge
(diagnostic reasoning)that is difficult to
obtain from a general population - e.g. symptom is sstiffNeck, disease is
mmeningitis - P(sm) 1/2 the casual knowledge
- P(m) 1/50000, P(s) 1/20 prior probabilities
- P(ms) ? desired diagnostic knowledge
- (1/2 1/50000)/ (1/20) 1/5000
- doctor can now use P(ms) to guide diagnosis
24Combining Multiple Evidenceusing Bayes' Rule
- How do you update conditional probabilityof Y
given two pieces of evidence A and B? - General Bayes' Rule for multi-valued RVsP(YX)
(P(XY) P(Y)) / P(X) - let XA,B
- P(YA,B) (P(A,BY) P(Y) ) / P(A,B) (P(Y)
(P(BA,Y) P(AY)) / (P(BA) P(A)) - P(Y)(P(AY)/P(A))(P(BA,Y)/P(BA))
- conditionalized chain rule used, product rule
used - Problems
- P(BA,Y) generally hard to compute or obtain
- doesn't scale well for n evidence RVs, table size
grows O(2n)
25Combining Multiple Evidenceusing Bayes' Rule
- Problems can be circumvented
- If A and B are conditionally independent given
Ythen P(A,BY) P(AY)P(BY) and for P(A,B) use
product rule - P(YA,B) (P(Y) P(A,BY) ) / P(A,B) Bayes' Rule
Multi-E - P(YA,B) P(Y) (P(AY)/P(A))
(P(BY)/P(BA)) - No joint probabilities, representation grows O(n)
- If A is unconditionally independent of Bthen
P(A,BY) P(AY)P(BY) and P(A,B) P(A)P(B) - P(YA,B) (P(Y) P(A,BY) ) / P(A,B) Bayes' Rule
Multi-E P(YA,B) P(Y) (P(AY)/P(A))
(P(BY)/P(B)) - This equation used to define a naïve Bayes
classifier.
26Combining Multiple Evidenceusing Bayes' Rule
- Example
- What is the likelihood that a patient has
sclerosis colangitis? - doctor's initial belief P(sc) 1/1,000,000
- examine reveals jaundice P(j)
1/10,000 P(jsc) 1/5 - doctor's belief given test result P(scj)
P(sc)P(jsc)/P(j) 2/1000 - tests reveal fibrosis of bile ducts P(fsc)
4/5 P(f) 1/100 - doctor naïvely assumes jaundice and fibrosis are
independent - doctor's belief now rises P(scj,f) 16/100
P(scj,f) P(sc)(P(j sc)/P(j)) (P(f
sc)/P(f )) P(YA,B) P(Y) (P(AY)/P(A))(P(BY
) /P(B))
27Naïve Bayes Classifier
- Naïve Bayes Classifierused where single class is
based on a number of featuresor where single
cause influences a number of effects - based on P(YA,B) P(Y) (P(AY)/P(A))
(P(BY)/P(B)) - given RV C
- domain is possible classifications say c1,c2,c3
- classifies input example of features F1, , Fn
- compute
- P(c1F1, , Fn), P(c2F1, , Fn), P(c3F1, , Fn)
- naïvely assume features are independent
- choose value for C that gives maximum probability
- works surprising well in practice evenwhen
independence assumption aren't true
28Bayesian Networks
- AKA Bayes Nets, Belief Nets, Causal Nets, etc.
- Encodes the full joint probability distribution
(FJPD) for the set of RVs defining a problem
domain - Uses a space-efficient data structure by
exploiting - fact that dependencies between RVs are generally
local - which results in lots of conditionally
independent RVs - Captures both qualitative and quantitative
relationships between RVs
29Bayesian Networks
- Can be used to compute any value in FJPD
- Can be used to reason
- predictive/causal reasoningforward (top-down)
from causes to effects - diagnostic reasoningbackward (bottom-up) from
effects to causes
30Bayesian Network Representation
- Is an augmented DAG (i.e. directed, acyclic
graph) - Represented by V,E where
- V is a set of vertices
- E is a set of directed edges joining vertices, no
loops - Each vertex contains
- the RV's name
- either a prior probability distribution ora
conditional probability distribution table
(CDT)that quantifies the effects of the parents
on this RV - Each directed arc
- is from cause (parent) to its immediate effects
(children) - represents direct causal relationship between RVs
31Bayesian Network Representation
- Example in class
- each row in conditional probability tables must
sum to 1 - columns don't need to sum to 1
- values obtained from experts
- Number of probabilities required is typicallyfar
fewer than the number required for a FJDT - Quantitative information is usually givenby an
expert or determined empirically from data
32Conditional Independence
- Assume effects are conditionally independentof
each other given their common cause - The net is constructed so that given its
parents,a node is conditionally independent of
its non-descendant RVs in the net - P(X1x1, ..., Xnxn) P(xi parents(Xi)) ...
P(xn parents(Xn)) - Note the full joint probability distribution
isn't needed, only need conditionals relative to
their parent RVs
33Algorithm for ConstructingBayesian Networks
- Choose a set of relevant random variables
- Choose an ordering for them
- Assume they're X1 .. Xm where X1 is first, X2 is
second, etc. - For i 1 to m
- add a new node for Xi to the network
- set Parents(Xi) to be a minimal subset of X1 ..
Xi-1such that we have conditional independence
of Xiand all other members of X1 ..Xi-1 given
Parents(Xi) - add directed arc from each node in Parents(Xi) to
Xi - non-root nodes define a conditional probability
table P(Xi x combinations of
Parents(Xi)) root nodes define prior
probability distribution at Xi P(Xi)
34Algorithm for ConstructingBayesian Networks
- For a given set of random variables (RVs)there
is not, in general, a unique Bayesian Netbut all
of them represent the same information - For the best net, topologically sort RVs in step
2 - each RV comes before all of its children
- first nodes are roots, then nodes they directly
influence - Best Bayesian Network for a problem has
- fewest number of probabilities and arcs
- easy to determine probabilities for the CDT
- Algorithm won't construct a net that violatesthe
rules of probability
35Computing Joint Probabilitiesusing a Bayesian
Network
- Use product rule
- Simplify using independence
- For Example
- Compute P(a,b,c,d) P(d,c,b,a)
- order RVS in joint probability bottom up D,C,B,A
- P(dc,b,a) P(c,b,a) Product Rule P(d,c,b,a)
- P(dc) P(c,b,a) Conditional Independ. of D
given C - P(dc) P(cb,a) P(b,a) Product Rule
P(c,b,a) - P(dc) P(cb,a) P(ba) P(a) Product Rule
P(b,a) - P(dc) P(cb,a) P(b ) P(a) Independence of
B and A given no evidence
36Computing Joint Probabilitiesusing a Bayesian
Network
- Any entry in the full joint dist. table(i.e.
atomic event) can be computed! - P(v1,...,vn) PP(viParents(Vi)) over i from 1
to n - e.g. given boolean RVs what is P(a,..,h,k,..,p)?
- P(a)P(b)P(c)P(da,b)P(eb,c)P(f)P(gd,e)P(h)
P(kf,g)P(lgh)P(mk)P(nk)P(ok,l)P(pl) - Note this is fast, i.e. linearin the number of
nodes in the net!
37Computing Joint Probabilitiesusing a Bayesian
Network
- How is any joint probability computed?
- sum the relevant joint probabilities
- e.g. Compute P(a,b)
- P(a,b,c,d) P(a,b,c,Ød) P(a,b,Øc,d)
P(a,b,Øc,Ød) - e.g. Compute P(c)
- P(a,b,c,d) P(a,Øb,c,d) PØa,b,c,d)
PØa,Øb,c,d) P(a,b,c,Ød) P(a,Øb,c,Ød)
P(Øa,b,c,Ød) P(Øa,Øb,c,Ød) - A BN can answer any query (i.e. probability)
about the domain by summing the relevant joint
probs. - Enumeration can require many computations!
38Computing Conditional Probabilitiesusing a
Bayesian Network
- Basic task of probabilistic systemis to compute
conditional probabilities. - Any conditional probability can be computed
- P(v1,...,vkvk1,...,vn) ?P(V1v1,...,Vnvn) /
?P(Vk1vk1,...,Vnvn) - Key problem is that the technique of
enumeratingjoint probabilities can make the
computations intractable (exponential in the
number of RVs).
39Computing Conditional Probabilitiesusing a
Bayesian Network
- These computations generally relyon the
simplifications resulting fromthe independence
of the RVs. - Every variable that isn't an ancestorof a query
variable or an evidence variableis irrelevant to
the query. - What ancestors are irrelevant?
40Independence in a Bayesian Network
- Given a Bayesian Networkhow is independence
established?
- A node is conditionally independent (CI)of its
non-descendants, given its parents. - e.g. Given D and E, G is CI of ?
41Independence in a Bayesian Network
- Given a Bayesian Networkhow is independence
established?
- A node is conditionally independent (CI)of its
non-descendants, given its parents. - e.g. Given D and E, G is CI of ?
A, B, C, F, H
e.g. Given F and G, K is CI of ?
42Independence in a Bayesian Network
- Given a Bayesian Networkhow is independence
established?
- A node is conditionally independentof all other
nodes in the network givenits parents, children,
and children'sparents, which is called a Markov
blanket - e.g. What is the Markov blanket for G?
G
Given this blanket G is CI of ? A, B, C, M,
N , O, P
What about absolute independence?
43Computing Conditional Probabilitiesusing a
Bayesian Network
- The general algorithm for computingconditional
probabilities is complicated. - It is easy if the query involves nodesthat are
directly connected to each other. - examples assumed to use boolean RVs
- Simple causal inference P(EC)
- conditional prob. dist. of effect E given cause C
as evidence - reasoning in same direction as arc, e.g. disease
to symptom - Simple diagnostic inference P(QE)
- conditional prob. dist. of query Q given effect E
as evidence - reasoning in direction opposite of arc, e.g.
symptom to disease
44Computing Conditional ProbabilitiesCausal
(Top-Down) Inference
- Compute P(ec)
- conditional probability of effect Ee given cause
Cc as evidence - assume arc exists to E from C and C2
- Rewrite conditional probability of e in termsof
e and all of its parents (that aren't
evidence)given evidence c - Re-express each joint probability backto the
probability of e given all of its parents - Simplify using independence and Look Uprequired
values in the Bayesian Network
45Computing Conditional ProbabilitiesCausal
(Top-Down) Inference
- Compute P(ec)
- P(e,c) / P(c) product rule
- (P(e,c,c2) P(e,c,Øc2)) / P(c)
marginalizing - P(e,c,c2) / P(c) P(e,c,Øc2) / P(c)
algebra - P(e,c2c) P(e,Øc2c) product
rule, e.g. Xe,c2 - P(ec2,c) P(c2c) P(eØc2,c) P(Øc2c) cond.
chain rule - Simplify given C and C2 are independent
- P(c2c) P(c2)
- P(Øc2c) P(Øc2)
- P(ec2,c) P(c2) P(eØc2,c) P(Øc2)
algebra - now look up values to finish computation
46Computing Conditional ProbabilitiesDiagnostic
(Bottom-Up) Inference
- Compute P(ce)
- conditional probability of cause Cc given effect
Ee as evidence - assume arc exists from C to E
- idea convert to casual inference using Bayes'
rule - Use Bayes' rule P(ce) P(ec) P(c) / P(e)
- Compute P(ec) using causal inference method
- Look up value of P(c) in Bayesian Net
- Use normalization to avoid computing P(e)
- requires computing P(Øce)
- using steps as in 1 3 above
47Summary the Good News
- Bayesian Nets are the bread and butter
ofAI-uncertainty community (like resolution to
AI-logic) - Bayesian Nets are a compact representation
- don't require exponential storage to holdall of
the info in the full joint probability
distribution (FJPD) table - are a decomposed representation of the FJPD table
- conditional probability distribution tables in
non-root nodes are only exponential in the max
number of parents of any node - Bayesian Nets are fast at computing joint
probs P(V1, ..., Vk) i.e. prior probability of
V1, ..., Vk - computing the probability of an atomic event can
be donein linear time with the number of nodes
in the net
48Summary the Bad News
- Conditional probabilities can also be computed
- P(QE1, ..., Ek)posterior probability of query
Q given multiple evidence E1, ..., Ek - requires enumerating all of the matching
entries,which takes exponential time in the
number of variables - in special cases it can be done faster, lt
polynomial timee.g. polytree linear time for
nets structured like trees - In general, inference in Bayesian Networks
(BN)is NP-hard. ? - but BNs are well studied so there exists many
efficient exact solution methods as well as a
variety of approximation techniques