Review - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Review

Description:

However, we can provide a degree of ... Assigning probability 0 to a sentence correspond to an unequivocal belief that ... Lower the entropy purer the node. ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 78
Provided by: alext8
Category:
Tags: purer | review

less

Transcript and Presenter's Notes

Title: Review


1
Review
2
Belief and Probability
  • The connection between toothaches and cavities is
    not a logical consequence in either direction.
  • However, we can provide a degree of belief on the
    sentences.
  • We usually get this belief from statistical data.
  • Assigning probability 0 to a sentence correspond
    to an unequivocal belief that the sentence is
    false.
  • Assigning probability 1 to a sentence correspond
    to an unequivocal belief that the sentence is
    true.

3
Syntax
  • Basic element random variable
  • Possible worlds defined by assignment of values
    to random variables.
  • Boolean random variables
  • Cavity (do I have a cavity?)
  • Discrete random variables
  • Weather is one of ltsunny,rainy,cloudy,snowgt
  • Domain values must be exhaustive and mutually
    exclusive
  • Elementary propositions e.g.,
  • Weather sunny (abbreviated as sunny)
  • Cavity false (abbreviated as ?cavity)
  • Complex propositions formed from elementary
    propositions and standard logical connectives
    e.g.,
  • Weather sunny ? Cavity false
  • sunny ? ?cavity

4
Atomic events
  • Atomic event A complete specification of the
    state of the world
  • E.g., if the world is described by only two
    Boolean variables Cavity and Toothache, then
    there are 4 distinct atomic events
  • Cavity false ? Toothache false
  • Cavity false ? Toothache true
  • Cavity true ? Toothache false
  • Cavity true ? Toothache true
  • Atomic events are mutually exclusive and
    exhaustive

5
Joint probability
  • Joint probability distribution for a set of
    random variables gives the probability of every
    atomic event on those random variables.
  • If we consider all the variables then the joint
    probability distribution is called full joint
    probability distribution.
  • A full joint distribution specifies the
    probability of every atomic event.
  • Any probabilistic question about a domain can be
    answered by the full joint distribution.

6
Prior and Conditional probability
  • Prior or unconditional probability associated
    with a proposition is the degree of belief
    accorded to it in the absence of any other
    information.
  • P(Cavity true) 0.1 (or P(cavity) 0.1)
  • P(Weather sunny) 0.7 (or P(sunny)
    0.7)
  • Conditional or posterior probabilities
  • P(cavity toothache) 0.8
  • i.e., given that toothache is all I know
  • Definition of conditional probability
  • Product rule

7
Chain rule
  • Chain rule is derived by successive application
    of product rule

8
Inference by enumeration
  • Suppose we are given a proposition f. Start with
    the joint probability distribution
  • Sum up the probabilities of the atomic events
    where f is true.

9
Inference by enumeration
10
Inference by enumeration
11
Inference by enumeration
  • We can also compute conditional probabilities

12
Normalization Constant
13
Hidden Variables
  • What does it mean to compute
  • We sum up 0.108 0.012
  • I.e. the probabilities of atomic events that make
    the proposition cavity ? toothache true. So,
  • The variable Catch is a called a hidden
    variable.

14
Independence
  • We can write
  • P(toothache, catch, cavity, cloudy)
    P(cloudy toothache, catch, cavity)
    P(toothache, catch, cavity)
  • P(cloudy) P(toothache, catch, cavity)
  • Thus, the 32 element table for four variables can
    be constructed from one 8-element table and one
    4-element table!!
  • A and B are independent if for each a, b in the
    domain of A and B respectively, we have
  • P(ab) P(a) or
  • P(ba) P(b) or
  • P(a, b) P(a) P(b)
  • Absolute independence powerful but rare

15
Bayes' Rule
  • Product rule
  • Bayes' rule
  • Useful for assessing diagnostic probability from
    causal probability

16
Bayes rule (contd)
  • Let  s  be the proposition that the patient
    has a stiff neck
  •            m  be the proposition that the patient
    has meningitis,  
  •             P(sm) 0.5
  •             P(m) 1/50000
  •             P(s) 1/20
  •             P(ms) P(sm) P(m) / P(s)
  • (0.5) x (1/50000) / (1/20)
  • 0.0002
  •                                               
  • That is, we expect only 1 in 5000 patients with a
    stiff neck to have meningitis.

17
More than two variables
Now, the Naïve Bayes model makes the following
assumption Although Effect1, ,Effectn might
not be independent in general, they are
independent given the value of Cause. This is
called conditional independence. E.g. If I have
a cavity, the probability that the probe catches
in it doesn't depend on whether I have a
toothache. We write this as P(toothache,
catch cavity) P(toothache cavity) . P(catch
cavity) . P(cavity)
18
Naïve Bayes
What about when the Naïve Bayes assumption
doesnt hold? Instead we have a network of
inter-dependencies. Lets, first review the
conditional independence.
For finding the alpha we need to compute also
Then the alpha is
19
Conditional Independence Equations
  • Equivalent statements
  • P(toothache, catch cavity) P(toothache
    cavity) P(catch cavity)
  • P(toothache catch, cavity) P(toothache
    cavity)
  • In general

20
Conditional Independence (contd)
  • We can write out full joint distribution using
    chain rule, e.g.
  • P(toothache, catch, cavity)
  • P(toothache catch, cavity) P(catch, cavity)
  • P(toothache catch, cavity) P(catch cavity)
    P(cavity)
  • P(toothache cavity) P(catch cavity)
    P(cavity)
  • In most cases, the use of conditional
    independence reduces the size of the
    representation of the joint distribution from
    exponential in n to linear in n.
  • Conditional independence is our most basic and
    robust form of knowledge about uncertain
    environments.

Because of Cond. Indep.
21
Bayesian Networks Motivation
  • The full joint probability can be used to answer
    any question about the domain,
  • but intractable as the number of variables grow.
  • Furthermore specifying probabilities of atomic
    events is rather unnatural and can be very
    difficult.

22
Bayesian networks
  • Syntax
  • a set of nodes, one per variable
  • a directed, acyclic graph (a link means
    "directly influences")
  • a conditional distribution for each node given
    its parents
  • P(Xi Parents (Xi))
  • The conditional distribution is represented as a
    conditional probability table (CPT) giving the
    distribution over Xi for each combination of
    parent values.

23
Example
  • Topology of network encodes conditional
    independence assertions
  • Weather is independent of the other variables
  • Toothache and Catch are conditionally independent
    given Cavity, which is indicated by the absence
    of a link between them.

24
Another Example
The topology shows that burglary and earthquakes
directly affect the probability of alarm, but
whether Mary or John call depends only on the
alarm. Thus, our assumptions are that they
dont perceive any burglaries directly, and they
dont confer before calling.
25
Semantics
  • The full joint distribution is defined as the
    product of the local conditional distributions
  • P(x1, ,xn)
  • ?i 1 P(xi parents(xi))
  • e.g.,
  • P(j ? m ? a ? ?b ? ?e)
  • P(j a) P(m a) P(a ?b, ?e) P(?b) P(?e)

26
Inference in Bayesian Networks
  • The basic task for any probabilistic inference
    system is to compute the posterior probability
    for a query variable, given some observed events
    (or effects) that is, some assignment of values
    to a set of evidence variables.
  • A typical query
  • P(xe1,,em)
  • We could ask Whats the probability of a
    burglary if both Mary and John calls
  • P(burglary johncalls, marycalls)?

27
Inference by enumeration
  • Sum out variables from the joint without actually
    constructing its explicit representation

e (earthquake) and a (alarm) are values of the
hidden variables. All the possible es and as
have to be considered.
Now, rewrite full joint entries using product of
CPT entries
28
Numerically
Complete it for exercise
  • P(b j,m) ? P(b) ?e P(e)?aP(ab,e)P(ja)P(ma)
    ? 0.00059
  • P(?b j,m) ? P(?b) ?eP(e)?aP(a?b,e)P(ja)P(ma
    ) ? 0.0015
  • P(B j,m) ? lt0.00059, 0.0015gt lt0.284, 0.716gt.

29
Machine Learning
30
Pseudo-code for 1R
  • For each attribute,
  • For each value of the attribute, make a rule as
    follows
  • count how often each class appears
  • find the most frequent class
  • make the rule assign that class to this
    attribute-value
  • Calculate the error rate of the rules
  • Choose the rules with the smallest error rate

31
Evaluating the weather attributes
Arbitrarily breaking the tie between the
first and third rule sets we pick the
first. Oddly enough the game is played when its
overcast and rainy but not when its sunny.
Perhaps its an indoor pursuit.
32
Statistical modeling Probabilities for the
weather data
33
Naïve Bayes for classification
  • Classification learning whats the probability
    of the class given an instance?
  • e instance
  • h class value for instance
  • Naïve Bayes assumption evidence can be split
    into independent parts (i.e. attributes of
    instance!)
  • P(h e) P(e h) P(h) / P(e)
  • P(e1h) P(e2h) P(enh) P(h) / P(e)

34
The weather data example
  • P(Playyes e)
  • P(OutlookSunny Playyes)
  • P(TempCool Playyes)
  • P(HumidityHigh Playyes)
  • P(WindyTrue Playyes)
  • P(Playyes) / P(e)
  • (2/9) (3/9) (3/9) (3/9) (9/14) / P(e)
    0.0053 / P(e)
  • Dont worry for the 1/P(E) Its alpha, the
    normalization constant.

35
The weather data example
P(Playno e) P(OutlookSunny Playno)
P(TempCool Playno)
P(HumidityHigh Playno) P(WindyTrue
Playno) P(playno) / P(e) (3/5)
(1/5) (4/5) (3/5) (5/14) / P(e) 0.0206 /
P(e)
36
Normalization constant
  • ? 1/P(e) 1/(0.0053 0.0206)
  • So,
  • P(Playyes e) 0.0053 / (0.0053 0.0206)
    20.5
  • P(Playno e) 0.0206 / (0.0053 0.0206)
    79.5

37
The zero-frequency problem
  • What if an attribute value doesnt occur with
    every class value (e.g. Humidity High for
    class PlayYes)?
  • Probability P(HumidityHigh playyes) will be
    zero!
  • A posteriori probability will also be zero!
  • No matter how likely the other values are!
  • Remedy add 1 to the count for every attribute
    value-class combination (Laplace estimator)
  • I.e. initialize the counters to 1 instead of 0.
  • Result probabilities will never be zero!
    (stabilizes probability estimates)

38
Constructing Decision Trees
  • Normal procedure top down in recursive
    divide-and-conquer fashion
  • First an attribute is selected for root node and
    a branch is created for each possible attribute
    value
  • Then the instances are split into subsets (one
    for each branch extending from the node)
  • Finally the same procedure is repeated
    recursively for each branch, using only instances
    that reach the branch
  • Process stops if all instances have the same class

39
Which attribute to select?
(b)
(a)
(c)
(d)
40
A criterion for attribute selection
  • Which is the best attribute?
  • The one which will result in the smallest tree
  • Heuristic choose the attribute that produces the
    purest nodes
  • Popular impurity criterion entropy of nodes
  • Lower the entropy purer the node.
  • Strategy choose attribute that results in lowest
    entropy of the children nodes.

41
Example attribute Outlook
42
The final decision tree
  • Note not all leaves need to be pure sometimes
    identical instances have different classes
  • Þ Splitting stops when data cant be split any
    further

43
Numerical attributes
  • Tests in nodes can be of the form xj gt constant
  • Divides the space into rectangles.

44
Considering splits
  • The only thing we really need to do differently
    in our algorithm is to consider splitting between
    each data point in each dimension.
  • So, in our bankruptcy domain, we'd consider 9
    different splits in the R dimension
  • In general, we'd expect to consider m - 1 splits,
    if we have m data points
  • But in our data set we have some examples with
    equal R values.

45
Considering splits II
  • And there are another 6 possible splits in the L
    dimension
  • because L is an integer, really, there are lots
    of duplicate L values.

46
Bankruptcy Example
47
Bankruptcy Example
  • We consider all the possible splits in each
    dimension, and compute the average entropies of
    the children.
  • And we see that, conveniently, all the points
    with L not greater than 1.5 are of class 0, so we
    can make a leaf there.

48
Bankruptcy Example
  • Now, we consider all the splits of the remaining
    part of space.
  • Note that we have to recalculate all the average
    entropies again, because the points that fall
    into the leaf node are taken out of consideration.

49
Bankruptcy Example
  • Now the best split is at R gt 0.9. And we see that
    all the points for which that's true are
    positive, so we can make another leaf.

50
Bankruptcy Example
  • Continuing in this way, we finally obtain

51
Rules Covering
  • Strategy for generating a rule set directly is
    based on covering.
  • A rule lhs then rhs covers a data instance, if
    the tests in lhs are true for that instance.

52
Example contact lenses data
53
Example contact lenses data
The numbers on the right show the fraction of
correct instances in the set singled out by
that choice (second numbers show covering). In
this case, correct means that their
recommendation is hard.
54
Selecting a Test for the LHS of a Rule
  • Goal to maximize accuracy
  • t total number of instances covered by rule
  • p positive examples of the class covered by rule
  • t-p number of errors made by rule
  • Þ Select test that maximizes the ratio p/t
  • We are finished when p/t 1 or the set of
    instances cant be split any further

55
Modified rule and resulting data
The rule isnt very accurate, getting only 4 out
of 12 that it covers. So, it needs further
refinement.
56
Further refinement
57
Modified rule and resulting data
Should we stop here? Perhaps. But lets say we
are going for exact rules, no matter how complex
they become. So, lets refine further.
58
Further refinement
59
The result
60
Linear Hypothesis Class
  • Equation of a hyperplane in the feature space
  • w, b are to be learned
  • In two dimensions, we can see the geometric
    interpretation of w and b.
  • The vector w is perpendicular to the linear
    separator such a vector is known as the normal
    vector.
  • The scalar b, which is called the offset, is
    proportional to the perpendicular distance from
    the origin to the linear separator.
  • The constant of proportionality is the negative
    of the magnitude of the normal vector.

61
Hyperplane Geometry
62
Linear classifier
  • We can now exploit the sign of this distance to
    define a linear classifier, one whose decision
    boundary is a hyperplane.
  • Instead of using 0 and 1 as the class labels
    (which was an arbitrary choice anyway) we use the
    sign of the distance, either 1 or -1 as the
    labels (that is the values of the yi s).

63
Margin
  • A variant of the signed distance of a training
    point to a hyperplane is the margin of the point.

Margin ?i yi(w.xib) proportional to
perpendicular distance of point xi to
hyperplane. ?i gt 0 point is correctly
classified (sign of distance yi) ?i lt 0
point is incorrectly classified (sign of distance
? yi)
64
Training Perceptron algorithm
65
Artificial Neural Networks
  • The basic idea in neural nets is to define
    interconnected networks of simple units (let's
    call them "artificial neurons") in which each
    connection has a weight.
  • Weight wij is the weight of the ith input into
    unit j.
  • The networks have some inputs where the feature
    values are placed and they compute one or more
    output values.
  • Each output unit corresponds to a class. The
    network prediction is the output whose value is
    highest.
  • The learning takes place by adjusting the weights
    in the network so that the desired output is
    produced whenever a sample in the input data set
    is presented.

66
Single Perceptron Unit
  • We start by looking at a simpler kind of
    "neural-like" unit called a perceptron.
  • This is where the perceptron algorithm that we
    saw earlier came from.
  • Perceptrons antedate the modern neural nets.
  • A perceptron unit basically compares a weighted
    combination of its inputs against a threshold
    value and then outputs a 1 if the weighted inputs
    exceed the threshold.
  • Trick we treat the (arbitrary) threshold as if
    it were a weight w0 on a constant input x0 whose
    value is 1.
  • In this way, we can write the basic rule of
    operation as computing the weighted sum of all
    the inputs and comparing to 0.

67
Linear Classifier Single Perceptron Unit
where
68
Multi-Layer Perceptron
  • Yes. The introduction of "hidden" units into
    these networks make them much more powerful
  • they are no longer limited to linearly separable
    problems.
  • Earlier layers transform the problem into more
    tractable problems for the latter layers.

Check XOR example in the full slides.
69
Multi-Layer Perceptron Learning
  • Any set of training points can be separated by a
    three-layer perceptron network.
  • Almost any set of points is separable by
    two-layer perceptron network.
  • However, the presence of the discontinuous
    threshold in the operation means that there is no
    simple local search for a good set of weights
  • one is forced into trying possibilities in a
    combinatorial way.
  • The limitations of the single-layer perceptron
    and the lack of a good learning algorithm for
    multilayer perceptrons essentially killed the
    field for quite a few years.

70
Sigmoid Unit
  • The key property of the sigmoid is that it is
    differentiable.
  • The output of this function (y) varies smoothly
    with changes in the weights.
  • So, we can use gradient.

71
Gradient Descent
Here we see the gradient of the training error as
a function of the weights. The descent rule is
basically to change the weights by taking a small
step (determined by the learning rate ?) in the
direction opposite this gradient.
Online version We consider each time only the
error for one data item
72
Had we a single unit
Substituting in the equation of previous slide we
get (for the arbitrary ith element)
Delta rule
73
Derivative of the sigmoid
74
Generalized Delta Rule
In general, for a hidden unit j we have
Immediate downstream only.
75
Generalized Delta Rule
  • For an output unit we have

Where yn is the output of this output unit. y
is the real class of the on focus data instance.
76
Backpropagation Algorithm
  • Initialize weights to small random values
  • Choose a random sample training item, say (xm,
    ym)
  • Compute total input zj and output yj for each
    unit (forward prop)
  • Compute ?n for output layer ?n yn(1-yn)(yn-ynm)
  • Compute ?j for all preceding layers by backprop
    rule
  • Compute weight change by descent rule (repeat for
    all weights)
  • Note that each expression involves data local to
    a particular unit, we don't have to look around
    summing things over the whole network.
  • It is for this reason, simplicity, locality and,
    therefore, efficiency that backpropagation has
    become the dominant paradigm for training neural
    nets.

77
Backpropagation Example
First do forward propagation Compute zis and
yis.
3
w03
-1
w13
w23
1
2
w21
w12
w02
w01
w11
w22
-1
-1
x2
x1
Write a Comment
User Comments (0)
About PowerShow.com