CS G120 Artificial Intelligence

About This Presentation

Title:

CS G120 Artificial Intelligence

Description:

i.e., when designer lacks omniscience. Learning is useful as a system construction method, ... Draw the first split and assign the collections. N1: Hair Color ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 40

Provided by: MinYe8

Category:

more less

Transcript and Presenter's Notes

Title: CS G120 Artificial Intelligence

1
CS G120 Artificial Intelligence

Prof. C. Hafner
Class Notes March 26, 2009

2
Outline

Learning agents
Inductive learning
Decision tree learning
Bayesian learning

3
Learning

Learning is essential for unknown environments,
i.e., when designer lacks omniscience
Learning is useful as a system construction
method,
i.e., expose the agent to reality rather than
trying to write it down
Learning modifies the agent's decision mechanisms
to improve performance

4
How do people learn ?

By experience
By being told (in person, reading, TV, etc.)
Inductive learning framework
Agent gets positive and negative examples of some
concept
A major problem is overfitting
Compare with learning a new model or theory such
as the balance of power among the 3 branches of
US government
(Skill acquisition not considered)

5
Inductive Learning Elements

Design of a learning element is affected by
Which components of the performance element are
to be learned
What feedback is available to learn these
components
What representation is used for the components
Type of feedback
Supervised learning correct answers for each
example
Unsupervised learning correct answers not given
Reinforcement learning occasional rewards

6
Learning frequently applied to

Classification problems
Finite number of classes, pre-defined features
Score Ci S wi Fi
Apply supervised learning to select the weights
Also applied to finding a good heuristic function
for searching
H S wi Fi
Apply supervised or reinforcement learning to
select the weights

7
Inductive Learning agents
8
Learning decision trees

Problem decide whether to wait for a table at a
restaurant, based on the following attributes
Alternate is there an alternative restaurant
nearby?
Bar is there a comfortable bar area to wait in?
Fri/Sat is today Friday or Saturday?
Hungry are we hungry?
Patrons number of people in the restaurant
(None, Some, Full)
Price price range (, , )
Raining is it raining outside?
Reservation have we made a reservation?
Type kind of restaurant (French, Italian, Thai,
Burger)
WaitEstimate estimated waiting time (0-10,
10-30, 30-60, gt60)

9
Decision trees

One possible representation for hypotheses
E.g., here is the true tree for deciding
whether to wait

10
Attribute-based representations

Examples described by attribute values (Boolean,
discrete, continuous)
E.g., situations where I will/won't wait for a
table
Classification of examples is positive (T) or
negative (F)

11
Expressiveness

Decision trees can express any function of the
input attributes.
E.g., for Boolean functions, truth table row ?
path to leaf
Trivially, there is a consistent decision tree
for any training set with one path to leaf for
each example (unless f nondeterministic in x) but
it probably won't generalize to new examples
Prefer to find more compact decision trees

12
Example contd.

Decision tree learned from the 12 examples
Substantially simpler than true tree---a more
complex hypothesis isnt justified by small
amount of data

13
Hypothesis spaces

How many distinct decision trees with n Boolean
attributes?
number of Boolean functions
number of distinct truth tables with 2n rows
22n
E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 trees

14
Decision tree learning

Aim find a small tree consistent with the
training examples
Idea (recursively) choose "most significant"
attribute as root of (sub)tree

15
Choosing an attribute

Idea a good attribute splits the examples into
subsets that are (ideally) "all positive" or "all
negative"
Patrons? is a better choice

16
Using information theory

To implement Choose-Attribute in the DTL
algorithm
Information Content (Entropy)
I(P(v1), , P(vn)) Si1 -P(vi) log2 P(vi)
For a training set containing p positive examples
and n negative examples

17
Information gain

A chosen attribute A divides a collection C into
subsets E1, , Ev according to their values for
A, where A has v distinct values.
Information Gain (IG) or reduction in entropy
from the attribute test
Choose the attribute with the largest IG

18
Information gain

For the training set, p n 6, I(6/12, 6/12)
1 bit
Consider the attributes Patrons and Type (and
others too)
Patrons has the highest IG of all attributes and
so is chosen by the DTL algorithm as the root

19
Information Theory and Entropy

We measure quantity of information by the
resources needed to represent/store/transmit the
information
Messages are sequences of 0s and 1s
(dots/dashes) which we call bits (for binary
digits)
You need to send a message containing the
identity of a spy
It is known to be Mr. Brown or Mr. Smith
You can send the message with 1 bit, therefore
the event the spy is Smith has 1 bit of
information

20
Calculating quantity of information

Def A uniform distribution of a set of possible
outcomes (X1 . . . Xn) means the outcomes are
equally probable that is, they each have
probability 1/n.
Suppose there are 8 people who can be the spy.
Then the message requires 3 bits. If there are
64 possible spies the message requires 6 bits,
etc. (assuming a uniform distribution)
Def The information quantity of a message where
the (uniform) probability of each value is p
I -log p bits

21
Intuition and Examples

Intuitively, the more surprising a message is,
the more information it contains. If there are
64 equally-probable spies we are more surprised
by the identity of the spy than if there are only
two equally probable spies.
There are 26 letters in the alphabet. Assuming
they are equally probable, how much information
is in each letter I -log (1/26) log 26
4.7 bits
Assuming the digits from 0 to 9 are equally
probable. Will the information in each digit be
more or less than the information in each letter?

22
Sequences of messages

Things get interesting when we looks beyond a
single message to a long sequence of messages.
Consider a 4-sided die, with symbols A, B, C, D
Let 00 A, 01B, 10C, 11D
Each message is 2 bits. If you throw the die 800
times, you get a message 1600 bits long
Thats the best you can do if A,B,C,D equally
probable

23
Non-uniform distributions (cont.)

Consider a 4-sided die, with symbols A, B, C, D
But assume P(A) 7/8 and P(B)P(C)P(D) 1/24
We can take advantage of that with a different
code
0 A, 10 B, 110 C, 111 D
If we throw the die 800 times, what is the
expected length of the message? What is the
entropy?
ENTROPY is the average information (in bits) of
events in a long repeated sequence

24
Entropy

Formula for entropy with outcomes x1 . . . xn
- S P(xi) log P(xi) bits
For a uniform distribution this is the same as
log P(x1) since all the P(xi) are the same.
What does it mean? Consider 6-sided die,
outcomes equally probable
-log 1/6 2.58 tells us a long sequence of die
throws can be transmitted using 2.58 bits per
throw on the average and this is the
theoretical best

25
Review/Explain Entropy

Entropy is sometimes called disorder it
represents the lack of predictability as to the
outcome for any element of a sequence (or set)
If a set has just one outcome, entropy 1
-log(1) 0
If there are 2 outcomes, then 50/50 probability
gives the maximum entropy complete
unpredictability. This generalizes to any
uniform distribution for n outcomes.
- (0.5 log(.5) 0.5 log(.5)) 1 bit
Note log(1/2) -log(2) -1

26
Calculating Entropy

Consider a biased coin P(heads) ¾ P(tails)
¼
What is the entropy of a coin toss outcome?
H ¼ -log(1/4) ¾ -log(3/4) 0.811 bits
Using the Information Theory Log Table
H 0.25 2.0 0.75 0.415 0.5 0.311
.811
A fair coin toss has more information
The more unbalanced the probabilities, the more
predictable the outcome, the less you learn from
each message.

27
Maximum disorder
1
H (entropy in bits)
0 ½
1
Probability of x1
Entropy for a set containing 2 possible outcomes
(x1, x2) What if there are 3 possible outcomes?
for equal probability case H -log(1/3)
about 1.58
28
Define classification tree and ID3 algorithm

Def Given a table with one result attribute and
several designated predictor attributes, a
classification tree for that table is a tree such
that
Each leaf node is labeled with a value of the
result attribute
Each non-leaf node is labeled with the name of a
predictor
Each link is labeled with one value of the
parents predictor
Def the ID3 algorithm takes a table as input and
learns a classification tree that efficiently
maps predictor value sets into their results from
the table.

29
A trivial example of a classification tree
Color
yellow
red
Shape
apple
oblong
round
lemon
banana
The goal is to create an efficient
classification tree which always gives the same
answer as the table
30
A well-known toy example sunburn data
Predictor attributes hair, height, weight,
lotion
31
Blonde
Brown
Red
N
Y
32
Review the algorithm

Create the root, and make its COLLECTION the
entire table
Select any non-singular leaf node N to SPLIT
Choose the best attribute A for splitting N (use
info theory)
For each value of A (a1, a2, . .) create a child
of N, Nai
Label the links from N to its children A ai
SPLIT the collection of N among its children
according to their values of A
When no more non-singular leaf nodes exist, the
tree is finished
Def a singular node is one whose COLLECTION
includes just one value for the result attribute
(therefore its entropy 0)

33
Choosing the best attribute to SPLIT the one
that is MOST INFORMATIVE (highest IG)that
reduces the entropy (DISORDER) the most

Assume there are k attributes we can choose.
For each one, we compute how much less entropy
exists in the resulting children than we had in
the parents
H(N) weighted sum of H(children
of N)
Each childs entropy is weighted by the
probability of that child (estimated by the
proportion of the parents collection that would
be transferred to the child in the split)

34
C S,D,X,A,E,P,J,K(3,5)/____
Calculate entropy - 3/8 log 3/8 5/8 log 5/8
.53 .424 .954
S1 _______CC
Find information gain (IG) for all 4 predictors
hair, height, weight, lotion Start with lotion
values (yes, no) Child 1 (yes) D,X,K(0,
3)/0 Child 2 (no) S,A,E,P,J(3,2)/ -3/5 log
3/5 2/5 log 2/5 .971 Child set entropy
3/8 0 5/8 .971 0.607 IG(Lotion) .954 -
.607 .347 Then try hair color values (blond,
brown, red) Child 1(blond) S,D,A,K(2,2)/1 Chil
d 2(brown) X,P,J(0,3)/0 Child 3(red)
E(1,0)/0 Child set entropy 4/8 1 3/8
0 1/8 0 0.5 IG(Hair color) .954 - 0.5
.454
35
Next try Height values (average, tall
short) Child1(average) S,E,J(2,1)/ -2/3 log
2/3 1/3 log 1/3 0.92 Child2(tall)
D,P(0,2)/0 Child3(short)X,A,K(1,2)/0.92 Chil
d set entropy 3/8 0.92 2/8 0 3/8 0.92
0.69 IG(Height) .954 - .69 0.26 Next try
Weight . . . IG(Weight) 0.954 0.94 0.014 So
Hair color wins Draw the first split and assign
the collections
N1 Hair Color
Blond C S,D,A,K(2,2)/1
Red
Brown
yes
no
S2_______
36
C S,D,A,K(2,2)/1
S2_________
Start with lotion values (yes, no) Child 1
(yes) D, K(0, 2)/0 Child 2 (no)
S,A(2,0)/ 0 Child set entropy 0 IG(Lotion)
1 0 1 No reason to go any farther

S1 Hair Color
Blond C S,D,A,K(2,2)/1
Red
Brown
yes
no
S2 Lotion
no
yes
yes
no
37
Limitations of DTL

Inconsistency
Can use majority, if enough data
Missing data
Overfitting (a problem with all inductive
learning)
Multivalued attributes
Use gain ratio
Numerical attributes
Search for split points that maximize IG

38
Performance Evaluation of DTL

Training set/test set division
Addresses overfitting problem to some extent
K-fold cross validation (5, 10, N)
The problem of peeking (parameter setting and
evaluation require separate test sets)

CS G120 Artificial Intelligence - PowerPoint PPT Presentation

CS G120 Artificial Intelligence

i.e., when designer lacks omniscience. Learning is useful as a system construction method, ... Draw the first split and assign the collections. N1: Hair Color ... – PowerPoint PPT presentation