Title: CS G120 Artificial Intelligence
1CS G120 Artificial Intelligence
- Prof. C. Hafner
- Class Notes March 26, 2009
2Outline
- Learning agents
- Inductive learning
- Decision tree learning
- Bayesian learning
3Learning
- Learning is essential for unknown environments,
- i.e., when designer lacks omniscience
- Learning is useful as a system construction
method, - i.e., expose the agent to reality rather than
trying to write it down - Learning modifies the agent's decision mechanisms
to improve performance
4How do people learn ?
- By experience
- By being told (in person, reading, TV, etc.)
- Inductive learning framework
- Agent gets positive and negative examples of some
concept - A major problem is overfitting
- Compare with learning a new model or theory such
as the balance of power among the 3 branches of
US government - (Skill acquisition not considered)
5Inductive Learning Elements
- Design of a learning element is affected by
- Which components of the performance element are
to be learned - What feedback is available to learn these
components - What representation is used for the components
- Type of feedback
- Supervised learning correct answers for each
example - Unsupervised learning correct answers not given
- Reinforcement learning occasional rewards
6Learning frequently applied to
- Classification problems
- Finite number of classes, pre-defined features
- Score Ci S wi Fi
- Apply supervised learning to select the weights
- Also applied to finding a good heuristic function
for searching - H S wi Fi
- Apply supervised or reinforcement learning to
select the weights
7Inductive Learning agents
8Learning decision trees
- Problem decide whether to wait for a table at a
restaurant, based on the following attributes - Alternate is there an alternative restaurant
nearby? - Bar is there a comfortable bar area to wait in?
- Fri/Sat is today Friday or Saturday?
- Hungry are we hungry?
- Patrons number of people in the restaurant
(None, Some, Full) - Price price range (, , )
- Raining is it raining outside?
- Reservation have we made a reservation?
- Type kind of restaurant (French, Italian, Thai,
Burger) - WaitEstimate estimated waiting time (0-10,
10-30, 30-60, gt60)
9Decision trees
- One possible representation for hypotheses
- E.g., here is the true tree for deciding
whether to wait
10Attribute-based representations
- Examples described by attribute values (Boolean,
discrete, continuous) - E.g., situations where I will/won't wait for a
table -
- Classification of examples is positive (T) or
negative (F)
11Expressiveness
- Decision trees can express any function of the
input attributes. - E.g., for Boolean functions, truth table row ?
path to leaf - Trivially, there is a consistent decision tree
for any training set with one path to leaf for
each example (unless f nondeterministic in x) but
it probably won't generalize to new examples - Prefer to find more compact decision trees
12Example contd.
- Decision tree learned from the 12 examples
- Substantially simpler than true tree---a more
complex hypothesis isnt justified by small
amount of data
13Hypothesis spaces
- How many distinct decision trees with n Boolean
attributes? - number of Boolean functions
- number of distinct truth tables with 2n rows
22n - E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 trees
14Decision tree learning
- Aim find a small tree consistent with the
training examples - Idea (recursively) choose "most significant"
attribute as root of (sub)tree
15Choosing an attribute
- Idea a good attribute splits the examples into
subsets that are (ideally) "all positive" or "all
negative" -
- Patrons? is a better choice
16Using information theory
- To implement Choose-Attribute in the DTL
algorithm - Information Content (Entropy)
- I(P(v1), , P(vn)) Si1 -P(vi) log2 P(vi)
- For a training set containing p positive examples
and n negative examples
17Information gain
- A chosen attribute A divides a collection C into
subsets E1, , Ev according to their values for
A, where A has v distinct values. - Information Gain (IG) or reduction in entropy
from the attribute test - Choose the attribute with the largest IG
18Information gain
- For the training set, p n 6, I(6/12, 6/12)
1 bit - Consider the attributes Patrons and Type (and
others too) - Patrons has the highest IG of all attributes and
so is chosen by the DTL algorithm as the root
19Information Theory and Entropy
- We measure quantity of information by the
resources needed to represent/store/transmit the
information - Messages are sequences of 0s and 1s
(dots/dashes) which we call bits (for binary
digits) - You need to send a message containing the
identity of a spy - It is known to be Mr. Brown or Mr. Smith
- You can send the message with 1 bit, therefore
the event the spy is Smith has 1 bit of
information
20Calculating quantity of information
- Def A uniform distribution of a set of possible
outcomes (X1 . . . Xn) means the outcomes are
equally probable that is, they each have
probability 1/n. - Suppose there are 8 people who can be the spy.
Then the message requires 3 bits. If there are
64 possible spies the message requires 6 bits,
etc. (assuming a uniform distribution) - Def The information quantity of a message where
the (uniform) probability of each value is p - I -log p bits
21Intuition and Examples
- Intuitively, the more surprising a message is,
the more information it contains. If there are
64 equally-probable spies we are more surprised
by the identity of the spy than if there are only
two equally probable spies. - There are 26 letters in the alphabet. Assuming
they are equally probable, how much information
is in each letter I -log (1/26) log 26
4.7 bits - Assuming the digits from 0 to 9 are equally
probable. Will the information in each digit be
more or less than the information in each letter?
22Sequences of messages
- Things get interesting when we looks beyond a
single message to a long sequence of messages. - Consider a 4-sided die, with symbols A, B, C, D
- Let 00 A, 01B, 10C, 11D
- Each message is 2 bits. If you throw the die 800
times, you get a message 1600 bits long - Thats the best you can do if A,B,C,D equally
probable
23Non-uniform distributions (cont.)
- Consider a 4-sided die, with symbols A, B, C, D
- But assume P(A) 7/8 and P(B)P(C)P(D) 1/24
- We can take advantage of that with a different
code - 0 A, 10 B, 110 C, 111 D
- If we throw the die 800 times, what is the
expected length of the message? What is the
entropy? - ENTROPY is the average information (in bits) of
events in a long repeated sequence
24Entropy
- Formula for entropy with outcomes x1 . . . xn
- - S P(xi) log P(xi) bits
- For a uniform distribution this is the same as
log P(x1) since all the P(xi) are the same. - What does it mean? Consider 6-sided die,
outcomes equally probable - -log 1/6 2.58 tells us a long sequence of die
throws can be transmitted using 2.58 bits per
throw on the average and this is the
theoretical best
25Review/Explain Entropy
- Entropy is sometimes called disorder it
represents the lack of predictability as to the
outcome for any element of a sequence (or set) - If a set has just one outcome, entropy 1
-log(1) 0 - If there are 2 outcomes, then 50/50 probability
gives the maximum entropy complete
unpredictability. This generalizes to any
uniform distribution for n outcomes. - - (0.5 log(.5) 0.5 log(.5)) 1 bit
- Note log(1/2) -log(2) -1
-
26Calculating Entropy
- Consider a biased coin P(heads) ¾ P(tails)
¼ - What is the entropy of a coin toss outcome?
- H ¼ -log(1/4) ¾ -log(3/4) 0.811 bits
- Using the Information Theory Log Table
- H 0.25 2.0 0.75 0.415 0.5 0.311
.811 - A fair coin toss has more information
- The more unbalanced the probabilities, the more
predictable the outcome, the less you learn from
each message.
27Maximum disorder
1
H (entropy in bits)
0 ½
1
Probability of x1
Entropy for a set containing 2 possible outcomes
(x1, x2) What if there are 3 possible outcomes?
for equal probability case H -log(1/3)
about 1.58
28Define classification tree and ID3 algorithm
- Def Given a table with one result attribute and
several designated predictor attributes, a
classification tree for that table is a tree such
that - Each leaf node is labeled with a value of the
result attribute - Each non-leaf node is labeled with the name of a
predictor - Each link is labeled with one value of the
parents predictor - Def the ID3 algorithm takes a table as input and
learns a classification tree that efficiently
maps predictor value sets into their results from
the table.
29A trivial example of a classification tree
Color
yellow
red
Shape
apple
oblong
round
lemon
banana
The goal is to create an efficient
classification tree which always gives the same
answer as the table
30A well-known toy example sunburn data
Predictor attributes hair, height, weight,
lotion
31Blonde
Brown
Red
N
Y
32Review the algorithm
- Create the root, and make its COLLECTION the
entire table - Select any non-singular leaf node N to SPLIT
- Choose the best attribute A for splitting N (use
info theory) - For each value of A (a1, a2, . .) create a child
of N, Nai - Label the links from N to its children A ai
- SPLIT the collection of N among its children
according to their values of A - When no more non-singular leaf nodes exist, the
tree is finished - Def a singular node is one whose COLLECTION
includes just one value for the result attribute - (therefore its entropy 0)
33Choosing the best attribute to SPLIT the one
that is MOST INFORMATIVE (highest IG)that
reduces the entropy (DISORDER) the most
- Assume there are k attributes we can choose.
For each one, we compute how much less entropy
exists in the resulting children than we had in
the parents - H(N) weighted sum of H(children
of N) - Each childs entropy is weighted by the
probability of that child (estimated by the
proportion of the parents collection that would
be transferred to the child in the split)
34C S,D,X,A,E,P,J,K(3,5)/____
Calculate entropy - 3/8 log 3/8 5/8 log 5/8
.53 .424 .954
S1 _______CC
Find information gain (IG) for all 4 predictors
hair, height, weight, lotion Start with lotion
values (yes, no) Child 1 (yes) D,X,K(0,
3)/0 Child 2 (no) S,A,E,P,J(3,2)/ -3/5 log
3/5 2/5 log 2/5 .971 Child set entropy
3/8 0 5/8 .971 0.607 IG(Lotion) .954 -
.607 .347 Then try hair color values (blond,
brown, red) Child 1(blond) S,D,A,K(2,2)/1 Chil
d 2(brown) X,P,J(0,3)/0 Child 3(red)
E(1,0)/0 Child set entropy 4/8 1 3/8
0 1/8 0 0.5 IG(Hair color) .954 - 0.5
.454
35Next try Height values (average, tall
short) Child1(average) S,E,J(2,1)/ -2/3 log
2/3 1/3 log 1/3 0.92 Child2(tall)
D,P(0,2)/0 Child3(short)X,A,K(1,2)/0.92 Chil
d set entropy 3/8 0.92 2/8 0 3/8 0.92
0.69 IG(Height) .954 - .69 0.26 Next try
Weight . . . IG(Weight) 0.954 0.94 0.014 So
Hair color wins Draw the first split and assign
the collections
N1 Hair Color
Blond C S,D,A,K(2,2)/1
Red
Brown
yes
no
S2_______
36C S,D,A,K(2,2)/1
S2_________
Start with lotion values (yes, no) Child 1
(yes) D, K(0, 2)/0 Child 2 (no)
S,A(2,0)/ 0 Child set entropy 0 IG(Lotion)
1 0 1 No reason to go any farther
S1 Hair Color
Blond C S,D,A,K(2,2)/1
Red
Brown
yes
no
S2 Lotion
no
yes
yes
no
37Limitations of DTL
- Inconsistency
- Can use majority, if enough data
- Missing data
- Overfitting (a problem with all inductive
learning) - Multivalued attributes
- Use gain ratio
- Numerical attributes
- Search for split points that maximize IG
38Performance Evaluation of DTL
- Training set/test set division
- Addresses overfitting problem to some extent
- K-fold cross validation (5, 10, N)
- The problem of peeking (parameter setting and
evaluation require separate test sets)
39Performance measurement
- Learning curve correct on test set as a
function of training set size