Title: Donald
1Donald Godel Rumsfeld
Winner of 2003 Foot in the Mouth Award
- ''Reports that say that something hasn't happened
are always interesting to me, because as we know,
there are known knowns, there are things we know
we know,'' Rumsfeld said. ''We also know
there are known unknowns that is to say we know
there are some things we do not know. But there
are also unknown unknowns the ones we don't
know we don't know.'' - Rumsfeld talking about the reported lack of WMDs
in Iraq (News Conference, April 2003)
''We think we know what he means,'' said Plain
English Campaign spokesman John Lister. ''But we
don't know if we really know.''
212/2
- Decisions.. Decisions
- Vote on final
- In-class
- (16th 240pm)
- OR Take-home
- (will be due by 16th)
- Clarification on HW5
- Participation survey
3Learning
Dimensions What can be learned? --Any of
the boxes representing the agents
knowledge --action description, effect
probabilities, causal relations in the
world (and the probabilities of
causation), utility models (sort of through
credit assignment), sensor data
interpretation models What feedback is
available? --Supervised, unsupervised,
reinforcement learning --Credit
assignment problem What prior knowledge is
available? -- Tabularasa (agents head is
a blank slate) or pre-existing knowledge
4Inductive Learning(Classification Learning)
- Given a set of labeled examples, and a space of
hypotheses - Find the rule that underlies the labeling
- (so you can use it to predict future unlabeled
examples) - Tabularasa, fully supervised
- Idea
- Loop through all hypotheses
- Rank each hypothesis in terms of its match to
data - Pick the best hypothesis
Closely related to Function learning
or curve-fitting (regression)
5A classification learning example Predicting when
Rusell will wait for a table
--similar to predicting credit card fraud,
predicting when people are likely to respond to
junk mail
6The hypothesis classifies the example as ve,
but it is actually -ve
A good hypothesis will have fewest false
positives (Fh) and fewest false negatives
(Fh-) Ideally, we want them to be zero Rank(h)
f(Fh, Fh-) --f depends on the domain
--in a medical domain False negatives are
penalized more --in a junk-mailing domain,
False negatives are penalized less
H1 Russell waits only in italian restaurants
false ves X10, false ves
X1,X3,X4,X8,X12 H2 Russell waits only in cheap
french restaurants False ves False
ves X1,X3,X4,X6,X8,X12
Ranking hypotheses
7When do you know you have learned the concept
well?
- You can classify all new instances (test cases)
correctly always - Always
- May be the training samples are not completely
representative of the test samples - So, we go with probably
- Correctly?
- May be impossible if the training data has noise
(the teacher may make mistakes too) - So, we go with approximately
- The goal of a learner then is to produce a
probably approximately correct (PAC) hypothesis,
for a given approximation (error rate) e and
probability d. - When is a learner A better than learner B?
- For the same e,d bounds, A needs fewer trailing
samples than B to reach PAC.
8PAC learning
Note This result only holds for finite
hypothesis spaces (e.g. not valid for the space
of line hypotheses!)
9Inductive Learning(Classification Learning)
- Given a set of labeled examples, and a space of
hypotheses - Find the rule that underlies the labeling
- (so you can use it to predict future unlabeled
examples) - Tabularasa, fully supervised
- Idea
- Loop through all hypotheses
- Rank each hypothesis in terms of its match to
data - Pick the best hypothesis
- Main variations
- Bias the sort of rule are you looking for?
- If you are looking for only conjunctive
hypotheses, there are just 3n - Search
- Greedy search
- Decision tree learner
- Systematic search
- Version space learner
- Iterative search
- Neural net learner
It can be shown that sample complexity of PAC
learning is proportional to 1/e, 1/d AND log H
The main problem is that the space of
hypotheses is too large Given examples described
in terms of n boolean variables There are 2
different hypotheses For 6 features, there are
18,446,744,073,709,551,616 hypotheses
2n
10More expressive the bias, larger the
hypothesis space ?Slower the learning --Line
fitting is faster than curve fitting --Line
fitting may miss non-line patterns
IMPORTANCE OF Bias in Learning
Gavagai example. -The whole object bias
in language learning.
11Uses different biases in predicting Russels
waiting habbits
Decision Trees --Examples are used to --Learn
topology --Order of questions
If patronsfull and dayFriday then wait
(0.3/0.7) If waitgt60 and Reservationno then
wait (0.4/0.9)
Association rules --Examples are used to
--Learn support and confidence of
association rules
Neural Nets --Examples are used to --Learn
topology --Learn edge weights
Naïve bayes (bayesnet learning) --Examples are
used to --Learn topology --Learn CPTs
12Mirror, Mirror, on the wall Which learning
bias is the best of all?
Well, there is no such thing, silly! --Each
bias makes it easier to learn some patterns and
harder (or impossible) to learn others -A
line-fitter can fit the best line to the data
very fast but wont know what to do if the data
doesnt fall on a line --A curve fitter can
fit lines as well as curves but takes longer
time to fit lines than a line fitter. --
Different types of bias classes (Decision trees,
NNs etc) provide different ways of naturally
carving up the space of all possible
hypotheses So a more reasonable question is --
What is the bias class that has a specialization
corresponding to the type of patterns that
underlie my data? -- In this bias class, what is
the most restrictive bias that still can capture
the true pattern in the data?
--Decision trees can capture all boolean
functions --but are faster at capturing
conjunctive boolean functions --Neural nets can
capture all boolean or real-valued functions
--but are faster at capturing linearly seperable
functions --Bayesian learning can capture all
probabilistic dependencies But are faster at
capturing single level dependencies (naïve bayes
classifier)
1312/4
- Interactive Review next class!!
- Minhs review Next Monday evening
- Raos review Reading day?
- Vote on participation credit
- Should I consider participation credit or not?
14Fitting test cases vs. predicting future
cases The BIG TENSION.
Review
2
1
3
Why not the 3rd?
15(No Transcript)
16Uses different biases in predicting Russels
waiting habbits
17Learning Decision Trees---How?
Basic Idea --Pick an attribute --Split
examples in terms of that attribute
--If all examples are ve label Yes.
Terminate --If all examples are ve
label No. Terminate --If some are ve, some
are ve continue splitting recursively
18Depending on the order we pick, we can get
smaller or bigger trees
Which tree is better? Why do you think so??
19Basic Idea --Pick an attribute --Split
examples in terms of that attribute
--If all examples are ve label Yes.
Terminate --If all examples are ve
label No. Terminate --If some are ve, some
are ve continue splitting recursively
--if no attributes left to split?
(label with majority element)
20The Information Gain Computation
P N /(NN-) P- N- /(NN-) I(P ,, P-)
-P log(P) - P- log(P- )
The difference is the information gain So, pick
the feature with the largest Info Gain I.e.
smallest residual info
Given k mutually exclusive and exhaustive events
E1.Ek whose probabilities are p1.pk The
information content (entropy) is defined as
S i -pi log2 pi
21I(1/2,1/2) -1/2 log 1/2 -1/2 log 1/2
1/2 1/2 1 I(1,0) 1log 1 0
log 0 0
A simple example
V(M) 2/4 I(1/2,1/2) 2/4 I(1/2,1/2)
1 V(A) 2/4 I(1,0) 2/4 I(0,1)
0 V(N) 2/4 I(1/2,1/2) 2/4
I(1/2,1/2) 1
So Anxious is the best attribute to split on Once
you split on Anxious, the problem is solved
22(No Transcript)
23Lesson Every bias makes something easier to
learn and others harder to learn
Evaluating the Decision Trees
Learning curves Given N examples, partition
them into Ntr the training set and Ntest the test
instances Loop for i1 to Ntr Loop for
Ns in subsets of Ntr of size I Train the
learner over Ns Test the learned
pattern over Ntest and compute the accuracy
(correct)
24Problems with Info. Gain. Heuristics
- Feature correlation The Costanza party problem
- No obvious solution
- Overfitting We may look too hard for patterns
where there are none - E.g. Coin tosses classified by the day of the
week, the shirt I was wearing, the time of the
day etc. - Solution Dont consider splitting if the
information gain given by the best feature is
below a minimum threshold - Can use the c2 test for statistical significance
- Will also help when we have noisy samples
- We may prefer features with very high branching
- e.g. Branch on the universal time string for
Russell restaurant example - Branch on social security number to look
for patterns on who will get A - Solution gain ratio --ratio of information
gain with the attribute A to the information
content of answering the question What is the
value of A? - The denominator is smaller for attributes with
smaller domains.
25(No Transcript)
26Neural Network Learning
- Idea Since classification is really a question
of finding a surface to separate the ve examples
from the -ve examples, why not directly search in
the space of possible surfaces? - Mathematically, a surface is a function
- Need a way of learning functions
- Threshold units
27Neural Net is a collection of with
interconnections
threshold units
differentiable
28The Brain Connection
A Threshold Unit
Threshold Functions
differentiable
is sort of like a neuron
29Perceptron Networks
What happened to the Threshold? --Can model
as an extra weight with static input
30Can Perceptrons Learn All Boolean Functions?
--Are all boolean functions linearly separable?
31Perceptron Training in Action
A nice applet at
http//neuron.eng.wayne.edu/java/Perceptron/New38.
html
32Comparing Perceptrons and Decision Trees in
Majority Function and Russell Domain
Decision Trees
Perceptron
Decision Trees
Perceptron
Majority function
Russell Domain
Majority function is linearly seperable..
Russell domain is apparently not....
Encoding one input unit per attribute. The unit
takes as many distinct real values as the size
of attribute domain