Donald - PowerPoint PPT Presentation

About This Presentation

Title:

Donald

Description:

have fewest false positives (Fh ) and fewest false negatives (Fh ... False negatives are penalized more --in a junk-mailing domain, ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 30

Provided by: subbraoka

Learn more at: https://rakaposhi.eas.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Donald

1
Donald Godel Rumsfeld
Winner of 2003 Foot in the Mouth Award

''Reports that say that something hasn't happened
are always interesting to me, because as we know,
there are known knowns, there are things we know
we know,'' Rumsfeld said. ''We also know
there are known unknowns that is to say we know
there are some things we do not know. But there
are also unknown unknowns the ones we don't
know we don't know.''
Rumsfeld talking about the reported lack of WMDs
in Iraq (News Conference, April 2003)

''We think we know what he means,'' said Plain
English Campaign spokesman John Lister. ''But we
don't know if we really know.''
2
12/2

Decisions.. Decisions
Vote on final
In-class
(16th 240pm)
OR Take-home
(will be due by 16th)
Clarification on HW5
Participation survey

3
Learning
Dimensions What can be learned? --Any of
the boxes representing the agents
knowledge --action description, effect
probabilities, causal relations in the
world (and the probabilities of
causation), utility models (sort of through
credit assignment), sensor data
interpretation models What feedback is
available? --Supervised, unsupervised,
reinforcement learning --Credit
assignment problem What prior knowledge is
available? -- Tabularasa (agents head is
a blank slate) or pre-existing knowledge
4
Inductive Learning(Classification Learning)

Given a set of labeled examples, and a space of
hypotheses
Find the rule that underlies the labeling
(so you can use it to predict future unlabeled
examples)
Tabularasa, fully supervised
Idea
Loop through all hypotheses
Rank each hypothesis in terms of its match to
data
Pick the best hypothesis

Closely related to Function learning
or curve-fitting (regression)
5
A classification learning example Predicting when
Rusell will wait for a table
--similar to predicting credit card fraud,
predicting when people are likely to respond to
junk mail
6
The hypothesis classifies the example as ve,
but it is actually -ve
A good hypothesis will have fewest false
positives (Fh) and fewest false negatives
(Fh-) Ideally, we want them to be zero Rank(h)
f(Fh, Fh-) --f depends on the domain
--in a medical domain False negatives are
penalized more --in a junk-mailing domain,
False negatives are penalized less
H1 Russell waits only in italian restaurants
false ves X10, false ves
X1,X3,X4,X8,X12 H2 Russell waits only in cheap
french restaurants False ves False
ves X1,X3,X4,X6,X8,X12
Ranking hypotheses
7
When do you know you have learned the concept
well?

You can classify all new instances (test cases)
correctly always
Always
May be the training samples are not completely
representative of the test samples
So, we go with probably
Correctly?
May be impossible if the training data has noise
(the teacher may make mistakes too)
So, we go with approximately

The goal of a learner then is to produce a
probably approximately correct (PAC) hypothesis,
for a given approximation (error rate) e and
probability d.
When is a learner A better than learner B?
For the same e,d bounds, A needs fewer trailing
samples than B to reach PAC.

8
PAC learning
Note This result only holds for finite
hypothesis spaces (e.g. not valid for the space
of line hypotheses!)
9
Inductive Learning(Classification Learning)

Given a set of labeled examples, and a space of
hypotheses
Find the rule that underlies the labeling
(so you can use it to predict future unlabeled
examples)
Tabularasa, fully supervised
Idea
Loop through all hypotheses
Rank each hypothesis in terms of its match to
data
Pick the best hypothesis

Main variations
Bias the sort of rule are you looking for?
If you are looking for only conjunctive
hypotheses, there are just 3n
Search
Greedy search
Decision tree learner
Systematic search
Version space learner
Iterative search
Neural net learner

It can be shown that sample complexity of PAC
learning is proportional to 1/e, 1/d AND log H
The main problem is that the space of
hypotheses is too large Given examples described
in terms of n boolean variables There are 2
different hypotheses For 6 features, there are
18,446,744,073,709,551,616 hypotheses
2n
10
More expressive the bias, larger the
hypothesis space ?Slower the learning --Line
fitting is faster than curve fitting --Line
fitting may miss non-line patterns
IMPORTANCE OF Bias in Learning
Gavagai example. -The whole object bias
in language learning.
11
Uses different biases in predicting Russels
waiting habbits
Decision Trees --Examples are used to --Learn
topology --Order of questions
If patronsfull and dayFriday then wait
(0.3/0.7) If waitgt60 and Reservationno then
wait (0.4/0.9)
Association rules --Examples are used to
--Learn support and confidence of
association rules
Neural Nets --Examples are used to --Learn
topology --Learn edge weights
Naïve bayes (bayesnet learning) --Examples are
used to --Learn topology --Learn CPTs
12
Mirror, Mirror, on the wall Which learning
bias is the best of all?
Well, there is no such thing, silly! --Each
bias makes it easier to learn some patterns and
harder (or impossible) to learn others -A
line-fitter can fit the best line to the data
very fast but wont know what to do if the data
doesnt fall on a line --A curve fitter can
fit lines as well as curves but takes longer
time to fit lines than a line fitter. --
Different types of bias classes (Decision trees,
NNs etc) provide different ways of naturally
carving up the space of all possible
hypotheses So a more reasonable question is --
What is the bias class that has a specialization
corresponding to the type of patterns that
underlie my data? -- In this bias class, what is
the most restrictive bias that still can capture
the true pattern in the data?
--Decision trees can capture all boolean
functions --but are faster at capturing
conjunctive boolean functions --Neural nets can
capture all boolean or real-valued functions
--but are faster at capturing linearly seperable
functions --Bayesian learning can capture all
probabilistic dependencies But are faster at
capturing single level dependencies (naïve bayes
classifier)
13
12/4

Interactive Review next class!!
Minhs review Next Monday evening
Raos review Reading day?
Vote on participation credit
Should I consider participation credit or not?

14
Fitting test cases vs. predicting future
cases The BIG TENSION.
Review
2
1
3
Why not the 3rd?
15
(No Transcript)
16
Uses different biases in predicting Russels
waiting habbits
17
Learning Decision Trees---How?
Basic Idea --Pick an attribute --Split
examples in terms of that attribute
--If all examples are ve label Yes.
Terminate --If all examples are ve
label No. Terminate --If some are ve, some
are ve continue splitting recursively
18
Depending on the order we pick, we can get
smaller or bigger trees
Which tree is better? Why do you think so??
19
Basic Idea --Pick an attribute --Split
examples in terms of that attribute
--If all examples are ve label Yes.
Terminate --If all examples are ve
label No. Terminate --If some are ve, some
are ve continue splitting recursively
--if no attributes left to split?
(label with majority element)
20
The Information Gain Computation
P N /(NN-) P- N- /(NN-) I(P ,, P-)
-P log(P) - P- log(P- )
The difference is the information gain So, pick
the feature with the largest Info Gain I.e.
smallest residual info
Given k mutually exclusive and exhaustive events
E1.Ek whose probabilities are p1.pk The
information content (entropy) is defined as
S i -pi log2 pi
21
I(1/2,1/2) -1/2 log 1/2 -1/2 log 1/2
1/2 1/2 1 I(1,0) 1log 1 0
log 0 0
A simple example
V(M) 2/4 I(1/2,1/2) 2/4 I(1/2,1/2)
1 V(A) 2/4 I(1,0) 2/4 I(0,1)
0 V(N) 2/4 I(1/2,1/2) 2/4
I(1/2,1/2) 1
So Anxious is the best attribute to split on Once
you split on Anxious, the problem is solved
22
(No Transcript)
23
Lesson Every bias makes something easier to
learn and others harder to learn
Evaluating the Decision Trees
Learning curves Given N examples, partition
them into Ntr the training set and Ntest the test
instances Loop for i1 to Ntr Loop for
Ns in subsets of Ntr of size I Train the
learner over Ns Test the learned
pattern over Ntest and compute the accuracy
(correct)
24
Problems with Info. Gain. Heuristics

Feature correlation The Costanza party problem
No obvious solution
Overfitting We may look too hard for patterns
where there are none
E.g. Coin tosses classified by the day of the
week, the shirt I was wearing, the time of the
day etc.
Solution Dont consider splitting if the
information gain given by the best feature is
below a minimum threshold
Can use the c2 test for statistical significance
Will also help when we have noisy samples
We may prefer features with very high branching
e.g. Branch on the universal time string for
Russell restaurant example
Branch on social security number to look
for patterns on who will get A
Solution gain ratio --ratio of information
gain with the attribute A to the information
content of answering the question What is the
value of A?
The denominator is smaller for attributes with
smaller domains.

25
(No Transcript)
26
Neural Network Learning

Idea Since classification is really a question
of finding a surface to separate the ve examples
from the -ve examples, why not directly search in
the space of possible surfaces?
Mathematically, a surface is a function
Need a way of learning functions
Threshold units

27
Neural Net is a collection of with
interconnections
threshold units
differentiable
28
The Brain Connection
A Threshold Unit
Threshold Functions
differentiable
is sort of like a neuron
29
Perceptron Networks
What happened to the Threshold? --Can model
as an extra weight with static input

30
Can Perceptrons Learn All Boolean Functions?
--Are all boolean functions linearly separable?
31
Perceptron Training in Action
A nice applet at
http//neuron.eng.wayne.edu/java/Perceptron/New38.
html
32
Comparing Perceptrons and Decision Trees in
Majority Function and Russell Domain
Decision Trees
Perceptron
Decision Trees
Perceptron
Majority function
Russell Domain
Majority function is linearly seperable..
Russell domain is apparently not....
Encoding one input unit per attribute. The unit
takes as many distinct real values as the size
of attribute domain

Write a Comment

User Comments (0)