Empirical Methods for AI (and CS)

About This Presentation

Title:

Empirical Methods for AI (and CS)

Description:

In addition (in this course at least) empirical means 'wanting to understand ... no need for expensive super-colliders. Control ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 50

Provided by: iang3

Learn more at: https://people.cs.pitt.edu

Category:

more less

Transcript and Presenter's Notes

Title: Empirical Methods for AI (and CS)

1
Empirical Methods for AI (and CS)

CS1573 AI Application Development
(after a tutorial by Paul Cohen, Ian P. Gent, and
Toby Walsh)

2
Overview

Introduction
What are empirical methods?
Why use them?
Case Study
Experiment design
Data analysis

3
Empirical Methods for CS

Part I Introduction

4
What does empirical mean?

Relying on observations, data, experiments
Empirical work should complement theoretical work
Theories often have holes
Theories are suggested by observations
Theories are tested by observations
Conversely, theories direct our empirical
attention
In addition (in this course at least) empirical
means wanting to understand behavior of complex
systems

5
Why We Need Empirical Methods Cohen, 1990 Survey
of 150 AAAI Papers

Roughly 60 of the papers gave no evidence that
the work they described had been tried on more
than a single example problem.
Roughly 80 of the papers made no attempt to
explain performance, to tell us why it was good
or bad and under which conditions it might be
better or worse.
Only 16 of the papers offered anything that
might be interpreted as a question or a
hypothesis.
Theory papers generally had no applications or
empirical work to support them, empirical papers
were demonstrations, not experiments, and had no
underlying theoretical support.
The essential synergy between theory and
empirical work was missing

6
Empirical CS/AI (Theory, not Theorems)

Computer programs are formal objects
so lets reason about them entirely formally?
Two reasons why we cant or wont
theorems are hard
some questions are empirical in nature
e.g. are finite state automata adequate to
represent the sort of knowledge met in practice?
e.g. even though our problem is intractable in
general, are the instances met in practice easy
to solve?

7
Empirical CS/AI

Treat computer programs as natural objects
like fundamental particles, chemicals, living
organisms
Build (approximate) theories about them
construct hypotheses
e.g. greedy search is important to chess
test with empirical experiments
e.g. compare search with planning
refine hypotheses and modelling assumptions
e.g. greediness not important, but search is!

8
Empirical CS/AI

Many advantage over other sciences
Cost
no need for expensive super-colliders
Control
unlike the real world, we often have complete
command of the experiment
Reproducibility
in theory, computers are entirely deterministic
Ethics
no ethics panels needed before you run experiments

9
Types of hypothesis

My search program is better than yours
not very helpful beauty competition?
Search cost grows exponentially with number of
variables for this kind of problem
better as we can extrapolate to data not yet
seen?
Constraint systems are better at handling
over-constrained systems, but OR systems are
better at handling under-constrained systems
even better as we can extrapolate to new
situations?

10
A typical conference conversation

What are you up to these days?
Im running an experiment to compare the
Davis-Putnam algorithm with GSAT?
Why?
I want to know which is faster
Why?
Lots of people use each of these algorithms
How will these people use your result?...

11
Keep in mind the BIG picture

What are you up to these days?
Im running an experiment to compare the
Davis-Putnam algorithm with GSAT?
Why?
I have this hypothesis that neither will dominate
What use is this?
A portfolio containing both algorithms will be
more robust than either algorithm on its own

12
Keep in mind the BIG picture

...
Why are you doing this?
Because many real problems are intractable in
theory but need to be solved in practice.
How does your experiment help?
It helps us understand the difference between
average and worst case results
So why is this interesting?
Intractability is one of the BIG open questions
in CS!

13
Why is empirical CS/AI in vogue?

Inadequacies of theoretical analysis
problems often arent as hard in practice as
theory predicts in the worst-case
average-case analysis is very hard (and often
based on questionable assumptions)
Some spectacular successes
phase transition behaviour
local search methods
theory lagging behind algorithm design

14
Why is empirical CS/AI in vogue?

Compute power ever increasing
even intractable problems coming into range
easy to perform large (and sometimes meaningful)
experiments
Empirical CS/AI perceived to be easier than
theoretical CS/AI
often a false perception as experiments easier to
mess up than proofs

15
Empirical Methods for CS

Part II A Case Study

16
Case Study

Scheduling processors on ring network
jobs spawned as binary trees
KOSO
keep one, send one to my left or right
arbitrarily
KOSO
keep one, send one to my least heavily loaded
neighbour

17
Theory

On complete binary trees, KOSO is asymptotically
optimal
So KOSO cant be any better?
But assumptions unrealistic
tree not complete
asymptotically not necessarily the same as in
practice!

18
Lesson Evaluation begins with claimsLesson
Demonstration is good, understanding better

Hypothesis (or claim) KOSO takes longer than
KOSO because KOSO balances loads better
The because phrase indicates a hypothesis about
why it works. This is a better hypothesis than
the beauty contest demonstration that KOSO beats
KOSO
Experiment design
Independent variables KOSO v KOSO, no. of
processors, no. of jobs, probability(job will
spawn),
Dependent variable time to complete jobs

19
Criticism This experiment design includes no
direct measure of the hypothesized effect

Hypothesis KOSO takes longer than KOSO because
KOSO balances loads better
But experiment design includes no direct measure
of load balancing
Independent variables KOSO v KOSO, no. of
processors, no. of jobs, probability(job will
spawn),
Dependent variable time to complete jobs

20
Lesson The task of empirical work is to explain
variability
Empirical work assumes the variability in a
dependent variable (e.g., run time) is the sum of
causal factors and random noise. Statistical
methods assign parts of this variability to the
factors and the noise.
Number of processors and number of jobs explain
74 of the variance in run time. Algorithm
explains almost none.
21
Lesson Keep the big picture in mind

Why are you studying this?
Load balancing is important to get good
performance out of parallel computers
Why is this important?
Parallel computing promises to tackle many of our
computational bottlenecks
How do we know this? Its in the first paragraph
of the paper!

22
Empirical Methods for CS

Part III Experiment design

23
Experimental Life Cycle

Exploration
Hypothesis construction
Experiment
Data analysis
Drawing of conclusions

24
Types of experiment designs

Manipulation experiment
Observation experiment
Factorial experiment

25
Manipulation experiment

Independent variable, x
xidentity of parser, size of dictionary,
Dependent variable, y
yaccuracy, speed,
Hypothesis
x influences y
Manipulation experiment
change x, record y

26
Observation experiment

Predictor, x
xvolatility of stock prices,
Response variable, y
yfund performance,
Hypothesis
x influences y
Observation experiment
classify according to x, compute y

27
Factorial experiment

Several independent variables, xi
there may be no simple causal links
data may come that way
e.g. programs have different languages,
algorithms, ...
Factorial experiment
every possible combination of xi considered
expensive as its name suggests!

28
Some problem issues

Control
Ceiling and Floor effects
Sampling Biases

29
Control

A control is an experiment in which the
hypothesised variation does not occur
so the hypothesized effect should not occur
either
BUT remember
placebos cure a large percentage of patients!

30
Control MYCIN case study

MYCIN was a medial expert system
recommended therapy for blood/meningitis
infections
How to evaluate its recommendations?
Shortliffe used
10 sample problems, 8 therapy recommenders
5 faculty, 1 resident, 1 postdoc, 1 student
8 impartial judges gave 1 point per problem
max score was 80
Mycin 65, faculty 40-60, postdoc 60, resident 45,
student 30

31
Control MYCIN case study

What were controls?
Control for judges bias for/against computers
judges did not know who recommended each therapy
Control for easy problems
medical student did badly, so problems not easy
Control for our standard being low
e.g. random choice should do worse
Control for factor of interest
e.g. hypothesis in MYCIN that knowledge is
power
have groups with different levels of knowledge

32
Ceiling and Floor Effects

Well designed experiments (with good controls)
can still go wrong
What if all our algorithms do particularly well
Or they all do badly?
Weve got little evidence to choose between them

33
Ceiling and Floor Effects

Ceiling effects arise when test problems are
insufficiently challenging
floor effects the opposite, when problems too
challenging
A problem in AI because we often repeatedly use
the same benchmark sets
most benchmarks will lose their challenge
eventually?
but how do we detect this effect?

34
Ceiling Effects machine learning

14 datasets from UCI corpus of benchmarks
used as mainstay of ML community
Problem is learning classification rules
each item is vector of features and a
classification
measure classification accuracy of method (max
100)
Compare C4 with 1R, two competing algorithms

35
Ceiling Effects machine learning

DataSet BC CH GL G2 HD HE Mean
C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9
1R 72.5 69.2 56.4 77 78 85.1 ... 83.8
Max 72.5 99.2 63.2 77 78 85.1 87.4
C4 achieves only about 2 better than 1R
Best of the C4/1R achieves 87.4 accuracy
We have only weak evidence that C4 better
Both methods performing near ceiling of possible
so comparison hard!

36
Ceiling Effects machine learning

In fact 1R only uses one feature (the best one)
C4 uses on average 6.6 features
5.6 features buy only about 2 improvement
Conclusion?
Either real world learning problems are easy (use
1R)
Or we need more challenging datasets
We need to be aware of ceiling effects in results

37
Sampling bias

Data collection is biased against certain data
e.g. teacher who says Girls dont answer maths
question
observation might suggest
girls dont answer many questions
but that the teacher doesnt ask them many
questions
Experienced AI researchers dont do that, right?

38
Sampling bias Phoenix case study

AI system to fight (simulated) forest fires
Experiments suggest that wind speed uncorrelated
with time to put out fire
obviously incorrect as high winds spread forest
fires

39
Sampling bias Phoenix case study

Wind Speed vs containment time (max 150 hours)
3 120 55 79 10 140 26 15 110 12 54 10 103
6 78 61 58 81 71 57 21 32 70
9 62 48 21 55 101
Whats the problem?

40
Sampling bias Phoenix case study

The cut-off of 150 hours introduces sampling bias
many high-wind fires get cut off, not many low
wind
On remaining data, there is no correlation
between wind speed and time (r -0.53)
In fact, data shows that
a lot of high wind fires take gt 150 hours to
contain
those that dont are similar to low wind fires
You wouldnt do this, right?
you might if you had automated data analysis.

41
Empirical Methods for CS

Part IV Data analysis

42
Kinds of data analysis

Exploratory (EDA) looking for patterns in data
Statistical inferences from sample data
Testing hypotheses
Estimating parameters
Building mathematical models of datasets
Machine learning, data mining
We will introduce hypothesis testing and
computer-intensive methods

43
The logic of hypothesis testing

Example toss a coin ten times, observe eight
heads. Is the coin fair (i.e., what is its long
run behavior?) and what is your residual
uncertainty?
You say, If the coin were fair, then eight or
more heads is pretty unlikely, so I think the
coin isnt fair.
Like proof by contradiction Assert the opposite
(the coin is fair) show that the sample result (
8 heads) has low probability p, reject the
assertion, with residual uncertainty related to
p.
Estimate p with a sampling distribution.

44
The logic of hypothesis testing

Establish a null hypothesis H0 p .5, the
coin is fair
Establish a statistic r, the number of heads in
N tosses
Figure out the sampling distribution of r given
H0
The sampling distribution will tell you the
probability p of a result at least as extreme as
your sample result, r 8
If this probability is very low, reject H0 the
null hypothesis
Residual uncertainty is p

45
A common statistical test The Z test for
different means

A sample N 25 computer science students has
mean IQ m135. Are they smarter than average?
Population mean is 100 with standard deviation 15
The null hypothesis, H0, is that the CS students
are average, i.e., the mean IQ of the
population of CS students is 100.
What is the probability p of drawing the sample
if H0 were true? If p small, then H0 probably
false.
Find the sampling distribution of the mean of a
sample of size 25, from population with mean 100

46
Reject the null hypothesis?

Commonly we reject the H0 when the probability of
obtaining a sample statistic (e.g., mean 135)
given the null hypothesis is low, say lt .05.
A test statistic value, e.g. Z 11.67, recodes
the sample statistic (mean 135) to make it easy
to find the probability of sample statistic given
H0.
We find the probabilities by looking them up in
tables, or statistics packages provide them.
For example, Pr(Z 1.67) .05 Pr(Z 1.96)
.01.
Pr(Z 11) is approximately zero, reject H0.

47
Summary of hypothesis testing

H0 negates what you want to demonstrate find
probability p of sample statistic under H0 by
comparing test statistic to sampling
distribution if probability is low, reject H0
with residual uncertainty proportional to p.
Example Want to demonstrate that CS graduate
students are smarter than average. H0 is that
they are average. t 2.89, p .022
Have we proved CS students are smarter? NO!
We have only shown that mean 135 is unlikely
if they arent. We never prove what we want to
demonstrate, we only reject H0, with residual
uncertainty.
And failing to reject H0 does not prove H0,
either!

48
Computer-intensive Methods

Basic idea Construct sampling distributions by
simulating on a computer the process of drawing
samples.
Three main methods
Monte carlo simulation when one knows population
parameters
Bootstrap when one doesnt
Randomization, also assumes nothing about the
population.
Enormous advantage Works for any statistic and
makes no strong assumptions (e.g., normality)

49
Summary