AI1 Experimental Methodology: Lecture 89 Experimental Design and Statistics - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

AI1 Experimental Methodology: Lecture 89 Experimental Design and Statistics

Description:

1. AI as Experimental Science. 2. Experiments involving Humans ... Data normally comes in sets - single experiment may involve repeating a test a number of times. ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 65

Provided by: helen72

Category:

more less

Transcript and Presenter's Notes

Title: AI1 Experimental Methodology: Lecture 89 Experimental Design and Statistics

1
AI1 Experimental Methodology Lecture 8/9
Experimental Design and Statistics
2
Exp Methods Course

1. AI as Experimental Science
2. Experiments involving Humans
3. Data, visualisation and correlation
5. Introduction to Knowledge Based Systems
6. Knowledge Acquisition
7. Building and Evaluating Symbolic Systems
8/9. Experimental design and statistics
10. Experiments with other systems

3
1. Reminder
4
Tools for Analysing Data

Data normally comes in sets - single experiment
may involve repeating a test a number of times.
Visualisation techniques used for exploratory
data
- display relationships between variables
visually to make patterns in dataset apparent
- tools for this
MATLAB, a matrix manipulation system with
excellent graphical display abilities.
Statistical tests used for confirmatory
experiments
- to determine extent to which an anticipated
effect is present in the data from the
experiment
- visualisation plays a much less significant
role here

5
Class by degree and birth month
6
By degree and birth month
7
By degree and birth month
8
Scatter plots
9
Summary statistics

Summary Statistics express a property of the data
set in a single number or set of a few numbers.
Most common mean, mode, median, variance and
standard deviation
Mean gives the centre of mass of the set
_
x ? x1,...xn
-------------
n
So mean of 2, 3, 6, 1, 5, 1 18/6 3

10
Variance

Variance ?2x is the mean deviation from the
centre _ _
?2x ? (x1 - x)2,...(xn - x)2
-------------
n
The Standard deviation ?x is the square root of
the variance
___
?2x ??2x

11
Linear correlation

Linear correlation measures how well the data fit
the model of a straight line relationship.
1. Compute the means of the x and y data from the
scatter plot separately.
2. For each point in the scatter plot (pair of
data) calculate the deviation of each datum from
its mean and multiply, that is
compute (x - mean(x))(y - mean(y))
3. Sum these products for all the data pairs and
divide by N-1 for N data.
4. Work out the standard deviation of x and y
separately, and divide the sum from step 3. by
the product of these standard deviations.

12
Pearson's Correlation Coefficient

Measures how well the data fit the straight line
model it assumes
_ _
correlation ? (x - x)(y - y)
-------------------
(N-1) ?x ?y
Lies between -1 (low X means high Y)
and 1 (high X means high Y)
with 0 meaning no correlation

13
Study v exam performance
14
Study v exam performance
15
2. Hypothesis testing
16
Role of Experiment in Design

Often experiments are used to guide new designs
or help understand existing design.
Programs are not themselves experiments. They
are normally a part of the basis for conducting
experiments (on an algorithm or a system or a
group of people).
Three types of activity
Exploratory Where we are wondering what to
design.
Formative evaluation Where we experiment with a
preliminary design with the aim of building a
better one.
Summative evaluation Where a final design is
analysed definitively.

17
Hypothesis Formation

Typical hypothesis Factor X affects behaviour Y.
Typical null hypothesis No effect of X on Y.
What will we measure about X and Y?
Will our experiments aim to prove or disprove
the (null) hypothesis?
Observation v Manipulation
- Observation experiments Look at population to
see if X correlates with Y.
- Manipulation experiments Change X and see
what happens to Y.
but we need to be sure that change in Y is due
only to the differences in X.

18
Attempt to disprove hypothesis

Formulate a precise experimental question or
hypothesis
Testing whether evidence supports our hypothesis
or not.
e.g. living near power cable increases likelihood
of certain cancers
or setting rate of mutation too high in a
genetic algorithm results in slow convergence or
poor solutions being found
Design an experiment to disprove our hypothesis
- a positive result could be caused by something
we haven't thought of
- but a single negative result disproves the
hypothesis
This means finding a way to answer the question
Are measurements of X and Y related?

19
Observation v Manipulation

Observation experiments
necessary when cannot directly manipulate X
group subjects based on measurement of X
e.g. 2 groups, 1 of people close to power lines
1of those far from power lines
see variation in incidence of cancers
Manipulation experiments
when factor of interest directly manipulable.
e.g. genetic algorithm example - run the program
with
different values of the mutation rate
parameter
and see what happens

20
Influence of other factors?

How do we know that effects we see (variations in
measured behaviour) due only to changes in the
factor of interest?
other factors may influence behaviour of interest
and may contaminate our experiments.
Consider this during experimental design
well-designed experiment allows us just one
explanation for effects we see in data it
produces
while a poor design may allow many.
When you look at data, and consider people's
conclusions based on it, you need always to ask
what else (apart from what they suggest) might
account for the effects described.

21
Almonds are good for you.?

Almonds It may sound pretty nutty, but even
though almonds are very high in fat ... they may
be good for your heart! A major study of 26,000
members of the Seventh Day Adventist Church in
the United States showed that those who ate
almonds, peanuts and walnuts at least six times a
week had an average lifespan of seven years
longer than the general population, and a
substantially lower rate of heart attack.
(p. 77, The Food Medicine Bible,
Earl Mindell and Carol Colman, 1994.)
Can we conclude that almonds are good for you?
Could be peanuts or walnuts, or combination
Or maybe it only applies to 7th day Adventists
Or something else is going on.

22
Control experiments

To resolve these issues we would need to do more
experiments (or do this one more carefully)
to demonstrate that almonds accounted for
healthier people and not the other nuts
to demonstrate that Seventh day Adventists are
typical of general population in relation to
health
Control experiments
purpose is to eliminate alternative explanations
of the data obtained from an experiment.
They are vitally important many an interesting
experiment rendered useless by poor controls.

23
Types of variables

Other factors may affect behaviour we are
investigating.
factor we wish to study is independent variable
(the thing we can vary as we choose)
behaviour of interest is dependent variable
(because it depends on the factor(s))
other factors are extraneous variables (things
that vary without our wanting them to)
Control experiments try to eliminate the
disturbances caused by extraneous variables by
controlling them in some way.

24
Controlling for extraneous variation (1)

Make the extraneous variable an independent one,
and include it in the experiment (if we can)
i.e. varying value of the extraneous variable
together with that of the independent variable
only possible if not too many extraneous
variables
2. Partition the test cases such that the
extraneous variable effects cancel out.
e.g. effect of gender on measured intelligence
- collect a large number of pairs of 1 male 1
female
such that each pair closely matched on age,
socio-economic class, domestic situation,
training, etc.
so differences within each pair due solely to
gender

25
Controlling for extraneous variation (2)

3. Random sample of the population of individuals
with each of the values of the independent
variable
compare the behaviours of these samples.
e.g. run 100 randomly different runs of a
genetic algorithm for each chosen value of
mutation rate
Effects of other, extraneous, variables should
appear as random variation in the dependent
variable
- effects of independent variable will not be
random
- a statistical test can distinguish them.
Be careful that samples really are random with
respect to the extraneous variables.
- if there is some cause-effect relationship we
don't know about, effects of extraneous variables
may compound instead of cancelling out.
Have to be very careful in selecting random
samples.

26
Choosing test probelms

How do we choose the set of tests that vary the
factor X of interest and how do we make
measurements of the behaviour Y we are studying.
Make set of test problems to assess performance
of system
- what results of performance tell us depends
what we are comparing our system against
- test problems should fair not so hard that no
comparator system could do well on, nor so easy
that any system could do well.
e.g.MYCIN can it perform as well as human
experts?
set of test problems for MYCIN and human experts
- human novices were also included in comparator
set
If novices and experts both do well problems too
easy - if both do badly too hard - fair test
divides the two

27
Measurement procedure

2. Given our test set, what do we measure?
- for MYCIN, test problem responses produced
checked by human experts.
- experts not told where solutions came from
i.e. which generated by MYCIN and which were
generated by the comparator set of humans.
- possible biases controlled for by blinding
judges to information which might bias their
response.
MYCIN was single blind trial, since only the
judges were unaware whether a solution was human
or machine generated.
When knowledge available to subject (or
experimenter) might cause a systematic variation
in the measured effects, double blind trials also
widely used
e.g. in drug testing neither subjects or
experimenters know who takes which drugs

28
Design

Often

29
3. Statistical Tests for Confirmatory Experiments
30
4. Statistical Measures of IndependenceChi-squar
ed
31
Evaluating usability example

We want to evaluate the usability of an
interface, so we ask users to rate it as
1. easy to use
2. average
3. difficult to use
We test it on different groups of users,
recording how many users select each rating, for
each of
a. Children (under 12 years)
b. Teenagers (13 - 18 years)
c. Adults (over 18)
If there is no consistency of usability then
ratings will be equally spread across 1 - 3
Is there a difference between different users?

32
Evaluating usability example
33
Alternative view of the data
34
Chi-squared statistical test

Measuring similarity of distributions of data
- one way is chi-squared (?2) statistical test
- tells us how likely these data could arise by
chance if no effect were present
Suppose there is no effect present
(i.e. data are independent)
- would expect 130/3 to choose each rating.
different numbers in each group, so calculate
this proportional to number in each group
Make table of expected and actual frequencies -
expected value of each cell is
row totals x column total
overall total

35
Considering the ratings overall
Square differences, divide by expected and sum
?2 ? ( (O - E)2 ) E (36 - 43.3)2
(51 - 43.3)2 (43 - 43.3) 2 2.6 43.3
43.3 43.3 result ?2 2.6, df 2, p gt
0.05, NS i.e. no significant differences in
usability rating across the groups as a whole
36
Comparison between groups expected frequencies
37
Comparison between groups

?2 ? ( (O - E)2 )
E
(7 - 8.9)2 (20 -12.6)2 (5 - 10.6)2 (33
-17.2)2
8.9 12.6 10.6 17.2
0.406 4.37 2.96 13.93 0.5 6.84
9.02 0.95 14.5 53.47
Degrees freedom (r-1)(c-1) 2 x 2 4
result ?2 53.47, df 4, p 0.01
So we reject the null hypothesis

38
What might we infer from this?

result ?2 53.47, df 4, p 0.01
Look this up in statistical tables - the chance
of obtaining the actual frequencies from an
experiment with true frequencies equal to those
expected is p lt 0.01
So we reject the null hypothesis
i.e. we do appear to have differing rating of
usability between groups

39
4. Robotics example
40
Further example (from Cohen)

Suppose we have a robot
it works in a difficult (windy) environment
it has to tackle problems for which it may or may
not have time to work out a best'' plan of work
to suit the conditions.
Given different levels of wind speed (W)
and different allowed thinking times (T)
how do changes in these influence success result
(R)?
Hypothesis wind speed and outcome are
independent when there is plenty of thinking
time, but not when there is inadequate thinking
time
Run observational experiments and see what
happens
Compare - when T thinking time is adequate
- when thinking time is not adequate

41
When thinking time is adequate
42
Expected values (adequate)
43
Expected values (adequate)

?2 ? ( (O - E)2 )
E
For T adequate this is
((30-27.95)2/27.95) ((5-7.05)2/7.05)
((32-31.94)2/31.94) ((8-8.06)2/8.06)
((53-55.1)2/55.1) ((16-13.9)2/13.9)
1.145
A low value, so a small difference.
Probability of independence is 0.56
(not significant)

44
When thinking time is inadequate
45
Expected values (inadequate)
46
Expected values (inadequate)

?2 ? ( (O - E)2 )
E
For T inadequate this is
((55-42.71)2/42.71) ((30-42.29)42.29)
((35-38.69)2/38.69) ((42-38.31)2/38.31)
((10-18.59)2/18.59) ((27-18.41)2/18.41)
15.79
A high value, so a big difference.
Probability of independence is 0.0004
(significant at lt 0.01 level)

47
Conclusions r.e. hypotheses

When thinking time is adequate, probability of
independence is 0.56 (not significant)
When thinking time is inadequate, probability of
independence is 0.0004
(significant at lt 0.01 level)
So, reject null hypothesis
Support for hypothesis
Wind speed and outcome are independent when
there is plenty of thinking time, but not when
there is inadequate thinking time

48
Chi-squared - summary

1. Assume that the data are independent.
2. Calculate expected frequencies of each kind of
result for a sample of the same size and
composition as the one you have, given the
independence assumption.
Calculate the square deviation between actual and
expected frequencies, divide each by the expected
frequency, and sum over the whole table.
4. Work out the degrees of freedom
5. Consult tables giving chi2 distribution
probabilities to find chance that data could
generated by accident, given assumption of
independence

49
Presenting Experimental Work

1. Give enough information so that conclusions
and analysis can be checked by an interested
reader.
i.e. state sample sizes, sample variances and
means, and other statistical information the
reader may need.
2. Give enough information for reader to be able
to replicate your work - make replication
possible.
i.e. give clear descriptions of methods used,
parameters chosen, details of algorithms make
training and test data sets available, and say
where to get them from.
In general, display data visually in informative
ways.
Use tools such as MATLAB to create clear and
effective graphical presentations that convey
information