Title: AI1 Experimental Methodology: Lecture 89 Experimental Design and Statistics
1AI1 Experimental Methodology Lecture 8/9
Experimental Design and Statistics
2Exp Methods Course
- 1. AI as Experimental Science
- 2. Experiments involving Humans
- 3. Data, visualisation and correlation
- 5. Introduction to Knowledge Based Systems
- 6. Knowledge Acquisition
- 7. Building and Evaluating Symbolic Systems
- 8/9. Experimental design and statistics
- 10. Experiments with other systems
31. Reminder
4Tools for Analysing Data
- Data normally comes in sets - single experiment
may involve repeating a test a number of times. - Visualisation techniques used for exploratory
data - - display relationships between variables
visually to make patterns in dataset apparent - - tools for this
- MATLAB, a matrix manipulation system with
excellent graphical display abilities. - Statistical tests used for confirmatory
experiments - - to determine extent to which an anticipated
effect is present in the data from the
experiment - - visualisation plays a much less significant
role here
5Class by degree and birth month
6By degree and birth month
7By degree and birth month
8Scatter plots
9Summary statistics
- Summary Statistics express a property of the data
set in a single number or set of a few numbers. - Most common mean, mode, median, variance and
standard deviation - Mean gives the centre of mass of the set
- _
- x ? x1,...xn
- -------------
- n
- So mean of 2, 3, 6, 1, 5, 1 18/6 3
10Variance
- Variance ?2x is the mean deviation from the
centre _ _ - ?2x ? (x1 - x)2,...(xn - x)2
- -------------
- n
- The Standard deviation ?x is the square root of
the variance - ___
- ?2x ??2x
11Linear correlation
- Linear correlation measures how well the data fit
the model of a straight line relationship. - 1. Compute the means of the x and y data from the
scatter plot separately. - 2. For each point in the scatter plot (pair of
data) calculate the deviation of each datum from
its mean and multiply, that is - compute (x - mean(x))(y - mean(y))
- 3. Sum these products for all the data pairs and
divide by N-1 for N data. - 4. Work out the standard deviation of x and y
separately, and divide the sum from step 3. by
the product of these standard deviations.
12Pearson's Correlation Coefficient
- Measures how well the data fit the straight line
model it assumes - _ _
- correlation ? (x - x)(y - y)
- -------------------
- (N-1) ?x ?y
- Lies between -1 (low X means high Y)
- and 1 (high X means high Y)
- with 0 meaning no correlation
13Study v exam performance
14Study v exam performance
152. Hypothesis testing
16Role of Experiment in Design
- Often experiments are used to guide new designs
or help understand existing design. - Programs are not themselves experiments. They
are normally a part of the basis for conducting
experiments (on an algorithm or a system or a
group of people). - Three types of activity
- Exploratory Where we are wondering what to
design. - Formative evaluation Where we experiment with a
preliminary design with the aim of building a
better one. - Summative evaluation Where a final design is
analysed definitively.
17Hypothesis Formation
- Typical hypothesis Factor X affects behaviour Y.
- Typical null hypothesis No effect of X on Y.
- What will we measure about X and Y?
- Will our experiments aim to prove or disprove
the (null) hypothesis? - Observation v Manipulation
- - Observation experiments Look at population to
see if X correlates with Y. - - Manipulation experiments Change X and see
what happens to Y. - but we need to be sure that change in Y is due
only to the differences in X.
18Attempt to disprove hypothesis
- Formulate a precise experimental question or
hypothesis - Testing whether evidence supports our hypothesis
or not. - e.g. living near power cable increases likelihood
of certain cancers - or setting rate of mutation too high in a
genetic algorithm results in slow convergence or
poor solutions being found - Design an experiment to disprove our hypothesis
- - a positive result could be caused by something
we haven't thought of - - but a single negative result disproves the
hypothesis - This means finding a way to answer the question
- Are measurements of X and Y related?
19Observation v Manipulation
- Observation experiments
- necessary when cannot directly manipulate X
- group subjects based on measurement of X
- e.g. 2 groups, 1 of people close to power lines
- 1of those far from power lines
- see variation in incidence of cancers
- Manipulation experiments
- when factor of interest directly manipulable.
- e.g. genetic algorithm example - run the program
with - different values of the mutation rate
parameter - and see what happens
20Influence of other factors?
- How do we know that effects we see (variations in
measured behaviour) due only to changes in the
factor of interest? - other factors may influence behaviour of interest
and may contaminate our experiments. - Consider this during experimental design
- well-designed experiment allows us just one
explanation for effects we see in data it
produces - while a poor design may allow many.
- When you look at data, and consider people's
conclusions based on it, you need always to ask
what else (apart from what they suggest) might
account for the effects described.
21Almonds are good for you.?
- Almonds It may sound pretty nutty, but even
though almonds are very high in fat ... they may
be good for your heart! A major study of 26,000
members of the Seventh Day Adventist Church in
the United States showed that those who ate
almonds, peanuts and walnuts at least six times a
week had an average lifespan of seven years
longer than the general population, and a
substantially lower rate of heart attack. - (p. 77, The Food Medicine Bible,
- Earl Mindell and Carol Colman, 1994.)
- Can we conclude that almonds are good for you?
- Could be peanuts or walnuts, or combination
- Or maybe it only applies to 7th day Adventists
- Or something else is going on.
22Control experiments
- To resolve these issues we would need to do more
experiments (or do this one more carefully) - to demonstrate that almonds accounted for
healthier people and not the other nuts - to demonstrate that Seventh day Adventists are
typical of general population in relation to
health - Control experiments
- purpose is to eliminate alternative explanations
of the data obtained from an experiment. - They are vitally important many an interesting
experiment rendered useless by poor controls.
23Types of variables
- Other factors may affect behaviour we are
investigating. - factor we wish to study is independent variable
(the thing we can vary as we choose) - behaviour of interest is dependent variable
(because it depends on the factor(s)) - other factors are extraneous variables (things
that vary without our wanting them to) -
- Control experiments try to eliminate the
disturbances caused by extraneous variables by
controlling them in some way.
24Controlling for extraneous variation (1)
- Make the extraneous variable an independent one,
and include it in the experiment (if we can) - i.e. varying value of the extraneous variable
together with that of the independent variable - only possible if not too many extraneous
variables - 2. Partition the test cases such that the
extraneous variable effects cancel out. - e.g. effect of gender on measured intelligence
- - collect a large number of pairs of 1 male 1
female - such that each pair closely matched on age,
socio-economic class, domestic situation,
training, etc. - so differences within each pair due solely to
gender
25Controlling for extraneous variation (2)
- 3. Random sample of the population of individuals
with each of the values of the independent
variable - compare the behaviours of these samples.
-
- e.g. run 100 randomly different runs of a
genetic algorithm for each chosen value of
mutation rate -
- Effects of other, extraneous, variables should
appear as random variation in the dependent
variable -
- - effects of independent variable will not be
random -
- - a statistical test can distinguish them.
-
- Be careful that samples really are random with
respect to the extraneous variables. -
- - if there is some cause-effect relationship we
don't know about, effects of extraneous variables
may compound instead of cancelling out. -
- Have to be very careful in selecting random
samples.
26Choosing test probelms
- How do we choose the set of tests that vary the
factor X of interest and how do we make
measurements of the behaviour Y we are studying.
- Make set of test problems to assess performance
of system - - what results of performance tell us depends
what we are comparing our system against - - test problems should fair not so hard that no
comparator system could do well on, nor so easy
that any system could do well. - e.g.MYCIN can it perform as well as human
experts? - set of test problems for MYCIN and human experts
- - human novices were also included in comparator
set - If novices and experts both do well problems too
easy - if both do badly too hard - fair test
divides the two
27Measurement procedure
- 2. Given our test set, what do we measure?
- - for MYCIN, test problem responses produced
checked by human experts. - - experts not told where solutions came from
- i.e. which generated by MYCIN and which were
generated by the comparator set of humans. - - possible biases controlled for by blinding
judges to information which might bias their
response. - MYCIN was single blind trial, since only the
judges were unaware whether a solution was human
or machine generated. - When knowledge available to subject (or
experimenter) might cause a systematic variation
in the measured effects, double blind trials also
widely used - e.g. in drug testing neither subjects or
experimenters know who takes which drugs
28Design
293. Statistical Tests for Confirmatory Experiments
304. Statistical Measures of IndependenceChi-squar
ed
31Evaluating usability example
- We want to evaluate the usability of an
interface, so we ask users to rate it as - 1. easy to use
- 2. average
- 3. difficult to use
- We test it on different groups of users,
recording how many users select each rating, for
each of - a. Children (under 12 years)
- b. Teenagers (13 - 18 years)
- c. Adults (over 18)
- If there is no consistency of usability then
ratings will be equally spread across 1 - 3 - Is there a difference between different users?
32Evaluating usability example
33Alternative view of the data
34Chi-squared statistical test
- Measuring similarity of distributions of data
- - one way is chi-squared (?2) statistical test
- - tells us how likely these data could arise by
chance if no effect were present - Suppose there is no effect present
- (i.e. data are independent)
- - would expect 130/3 to choose each rating.
- different numbers in each group, so calculate
this proportional to number in each group - Make table of expected and actual frequencies -
expected value of each cell is - row totals x column total
- overall total
35Considering the ratings overall
Square differences, divide by expected and sum
?2 ? ( (O - E)2 ) E (36 - 43.3)2
(51 - 43.3)2 (43 - 43.3) 2 2.6 43.3
43.3 43.3 result ?2 2.6, df 2, p gt
0.05, NS i.e. no significant differences in
usability rating across the groups as a whole
36Comparison between groups expected frequencies
37Comparison between groups
- ?2 ? ( (O - E)2 )
- E
- (7 - 8.9)2 (20 -12.6)2 (5 - 10.6)2 (33
-17.2)2 - 8.9 12.6 10.6 17.2
-
- 0.406 4.37 2.96 13.93 0.5 6.84
- 9.02 0.95 14.5 53.47
- Degrees freedom (r-1)(c-1) 2 x 2 4
-
- result ?2 53.47, df 4, p 0.01
-
- So we reject the null hypothesis
38What might we infer from this?
- result ?2 53.47, df 4, p 0.01
- Look this up in statistical tables - the chance
of obtaining the actual frequencies from an
experiment with true frequencies equal to those
expected is p lt 0.01 - So we reject the null hypothesis
- i.e. we do appear to have differing rating of
usability between groups
394. Robotics example
40Further example (from Cohen)
- Suppose we have a robot
- it works in a difficult (windy) environment
- it has to tackle problems for which it may or may
not have time to work out a best'' plan of work
to suit the conditions. - Given different levels of wind speed (W)
- and different allowed thinking times (T)
- how do changes in these influence success result
(R)? - Hypothesis wind speed and outcome are
independent when there is plenty of thinking
time, but not when there is inadequate thinking
time - Run observational experiments and see what
happens - Compare - when T thinking time is adequate
- - when thinking time is not adequate
41When thinking time is adequate
42Expected values (adequate)
43Expected values (adequate)
- ?2 ? ( (O - E)2 )
- E
- For T adequate this is
- ((30-27.95)2/27.95) ((5-7.05)2/7.05)
- ((32-31.94)2/31.94) ((8-8.06)2/8.06)
- ((53-55.1)2/55.1) ((16-13.9)2/13.9)
- 1.145
- A low value, so a small difference.
- Probability of independence is 0.56
- (not significant)
44When thinking time is inadequate
45Expected values (inadequate)
46Expected values (inadequate)
- ?2 ? ( (O - E)2 )
- E
- For T inadequate this is
- ((55-42.71)2/42.71) ((30-42.29)42.29)
- ((35-38.69)2/38.69) ((42-38.31)2/38.31)
- ((10-18.59)2/18.59) ((27-18.41)2/18.41)
- 15.79
- A high value, so a big difference.
- Probability of independence is 0.0004
- (significant at lt 0.01 level)
47Conclusions r.e. hypotheses
- When thinking time is adequate, probability of
independence is 0.56 (not significant) - When thinking time is inadequate, probability of
independence is 0.0004 - (significant at lt 0.01 level)
- So, reject null hypothesis
- Support for hypothesis
- Wind speed and outcome are independent when
there is plenty of thinking time, but not when
there is inadequate thinking time
48Chi-squared - summary
- 1. Assume that the data are independent.
- 2. Calculate expected frequencies of each kind of
result for a sample of the same size and
composition as the one you have, given the
independence assumption. - Calculate the square deviation between actual and
expected frequencies, divide each by the expected
frequency, and sum over the whole table. - 4. Work out the degrees of freedom
- 5. Consult tables giving chi2 distribution
probabilities to find chance that data could
generated by accident, given assumption of
independence
49Presenting Experimental Work
- 1. Give enough information so that conclusions
and analysis can be checked by an interested
reader. - i.e. state sample sizes, sample variances and
means, and other statistical information the
reader may need. - 2. Give enough information for reader to be able
to replicate your work - make replication
possible. - i.e. give clear descriptions of methods used,
parameters chosen, details of algorithms make
training and test data sets available, and say
where to get them from. - In general, display data visually in informative
ways. - Use tools such as MATLAB to create clear and
effective graphical presentations that convey
information
505. Genetic algorithms example
51Genetic algorithms example
52Genetic algorithms example
53Gas
54Gas
55Gas
56Gas - t-test
576. Statistical Measures of Independencet-test
58Do another example
59By whatever.
60Further example
61Further example
62Further example
63Standard Error I
64Standard Error II