AI1 Experimental Methodology: Lecture 89 Experimental Design and Statistics - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

AI1 Experimental Methodology: Lecture 89 Experimental Design and Statistics

Description:

1. AI as Experimental Science. 2. Experiments involving Humans ... Data normally comes in sets - single experiment may involve repeating a test a number of times. ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 65
Provided by: helen72
Category:

less

Transcript and Presenter's Notes

Title: AI1 Experimental Methodology: Lecture 89 Experimental Design and Statistics


1
AI1 Experimental Methodology Lecture 8/9
Experimental Design and Statistics
2
Exp Methods Course
  • 1. AI as Experimental Science
  • 2. Experiments involving Humans
  • 3. Data, visualisation and correlation
  • 5. Introduction to Knowledge Based Systems
  • 6. Knowledge Acquisition
  • 7. Building and Evaluating Symbolic Systems
  • 8/9. Experimental design and statistics
  • 10. Experiments with other systems

3
1. Reminder
4
Tools for Analysing Data
  • Data normally comes in sets - single experiment
    may involve repeating a test a number of times.
  • Visualisation techniques used for exploratory
    data
  • - display relationships between variables
    visually to make patterns in dataset apparent
  • - tools for this
  • MATLAB, a matrix manipulation system with
    excellent graphical display abilities.
  • Statistical tests used for confirmatory
    experiments
  • - to determine extent to which an anticipated
    effect is present in the data from the
    experiment
  • - visualisation plays a much less significant
    role here

5
Class by degree and birth month
6
By degree and birth month
7
By degree and birth month
8
Scatter plots
9
Summary statistics
  • Summary Statistics express a property of the data
    set in a single number or set of a few numbers.
  • Most common mean, mode, median, variance and
    standard deviation
  • Mean gives the centre of mass of the set
  • _
  • x ? x1,...xn
  • -------------
  • n
  • So mean of 2, 3, 6, 1, 5, 1 18/6 3

10
Variance
  • Variance ?2x is the mean deviation from the
    centre _ _
  • ?2x ? (x1 - x)2,...(xn - x)2
  • -------------
  • n
  • The Standard deviation ?x is the square root of
    the variance
  • ___
  • ?2x ??2x

11
Linear correlation
  • Linear correlation measures how well the data fit
    the model of a straight line relationship.
  • 1. Compute the means of the x and y data from the
    scatter plot separately.
  • 2. For each point in the scatter plot (pair of
    data) calculate the deviation of each datum from
    its mean and multiply, that is
  • compute (x - mean(x))(y - mean(y))
  • 3. Sum these products for all the data pairs and
    divide by N-1 for N data.
  • 4. Work out the standard deviation of x and y
    separately, and divide the sum from step 3. by
    the product of these standard deviations.

12
Pearson's Correlation Coefficient
  • Measures how well the data fit the straight line
    model it assumes
  • _ _
  • correlation ? (x - x)(y - y)
  • -------------------
  • (N-1) ?x ?y
  • Lies between -1 (low X means high Y)
  • and 1 (high X means high Y)
  • with 0 meaning no correlation

13
Study v exam performance
14
Study v exam performance
15
2. Hypothesis testing
16
Role of Experiment in Design
  • Often experiments are used to guide new designs
    or help understand existing design.
  • Programs are not themselves experiments. They
    are normally a part of the basis for conducting
    experiments (on an algorithm or a system or a
    group of people).
  • Three types of activity
  • Exploratory Where we are wondering what to
    design.
  • Formative evaluation Where we experiment with a
    preliminary design with the aim of building a
    better one.
  • Summative evaluation Where a final design is
    analysed definitively.

17
Hypothesis Formation
  • Typical hypothesis Factor X affects behaviour Y.
  • Typical null hypothesis No effect of X on Y.
  • What will we measure about X and Y?
  • Will our experiments aim to prove or disprove
    the (null) hypothesis?
  • Observation v Manipulation
  • - Observation experiments Look at population to
    see if X correlates with Y.
  • - Manipulation experiments Change X and see
    what happens to Y.
  • but we need to be sure that change in Y is due
    only to the differences in X.

18
Attempt to disprove hypothesis
  • Formulate a precise experimental question or
    hypothesis
  • Testing whether evidence supports our hypothesis
    or not.
  • e.g. living near power cable increases likelihood
    of certain cancers
  • or setting rate of mutation too high in a
    genetic algorithm results in slow convergence or
    poor solutions being found
  • Design an experiment to disprove our hypothesis
  • - a positive result could be caused by something
    we haven't thought of
  • - but a single negative result disproves the
    hypothesis
  • This means finding a way to answer the question
  • Are measurements of X and Y related?

19
Observation v Manipulation
  • Observation experiments
  • necessary when cannot directly manipulate X
  • group subjects based on measurement of X
  • e.g. 2 groups, 1 of people close to power lines
  • 1of those far from power lines
  • see variation in incidence of cancers
  • Manipulation experiments
  • when factor of interest directly manipulable.
  • e.g. genetic algorithm example - run the program
    with
  • different values of the mutation rate
    parameter
  • and see what happens

20
Influence of other factors?
  • How do we know that effects we see (variations in
    measured behaviour) due only to changes in the
    factor of interest?
  • other factors may influence behaviour of interest
    and may contaminate our experiments.
  • Consider this during experimental design
  • well-designed experiment allows us just one
    explanation for effects we see in data it
    produces
  • while a poor design may allow many.
  • When you look at data, and consider people's
    conclusions based on it, you need always to ask
    what else (apart from what they suggest) might
    account for the effects described.

21
Almonds are good for you.?
  • Almonds It may sound pretty nutty, but even
    though almonds are very high in fat ... they may
    be good for your heart! A major study of 26,000
    members of the Seventh Day Adventist Church in
    the United States showed that those who ate
    almonds, peanuts and walnuts at least six times a
    week had an average lifespan of seven years
    longer than the general population, and a
    substantially lower rate of heart attack.
  • (p. 77, The Food Medicine Bible,
  • Earl Mindell and Carol Colman, 1994.)
  • Can we conclude that almonds are good for you?
  • Could be peanuts or walnuts, or combination
  • Or maybe it only applies to 7th day Adventists
  • Or something else is going on.

22
Control experiments
  • To resolve these issues we would need to do more
    experiments (or do this one more carefully)
  • to demonstrate that almonds accounted for
    healthier people and not the other nuts
  • to demonstrate that Seventh day Adventists are
    typical of general population in relation to
    health
  • Control experiments
  • purpose is to eliminate alternative explanations
    of the data obtained from an experiment.
  • They are vitally important many an interesting
    experiment rendered useless by poor controls.

23
Types of variables
  • Other factors may affect behaviour we are
    investigating.
  • factor we wish to study is independent variable
    (the thing we can vary as we choose)
  • behaviour of interest is dependent variable
    (because it depends on the factor(s))
  • other factors are extraneous variables (things
    that vary without our wanting them to)
  • Control experiments try to eliminate the
    disturbances caused by extraneous variables by
    controlling them in some way.

24
Controlling for extraneous variation (1)
  • Make the extraneous variable an independent one,
    and include it in the experiment (if we can)
  • i.e. varying value of the extraneous variable
    together with that of the independent variable
  • only possible if not too many extraneous
    variables
  • 2. Partition the test cases such that the
    extraneous variable effects cancel out.
  • e.g. effect of gender on measured intelligence
  • - collect a large number of pairs of 1 male 1
    female
  • such that each pair closely matched on age,
    socio-economic class, domestic situation,
    training, etc.
  • so differences within each pair due solely to
    gender

25
Controlling for extraneous variation (2)
  • 3. Random sample of the population of individuals
    with each of the values of the independent
    variable
  • compare the behaviours of these samples.
  • e.g. run 100 randomly different runs of a
    genetic algorithm for each chosen value of
    mutation rate
  • Effects of other, extraneous, variables should
    appear as random variation in the dependent
    variable
  • - effects of independent variable will not be
    random
  • - a statistical test can distinguish them.
  • Be careful that samples really are random with
    respect to the extraneous variables.
  • - if there is some cause-effect relationship we
    don't know about, effects of extraneous variables
    may compound instead of cancelling out.
  • Have to be very careful in selecting random
    samples.

26
Choosing test probelms
  • How do we choose the set of tests that vary the
    factor X of interest and how do we make
    measurements of the behaviour Y we are studying.
  • Make set of test problems to assess performance
    of system
  • - what results of performance tell us depends
    what we are comparing our system against
  • - test problems should fair not so hard that no
    comparator system could do well on, nor so easy
    that any system could do well.
  • e.g.MYCIN can it perform as well as human
    experts?
  • set of test problems for MYCIN and human experts
  • - human novices were also included in comparator
    set
  • If novices and experts both do well problems too
    easy - if both do badly too hard - fair test
    divides the two

27
Measurement procedure
  • 2. Given our test set, what do we measure?
  • - for MYCIN, test problem responses produced
    checked by human experts.
  • - experts not told where solutions came from
  • i.e. which generated by MYCIN and which were
    generated by the comparator set of humans.
  • - possible biases controlled for by blinding
    judges to information which might bias their
    response.
  • MYCIN was single blind trial, since only the
    judges were unaware whether a solution was human
    or machine generated.
  • When knowledge available to subject (or
    experimenter) might cause a systematic variation
    in the measured effects, double blind trials also
    widely used
  • e.g. in drug testing neither subjects or
    experimenters know who takes which drugs

28
Design
  • Often

29
3. Statistical Tests for Confirmatory Experiments
30
4. Statistical Measures of IndependenceChi-squar
ed
31
Evaluating usability example
  • We want to evaluate the usability of an
    interface, so we ask users to rate it as
  • 1. easy to use
  • 2. average
  • 3. difficult to use
  • We test it on different groups of users,
    recording how many users select each rating, for
    each of
  • a. Children (under 12 years)
  • b. Teenagers (13 - 18 years)
  • c. Adults (over 18)
  • If there is no consistency of usability then
    ratings will be equally spread across 1 - 3
  • Is there a difference between different users?

32
Evaluating usability example
33
Alternative view of the data
34
Chi-squared statistical test
  • Measuring similarity of distributions of data
  • - one way is chi-squared (?2) statistical test
  • - tells us how likely these data could arise by
    chance if no effect were present
  • Suppose there is no effect present
  • (i.e. data are independent)
  • - would expect 130/3 to choose each rating.
  • different numbers in each group, so calculate
    this proportional to number in each group
  • Make table of expected and actual frequencies -
    expected value of each cell is
  • row totals x column total
  • overall total

35
Considering the ratings overall
Square differences, divide by expected and sum
?2 ? ( (O - E)2 ) E (36 - 43.3)2
(51 - 43.3)2 (43 - 43.3) 2 2.6 43.3
43.3 43.3 result ?2 2.6, df 2, p gt
0.05, NS i.e. no significant differences in
usability rating across the groups as a whole
36
Comparison between groups expected frequencies
37
Comparison between groups
  • ?2 ? ( (O - E)2 )
  • E
  • (7 - 8.9)2 (20 -12.6)2 (5 - 10.6)2 (33
    -17.2)2
  • 8.9 12.6 10.6 17.2
  • 0.406 4.37 2.96 13.93 0.5 6.84
  • 9.02 0.95 14.5 53.47
  • Degrees freedom (r-1)(c-1) 2 x 2 4
  • result ?2 53.47, df 4, p 0.01
  • So we reject the null hypothesis

38
What might we infer from this?
  • result ?2 53.47, df 4, p 0.01
  • Look this up in statistical tables - the chance
    of obtaining the actual frequencies from an
    experiment with true frequencies equal to those
    expected is p lt 0.01
  • So we reject the null hypothesis
  • i.e. we do appear to have differing rating of
    usability between groups

39
4. Robotics example
40
Further example (from Cohen)
  • Suppose we have a robot
  • it works in a difficult (windy) environment
  • it has to tackle problems for which it may or may
    not have time to work out a best'' plan of work
    to suit the conditions.
  • Given different levels of wind speed (W)
  • and different allowed thinking times (T)
  • how do changes in these influence success result
    (R)?
  • Hypothesis wind speed and outcome are
    independent when there is plenty of thinking
    time, but not when there is inadequate thinking
    time
  • Run observational experiments and see what
    happens
  • Compare - when T thinking time is adequate
  • - when thinking time is not adequate

41
When thinking time is adequate
42
Expected values (adequate)
43
Expected values (adequate)
  • ?2 ? ( (O - E)2 )
  • E
  • For T adequate this is
  • ((30-27.95)2/27.95) ((5-7.05)2/7.05)
  • ((32-31.94)2/31.94) ((8-8.06)2/8.06)
  • ((53-55.1)2/55.1) ((16-13.9)2/13.9)
  • 1.145
  • A low value, so a small difference.
  • Probability of independence is 0.56
  • (not significant)

44
When thinking time is inadequate
45
Expected values (inadequate)
46
Expected values (inadequate)
  • ?2 ? ( (O - E)2 )
  • E
  • For T inadequate this is
  • ((55-42.71)2/42.71) ((30-42.29)42.29)
  • ((35-38.69)2/38.69) ((42-38.31)2/38.31)
  • ((10-18.59)2/18.59) ((27-18.41)2/18.41)
  • 15.79
  • A high value, so a big difference.
  • Probability of independence is 0.0004
  • (significant at lt 0.01 level)

47
Conclusions r.e. hypotheses
  • When thinking time is adequate, probability of
    independence is 0.56 (not significant)
  • When thinking time is inadequate, probability of
    independence is 0.0004
  • (significant at lt 0.01 level)
  • So, reject null hypothesis
  • Support for hypothesis
  • Wind speed and outcome are independent when
    there is plenty of thinking time, but not when
    there is inadequate thinking time

48
Chi-squared - summary
  • 1. Assume that the data are independent.
  • 2. Calculate expected frequencies of each kind of
    result for a sample of the same size and
    composition as the one you have, given the
    independence assumption.
  • Calculate the square deviation between actual and
    expected frequencies, divide each by the expected
    frequency, and sum over the whole table.
  • 4. Work out the degrees of freedom
  • 5. Consult tables giving chi2 distribution
    probabilities to find chance that data could
    generated by accident, given assumption of
    independence

49
Presenting Experimental Work
  • 1. Give enough information so that conclusions
    and analysis can be checked by an interested
    reader.
  • i.e. state sample sizes, sample variances and
    means, and other statistical information the
    reader may need.
  • 2. Give enough information for reader to be able
    to replicate your work - make replication
    possible.
  • i.e. give clear descriptions of methods used,
    parameters chosen, details of algorithms make
    training and test data sets available, and say
    where to get them from.
  • In general, display data visually in informative
    ways.
  • Use tools such as MATLAB to create clear and
    effective graphical presentations that convey
    information

50
5. Genetic algorithms example
51
Genetic algorithms example
  • Still to do for Thursday

52
Genetic algorithms example
  • Change

53
Gas
  • Change

54
Gas
  • Change

55
Gas
  • Change

56
Gas - t-test
  • Change

57
6. Statistical Measures of Independencet-test
58
Do another example
59
By whatever.
60
Further example
  • Distributions
  • Normal etc

61
Further example
62
Further example
63
Standard Error I
64
Standard Error II
Write a Comment
User Comments (0)
About PowerShow.com