Title: Statistical Methods in Computer Science
1Statistical Methods in Computer Science
- The Basis for
- Experiment Design
- Ido Dagan
2Experimental Lifecycle
Analysis
Hypothesis
Experiment
3Proving a Theory?
- We've discussed 4 methods of proving a
proposition - Everyone knows it
- Someone specific says it
- An experiment supports it
- We can mathematically prove it
- Some propositions cannot be verified empirically
- This compiler has linear run-time
- Infinite possible inputs --gt cannot prove
empirically - But they may still be disproved
- e.g., code that causes the compiler to run
non-linearly
4Karl Popper's Philosophy of Science
- Popper advanced a particular philosophy of
science - Falsifiability
- For a theory to be considered scientific, it must
be falsifiable - There must be some way to refute it, in principle
- Not falsifiable ltgt Not scientific
- Examples
- All crows are black falsifiable by finding a
white crow - Compile in linear time falsifiable by
non-linear performance - Theory tested on its predictions
5Proving by disproving...
- Platt (Strong Inference, 1964) offers a
specific method - Devise alternative hypotheses for observations
- Devise experiment(s) allowing elimination of
hypotheses - Carry out experiments to obtain a clean result
- Go to 1.
- The idea is to eliminate (falsify) hypotheses
6Forming Hypotheses
- So, to support theory X, we
- Construct falsifiability hypotheses X1,.... Xn,
.... - Systematically experiment to disprove X, by
proving Xi - If all falsification hypotheses eliminated, then
this lends support to the theory - Note that future falsification hypotheses may be
formed - Theory must continue to hold against attacks
- Popper Scientific evolution, survival of the
fittest theory - E.g. Newtons theory
- How does this view hold in computer science?
7Forming Hypotheses in CS
- Carefully identify the theoretical object we
are studying - e.g., the relation between input-size and
run-time is linear - e.g., the display improves user performance
- Identify falsification hypothesis (null
hypothesis) H0 - e.g., there is an input-size for which run-time
is non-linear - e.g., the display will have no effect on user
performance - Now, experiment to eliminate H0
8The Basics of Experiment Design
- Experiments identify a relation between variables
X, Y, ... - Simple experiments Provide indication of a
relation - Better/worse, linear or non-linear, ....
- Advanced experiments help identify causes,
interactions - Linear in input size but constant factor depends
on type of data
9Types of Experiments and Variables
- Manipulation experiments
- Manipulate ( set value of) independent variables
(input size) - Observe (measure value of) dependent variables
(run time) - Observation experiments
- Observe predictor variables (person height)
- Observe response variables (running speed)
- Also running time if observing system in actual
use - Other variables
- Endogenous On causal path between independent
and dependent - Exogenous Other variables influencing dependent
variables
10An example of observation experiment
- Theory Gender affects score performance
- Falsifying hypothesis Gender does not affect
performance - I.e. Men women perform the same
- Cannot use manipulation experiments
- Cannot control gender
- Must use observation experiments
11An example observation experiment(ala Empirical
methods in AI, Cohen 1995)
Siblings 2
Teacher's attitude
Mother artist
Test score 650
Gender Male
Child confidence
Height 145cm
Independent (Predictor) Variables
Siblings 3
Teacher's attitude
Mother Doctor
Test score 720
Gender Female
Child confidence
Height 135cm
12An example observation experiment(ala Empirical
methods in AI, Cohen 1995)
Siblings 2
Teacher's attitude
Mother artist
Test score 650
Gender Male
Child confidence
Height 145cm
Dependent (Response) Variables
Siblings 3
Teacher's attitude
Mother Doctor
Test score 720
Gender Female
Child confidence
Height 135cm
13An example observation experiment(ala Empirical
methods in AI, Cohen 1995)
Siblings 2
Teacher's attitude
Mother artist
Test score 650
Gender Male
Child confidence
Height 145cm
Endogenous Variables
Siblings 3
Teacher's attitude
Mother Doctor
Test score 720
Gender Female
Child confidence
Height 135cm
14An example observation experiment(ala Empirical
methods in AI, Cohen 1995)
Siblings 2
Teacher's attitude
Mother artist
Test score 650
Gender Male
Child confidence
Height 145cm
Exogenous Variables
Siblings 3
Teacher's attitude
Mother Doctor
Test score 720
Gender Female
Child confidence
Height 135cm
15Experiment Design Introduction
- Different experiment types explore different
hypotheses - For instance, a very simple design treatment
experiment - Sometimes known as a lesion study
- treatment Ind1 Ex1 Ex2 ....
Exn Dep1 - control Not(Ind1) Ex1 Ex2 ....
Exn Dep2 - Treatment condition
- Independent variable set to with treatment
- Control condition Independent var set to no
treatment
Dependent Variable
Variables V0
V1
V2
...
Vn
16Single-Factor Treatment Experiments
- A generalization of treatment experiments
- Allow comparison of different conditions
- treatment1 Ind1 Ex1 Ex2 ....
Exn Dep1 - treatment2 Ind2 Ex1 Ex2
.... Exn Dep2 - control Not(Ind) Ex1 Ex2 ....
Exn Dep3 - Compare performance of algorithm A to B to C ....
- Control condition Optional (e.g., to establish
baseline) - Determine relation of categorical var V0 and the
dependent var
Vn
V1
Dependent Variable
V2
V0
17Careful !
- An effect on the dependent variable may not be as
expected - Example An experiment
- Hypothesis fly's ear is on its wings
- Fly with two wings. Make loud noise. Observe
flight. - Fly with one wing. Make loud noise. No flight.
- Conclusion Fly with only one wing cannot hear!
- What's going on here?
- First, interpretation by the experimenter
- But also, lack of sufficient falsifiability
- There are other possible explanations for why fly
wouldn't fly.
18Controlling for other factors
- Often, we cannot manipulate all exogenous
variables - Then, we need to make sure they are sampled
randomly - Randomization averages out their affect
- This can be difficult
- e.g.,, suppose we are trying to relate gender and
math - We control for effect of of siblings by random
sampling - But of siblings may be related to gender
- Parents continue to have children hoping for a
boy (Beal 1994) - Thus of siblings tied with gender
- Must separate results based on of siblings
19Factorial Experiment Designs
- Every combination of factor values is sampled
- Hope is to exclude or reveal interactions
- This creates a combinatorial number of
experiments - N factors, k values each kN combinations
- Strategies for eliminating values
- Merge values, categories. Skip values.
- Focus on extremes, to get a general trend
- But may hide behavior at intermediate values
20Tips for Factorial Experiments
- For numerical variables, 2 value ranges are not
enough - Don't give a good sense of the function relating
variables. - Measure, measure, measure.
- Piggybacking measurements on planned experiments
cheaper than re-running experiments - Simplify comparisons
- Use same number of data points (trials) for all
configurations
21Experiment Validity
- Type of validity Internal and External validity
- Internal validity
- Experiment shows relationship (independent causes
dependent) - External validity
- Degree to which results generalize to other
conditions - Threats uncontrolled conditions threatening
validity
22Internal validity threats Examples
- Order effects
- Practice effects in human or animal test subjects
- E.g. user performance improves in user interface
tasks - Solution randomize order of presentation to
subjects - Bug or side-effects in testing system leaves
system unclean for next trial need to clean
system between experiments - If treatment/control given in two different
orders - E.g. run with/without new algorithm operating,
for same users - Order may be good for treatment, bad for control
(or vice versa) - Solution counter-balancing (all possible orders)
- Demand effects
- Experimenter influences subject
- e.g., guiding subjects
- Confounding effects variable relations arent
clear - See fly with no wings cannot hear
23External threats to validity
- Sampling bias Non-representative samples
- e.g., non-representative external factors
- Floor and ceiling effects
- Problems tested too hard, too easy
- Regression effects
- Results have no way to go but up or down
- Solution approach Run pilot experiments
24Sampling Bias
- Setting prefers measuring specific values over
others - For instance
- Random selection of mice from cage for
experiment - Specific values slow, doesnt bite (not
aggressive), - Including results that were found by some
deadline - Solution Detect, and remove
- e.g., by visualization, looking for non-normal
distributions - e.g., surprising distribution of dependent data,
for different values of independent variable.
25Baselines Floor and Ceiling Effects
- How do we know A is good? Bad?
- Maybe the problems are too simple? Too hard?
- For example
- New machine learning algorithm has 95 accuracy
- Is this good?
- Controlling for Floor/Ceiling
- Establish baselines
- Show that a silly approach achieves close
result - Comparison to strawman (easy), ironman (hard)
- May be misleading if not chosen appropriately
26Regression Effects
- General phenomenon Regression towards the mean
- Repeated measurement converges towards mean
values - Example threat Run a program on 100 different
inputs - Problems 6, 14, 15 get a very low score
- We now fix the problem that affected only these
inputs, and want to re-test - If chance has anything to do with scoring, then
must re-run all - Why?
- Scores on 6, 14, 15 has no where to go but up.
- So re-running these problems will show
improvement by chance - Solution
- Re-run complete tests, or sample conditions
uniformly
27Summary
- Defensive thinking
- If I were trying to disprove the claim, what
would I do - Then think ways to counter any possible attack on
claim - Strong Inference, Popper's falsification ideas
- Science moves by disproving theories
(empirically) - Experiment design
- Ideal independent variables easy to manipulate
- Ideal dependent variables measurable, sensitive,
and meaningful - Carefully think through threats
- Next week Hypothesis testing