Title: IEOR 170: Quantitative Evaluation
1IEOR 170 Quantitative Evaluation
Slides based on those of John Canny and Maneesh
Agrawala
2Administrivia
- Design Notebook/Idea Logs review deadline is
4/25/2007 - High-Fidelity Prototype and Evaluation assignment
has been released today - Due 4/25/2007
- 15 of the grading is group-specific
3Previously on IEOR 170
- Heuristic Evaluation
- Neilsons 10 Heuristics
- Evaluation Process
- Pros and Cons
4Qualitative vs. Quantitative Studies
- Qualitative What weve been doing so far
- Contextual Inquiry trying to understand users
tasks and their conceptual model. - Usability Studies looking for critical incidents
in a user interface - Qualitative methods help us
- Understand whats going on,
- Look for problems,
- Get a rough idea of the usability of an
interface.
5Quantitative Studies
- Quantitative
- Use to reliably measure something
- Compare two or more designs on a measurable
aspect - Approaches
- Collect and analyze user events that occur in
natural use - Key presses, Mouse clicks
- Controlled experiments
- Examples of measures
- Time to complete a task.
- Average number of errors on a task.
- Users ratings of an interface
- Ease of use, elegance, performance, robustness,
speed,
- - You could argue that users perception of
speed, error rates etc is more important than
their actual values.
6Comparison
- Qualitative studies
- Faster, less expensive -gt Especially useful in
early stage of design cycle - In real-world design quantitative study not
always necessary - Quantitative studies
- Reliable, repeatable result -gt scientific method
- Best studies produce generalizable results
7Steps in Designing an Experiment
- State a lucid, testable hypothesis
- Identify variables (independent, dependent
control, random) - Design the experiment protocol
- Choose user population
- Apply for human subjects protocol review
- Run pilot studies
- Run the experiment
- Perform statistical analysis
- Draw conclusions
8Example Menu Selection
Guimbtiere et al. 03
9Lucid, Testable Hypothesis
- Because users must reach for it, tool palette
will be slower - Other hypotheses?
10Experiment Design
- Testable hypothesis
- Precise statement of expected outcome
- Factors (independent variables)
- Attributes we manipulate/vary in each condition
- Levels values for independent variables
- Response variables (dependent variables)
- Outcome of experiment (measurements)
- Usually measure user performance
- Time
- Errors
11Experiment Design
- Control variables
- Attributes that will be fixed throughout
experiment - Confound attribute that varied and was not
accounted for - Problem Confound rather IV could have caused DVs
- Confounds make it difficult/impossible to draw
conclusions - Random variables
- Attributes that are randomly sampled
- Increases generalizability
12Variables
- Independent variables
- Dependent variables
- Control variables
- Random variables
13Variables
- Independent variables
- Menu type (4 choices)
- Device type (2 choices)
- Dependent variables
- Time
- Error rate
- User satisfaction
- Control variables
- Location/environment..
- Device type?
- Random variables
- Attributes of subjects
- Age, sex, .
14Goals
- Internal validity
- Manipulation of IV is cause of change in DV
- Requires that experiment is replicable
- External validity
- Results are generalizable to other experimental
settings - Ecological validity results generalizable to
real-world settings - Confident in results
- Statistics
15Experimental Protocol
- What is the task?
- What are all the combinations of conditions?
- How often to repeat each combination of
conditions? - Between subjects or within subjects
- Avoid bias (instructions, ordering,)
16Task Must Reflect Hypothesis
- Connect the dots choosing the given color for
each one - Connected dots filled in gray. Next dot is open
in green
17Number of Conditions
- Consider all combinations to isolate effects of
each IV (factorial design) - (4 Menu types)(2 Device types) 8 combinations
- Tool Palette Pen
- Tool Palette Mouse
- Tool Glass Pen
- Tool Glass Mouse
- Flow Menu Pen
- Flow Menu Mouse
- Control Menu Pen
- Control Menu Mouse
- Adding levels or factors can yield lots of
combinations
18Reducing Number of Conditions
- Vary only one independent variable leaving others
fixed - Problem?
19Reducing Number of Conditions
- Vary only one independent variable leaving others
fixed - Problem Will miss effects of interactions
20Other Reduction Strategies
- Run a few independent variables at a time
- If strong effect, include variable in future
studies - Otherwise pick fixed control value for it
- Factional factor design
- Procedures for choosing subset of independent
variables to vary in each experiment
21Choosing Subjects
- Pick balanced sample reflecting intended user
population - Novices, experts
- Age group
- Sex
-
- Example
- 12 non-colorblind right-handed adults (male and
female) - Population group can also be an IV or a
controlled variable - What is the disadvantage of making population a
controlled variable ? - What are the pros/cons of making population an IV
?
22Between Subjects Design
23Within Subjects Design
24Between vs. Within Subjects
- Between subjects
- Each participant uses one condition
- /- Participants cannot compare conditions
- Can collect more data for a given condition
- - Need more participants
- Within Subjects
- All participants try all conditions
- Compare one person across conditions to isolate
effects of individual diffs - Requires fewer participants
- - Fatigue effects
- - Bias due to ordering/learning effects
25Within Subjects Ordering Effects
- In within-subjects designs ordering of conditions
is a variable that can confound results - Why?
- Turn it into a random variable
- Randomize order of conditions across subjects
- Counterbalancing (ensure all orderings are
covered) - Latin square (partial counterbalancing)
-
- Menu selection example Within-subjects, each
subject tries each condition multiple times,
ordering counterbalanced
26Run the Experiment
- Always pilot it first!
- Reveals unexpected problems
- Cant change experiment design after starting it
- Always follow same steps use a check list
- Get consent from subjects
- Debrief subjects afterwards
27Results Statistical Analysis
- Compute central tendencies (descriptive summary
statistics) for each independent variable - Mean
- Standard deviation
28Normal Distributions
- Often DVs are assumed to have a Normal
distribution - At left is the density, right is the cumulative
prob. - Normal distributions are completely characterized
by their mean and variance (mean squared
deviation from the mean).
29Are the Results Meaningful?
- Hypothesis testing
- Hypothesis Manipulation of IV effects DV in some
way - Null Hypothesis Manipulation of IV has no effect
on DV - Null hypothesis assumed true unless statics allow
us to reject it - Statistical Significance (p value)
- Likelihood that results are due to chance
variation - p lt 0.05 usually considered significant
(Sometimes p lt 0.01) - Means that lt5 chance that null hypothesis is
true - Statistical tests
- T-test (1 factor, 2 levels)
- Correlation
- ANOVA ( 1 factor, gt 2 levels, multiple factors)
- MANOVA (gt 1 dependent variable)
30T-test
- Compare means of 2 groups
- Null hypothesis No difference between means
- Assumptions
- Samples are normally distributed
- Very robust in practice
- Population variances are equal (between subjects
tests) - Reasonably robust for differing variances
- Individual observations in samples are
independent - Extremely important!
31Correlation
- Measure extent to which two variables are related
- Does not imply cause and effect
- Example Ice cream eating and drowning
- Need a large enough sample size
- Regression
- Compute the best fit
- Linear
- Logistic
32Lies, Damn lies and Statistics
- A common mistake (made by famous HCI researchers
) - Increasing n, the number of trials, by running
each subject several times. - No! the analysis only works when trials are
independent. - All the trials for one subject are dependent,
because that subject may be faster/slower/less
error-prone than others. - - making this error will not help you become a
famous HCI researcher ?.
33Statistics with Care
- What you can do to get better significance
- Run each subject several times, compute the
average for each subject. - Run the analysis as usual on subjects average
times, with n number of subjects. - This decreases the per-subject variance, while
keeping data independent.
34Statistics with Care
- Another common mistake
- An experiment fails to find a significant
difference between test and control cases (say at
p 0.05), so you conclude that there is no
significant difference. - No!
- A difference-of-averages test can only confirm
(with high probability) that there is a
difference. Failure to prove a significant
difference can be because - There is no difference, OR
- The number of subjects in the experiment is too
small
35Statistics with Care
- Example, what should you conclude if you find no
significant difference at p 0.05, but there is
a difference at p 0.2 ? - First of all, the result does not confirm a
significant difference with any confidence. - However, while there may not be a significant
difference, it is more likely that there is but
it is too weak at the N chosen. Therefore, try
repeating the experiment with a larger N.
36Statistics with Care
- You write a paper with 20 different studies, all
of which demonstrate effects at p0.05
significance. Theyre all right, right? - Actually, there is significant probability (as
high as 63) that there is no real effect in at
least one case. - Remember a p-value is an upper bound on the
probability of no effect, so there is always a
chance the experiment gives the wrong result.
37(No Transcript)
38Basics of Quantitative Methods
- Random variables, probabilities, distributions
- Review of statistics
- Collecting data
- Analyzing the data
39Random Variables
- Random variables take on different values
according to a probability distribution. - E.g. X ? 1, 2, 3 is a discrete random variable
with three possible values. - To characterize the variable, we need to define
the probabilities for each value - PrX1 PrX2 ¼, PrX3
½ - On each trial or experiment, we should see one of
these three values with the given probability.
40Random Variables and Trials
- When we examine X after a series of trials, we
might see the values 1, 1, 3, 2, 3, 1, 3, 3, 3,
1, 2, - We often want to denote the value of X on a
particular trial, such as Xi for the ith trial. - Then the above sequence could also be written as
- X1 1, X2 1, X3 3, X4 2, X5 3, X6
1, X7 3, X8 3, X9 3, X10 1, X11 2, - For large N, the sequence X1 ,XN should
contain the value 3 about N/2 times, the value 2
about N/4 times, and the value 1 about N/4 times.
41Random Variables and Trials
- Q How would you represent a fair coin toss with
a random variable? - X ? H,T PrXH ½ PrXT ½
- Q How would you represent a 6-sided die toss?
- Y ? 1,2,3,4,5,6, PrY i 1/6 for 1 i
6 PrY i 0
otherwise
42Independence
- Consider a random variable X which is the value
of a fair die toss. Now consider Y, which is the
value of another fair die toss. - Knowing the value of X tells us nothing about the
value of Y and vice versa. We say X and Y are
independent random variables. - However, if we defined Z X Y, then Z is
dependent on X and vice versa (large values of X
increase the probability of large values of Z,
and Z must be at least X1).
43Independent Trials
- We will often want to use random variables whose
values on different trials are independent. - If this is true, we say the experiment has
independent trials. - Example tossing a fair die many times. Each toss
is a random variable which is independent of the
other trials.
44Random Variables
- Given PrX1 PrX2 ¼, PrX3 ½ we can
also represent the distribution with a graph
45Continuous Random Variables
- Some random variables take on continuous values,
e.g. Y ? -1,1. - The probability must be defined by a probability
density function (pdf). - E.g. p(Y) ¾ (1 Y2)
- Note that the areaunder the curve is the total
probability,which must be 1.
¾
1
-1
46Continuous Random Variables
- The area under the pdf curve between two values
gives the probability that the value of the
variable lies in that range. - i.e. Pra lt Y lt b
47Meaning of the Distribution
- The limit of the area as the range a,b goes to
zero gives the value of p(Y)Pra lt Y lt adY
p(Y) dY
¾
a
1
-1
48CDF Cumulative Distribution
- The CDF is the area under the distribution from
-? to some value v - So C(- ?) 0 and C(?) 1
-1
1
v
49Mean and Variance
- The mean is the expected value of the variable.
Its roughly the average value of the variable
over many trials. - Mean EY
- In this case EY ½
¾
½
1
-1
50Variance
- Variance is the expected value of the square
difference from the mean. Its roughly the squared
width of the distribution. - VarY
- Standard deviation stdX is the square root of
variance.
¾
½
1
-1
51Mean and Variance
- What is the mean and variance for the following
distribution?
½
¼
2
4
3
52Sums of Random Variables
- For any X1 and X2, the expected value of a sum is
the sum of the expected values - EX1 X2 EX1 EX2
- For independent X1 and X2, the variance of the
sum is also the sum of the variances - VarX1 X2 VarX1 VarX2
53Identical Trials
- For independent trials with the same mean and
variance EX and VarX, - EX1 Xn n EX
- VarX1 Xn n VarX
- StdX1 Xn ?n StdX
- Where StdX VarX 1/2
54Identical Trials
- If we define Avg(X1, ,Xn) (X1 Xn)/n,
then - EAvg(X1, ,Xn) EX
- While
- StdAvg(X1, ,Xn) (1/?n)
StdX - i.e. the standard deviation in an average value
decreases with n, the number of trials.
55Identical Trials
- i.e. the distribution narrows in a relative
sense. - The blue curve is the sum of 100 random trials,
the red curve is the sum of 200.
56Detecting Differences
- The more times you repeat an experiment, the
narrower the distributions of measured average
values for two conditions. - So the more likely you are to detect a difference
in a test variable between two cases.
57 58Variable Types
- Independent Variables the ones you control
- Aspects of the interface design
- Characteristics of the testers
- Discrete A, B or C
- Continuous Time between clicks for double-click
- Dependent variables the ones you measure
- Time to complete tasks
- Number of errors
59Some Statistics
- Variables X Y
- A relation (hypothesis) e.g. X gt Y
- We would often like to know if a relation is true
- e.g. X time taken by novice users
- Y time taken by users with some training
- To find out if the relation is true we do
experiments to get lots of xs and ys
(observations) - Suppose avg(x) gt avg(y), or that most of the xs
are larger than all of the ys. What does that
prove?
60Significance
- The significance or p-value of an outcome is the
probability that it happens by chance if the
relation does not hold. - E.g. p 0.05 means that there is a 1/20 chance
that the observation happens if the hypothesis is
false. - So the smaller the p-value, the greater the
significance.
61Significance
- For instance p 0.001 means there is a 1/1000
chance that the observation would happen if the
hypothesis is false. So the hypothesis is almost
surely true. - Significance increases with number of trials.
- CAVEAT You have to make assumptions about the
probability distributions to get good p-values.
There is always an implied model of user
performance.
62Normal Distributions
- Many variables have a Normal distribution (pdf)
- At left is the density, right is the cumulative
prob. - Normal distributions are completely characterized
by their mean and variance (mean squared
deviation from the mean).
63Normal Distributions
- The std. deviation for a normal distribution
occurs at about 60 of its value
One standard deviation
64T-test
- The T-test asks for the probability that EX gt
EY is false. - i.e. the null hypothesis for the T-test is
whether EX EY. - What is the probability of that given the
observations?
65T-test
- We actually ask for the probability that EX and
EY are at least as different as the observed
means.
X
Y
66Analyzing the Numbers
- Example prove that task 1 is faster on design A
than design B. - Suppose the average time for design B is 20
higher than A. - Suppose subjects times in the study have a std.
dev. which is 30 of their mean time (typical). - How many subjects are needed?
67Analyzing the Numbers
- Example prove that task 1 is faster on design A
than design B. - Suppose the average time for design B is 20
higher than A. - Suppose subjects times in the study have a std.
dev. which is 30 of their mean time (typical). - How many subjects are needed?
- Need at least 13 subjects for significance p0.01
- Need at least 22 subjects for significance
p0.001 - (assumes subjects use both designs)
68Analyzing the Numbers (cont.)
- i.e. even with strong (20) difference, need lots
of subjects to prove it. - Usability test data is quite variable
- 4 times as many tests will only narrow range by
2x - breadth of range depends on sqrt of of test
users - This is when surveys or automatic usability
testing can help
69Lies, Damn lies and Statistics
- A common mistake (made by famous HCI researchers
) - Increasing n, the number of trials, by running
each subject several times. - No! the analysis only works when trials are
independent. - All the trials for one subject are dependent,
because that subject may be faster/slower/less
error-prone than others. - - making this error will not help you become a
famous HCI researcher ?.
70Statistics with Care
- What you can do to get better significance
- Run each subject several times, compute the
average for each subject. - Run the analysis as usual on subjects average
times, with n number of subjects. - This decreases the per-subject variance, while
keeping data independent.
71Statistics with Care
- Another common mistake
- An experiment fails to find a significant
difference between test and control cases (say at
p 0.05), so you conclude that there is no
significant difference. - No!
- A difference-of-averages test can only confirm
(with high probability) that there is a
difference. Failure to prove a significant
difference can be because - There is no difference, OR
- The number of subjects in the experiment is too
small
72Statistics with Care
- Example, what should you conclude if you find no
significant difference at p 0.05, but there is
a difference at p 0.2 ? - First of all, the result does not confirm a
significant difference with any confidence. - However, while there may not be a significant
difference, it is more likely that there is but
it is too weak at the N chosen. Therefore, try
repeating the experiment with a larger N.
73Statistics with Care
- You write a paper with 20 different studies, all
of which demonstrate effects at p0.05
significance. Theyre all right, right? - Actually, there is significant probability (as
high as 63) that there is no real effect in at
least one case. - Remember a p-value is an upper bound on the
probability of no effect, so there is always a
chance the experiment gives the wrong result.
74Using Subjects
- Between subjects experiment
- Two groups of test users
- Each group uses only 1 of the systems
- Within subjects experiment
- One group of test users
- Each person uses both systems
75Between Subjects
- Two groups of testers, each use 1 system
- Advantages
- Users only have to use one system (practical).
- No learning effects.
- Disadvantages
- Per-user performance differences confounded with
system differences - Much harder to get significant results (many more
subjects needed). - Harder to even predict how many subjects will be
needed (depends on subjects).
76Within Subjects
- One group of testers who use both systems
- Advantages
- Much more significance for a given number of test
subjects. - Disadvantages
- Users have to use both systems (two sessions).
- Order and learning effects (can be minimized by
experiment design).
77Example
- Same experiment as before
- System B is 20 slower than A
- Subjects have 30 std. dev. in their times.
- Within subjects
- Need 13 subjects for significance p 0.01
- Between subjects
- Typically require 52 subjects for significance p
0.01. - But depending on the subjects, we may get lower
or higher significance.
78Experimental Details
- Learning effects
- Subjects do better when they repeat a trial
- This can bias within-subjects studies
- So balance the order of trials with equal
numbers of A-B and B-A orders. - What if someone doesnt finish?
- Multiply time and number of errors by 1/fraction
of trial that they completed. - Pilot study to fix problems
- Do 2, first with colleagues, then with real users
79Reporting the Results
- Report what you did what happened
- Images graphs help people get it!
80Summary
- Random variables
- Distributions
- Statistics (and some hazard warnings)
- Experiment design guidelines