IEOR 170: Quantitative Evaluation - PowerPoint PPT Presentation

1 / 80
About This Presentation
Title:

IEOR 170: Quantitative Evaluation

Description:

the analysis only works when trials are independent. All the trials for one subject are dependent, because that subject may be faster ... Independent Trials ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 81
Provided by: csBer
Category:

less

Transcript and Presenter's Notes

Title: IEOR 170: Quantitative Evaluation


1
IEOR 170 Quantitative Evaluation
  • Jingtao Wang
  • 4/16/2007

Slides based on those of John Canny and Maneesh
Agrawala
2
Administrivia
  • Design Notebook/Idea Logs review deadline is
    4/25/2007
  • High-Fidelity Prototype and Evaluation assignment
    has been released today
  • Due 4/25/2007
  • 15 of the grading is group-specific

3
Previously on IEOR 170
  • Heuristic Evaluation
  • Neilsons 10 Heuristics
  • Evaluation Process
  • Pros and Cons

4
Qualitative vs. Quantitative Studies
  • Qualitative What weve been doing so far
  • Contextual Inquiry trying to understand users
    tasks and their conceptual model.
  • Usability Studies looking for critical incidents
    in a user interface
  • Qualitative methods help us
  • Understand whats going on,
  • Look for problems,
  • Get a rough idea of the usability of an
    interface.

5
Quantitative Studies
  • Quantitative
  • Use to reliably measure something
  • Compare two or more designs on a measurable
    aspect
  • Approaches
  • Collect and analyze user events that occur in
    natural use
  • Key presses, Mouse clicks
  • Controlled experiments
  • Examples of measures
  • Time to complete a task.
  • Average number of errors on a task.
  • Users ratings of an interface
  • Ease of use, elegance, performance, robustness,
    speed,
  • - You could argue that users perception of
    speed, error rates etc is more important than
    their actual values.

6
Comparison
  • Qualitative studies
  • Faster, less expensive -gt Especially useful in
    early stage of design cycle
  • In real-world design quantitative study not
    always necessary
  • Quantitative studies
  • Reliable, repeatable result -gt scientific method
  • Best studies produce generalizable results

7
Steps in Designing an Experiment
  • State a lucid, testable hypothesis
  • Identify variables (independent, dependent
    control, random)
  • Design the experiment protocol
  • Choose user population
  • Apply for human subjects protocol review
  • Run pilot studies
  • Run the experiment
  • Perform statistical analysis
  • Draw conclusions

8
Example Menu Selection
Guimbtiere et al. 03
9
Lucid, Testable Hypothesis
  • Because users must reach for it, tool palette
    will be slower
  • Other hypotheses?

10
Experiment Design
  • Testable hypothesis
  • Precise statement of expected outcome
  • Factors (independent variables)
  • Attributes we manipulate/vary in each condition
  • Levels values for independent variables
  • Response variables (dependent variables)
  • Outcome of experiment (measurements)
  • Usually measure user performance
  • Time
  • Errors

11
Experiment Design
  • Control variables
  • Attributes that will be fixed throughout
    experiment
  • Confound attribute that varied and was not
    accounted for
  • Problem Confound rather IV could have caused DVs
  • Confounds make it difficult/impossible to draw
    conclusions
  • Random variables
  • Attributes that are randomly sampled
  • Increases generalizability

12
Variables
  • Independent variables
  • Dependent variables
  • Control variables
  • Random variables

13
Variables
  • Independent variables
  • Menu type (4 choices)
  • Device type (2 choices)
  • Dependent variables
  • Time
  • Error rate
  • User satisfaction
  • Control variables
  • Location/environment..
  • Device type?
  • Random variables
  • Attributes of subjects
  • Age, sex, .

14
Goals
  • Internal validity
  • Manipulation of IV is cause of change in DV
  • Requires that experiment is replicable
  • External validity
  • Results are generalizable to other experimental
    settings
  • Ecological validity results generalizable to
    real-world settings
  • Confident in results
  • Statistics

15
Experimental Protocol
  • What is the task?
  • What are all the combinations of conditions?
  • How often to repeat each combination of
    conditions?
  • Between subjects or within subjects
  • Avoid bias (instructions, ordering,)

16
Task Must Reflect Hypothesis
  • Connect the dots choosing the given color for
    each one
  • Connected dots filled in gray. Next dot is open
    in green

17
Number of Conditions
  • Consider all combinations to isolate effects of
    each IV (factorial design)
  • (4 Menu types)(2 Device types) 8 combinations
  • Tool Palette Pen
  • Tool Palette Mouse
  • Tool Glass Pen
  • Tool Glass Mouse
  • Flow Menu Pen
  • Flow Menu Mouse
  • Control Menu Pen
  • Control Menu Mouse
  • Adding levels or factors can yield lots of
    combinations

18
Reducing Number of Conditions
  • Vary only one independent variable leaving others
    fixed
  • Problem?

19
Reducing Number of Conditions
  • Vary only one independent variable leaving others
    fixed
  • Problem Will miss effects of interactions

20
Other Reduction Strategies
  • Run a few independent variables at a time
  • If strong effect, include variable in future
    studies
  • Otherwise pick fixed control value for it
  • Factional factor design
  • Procedures for choosing subset of independent
    variables to vary in each experiment

21
Choosing Subjects
  • Pick balanced sample reflecting intended user
    population
  • Novices, experts
  • Age group
  • Sex
  • Example
  • 12 non-colorblind right-handed adults (male and
    female)
  • Population group can also be an IV or a
    controlled variable
  • What is the disadvantage of making population a
    controlled variable ?
  • What are the pros/cons of making population an IV
    ?

22
Between Subjects Design
23
Within Subjects Design
24
Between vs. Within Subjects
  • Between subjects
  • Each participant uses one condition
  • /- Participants cannot compare conditions
  • Can collect more data for a given condition
  • - Need more participants
  • Within Subjects
  • All participants try all conditions
  • Compare one person across conditions to isolate
    effects of individual diffs
  • Requires fewer participants
  • - Fatigue effects
  • - Bias due to ordering/learning effects

25
Within Subjects Ordering Effects
  • In within-subjects designs ordering of conditions
    is a variable that can confound results
  • Why?
  • Turn it into a random variable
  • Randomize order of conditions across subjects
  • Counterbalancing (ensure all orderings are
    covered)
  • Latin square (partial counterbalancing)
  • Menu selection example Within-subjects, each
    subject tries each condition multiple times,
    ordering counterbalanced

26
Run the Experiment
  • Always pilot it first!
  • Reveals unexpected problems
  • Cant change experiment design after starting it
  • Always follow same steps use a check list
  • Get consent from subjects
  • Debrief subjects afterwards

27
Results Statistical Analysis
  • Compute central tendencies (descriptive summary
    statistics) for each independent variable
  • Mean
  • Standard deviation

28
Normal Distributions
  • Often DVs are assumed to have a Normal
    distribution
  • At left is the density, right is the cumulative
    prob.
  • Normal distributions are completely characterized
    by their mean and variance (mean squared
    deviation from the mean).

29
Are the Results Meaningful?
  • Hypothesis testing
  • Hypothesis Manipulation of IV effects DV in some
    way
  • Null Hypothesis Manipulation of IV has no effect
    on DV
  • Null hypothesis assumed true unless statics allow
    us to reject it
  • Statistical Significance (p value)
  • Likelihood that results are due to chance
    variation
  • p lt 0.05 usually considered significant
    (Sometimes p lt 0.01)
  • Means that lt5 chance that null hypothesis is
    true
  • Statistical tests
  • T-test (1 factor, 2 levels)
  • Correlation
  • ANOVA ( 1 factor, gt 2 levels, multiple factors)
  • MANOVA (gt 1 dependent variable)

30
T-test
  • Compare means of 2 groups
  • Null hypothesis No difference between means
  • Assumptions
  • Samples are normally distributed
  • Very robust in practice
  • Population variances are equal (between subjects
    tests)
  • Reasonably robust for differing variances
  • Individual observations in samples are
    independent
  • Extremely important!

31
Correlation
  • Measure extent to which two variables are related
  • Does not imply cause and effect
  • Example Ice cream eating and drowning
  • Need a large enough sample size
  • Regression
  • Compute the best fit
  • Linear
  • Logistic

32
Lies, Damn lies and Statistics
  • A common mistake (made by famous HCI researchers
    )
  • Increasing n, the number of trials, by running
    each subject several times.
  • No! the analysis only works when trials are
    independent.
  • All the trials for one subject are dependent,
    because that subject may be faster/slower/less
    error-prone than others.
  • - making this error will not help you become a
    famous HCI researcher ?.

33
Statistics with Care
  • What you can do to get better significance
  • Run each subject several times, compute the
    average for each subject.
  • Run the analysis as usual on subjects average
    times, with n number of subjects.
  • This decreases the per-subject variance, while
    keeping data independent.

34
Statistics with Care
  • Another common mistake
  • An experiment fails to find a significant
    difference between test and control cases (say at
    p 0.05), so you conclude that there is no
    significant difference.
  • No!
  • A difference-of-averages test can only confirm
    (with high probability) that there is a
    difference. Failure to prove a significant
    difference can be because
  • There is no difference, OR
  • The number of subjects in the experiment is too
    small

35
Statistics with Care
  • Example, what should you conclude if you find no
    significant difference at p 0.05, but there is
    a difference at p 0.2 ?
  • First of all, the result does not confirm a
    significant difference with any confidence.
  • However, while there may not be a significant
    difference, it is more likely that there is but
    it is too weak at the N chosen. Therefore, try
    repeating the experiment with a larger N.

36
Statistics with Care
  • You write a paper with 20 different studies, all
    of which demonstrate effects at p0.05
    significance. Theyre all right, right?
  • Actually, there is significant probability (as
    high as 63) that there is no real effect in at
    least one case.
  • Remember a p-value is an upper bound on the
    probability of no effect, so there is always a
    chance the experiment gives the wrong result.

37
(No Transcript)
38
Basics of Quantitative Methods
  • Random variables, probabilities, distributions
  • Review of statistics
  • Collecting data
  • Analyzing the data

39
Random Variables
  • Random variables take on different values
    according to a probability distribution.
  • E.g. X ? 1, 2, 3 is a discrete random variable
    with three possible values.
  • To characterize the variable, we need to define
    the probabilities for each value
  • PrX1 PrX2 ¼, PrX3
    ½
  • On each trial or experiment, we should see one of
    these three values with the given probability.

40
Random Variables and Trials
  • When we examine X after a series of trials, we
    might see the values 1, 1, 3, 2, 3, 1, 3, 3, 3,
    1, 2,
  • We often want to denote the value of X on a
    particular trial, such as Xi for the ith trial.
  • Then the above sequence could also be written as
  • X1 1, X2 1, X3 3, X4 2, X5 3, X6
    1, X7 3, X8 3, X9 3, X10 1, X11 2,
  • For large N, the sequence X1 ,XN should
    contain the value 3 about N/2 times, the value 2
    about N/4 times, and the value 1 about N/4 times.

41
Random Variables and Trials
  • Q How would you represent a fair coin toss with
    a random variable?
  • X ? H,T PrXH ½ PrXT ½
  • Q How would you represent a 6-sided die toss?
  • Y ? 1,2,3,4,5,6, PrY i 1/6 for 1 i
    6 PrY i 0
    otherwise

42
Independence
  • Consider a random variable X which is the value
    of a fair die toss. Now consider Y, which is the
    value of another fair die toss.
  • Knowing the value of X tells us nothing about the
    value of Y and vice versa. We say X and Y are
    independent random variables.
  • However, if we defined Z X Y, then Z is
    dependent on X and vice versa (large values of X
    increase the probability of large values of Z,
    and Z must be at least X1).

43
Independent Trials
  • We will often want to use random variables whose
    values on different trials are independent.
  • If this is true, we say the experiment has
    independent trials.
  • Example tossing a fair die many times. Each toss
    is a random variable which is independent of the
    other trials.

44
Random Variables
  • Given PrX1 PrX2 ¼, PrX3 ½ we can
    also represent the distribution with a graph

45
Continuous Random Variables
  • Some random variables take on continuous values,
    e.g. Y ? -1,1.
  • The probability must be defined by a probability
    density function (pdf).
  • E.g. p(Y) ¾ (1 Y2)
  • Note that the areaunder the curve is the total
    probability,which must be 1.

¾
1
-1
46
Continuous Random Variables
  • The area under the pdf curve between two values
    gives the probability that the value of the
    variable lies in that range.
  • i.e. Pra lt Y lt b

47
Meaning of the Distribution
  • The limit of the area as the range a,b goes to
    zero gives the value of p(Y)Pra lt Y lt adY
    p(Y) dY

¾
a
1
-1
48
CDF Cumulative Distribution
  • The CDF is the area under the distribution from
    -? to some value v
  • So C(- ?) 0 and C(?) 1

-1
1
v
49
Mean and Variance
  • The mean is the expected value of the variable.
    Its roughly the average value of the variable
    over many trials.
  • Mean EY
  • In this case EY ½

¾
½
1
-1
50
Variance
  • Variance is the expected value of the square
    difference from the mean. Its roughly the squared
    width of the distribution.
  • VarY
  • Standard deviation stdX is the square root of
    variance.

¾
½
1
-1
51
Mean and Variance
  • What is the mean and variance for the following
    distribution?

½
¼
2
4
3
52
Sums of Random Variables
  • For any X1 and X2, the expected value of a sum is
    the sum of the expected values
  • EX1 X2 EX1 EX2
  • For independent X1 and X2, the variance of the
    sum is also the sum of the variances
  • VarX1 X2 VarX1 VarX2

53
Identical Trials
  • For independent trials with the same mean and
    variance EX and VarX,
  • EX1 Xn n EX
  • VarX1 Xn n VarX
  • StdX1 Xn ?n StdX
  • Where StdX VarX 1/2

54
Identical Trials
  • If we define Avg(X1, ,Xn) (X1 Xn)/n,
    then
  • EAvg(X1, ,Xn) EX
  • While
  • StdAvg(X1, ,Xn) (1/?n)
    StdX
  • i.e. the standard deviation in an average value
    decreases with n, the number of trials.

55
Identical Trials
  • i.e. the distribution narrows in a relative
    sense.
  • The blue curve is the sum of 100 random trials,
    the red curve is the sum of 200.

56
Detecting Differences
  • The more times you repeat an experiment, the
    narrower the distributions of measured average
    values for two conditions.
  • So the more likely you are to detect a difference
    in a test variable between two cases.

57
  • Break

58
Variable Types
  • Independent Variables the ones you control
  • Aspects of the interface design
  • Characteristics of the testers
  • Discrete A, B or C
  • Continuous Time between clicks for double-click
  • Dependent variables the ones you measure
  • Time to complete tasks
  • Number of errors

59
Some Statistics
  • Variables X Y
  • A relation (hypothesis) e.g. X gt Y
  • We would often like to know if a relation is true
  • e.g. X time taken by novice users
  • Y time taken by users with some training
  • To find out if the relation is true we do
    experiments to get lots of xs and ys
    (observations)
  • Suppose avg(x) gt avg(y), or that most of the xs
    are larger than all of the ys. What does that
    prove?

60
Significance
  • The significance or p-value of an outcome is the
    probability that it happens by chance if the
    relation does not hold.
  • E.g. p 0.05 means that there is a 1/20 chance
    that the observation happens if the hypothesis is
    false.
  • So the smaller the p-value, the greater the
    significance.

61
Significance
  • For instance p 0.001 means there is a 1/1000
    chance that the observation would happen if the
    hypothesis is false. So the hypothesis is almost
    surely true.
  • Significance increases with number of trials.
  • CAVEAT You have to make assumptions about the
    probability distributions to get good p-values.
    There is always an implied model of user
    performance.

62
Normal Distributions
  • Many variables have a Normal distribution (pdf)
  • At left is the density, right is the cumulative
    prob.
  • Normal distributions are completely characterized
    by their mean and variance (mean squared
    deviation from the mean).

63
Normal Distributions
  • The std. deviation for a normal distribution
    occurs at about 60 of its value

One standard deviation
64
T-test
  • The T-test asks for the probability that EX gt
    EY is false.
  • i.e. the null hypothesis for the T-test is
    whether EX EY.
  • What is the probability of that given the
    observations?

65
T-test
  • We actually ask for the probability that EX and
    EY are at least as different as the observed
    means.

X
Y
66
Analyzing the Numbers
  • Example prove that task 1 is faster on design A
    than design B.
  • Suppose the average time for design B is 20
    higher than A.
  • Suppose subjects times in the study have a std.
    dev. which is 30 of their mean time (typical).
  • How many subjects are needed?

67
Analyzing the Numbers
  • Example prove that task 1 is faster on design A
    than design B.
  • Suppose the average time for design B is 20
    higher than A.
  • Suppose subjects times in the study have a std.
    dev. which is 30 of their mean time (typical).
  • How many subjects are needed?
  • Need at least 13 subjects for significance p0.01
  • Need at least 22 subjects for significance
    p0.001
  • (assumes subjects use both designs)

68
Analyzing the Numbers (cont.)
  • i.e. even with strong (20) difference, need lots
    of subjects to prove it.
  • Usability test data is quite variable
  • 4 times as many tests will only narrow range by
    2x
  • breadth of range depends on sqrt of of test
    users
  • This is when surveys or automatic usability
    testing can help

69
Lies, Damn lies and Statistics
  • A common mistake (made by famous HCI researchers
    )
  • Increasing n, the number of trials, by running
    each subject several times.
  • No! the analysis only works when trials are
    independent.
  • All the trials for one subject are dependent,
    because that subject may be faster/slower/less
    error-prone than others.
  • - making this error will not help you become a
    famous HCI researcher ?.

70
Statistics with Care
  • What you can do to get better significance
  • Run each subject several times, compute the
    average for each subject.
  • Run the analysis as usual on subjects average
    times, with n number of subjects.
  • This decreases the per-subject variance, while
    keeping data independent.

71
Statistics with Care
  • Another common mistake
  • An experiment fails to find a significant
    difference between test and control cases (say at
    p 0.05), so you conclude that there is no
    significant difference.
  • No!
  • A difference-of-averages test can only confirm
    (with high probability) that there is a
    difference. Failure to prove a significant
    difference can be because
  • There is no difference, OR
  • The number of subjects in the experiment is too
    small

72
Statistics with Care
  • Example, what should you conclude if you find no
    significant difference at p 0.05, but there is
    a difference at p 0.2 ?
  • First of all, the result does not confirm a
    significant difference with any confidence.
  • However, while there may not be a significant
    difference, it is more likely that there is but
    it is too weak at the N chosen. Therefore, try
    repeating the experiment with a larger N.

73
Statistics with Care
  • You write a paper with 20 different studies, all
    of which demonstrate effects at p0.05
    significance. Theyre all right, right?
  • Actually, there is significant probability (as
    high as 63) that there is no real effect in at
    least one case.
  • Remember a p-value is an upper bound on the
    probability of no effect, so there is always a
    chance the experiment gives the wrong result.

74
Using Subjects
  • Between subjects experiment
  • Two groups of test users
  • Each group uses only 1 of the systems
  • Within subjects experiment
  • One group of test users
  • Each person uses both systems

75
Between Subjects
  • Two groups of testers, each use 1 system
  • Advantages
  • Users only have to use one system (practical).
  • No learning effects.
  • Disadvantages
  • Per-user performance differences confounded with
    system differences
  • Much harder to get significant results (many more
    subjects needed).
  • Harder to even predict how many subjects will be
    needed (depends on subjects).

76
Within Subjects
  • One group of testers who use both systems
  • Advantages
  • Much more significance for a given number of test
    subjects.
  • Disadvantages
  • Users have to use both systems (two sessions).
  • Order and learning effects (can be minimized by
    experiment design).

77
Example
  • Same experiment as before
  • System B is 20 slower than A
  • Subjects have 30 std. dev. in their times.
  • Within subjects
  • Need 13 subjects for significance p 0.01
  • Between subjects
  • Typically require 52 subjects for significance p
    0.01.
  • But depending on the subjects, we may get lower
    or higher significance.

78
Experimental Details
  • Learning effects
  • Subjects do better when they repeat a trial
  • This can bias within-subjects studies
  • So balance the order of trials with equal
    numbers of A-B and B-A orders.
  • What if someone doesnt finish?
  • Multiply time and number of errors by 1/fraction
    of trial that they completed.
  • Pilot study to fix problems
  • Do 2, first with colleagues, then with real users

79
Reporting the Results
  • Report what you did what happened
  • Images graphs help people get it!

80
Summary
  • Random variables
  • Distributions
  • Statistics (and some hazard warnings)
  • Experiment design guidelines
Write a Comment
User Comments (0)
About PowerShow.com