Title: SBD: Usability Evaluation
1SBDUsability Evaluation
2ANALYZE
claims about current practice
analysis of stakeholders, field studies
Problem scenarios
Scenario-Based Design
DESIGN
Activity scenarios
iterative analysis of usability claims
and re-design
metaphors, information technology, HCI
theory, guidelines
Information scenarios
Interaction scenarios
PROTOTYPE EVALUATE
summative evaluation
formative evaluation
Usability specifications
3Evaluation
- Formative vs. Summative
- Analytic vs. Emprical
4Usability Engineering
Reqs Analysis
Design
Evaluate
Develop
many iterations
5Usability Engineering
Formative evaluation
Summative evaluation
6Usability Evaluation
- Analytic Methods
- Usability inspection, Expert review
- Heuristic Evaluation
- Cognitive walk-through
- GOMS analysis
- Empirical Methods
- Usability Testing
- Field or lab
- Observation, problem identification
- Controlled Experiment
- Formal controlled scientific experiment
- Comparisons, statistical analysis
7User Interface Metrics
- Ease of learning
- learning time,
- Ease of use
- perf time, error rates
- User satisfaction
- surveys
- Not user friendly
8Usability Testing
9Usability Testing
- Formative helps guide design
- Early in design process
- when architecture is finalized, then its too
late! - A few users
- Usability problems, incidents
- Qualitative feedback from users
- Quantitative usability specification
10Usability Specification Table
Scenario task Worst case Planned Target Best case (expert) Observed
Find most expensive house for sale? 1 min. 10 sec. 3 sec. ??? sec
11Usability Test Setup
- Set of benchmark tasks
- Easy to hard, specific to open-ended
- Coverage of different UI features
- E.g. find the 5 most expensive houses for sale
- Different types learnability vs. performance
- Consent forms
- Not needed unless video-taping users face
(new rule) - Experimenters
- Facilitator instructs user
- Observers take notes, collect data, video tape
screen - Executor run the prototype if faked
- Users
- 3-5 users, quality not quantity
12Usability Test Procedure
- Goal mimic real life
- Do not cheat by showing them how to use the UI!
- Initial instructions
- We are evaluating the system, not you.
- Repeat
- Give user a task
- Ask user to think aloud
- Observe, note mistakes and problems
- Avoid interfering, hint only if completely stuck
- Interview
- Verbal feedback
- Questionnaire
- 1 hour / user
13Usability Lab
14Data
- Note taking
- E.g. _at_ user keeps clicking on the wrong
button - Verbal protocol think aloud
- E.g. user thinks that button does something else
- Rough quantitative measures
- HCI metrics e.g. task completion time, ..
- Interview feedback and surveys
- Video-tape screen mouse
- Eye tracking, biometrics?
15Analyze
- Initial reaction
- stupid user!, thats developer Xs fault!,
this sucks - Mature reaction
- how can we redesign UI to solve that usability
problem? - the user is always right
- Identify usability problems
- Learning issues e.g. cant figure out or didnt
notice feature - Performance issues e.g. arduous, tiring to
solve tasks - Subjective issues e.g. annoying, ugly
- Problem severity critical vs. minor
16Cost-Importance Analysis
- Importance 1-5 (task effect, frequency)
- 5 critical, major impact on user, frequent
occurance - 3 user can complete task, but with difficulty
- 1 minor problem, small speed bump, infrequent
- Ratio importance / cost
- Sort by this
- 3 categories Must fix, next version, ignored
Problem Importance Solutions Cost Ratio I/C
17Refine UI
- Simple solutions vs. major redesigns
- Solve problems in order of importance/cost
- Example
- Problem user didnt know he could zoom in to
see more - Potential solutions
- Better zoom button icon, tooltip
- Add a zoom bar slider (like moosburg)
- Icons for different zoom levels boundaries,
roads, buildings - NOT more help documentation!!! You can do
better. - Iterate
- Test, refine, test, refine, test, refine,
- Until? Meets usability specification
18Project Usability Evaluation
- Usability Evaluation
- gt3 users Not (tainted) HCI students
- Simple data collection (Biometrics optional!)
- Exploit this opportunity to improve your design
- Report
- Procedure (users, tasks, specs, data collection)
- Usability problems identified, specs not met
- Design modifications
19 Controlled Experiments
20Usability test vs. Controlled Expm.
- Usability test
- Formative helps guide design
- Single UI, early in design process
- Few users
- Usability problems, incidents
- Qualitative feedback from users
- Controlled experiment
- Summative measure final result
- Compare multiple UIs
- Many users, strict protocol
- Independent dependent variables
- Quantitative results, statistical significance
21What is Science?
22Scientific Method
- Form Hypothesis
- Collect data
- Analyze
- Accept/reject hypothesis
- How to prove a hypothesis in science?
- Easier to disprove things, by counterexample
- Null hypothesis opposite of hypothesis
- Disprove null hypothesis
- Hence, hypothesis is proved
23Empirical Experiment
- Typical question
- Which visualization is better in which
situations? - Spotfire vs. TableLens
24Cause and Effect
- Goal determine cause and effect
- Cause visualization tool (Spotfire vs.
TableLens) - Effect user performance time on task T
- Procedure
- Vary cause
- Measure effect
- Problem random variation
- Cause vis tool OR random variation?
random variation
Realworld
Collecteddata
uncertain conclusions
25Stats to the Rescue
- Goal
- Measured effect unlikely to result by random
variation - Hypothesis
- Cause visualization tool (e.g. Spotfire ?
TableLens) - Null hypothesis
- Visualization tool has no effect (e.g. Spotfire
TableLens) - Hence Cause random variation
- Stats
- If null hypothesis true, then measured effect
occurs with probability lt 5 (e.g. measured
effect gtgt random variation) - Hence
- Null hypothesis unlikely to be true
- Hence, hypothesis likely to be true
26Variables
- Independent Variables (what you vary), and
treatments (the variable values) - Visualization tool
- Spotfire, TableLens, Excel
- Task type
- Find, count, pattern, compare
- Data size ( of items)
- 100, 1000, 1000000
- Dependent Variables (what you measure)
- User performance time
- Errors
- Subjective satisfaction (survey)
- HCI metrics
27Example 2 x 3 design
Ind Var 2 Task Type
Task1 Task2 Task3
Spot-fire
Table-Lens
Ind Var 1 Vis. Tool
Measured user performance times (dep var)
28Groups
- Between subjects variable
- 1 group of users for each variable treatment
- Group 1 20 users, Spotfire
- Group 2 20 users, TableLens
- Total 40 users, 20 per cell
- With-in subjects (repeated) variable
- All users perform all treatments
- Counter-balancing order effect
- Group 1 20 users, Spotfire then TableLens
- Group 2 20 users, TableLens then Spotfire
- Total 40 users, 40 per cell
29Issues
- Eliminate or measure extraneous factors
- Randomized
- Fairness
- Identical procedures,
- Bias
- User privacy, data security
- IRB (internal review board)
30Procedure
- For each user
- Sign legal forms
- Pre-Survey demographics
- Instructions
- Do not reveal true purpose of experiment
- Training runs
- Actual runs
- Give task
- measure performance
- Post-Survey subjective measures
- n users
31Data
- Measured dependent variables
- Spreadsheet
User Spotfire Spotfire Spotfire TableLens TableLens TableLens
User task 1 task 2 task 3 task 1 task 2 task 3
32Step 1 Visualize it
- Dig out interesting facts
- Qualitative conclusions
- Guide stats
- Guide future experiments
33Step 2 Stats
Ind Var 2 Task Type
Task1 Task2 Task3
Spot-fire 37.2 54.5 103.7
Table-Lens 29.8 53.2 145.4
Ind Var 1 Vis. Tool
Average user performance times (dep var)
34TableLens better than Spotfire?
- Problem with Averages lossy
- Compares only 2 numbers
- What about the 40 data values? (Show me the
data!)
Avg Perf time (secs)
Spotfire TableLens
35The real picture
- Need stats that compare all data
Avg Perf time (secs)
Spotfire TableLens
36Statistics
- t-test
- Compares 1 dep var on 2 treatments of 1 ind var
- ANOVA Analysis of Variance
- Compares 1 dep var on n treatments of m ind vars
- Result
- p probability that difference between
treatments is random (null hypothesis) - statistical significance level
- typical cut-off p lt 0.05
- Hypothesis confidence 1 - p
37In Excel
38p lt 0.05
- Woohoo!
- Found a statistically significant difference
- Averages determine which is better
- Conclusion
- Cause visualization tool (e.g. Spotfire ?
TableLens) - Vis Tool has an effect on user performance for
task T - 95 confident that TableLens better than
Spotfire - NOT TableLens beats Spotfire 95 of time
- 5 chance of being wrong!
- Be careful about generalizing
39p gt 0.05
- Hence, no difference?
- Vis Tool has no effect on user performance for
task T? - Spotfire TableLens ?
- NOT!
- Did not detect a difference, but could still be
different - Potential real effect did not overcome random
variation - Provides evidence for Spotfire TableLens, but
not proof - Boring, basically found nothing
- How?
- Not enough users
- Need better tasks, data,
40Data Mountain
- Robertson, Data Mountain (Microsoft)
-
41Data Mountain Experiment
- Data Mountain vs. IE favorites
- 32 subjects
- Organize 100 pages, then retrieve based on cues
- Indep. Vars
- UI Data mountain (old, new), IE
- Cue Title, Summary, Thumbnail, all 3
- Dependent variables
- User performance time
- Error rates wrong pages, failed to find in 2
min - Subjective ratings
42Data Mountain Results
- Spatial Memory!
- Limited scalability?