Title: An Introduction to Usability Testing
1An Introduction to Usability Testing
- Bill Killam, MA CHFP
- Adjunct Professor
- University of Maryland
- bkillam_at_user-centereddesign.com
2 3Definitions
- Usability testing is the common name for
multiple forms user-based system evaluation - Popularized in the media by Jakob Neilson in the
1990s and usually thought of as related to web
site design - Usability testing is one of the activities of
Human Factors Engineering (or a user-centered
design) methodology that is 50 years old
4What is being tested?
- Sid Smiths user-system interface (compared to
user-computer interface) - Systems are made up of specific users performing
specific activities within a specific environment - Cant redesign users but we can design equipment,
so our goal as designer is to design equipment
to optimize system performance
5What does usability mean?
- ISO 9126
- A set of attributes that bear on the effort
needed for use, and on the individual assessment
of such use, by a stated or implied set of users - ISO 9241 Definition
- Extent to which a product can be used by
specified users to achieve specified goals with
effectiveness, efficiency and satisfaction in a
specified context of use.
6What does usability mean?
- Jakob Neilson
- Satisfaction
- Efficiency
- Learnability
- Low Errors
- Memorability
- Ben Shneiderman
- Ease of learning
- Speed of task completion
- Low error rate
- Retention of knowledge over time
- User satisfaction
7Usability Defined
- Accessibility
- A precursor to usability if users cannot gain
access to the product, its usability is a moot
point - Functional Suitability
- Does the product contain the functionality
required by the user? - Ease-of-learning
- Can users determine what functions the product
performs? - Can the user figure out how to exercise the
functionality provided? - Ease-of-use
- Can the user exercise the functionality
accurately and efficiently once its learned
(includes accessibility issues)? - Can users use it safely?
- Ease-of-recall
- Can the knowledge of operation be easily
maintained over time? - Subjective Preference
- Do users like using it?
8What determines a products usability ?
9perceptual issues(our brains deceive our
senses)
10 11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15we have trouble with patterned data
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22our perceptual abilities are limited in the
presence of noise
23THE QUICK BROWN FOX JUMPED OVER THE LAZY DOGS
BACK.
24THE QUICK BROWN FOX JUMPED OVER THE LAZY DOGS
BACK.
25The quick brown fox jumped over the lazy dogs
back.
26our cognition ability and presumptions effects
what we see
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31- Jack and Jill went
- went up the
- Hill to fetch a
- a pail of milk
32- FINISHED FILES ARE THE RE-
- SULT OF YEARS OF SCIENTIF-
- IC STUDY COMBINED WITH THE
- EXPERIENCE OF MANY YEARS
33our cognitive abilities are limited
34- Lets play the game of 15. The pieces of the
game are the numbers 1, 2, 3, 4, 5, 6, 7, 8, and
9. Each player takes a digit in turn. Once a
digit is taken, the other player cannot use it.
The first player to get three digits that sum to
15 wins. - Heres a sample game Player A takes 8. Player B
takes 2. Then A takes 4, and B takes 3. A takes
5. What digit should B take?
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39some mental processes have priority
40(No Transcript)
41Red
Green
Blue
Orange
Yellow
Black
42Stroop
Stroop
Stroop
Stroop
Stroop
Stroop
43Orange
Yellow
Green
Black
Blue
Red
44our memory affects our abilities
45our psychology affects our abilities
46and we need protection from ourselves
47(No Transcript)
48Usability Testing is
- any of a number of methodologies used to assess,
based on our understanding of human capabilities
and limitations, how a products design
contributes or hinders its usability when used
by the intend users to perform the intend tasks
in the intend environment
49Usability Testing is not
- basic or even applied research
- large scale effort that can demonstrate internal
and external reliability and validity - require statistical analysis to determine the
significance of the results
50Usability Testing is not
- Market Research
- Qualitative Quantitative market research is an
attempt to understanding the actual or potential
user population for a product of service - Size of markets
- Reasons for purchasing/not purchasing
51Usability Testing is not
- User Acceptance Testing
- A process to ensure the functional specifications
have been met - It occurs after development, just before a
product is shipped - It tell you the functionality exists regardless
of if it is usable - It may capture some subjective preference data,
but not truly based on the products usability
52When is best usability assessed?
- On an existing product to determine if usability
problems exist - During the design phase of a product
- During the development phase of a product to
assess proposed changes - Once a design is completed to determine if goal
were met - Each of these represent different purposes with
different approaches and are often funded from
different sources
53 54Types of Usability Testing
- Non-User Based
- Compliance Reviews
- Expert Reviews (Heuristic Evaluations)
- Cognitive Walkthroughs
- User-based
- Ethnographic Observation
- User Surveys/Questionnaires
- Think Aloud Protocols
- Performance-based Protocols
- Co-Discover Protocols
55Non-User Based Testing
56Compliance Testing
- Possible (within limits) to be performed by
anyone - Can remove the low level usability issues that
often mask more significant usability issues
57Compliance Testing (concluded)
- Style Guide-based Testing
- Checklists
- Interpretation Issues
- Scope Limitations
- Available Standards
- Commercially GUI Web Standards and Style Guides
- Domain Specific GUI Web Standards and Style
Guides - Internal Standards and Style Guides
- Interface Specification Testing
- May revert to user acceptance testing
58Expert Review
- Aka Heuristic Evaluation
- One or two usability experts review a product,
application, etc. - Free format review or structured review based on
heuristics - Subjective but based on sound usability and
design principles - Highly dependent on the qualifications of the
reviewer(s)
59Expert Review (Concluded)
- Nielsons 10 Most Common Mistakes Made by Web
Developers (three versions) - Shneidermans 8 Golden Rules
- Constantine Lockwood Heuristics
- Forrester Group Heuristics
- Normans 4 Principles of Usability
60Cognitive Walkthrough
- Team Approach
- Best if a diverse population of reviewers
- Issues related to cognition (understanding) more
than presentation - Also low cost usability testing
- Highly dependent on the qualifications of the
reviewer(s)
61User Based Testing
62User Surveys
- Standardized vs. Custom made
- Formats
- Closed ended
- Open ended
- Likert scale questions
- Cognitive Testing of the Survey itself
- User bias in evaluations
- Leniency Effect
- Central tendency
- Strictness Bias
- Strategic Responding
- the intent of the survey is so obvious that the
participants attempt to respond in a way they
believe will help - Self Selection versus Directed Surveys
63User Interviews
- Formats
- Closed ended
- Open ended
- Likert scale questions
- Strategic Responding
- The intent of the interview is so obvious that
the participants attempt to respond in a way they
believe will help
64Ethnographic Observation
- Field Study
- Natural Environment for the User
- Can be time consuming and logistically
prohibitive,
65How toDesign Conduct a User-based Testin a
Lab
66Test Set-up
- Whats the hypothesis?
- Required for research
- Required for usability testing?
- Define Your Variables
- Dependent and Independent Variables
- Confounding Variables
- Operationalize Your Variables
67Participant Issues
- User-types
- Users versus user surrogates
- All profiles or specific user profiles/personas?
- Critical segments?
- How many?
- Relationship to statistical significance
- Discount Usability whos rule?
68Participant Issues (concluded)
- Selecting subjects
- Screeners
- Getting Subjects
- Convenience Sampling
- Recruiting
- Participant stipends
- Over recruiting
- Scheduling
69Test Set-up (concluded)
- Select a Protocol
- Between Subject Designs
- Within Subject Designs
- Selecting a protocol
- Non-interrupted think aloud (with or without a
critical incidence analysis) - Interrupted think aloud
- Non-interrupted performance-based (with or
without a critical incidence analysis) - Interrupted performance-based
- Co-discovery
70Defining Task Scenarios
- Areas of concern, redesign, or of interest
- Short, unambiguous tasks to be performed
- Interrelationship
- Wording is critical
- In the users own terms
- Does not contain seeds to the correct solution
- Enough to form a complete test but able to stay
within the time limit - Flexibility is key
- Variations ARE allowed
71Defining Task Scenarios (concluded)
- Scenarios are contrived for testing, may not be
representative of real world usage patterns, and
are NOT required!
72Preparing Test Materials
- Consent form!
- Video release form!
- Receipt and confidentiality agreement!
- Demographic survey
- Facilitators Guide
- Introductory comments
- Participant task descriptions
- Questionnaires, SUS, Cooper-Harper, etc.
- Note Takers Forms
73Piloting the Design
- Getting subjects
- Convenience sampling
- Almost anyone will do
- Collect data
- Check for timing
74Conduct the Evaluation and Collect Data
- Collecting interaction data
- The data is NOT in the interface, the data is in
the user! - Collecting observed data
- Behavior
- Reactions
- Collecting participant comments
- Collecting subjective data
- Pre-test data
- Post scenario data
- Post test data
75Reporting the Results
- Tradition (Parametric) Descriptive and Predictive
Statistics are meaningless - These statistic require a normal distribution of
the results to be valid - The confidence intervals are too great with small
samples to draw a meaningful conclusion - Example Assume you test 8 people and 7 of them
complete a task. Your confidence interval is
52. So, you can only claim that you believe
that between 47 and 99 of all users will likely
succeed
76Reporting the Results (continued)
- Non-Parametric Statistics CAN be used
- Mann-Whitney U-Test, Rank SUM Test
- Is it worth it?
77Wilcoxin Rank Sum Test
- Take the poll on the participants comparing the
new product against the old version of the
product. People might be asked to comment on the
statement The new design is an improvement over
the old design. and given a choice of answers
from Definitely, its the tops. to No
definitely not, its awful. The data would be a
collection of opinions. Making up some data for
purposes of demonstration, assume the following
scale and results
78Wilcoxin Rank Sum Test (continued)
- The hypothesis you would want to test would be
The participants consider the new product an
improvement. - Always be cautious with rating scales as people
are not like rulers, watches, or thermometers.
The difference between opinions cannot be neatly
measured in the same way as the differences
between two lengths, times or temperatures. Is
the difference between Yes, its a good asset
(2) and Yes it is an asset, its OK (3) the
same amount of difference of opinion as between
I have no opinion (4) and Not really as
asset. (5)? Is it sensible to measure the
difference between two opinions numerically? Is
the opinion Yes it is an asset, its OK, rated
as 3, three times weaker than the opinion rated
as 1, Definitely an asset, its the tops.?
There is nevertheless an order to the different
opinions. Think of other situations where there
is an order, but doing arithmetic with the
numbering does not make sense. - A quick eyeball test shows that none of those
questioned thought it was awful and only one
person thought it not very good, so a first
impression is that people generally approve. If
you start by assuming that in the population
there is no opinion one way or the other, and
that peoples responses are symmetrically
distributed about no opinion, you can test the
hypothesis that people think the shopping centre
is an asset, with the null hypothesis that people
have no opinion about it, their response being
the median value 4.
79Wilcoxin Rank Sum Test (continued)
- You need to be careful to choose the appropriate
test statistic for the problem you are tackling. - For a one tailed test, where the alternative
hypothesis is that the median is greater than a
given value, the test statistic is W- . For a one
tailed test, where the alternative hypothesis is
that the median is less than a given value, the
test statistic is W . - For a two tailed test the test statistic is the
smaller of W and W - As people who think it an improvement will give a
rating of less than 4, the null and alternative
hypotheses can be stated as follows. - H0 the median response is 4
- H1 the median response is less than 4
- 1 tail test, Significance level 5
80Wilcoxin Rank Sum Test (continued)
- List the value
- Find the difference between each value and the
median. - Ignore the zeros and rank the absolute values of
the remaining scores. - Ignore the signs, start with the smallest
difference and give it rank 1. Where two or more
differences have the same value find their mean
rank, and use this. - Now check that W W- are the same as ½ n(n1),
where n is the number in the sample (having
ignored the zeros). In this case n 10. - ½ n(n1) ½ x 10 x 11 55
- W W- 95 455 55
81Wilcoxin Rank Sum Test (continued)
- Compare the test statistic with the critical
value in the tables. If the null hypothesis were
true, and the median is 4, you would expect W
and W- to have roughly the same value. There are
two possible test statistics here, W 95 and
W- 455, and you have to decide which one to
use. We are interested in W , the sum of the
ranks of ratings greater than 4. W is much less
than W- which suggests that more people felt the
shopping center was an asset. It could also
suggest that those who expressed a negative view
expressed a very strong one, with lots of high
numbers in the ratings. - Now you need to compare the value of W , the
test statistic, with the critical value from the
table. Given that W is small the key question
becomes Is W significantly smaller than would
happen by chance? The table helps you decide
this by supplying the critical value. For a
sample of 10, at the 5 significance level for a
1 tailed test, the value is 10. As W is 95,
which is less than this, the evidence suggests
that we can reject the null hypothesis. Your
conclusion is that the evidence shows, at the 5
significance level, that the public thinks the
new shopping centre is an asset to the town.
82Reporting the Results (continued)
- Real data from small number testing comes from
- The Principle of Inter-ocular Drama
- Observers ability to explain it
- (not from user comment or even user behavior)
83The sections for Clinicians was mistaken by many
participants to be general user information,
rather then for those who manage Clinical Trials
Consider visual separation or labelling to ensure
the audience for this information is
understood (e.g., Conducting Clinical Trials
(Info for Clinicians)
84Participants had difficulty understanding what
content was searched Many thought all content in
Clinical Trials would be searched, not just
ongoing trials
A few participants wanted to use the global NCI
search to search Clinical Trials (consider
labelling this Search NCI or NCI Search
Some participants resounded to the term Find
even when the search form was on the page.
85Bold form labels draws users eyes away from the
form and reduces usability. Consider removing the
bold and possibly bolding the content.
86Disciplines
Participants (without prior exposure) failed to
recognized the five primary disciplines as
navigational elements. The most common
expectation (if noticed at all) was that the
links would provide definitions of the terms.
87State Pages
The placement of this group suggests secondary
content, not primary content. Consider moving
this to the left side of the page.
88Reporting the Results (continued)
Cooper Harper
SUS
89Reporting the Results (continued)
Memphis
DC
Range 0-100 (100 being the best)
90Reporting the Results (continued)
Memphis
DC
Range 1-10 (10 being the best)