An Introduction to Usability Testing

1 / 90

About This Presentation

Title:

An Introduction to Usability Testing

Description:

Here's a sample game: Player A takes 8. Player B takes 2. Then A takes 4, and B ... Video release form! Receipt and confidentiality agreement! Demographic survey ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 91

Provided by: usercente

more less

Transcript and Presenter's Notes

Title: An Introduction to Usability Testing

1
An Introduction to Usability Testing

Bill Killam, MA CHFP
Adjunct Professor
University of Maryland
bkillam_at_user-centereddesign.com

The Background

3
Definitions

Usability testing is the common name for
multiple forms user-based system evaluation
Popularized in the media by Jakob Neilson in the
1990s and usually thought of as related to web
site design
Usability testing is one of the activities of
Human Factors Engineering (or a user-centered
design) methodology that is 50 years old

4
What is being tested?

Sid Smiths user-system interface (compared to
user-computer interface)
Systems are made up of specific users performing
specific activities within a specific environment
Cant redesign users but we can design equipment,
so our goal as designer is to design equipment
to optimize system performance

5
What does usability mean?

ISO 9126
A set of attributes that bear on the effort
needed for use, and on the individual assessment
of such use, by a stated or implied set of users
ISO 9241 Definition
Extent to which a product can be used by
specified users to achieve specified goals with
effectiveness, efficiency and satisfaction in a
specified context of use.

6
What does usability mean?

Jakob Neilson
Satisfaction
Efficiency
Learnability
Low Errors
Memorability
Ben Shneiderman
Ease of learning
Speed of task completion
Low error rate
Retention of knowledge over time
User satisfaction

7
Usability Defined

Accessibility
A precursor to usability if users cannot gain
access to the product, its usability is a moot
point
Functional Suitability
Does the product contain the functionality
required by the user?
Ease-of-learning
Can users determine what functions the product
performs?
Can the user figure out how to exercise the
functionality provided?
Ease-of-use
Can the user exercise the functionality
accurately and efficiently once its learned
(includes accessibility issues)?
Can users use it safely?
Ease-of-recall
Can the knowledge of operation be easily
maintained over time?
Subjective Preference
Do users like using it?

8
What determines a products usability ?
9
perceptual issues(our brains deceive our
senses)
10

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
we have trouble with patterned data
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
our perceptual abilities are limited in the
presence of noise
23
THE QUICK BROWN FOX JUMPED OVER THE LAZY DOGS
BACK.
24
THE QUICK BROWN FOX JUMPED OVER THE LAZY DOGS
BACK.
25
The quick brown fox jumped over the lazy dogs
back.
26
our cognition ability and presumptions effects
what we see
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31

Jack and Jill went
went up the
Hill to fetch a
a pail of milk

FINISHED FILES ARE THE RE-
SULT OF YEARS OF SCIENTIF-
IC STUDY COMBINED WITH THE
EXPERIENCE OF MANY YEARS

33
our cognitive abilities are limited
34

Lets play the game of 15. The pieces of the
game are the numbers 1, 2, 3, 4, 5, 6, 7, 8, and
9. Each player takes a digit in turn. Once a
digit is taken, the other player cannot use it.
The first player to get three digits that sum to
15 wins.
Heres a sample game Player A takes 8. Player B
takes 2. Then A takes 4, and B takes 3. A takes
5. What digit should B take?

35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
some mental processes have priority
40
(No Transcript)
41
Red
Green
Blue
Orange
Yellow
Black
42
Stroop
Stroop
Stroop
Stroop
Stroop
Stroop
43
Orange
Yellow
Green
Black
Blue
Red
44
our memory affects our abilities
45
our psychology affects our abilities
46
and we need protection from ourselves
47
(No Transcript)
48
Usability Testing is

any of a number of methodologies used to assess,
based on our understanding of human capabilities
and limitations, how a products design
contributes or hinders its usability when used
by the intend users to perform the intend tasks
in the intend environment

49
Usability Testing is not

basic or even applied research
large scale effort that can demonstrate internal
and external reliability and validity
require statistical analysis to determine the
significance of the results

50
Usability Testing is not

Market Research
Qualitative Quantitative market research is an
attempt to understanding the actual or potential
user population for a product of service
Size of markets
Reasons for purchasing/not purchasing

51
Usability Testing is not

User Acceptance Testing
A process to ensure the functional specifications
have been met
It occurs after development, just before a
product is shipped
It tell you the functionality exists regardless
of if it is usable
It may capture some subjective preference data,
but not truly based on the products usability

52
When is best usability assessed?

On an existing product to determine if usability
problems exist
During the design phase of a product
During the development phase of a product to
assess proposed changes
Once a design is completed to determine if goal
were met
Each of these represent different purposes with
different approaches and are often funded from
different sources

The Methods

54
Types of Usability Testing

Non-User Based
Compliance Reviews
Expert Reviews (Heuristic Evaluations)
Cognitive Walkthroughs
User-based
Ethnographic Observation
User Surveys/Questionnaires
Think Aloud Protocols
Performance-based Protocols
Co-Discover Protocols

55
Non-User Based Testing
56
Compliance Testing

Possible (within limits) to be performed by
anyone
Can remove the low level usability issues that
often mask more significant usability issues

57
Compliance Testing (concluded)

Style Guide-based Testing
Checklists
Interpretation Issues
Scope Limitations
Available Standards
Commercially GUI Web Standards and Style Guides
Domain Specific GUI Web Standards and Style
Guides
Internal Standards and Style Guides
Interface Specification Testing
May revert to user acceptance testing

58
Expert Review

Aka Heuristic Evaluation
One or two usability experts review a product,
application, etc.
Free format review or structured review based on
heuristics
Subjective but based on sound usability and
design principles
Highly dependent on the qualifications of the
reviewer(s)

59
Expert Review (Concluded)

Nielsons 10 Most Common Mistakes Made by Web
Developers (three versions)
Shneidermans 8 Golden Rules
Constantine Lockwood Heuristics
Forrester Group Heuristics
Normans 4 Principles of Usability

60
Cognitive Walkthrough

Team Approach
Best if a diverse population of reviewers
Issues related to cognition (understanding) more
than presentation
Also low cost usability testing
Highly dependent on the qualifications of the
reviewer(s)

61
User Based Testing
62
User Surveys

Standardized vs. Custom made
Formats
Closed ended
Open ended
Likert scale questions
Cognitive Testing of the Survey itself
User bias in evaluations
Leniency Effect
Central tendency
Strictness Bias
Strategic Responding
the intent of the survey is so obvious that the
participants attempt to respond in a way they
believe will help
Self Selection versus Directed Surveys

63
User Interviews

Formats
Closed ended
Open ended
Likert scale questions
Strategic Responding
The intent of the interview is so obvious that
the participants attempt to respond in a way they
believe will help

64
Ethnographic Observation

Field Study
Natural Environment for the User
Can be time consuming and logistically
prohibitive,

65
How toDesign Conduct a User-based Testin a
Lab
66
Test Set-up

Whats the hypothesis?
Required for research
Required for usability testing?
Define Your Variables
Dependent and Independent Variables
Confounding Variables
Operationalize Your Variables

67
Participant Issues

User-types
Users versus user surrogates
All profiles or specific user profiles/personas?
Critical segments?
How many?
Relationship to statistical significance
Discount Usability whos rule?

68
Participant Issues (concluded)

Selecting subjects
Screeners
Getting Subjects
Convenience Sampling
Recruiting
Participant stipends
Over recruiting
Scheduling

69
Test Set-up (concluded)

Select a Protocol
Between Subject Designs
Within Subject Designs
Selecting a protocol
Non-interrupted think aloud (with or without a
critical incidence analysis)
Interrupted think aloud
Non-interrupted performance-based (with or
without a critical incidence analysis)
Interrupted performance-based
Co-discovery

70
Defining Task Scenarios

Areas of concern, redesign, or of interest
Short, unambiguous tasks to be performed
Interrelationship
Wording is critical
In the users own terms
Does not contain seeds to the correct solution
Enough to form a complete test but able to stay
within the time limit
Flexibility is key
Variations ARE allowed

71
Defining Task Scenarios (concluded)

Scenarios are contrived for testing, may not be
representative of real world usage patterns, and
are NOT required!

72
Preparing Test Materials

Consent form!
Video release form!
Receipt and confidentiality agreement!
Demographic survey
Facilitators Guide
Introductory comments
Participant task descriptions
Questionnaires, SUS, Cooper-Harper, etc.
Note Takers Forms

73
Piloting the Design

Getting subjects
Convenience sampling
Almost anyone will do
Collect data
Check for timing

74
Conduct the Evaluation and Collect Data

Collecting interaction data
The data is NOT in the interface, the data is in
the user!
Collecting observed data
Behavior
Reactions
Collecting participant comments
Collecting subjective data
Pre-test data
Post scenario data
Post test data

75
Reporting the Results

Tradition (Parametric) Descriptive and Predictive
Statistics are meaningless
These statistic require a normal distribution of
the results to be valid
The confidence intervals are too great with small
samples to draw a meaningful conclusion
Example Assume you test 8 people and 7 of them
complete a task. Your confidence interval is
52. So, you can only claim that you believe
that between 47 and 99 of all users will likely
succeed

76
Reporting the Results (continued)

Non-Parametric Statistics CAN be used
Mann-Whitney U-Test, Rank SUM Test
Is it worth it?

77
Wilcoxin Rank Sum Test

Take the poll on the participants comparing the
new product against the old version of the
product. People might be asked to comment on the
statement The new design is an improvement over
the old design. and given a choice of answers
from Definitely, its the tops. to No
definitely not, its awful. The data would be a
collection of opinions. Making up some data for
purposes of demonstration, assume the following
scale and results

78
Wilcoxin Rank Sum Test (continued)

The hypothesis you would want to test would be
The participants consider the new product an
improvement.
Always be cautious with rating scales as people
are not like rulers, watches, or thermometers.
The difference between opinions cannot be neatly
measured in the same way as the differences
between two lengths, times or temperatures. Is
the difference between Yes, its a good asset
(2) and Yes it is an asset, its OK (3) the
same amount of difference of opinion as between
I have no opinion (4) and Not really as
asset. (5)? Is it sensible to measure the
difference between two opinions numerically? Is
the opinion Yes it is an asset, its OK, rated
as 3, three times weaker than the opinion rated
as 1, Definitely an asset, its the tops.?
There is nevertheless an order to the different
opinions. Think of other situations where there
is an order, but doing arithmetic with the
numbering does not make sense.
A quick eyeball test shows that none of those
questioned thought it was awful and only one
person thought it not very good, so a first
impression is that people generally approve. If
you start by assuming that in the population
there is no opinion one way or the other, and
that peoples responses are symmetrically
distributed about no opinion, you can test the
hypothesis that people think the shopping centre
is an asset, with the null hypothesis that people
have no opinion about it, their response being
the median value 4.

79
Wilcoxin Rank Sum Test (continued)

You need to be careful to choose the appropriate
test statistic for the problem you are tackling.
For a one tailed test, where the alternative
hypothesis is that the median is greater than a
given value, the test statistic is W- . For a one
tailed test, where the alternative hypothesis is
that the median is less than a given value, the
test statistic is W .
For a two tailed test the test statistic is the
smaller of W and W
As people who think it an improvement will give a
rating of less than 4, the null and alternative
hypotheses can be stated as follows.
H0 the median response is 4
H1 the median response is less than 4
1 tail test, Significance level 5

80
Wilcoxin Rank Sum Test (continued)

List the value
Find the difference between each value and the
median.
Ignore the zeros and rank the absolute values of
the remaining scores.
Ignore the signs, start with the smallest
difference and give it rank 1. Where two or more
differences have the same value find their mean
rank, and use this.
Now check that W W- are the same as ½ n(n1),
where n is the number in the sample (having
ignored the zeros). In this case n 10.
½ n(n1) ½ x 10 x 11 55
W W- 95 455 55

81
Wilcoxin Rank Sum Test (continued)

Compare the test statistic with the critical
value in the tables. If the null hypothesis were
true, and the median is 4, you would expect W
and W- to have roughly the same value. There are
two possible test statistics here, W 95 and
W- 455, and you have to decide which one to
use. We are interested in W , the sum of the
ranks of ratings greater than 4. W is much less
than W- which suggests that more people felt the
shopping center was an asset. It could also
suggest that those who expressed a negative view
expressed a very strong one, with lots of high
numbers in the ratings.
Now you need to compare the value of W , the
test statistic, with the critical value from the
table. Given that W is small the key question
becomes Is W significantly smaller than would
happen by chance? The table helps you decide
this by supplying the critical value. For a
sample of 10, at the 5 significance level for a
1 tailed test, the value is 10. As W is 95,
which is less than this, the evidence suggests
that we can reject the null hypothesis. Your
conclusion is that the evidence shows, at the 5
significance level, that the public thinks the
new shopping centre is an asset to the town.

82
Reporting the Results (continued)

Real data from small number testing comes from
The Principle of Inter-ocular Drama
Observers ability to explain it
(not from user comment or even user behavior)

83
The sections for Clinicians was mistaken by many
participants to be general user information,
rather then for those who manage Clinical Trials
Consider visual separation or labelling to ensure
the audience for this information is
understood (e.g., Conducting Clinical Trials
(Info for Clinicians)
84
Participants had difficulty understanding what
content was searched Many thought all content in
Clinical Trials would be searched, not just
ongoing trials
A few participants wanted to use the global NCI
search to search Clinical Trials (consider
labelling this Search NCI or NCI Search
Some participants resounded to the term Find
even when the search form was on the page.
85
Bold form labels draws users eyes away from the
form and reduces usability. Consider removing the
bold and possibly bolding the content.
86
Disciplines
Participants (without prior exposure) failed to
recognized the five primary disciplines as
navigational elements. The most common
expectation (if noticed at all) was that the
links would provide definitions of the terms.
87
State Pages
The placement of this group suggests secondary
content, not primary content. Consider moving
this to the left side of the page.
88
Reporting the Results (continued)
Cooper Harper
SUS
89
Reporting the Results (continued)
Memphis
DC
Range 0-100 (100 being the best)
90
Reporting the Results (continued)
Memphis
DC
Range 1-10 (10 being the best)

Write a Comment

User Comments (0)