An Introduction to Usability Testing

1 / 90
About This Presentation
Title:

An Introduction to Usability Testing

Description:

Here's a sample game: Player A takes 8. Player B takes 2. Then A takes 4, and B ... Video release form! Receipt and confidentiality agreement! Demographic survey ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: An Introduction to Usability Testing


1
An Introduction to Usability Testing
  • Bill Killam, MA CHFP
  • Adjunct Professor
  • University of Maryland
  • bkillam_at_user-centereddesign.com

2
  • The Background

3
Definitions
  • Usability testing is the common name for
    multiple forms user-based system evaluation
  • Popularized in the media by Jakob Neilson in the
    1990s and usually thought of as related to web
    site design
  • Usability testing is one of the activities of
    Human Factors Engineering (or a user-centered
    design) methodology that is 50 years old

4
What is being tested?
  • Sid Smiths user-system interface (compared to
    user-computer interface)
  • Systems are made up of specific users performing
    specific activities within a specific environment
  • Cant redesign users but we can design equipment,
    so our goal as designer is to design equipment
    to optimize system performance

5
What does usability mean?
  • ISO 9126
  • A set of attributes that bear on the effort
    needed for use, and on the individual assessment
    of such use, by a stated or implied set of users
  • ISO 9241 Definition
  • Extent to which a product can be used by
    specified users to achieve specified goals with
    effectiveness, efficiency and satisfaction in a
    specified context of use.

6
What does usability mean?
  • Jakob Neilson
  • Satisfaction
  • Efficiency
  • Learnability
  • Low Errors
  • Memorability
  • Ben Shneiderman
  • Ease of learning
  • Speed of task completion
  • Low error rate
  • Retention of knowledge over time
  • User satisfaction

7
Usability Defined
  • Accessibility
  • A precursor to usability if users cannot gain
    access to the product, its usability is a moot
    point
  • Functional Suitability
  • Does the product contain the functionality
    required by the user?
  • Ease-of-learning
  • Can users determine what functions the product
    performs?
  • Can the user figure out how to exercise the
    functionality provided?
  • Ease-of-use
  • Can the user exercise the functionality
    accurately and efficiently once its learned
    (includes accessibility issues)?
  • Can users use it safely?
  • Ease-of-recall
  • Can the knowledge of operation be easily
    maintained over time?
  • Subjective Preference
  • Do users like using it?

8
What determines a products usability ?
9
perceptual issues(our brains deceive our
senses)
10

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
we have trouble with patterned data
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
our perceptual abilities are limited in the
presence of noise
23
THE QUICK BROWN FOX JUMPED OVER THE LAZY DOGS
BACK.
24
THE QUICK BROWN FOX JUMPED OVER THE LAZY DOGS
BACK.
25
The quick brown fox jumped over the lazy dogs
back.
26
our cognition ability and presumptions effects
what we see
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
  • Jack and Jill went
  • went up the
  • Hill to fetch a
  • a pail of milk

32
  • FINISHED FILES ARE THE RE-
  • SULT OF YEARS OF SCIENTIF-
  • IC STUDY COMBINED WITH THE
  • EXPERIENCE OF MANY YEARS

33
our cognitive abilities are limited
34
  • Lets play the game of 15. The pieces of the
    game are the numbers 1, 2, 3, 4, 5, 6, 7, 8, and
    9. Each player takes a digit in turn. Once a
    digit is taken, the other player cannot use it.
    The first player to get three digits that sum to
    15 wins.
  • Heres a sample game Player A takes 8. Player B
    takes 2. Then A takes 4, and B takes 3. A takes
    5. What digit should B take?

35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
some mental processes have priority
40
(No Transcript)
41
Red
Green
Blue
Orange
Yellow
Black
42
Stroop
Stroop
Stroop
Stroop
Stroop
Stroop
43
Orange
Yellow
Green
Black
Blue
Red
44
our memory affects our abilities
45
our psychology affects our abilities
46
and we need protection from ourselves
47
(No Transcript)
48
Usability Testing is
  • any of a number of methodologies used to assess,
    based on our understanding of human capabilities
    and limitations, how a products design
    contributes or hinders its usability when used
    by the intend users to perform the intend tasks
    in the intend environment

49
Usability Testing is not
  • basic or even applied research
  • large scale effort that can demonstrate internal
    and external reliability and validity
  • require statistical analysis to determine the
    significance of the results

50
Usability Testing is not
  • Market Research
  • Qualitative Quantitative market research is an
    attempt to understanding the actual or potential
    user population for a product of service
  • Size of markets
  • Reasons for purchasing/not purchasing

51
Usability Testing is not
  • User Acceptance Testing
  • A process to ensure the functional specifications
    have been met
  • It occurs after development, just before a
    product is shipped
  • It tell you the functionality exists regardless
    of if it is usable
  • It may capture some subjective preference data,
    but not truly based on the products usability

52
When is best usability assessed?
  • On an existing product to determine if usability
    problems exist
  • During the design phase of a product
  • During the development phase of a product to
    assess proposed changes
  • Once a design is completed to determine if goal
    were met
  • Each of these represent different purposes with
    different approaches and are often funded from
    different sources

53
  • The Methods

54
Types of Usability Testing
  • Non-User Based
  • Compliance Reviews
  • Expert Reviews (Heuristic Evaluations)
  • Cognitive Walkthroughs
  • User-based
  • Ethnographic Observation
  • User Surveys/Questionnaires
  • Think Aloud Protocols
  • Performance-based Protocols
  • Co-Discover Protocols

55
Non-User Based Testing
56
Compliance Testing
  • Possible (within limits) to be performed by
    anyone
  • Can remove the low level usability issues that
    often mask more significant usability issues

57
Compliance Testing (concluded)
  • Style Guide-based Testing
  • Checklists
  • Interpretation Issues
  • Scope Limitations
  • Available Standards
  • Commercially GUI Web Standards and Style Guides
  • Domain Specific GUI Web Standards and Style
    Guides
  • Internal Standards and Style Guides
  • Interface Specification Testing
  • May revert to user acceptance testing

58
Expert Review
  • Aka Heuristic Evaluation
  • One or two usability experts review a product,
    application, etc.
  • Free format review or structured review based on
    heuristics
  • Subjective but based on sound usability and
    design principles
  • Highly dependent on the qualifications of the
    reviewer(s)

59
Expert Review (Concluded)
  • Nielsons 10 Most Common Mistakes Made by Web
    Developers (three versions)
  • Shneidermans 8 Golden Rules
  • Constantine Lockwood Heuristics
  • Forrester Group Heuristics
  • Normans 4 Principles of Usability

60
Cognitive Walkthrough
  • Team Approach
  • Best if a diverse population of reviewers
  • Issues related to cognition (understanding) more
    than presentation
  • Also low cost usability testing
  • Highly dependent on the qualifications of the
    reviewer(s)

61
User Based Testing
62
User Surveys
  • Standardized vs. Custom made
  • Formats
  • Closed ended
  • Open ended
  • Likert scale questions
  • Cognitive Testing of the Survey itself
  • User bias in evaluations
  • Leniency Effect
  • Central tendency
  • Strictness Bias
  • Strategic Responding
  • the intent of the survey is so obvious that the
    participants attempt to respond in a way they
    believe will help
  • Self Selection versus Directed Surveys

63
User Interviews
  • Formats
  • Closed ended
  • Open ended
  • Likert scale questions
  • Strategic Responding
  • The intent of the interview is so obvious that
    the participants attempt to respond in a way they
    believe will help

64
Ethnographic Observation
  • Field Study
  • Natural Environment for the User
  • Can be time consuming and logistically
    prohibitive,

65
How toDesign Conduct a User-based Testin a
Lab
66
Test Set-up
  • Whats the hypothesis?
  • Required for research
  • Required for usability testing?
  • Define Your Variables
  • Dependent and Independent Variables
  • Confounding Variables
  • Operationalize Your Variables

67
Participant Issues
  • User-types
  • Users versus user surrogates
  • All profiles or specific user profiles/personas?
  • Critical segments?
  • How many?
  • Relationship to statistical significance
  • Discount Usability whos rule?

68
Participant Issues (concluded)
  • Selecting subjects
  • Screeners
  • Getting Subjects
  • Convenience Sampling
  • Recruiting
  • Participant stipends
  • Over recruiting
  • Scheduling

69
Test Set-up (concluded)
  • Select a Protocol
  • Between Subject Designs
  • Within Subject Designs
  • Selecting a protocol
  • Non-interrupted think aloud (with or without a
    critical incidence analysis)
  • Interrupted think aloud
  • Non-interrupted performance-based (with or
    without a critical incidence analysis)
  • Interrupted performance-based
  • Co-discovery

70
Defining Task Scenarios
  • Areas of concern, redesign, or of interest
  • Short, unambiguous tasks to be performed
  • Interrelationship
  • Wording is critical
  • In the users own terms
  • Does not contain seeds to the correct solution
  • Enough to form a complete test but able to stay
    within the time limit
  • Flexibility is key
  • Variations ARE allowed

71
Defining Task Scenarios (concluded)
  • Scenarios are contrived for testing, may not be
    representative of real world usage patterns, and
    are NOT required!

72
Preparing Test Materials
  • Consent form!
  • Video release form!
  • Receipt and confidentiality agreement!
  • Demographic survey
  • Facilitators Guide
  • Introductory comments
  • Participant task descriptions
  • Questionnaires, SUS, Cooper-Harper, etc.
  • Note Takers Forms

73
Piloting the Design
  • Getting subjects
  • Convenience sampling
  • Almost anyone will do
  • Collect data
  • Check for timing

74
Conduct the Evaluation and Collect Data
  • Collecting interaction data
  • The data is NOT in the interface, the data is in
    the user!
  • Collecting observed data
  • Behavior
  • Reactions
  • Collecting participant comments
  • Collecting subjective data
  • Pre-test data
  • Post scenario data
  • Post test data

75
Reporting the Results
  • Tradition (Parametric) Descriptive and Predictive
    Statistics are meaningless
  • These statistic require a normal distribution of
    the results to be valid
  • The confidence intervals are too great with small
    samples to draw a meaningful conclusion
  • Example Assume you test 8 people and 7 of them
    complete a task. Your confidence interval is
    52. So, you can only claim that you believe
    that between 47 and 99 of all users will likely
    succeed

76
Reporting the Results (continued)
  • Non-Parametric Statistics CAN be used
  • Mann-Whitney U-Test, Rank SUM Test
  • Is it worth it?

77
Wilcoxin Rank Sum Test
  • Take the poll on the participants comparing the
    new product against the old version of the
    product. People might be asked to comment on the
    statement The new design is an improvement over
    the old design. and given a choice of answers
    from Definitely, its the tops. to No
    definitely not, its awful. The data would be a
    collection of opinions. Making up some data for
    purposes of demonstration, assume the following
    scale and results

78
Wilcoxin Rank Sum Test (continued)
  • The hypothesis you would want to test would be
    The participants consider the new product an
    improvement.
  • Always be cautious with rating scales as people
    are not like rulers, watches, or thermometers.
    The difference between opinions cannot be neatly
    measured in the same way as the differences
    between two lengths, times or temperatures. Is
    the difference between Yes, its a good asset
    (2) and Yes it is an asset, its OK (3) the
    same amount of difference of opinion as between
    I have no opinion (4) and Not really as
    asset. (5)? Is it sensible to measure the
    difference between two opinions numerically? Is
    the opinion Yes it is an asset, its OK, rated
    as 3, three times weaker than the opinion rated
    as 1, Definitely an asset, its the tops.?
    There is nevertheless an order to the different
    opinions. Think of other situations where there
    is an order, but doing arithmetic with the
    numbering does not make sense.
  • A quick eyeball test shows that none of those
    questioned thought it was awful and only one
    person thought it not very good, so a first
    impression is that people generally approve. If
    you start by assuming that in the population
    there is no opinion one way or the other, and
    that peoples responses are symmetrically
    distributed about no opinion, you can test the
    hypothesis that people think the shopping centre
    is an asset, with the null hypothesis that people
    have no opinion about it, their response being
    the median value 4.

79
Wilcoxin Rank Sum Test (continued)
  • You need to be careful to choose the appropriate
    test statistic for the problem you are tackling.
  • For a one tailed test, where the alternative
    hypothesis is that the median is greater than a
    given value, the test statistic is W- . For a one
    tailed test, where the alternative hypothesis is
    that the median is less than a given value, the
    test statistic is W .
  • For a two tailed test the test statistic is the
    smaller of W and W
  • As people who think it an improvement will give a
    rating of less than 4, the null and alternative
    hypotheses can be stated as follows.
  • H0 the median response is 4
  • H1 the median response is less than 4
  • 1 tail test, Significance level 5

80
Wilcoxin Rank Sum Test (continued)
  • List the value
  • Find the difference between each value and the
    median.
  • Ignore the zeros and rank the absolute values of
    the remaining scores.
  • Ignore the signs, start with the smallest
    difference and give it rank 1. Where two or more
    differences have the same value find their mean
    rank, and use this.
  • Now check that W W- are the same as ½ n(n1),
    where n is the number in the sample (having
    ignored the zeros). In this case n 10.
  • ½ n(n1) ½ x 10 x 11 55
  • W W- 95 455 55

81
Wilcoxin Rank Sum Test (continued)
  • Compare the test statistic with the critical
    value in the tables. If the null hypothesis were
    true, and the median is 4, you would expect W
    and W- to have roughly the same value. There are
    two possible test statistics here, W 95 and
    W- 455, and you have to decide which one to
    use. We are interested in W , the sum of the
    ranks of ratings greater than 4. W is much less
    than W- which suggests that more people felt the
    shopping center was an asset. It could also
    suggest that those who expressed a negative view
    expressed a very strong one, with lots of high
    numbers in the ratings.
  • Now you need to compare the value of W , the
    test statistic, with the critical value from the
    table. Given that W is small the key question
    becomes Is W significantly smaller than would
    happen by chance? The table helps you decide
    this by supplying the critical value. For a
    sample of 10, at the 5 significance level for a
    1 tailed test, the value is 10. As W is 95,
    which is less than this, the evidence suggests
    that we can reject the null hypothesis. Your
    conclusion is that the evidence shows, at the 5
    significance level, that the public thinks the
    new shopping centre is an asset to the town.

82
Reporting the Results (continued)
  • Real data from small number testing comes from
  • The Principle of Inter-ocular Drama
  • Observers ability to explain it
  • (not from user comment or even user behavior)

83
The sections for Clinicians was mistaken by many
participants to be general user information,
rather then for those who manage Clinical Trials
Consider visual separation or labelling to ensure
the audience for this information is
understood (e.g., Conducting Clinical Trials
(Info for Clinicians)
84
Participants had difficulty understanding what
content was searched Many thought all content in
Clinical Trials would be searched, not just
ongoing trials
A few participants wanted to use the global NCI
search to search Clinical Trials (consider
labelling this Search NCI or NCI Search
Some participants resounded to the term Find
even when the search form was on the page.
85
Bold form labels draws users eyes away from the
form and reduces usability. Consider removing the
bold and possibly bolding the content.
86
Disciplines
Participants (without prior exposure) failed to
recognized the five primary disciplines as
navigational elements. The most common
expectation (if noticed at all) was that the
links would provide definitions of the terms.
87
State Pages
The placement of this group suggests secondary
content, not primary content. Consider moving
this to the left side of the page.
88
Reporting the Results (continued)
Cooper Harper
SUS
89
Reporting the Results (continued)
Memphis
DC
Range 0-100 (100 being the best)
90
Reporting the Results (continued)
Memphis
DC
Range 1-10 (10 being the best)
Write a Comment
User Comments (0)