CS 544 Evaluation - PowerPoint PPT Presentation

About This Presentation
Title:

CS 544 Evaluation

Description:

... Landay (University of California at Berkeley), monica ... if no record is kept, evaluator may forget, miss, or mis-interpret events. paper and pencil ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 45
Provided by: joannamc7
Category:
Tags: evaluation

less

Transcript and Presenter's Notes

Title: CS 544 Evaluation


1
CS 544Evaluation
Acknowledgement Some of the material in these
lectures is based on material prepared for
similar courses by Saul Greenberg (University of
Calgary), Ravin Balakrishnan (University of
Toronto), James Landay (University of California
at Berkeley), monica schraefel (University of
Toronto), and Colin Ware (University of New
Hampshire). Used with the permission of the
respective original authors.
2
Goals
Methods
Products
3
Why bother?
Design
  • Tied to the usability engineering lifecycle
  • Pre-design
  • investing in new expensive system requires proof
    of viability
  • Initial design stages
  • develop and evaluate initial design ideas with
    the user
  • Iterative design
  • does system behaviour match the users task
    requirements?
  • are there specific problems with the design?
  • can users provide feedback to modify design?
  • Acceptance testing
  • verify that human/computer system meets expected
    performance criteria
  • ease of learning, usability, users attitude,
    performance criteria
  • e.g., a first time user will take 1-3 minutes to
    learn how to withdraw 50. from the automatic
    teller

Evaluate
Implement
4
The right method no such thing!
  • Methods enable but also limit evidence
  • All methods are valuable, but all have weaknesses
    or limitations
  • You can offset the different weaknesses of
    various methods by using multiple methods
  • You can choose such multiple methods so that they
    have patterned diversity, i.e., so the strengths
    of some methods offset the weaknesses of others

5
Taxonomy of Research Strategies
6
Maximization of 3 desirable features
  1. Generalizability of the evidence over the
    populations of Actors
  2. Precision of measurement of the behaviours that
    are being studied (an dprecision of control over
    extraneous factor that are not being studied)
  3. Realism of the situation or context within which
    the evidence is gathered, in relation to the
    contexts to which you want your evidence to apply

Although you always want to maximize all three of
these criteria, A, B, and C simultaneously, you
cannot do so. This is the fundamental dilemma of
the research process. Therefore, each study must
be interpreted in relation to other evidence
bearing on the same questions.
7
Quadrant I Field Strategies
  • Field Study
  • direct observations of natural, ongoing systems
  • minimal intrusion/disturbance of systems
  • e.g., cultural anthropology, case studies
  • Field Experiment
  • within an ongoing natural system
  • some intrusion one or more features of system
    manipulated
  • e.g., Hawthorne studies (vary lighting in
    organization)

8
Quadrant II Experimental Strategies
  • Lab experiment
  • concocted situation, rules of operation,
    individuals or groups engage in behaviours
    specified by rules
  • extraneous factors eliminated (which may or may
    not be relevant)
  • considerable precision
  • more obtrusive, reduced realism, less
    generalizable
  • e.g., unnatural task in a lab setting (target
    acquisition)
  • Lab simulation
  • to gain some realism concocted situation made to
    seam natural
  • e.g., giving a natural task in a lab setting

9
Quadrant III Respondent Strategies
  • Sample Survey
  • evidence obtained to estimate the distribution of
    some variables, or relationships among them,
    within a specified population
  • careful sampling from that population
  • e.g., public opinion surveys
  • Judgment Study
  • obtain information about the properties of a
    certain set of stimulus materials
  • focus is set of properties of stimulus materials,
    rather than attributes of the respondents
  • e.g., psychophysics studies (systematic relations
    between properties of the physical stimulus world
    and the psychological perception of those stimuli)

10
Quadrant IV Theoretical Strategies
  • Formal Theory
  • does not involve the gathering of any empirical
    observations
  • general relations among a number of variables of
    interest
  • based on earlier empirical evidence
  • e.g., model human processor, Fits Law
  • Computer Simulation
  • complete and closed system that models the
    operation of the concrete system without any
    behaviour by any system participants
  • e.g., physics simulator

11
Comparison Techniques
  • Baserates
  • must know how often Y occurs in the general case,
    to know if Y is some particular case is (not)
    notable
  • e.g., users can set up a network connection in
    less than 5 minutes in WinXP (is this an
    improvement?)
  • Correlation
  • how the values of property X vary in relation to
    the values in property Y
  • not necessarily causal
  • e.g., number of files and time spent in Windows
    Explorer

12
Basic Experimental Design
  • Independent variables
  • Factors that are manipulated in the experiment
    (e.g., W, A in Fitts Law)
  • Dependent variables
  • Factors that may depend on the independent
    variables (e.g., performance time)
  • Wide range of independent variables
  • E.g. Fitts law expt
  • W s range from character size (10) to icons (40)
    pixels
  • D s from short (50) to large (screen size 800
    pixels)

13
Other experimental examples
  • reading task, dependent variable reading
    performance
  • formatting task, dependent variables speed and
    accuracy

large screen small screen
blue font 10 10
black font 10 10
Mac users PC users
easy 15 15
medium 15 15
hard 15 15
14
Randomization and true experiments
  • can only control a small number of variables,
    what do you do with the others?
  • have to do something else with all other factors
  • randomization random assignment procedure
    allocating cases to conditions
  • does not guarantee an equal distribution of the
    extraneous factors, but makes an unequal
    distribution of any one factor highly unlikely
  • statistical inference selection and allocation
    of cases to conditions require random component
    to the procedure

15
Validity of Findings
  • Internal validity
  • presence of X (or variations in level of X)
    caused the altered level of Y values
  • need to rule out plausible rival hypotheses
  • e.g., study comparing readability on small and
    large screens that finds small screen slows
    reading, when in fact it was the glare of the
    screen that caused the difference in performance
  • Construct validity
  • the extent to which the methods used are in
    agreement with the theoretical concept
    (construct) of interest

16
  • External validity
  • findings will be replicable (repeatable)
  • generalizable to intended population
  • no one study has external validity
  • typical threats
  • non-representative users evaluated
  • non-representative tasks
  • non-representative environment (quiet lab vs.
    noisy office)

17
Measures and Manipulations
  • record made by actor, investigator, uninvolved
    third party
  • degree to which actors aware of being observed
    impacts naturalness of behaviour
  • Self-reports participants knowingly report their
    own behaviour
  • Observations participants behaviour recorded by
    investigator or tool (visible vs. non-visible)
  • Archival records data recorded independent of
    study (public vs. private)
  • Trace measures records of behaviour without
    actors awareness

18
Strengths and Weaknesses
  • Self-reports
  • questionnaires, interviews, rating scales, paper
    and pencil tests
  • frequently-used, very versatile, relatively cheap
  • potentially reactive
  • Observations
  • by visible observer, potentially reactive
  • vulnerable to observer errors
  • can only be used on overt behaviour, not thoughts
  • versatile, costly
  • Strength of one measure can compensate and offset
    weakness of another. Unlike study designs,
    investigator can and should use multiple measures.

19
Manipulating Variables
  • Selection select cases to be alike on a certain
    variable (e.g., Mac users vs. PC users)
  • not a true experiment, because not random
  • Direct intervention force the independent
    variable (e.g., small vs. large screen)
  • true experiment, but not always possible
  • Inductions less direct intervention
  • 3 ways misleading instructions, false feedback,
    experimental confederates

20
Ethics in reporting
UofT Bulletin 24 Sept 2001 (also covered in The
Economist)
21
Ethics in treatment of subjects
  • Testing can be a distressing experience
  • pressure to perform, errors inevitable
  • feelings of inadequacy
  • competition with other subjects
  • Golden rule
  • subjects should always be treated with respect

22
Managing subjects in an ethical manner
  • Before the test
  • dont waste the users time
  • use pilot tests to debug experiments,
    questionnaires etc
  • have everything ready before the user shows up
  • make users feel comfortable
  • emphasize that it is the system that is being
    tested, not the user
  • acknowledge that the software may have problems
  • let users know they can stop at any time
  • maintain privacy
  • tell user that individual test results will be
    kept completely confidential
  • inform the user
  • explain any monitoring that is being used
  • answer all users questions (but avoid bias)
  • only use volunteers
  • user must sign an informed consent form

23
Managing subjects in an ethical manner
  • During the test
  • dont waste the users time
  • never have the user perform unnecessary tasks
  • make users comfortable
  • try to give user an early success experience
  • keep a relaxed atmosphere in the room
  • coffee, breaks, etc
  • hand out test tasks one at a time
  • never indicate displeasure with the users
    performance
  • avoid disruptions
  • stop the test if it becomes too unpleasant
  • maintain privacy
  • do not allow the users management to observe the
    test

24
Managing subjects in an ethical manner
  • After the test
  • make the users feel comfortable
  • state that the user has helped you find areas of
    improvement
  • inform the user
  • answer particular questions about the experiment
    that could have biased the results before
  • maintain privacy
  • never report results in a way that individual
    users can be identified
  • only show videotapes outside the research group
    with the users permission

25
University Involvement in Ethics
  • Document evaluation protocol (strategy, methods,
    measures, number of subjects, subject
    recruitment, consent form, etc.)
  • Document purpose of evaluation
  • Submitted to Office of Research Studies (ORS)
  • Reviewed by a committee (different committees for
    different kinds of evaluation)
  • Usually 2 8 weeks for approval

26
More on Observation
  • Three general approaches
  • simple observation
  • think-aloud
  • co-discovery learning

27
Simple Observation
  • User is given the task (or not), and evaluator
    just watches the user
  • Problem
  • does not give insight into the users decision
    process or attitude

28
The Think Aloud Method
  • Subjects are asked to say what they are
    thinking/doing
  • what they believe is happening
  • what they are trying to do
  • why they took an action
  • Gives insight into what the user is thinking
  • Problems
  • awkward/uncomfortable for subject (thinking aloud
    is not normal!)
  • thinking about it may alter the way people
    perform their task
  • hard to talk when they are concentrating on
    problem
  • Most widely used evaluation method in industry

Hmm, what does this do? Ill try it Ooops, now
what happened?
29
(No Transcript)
30
Co-discovery Learning
  • Two people work together on a task
  • normal conversation between the two users is
    monitored
  • removes awkwardness of think-aloud, more natural
  • provides insights into thinking process of both
    users


31
Recoding observations
  • How do we record user actions during observation
    for later analysis?
  • if no record is kept, evaluator may forget, miss,
    or mis-interpret events
  • paper and pencil
  • primitive but cheap
  • evaluators record events, interpretations, and
    extraneous observations
  • hard to get detail (writing is slow)
  • coding schemes or forms that just need to be
    ticked off
  • audio recording
  • good for recording talk produced by thinking
    aloud/co-discovery interaction
  • hard to tie into user actions (i.e., what they
    are doing on the screen)
  • video recording
  • can see and hear what a user is doing
  • one camera for screen, another for subject
    (picture in picture)
  • can be intrusive during initial period of use
  • Companies often build usability labs with
    one-way mirrors, video cams, etc.
  • ideally have a system that synchronizes all these
    different records together

32
Querying Users via Interviews
  • Excellent for pursuing specific issues
  • vary questions to suit the context
  • probe more deeply on interesting issues as they
    arise
  • good for exploratory studies via open-ended
    questioning
  • often leads to specific constructive suggestions
  • Problems
  • accounts are subjective
  • time consuming
  • evaluator can easily bias the interview
  • prone to rationalization of events/thoughts by
    user
  • users reconstruction may be wrong

33
How to interview
  • Plan a set of central questions
  • could be based on results of user observations
  • gets things started
  • focuses the interview
  • ensures a base of consistency
  • Structured interview only ask planned questions
  • Semi-structured interview allow new questions
    to follow from answers to planned questions
  • Try not to ask leading questions
  • Start with individual discussions to discover
    different perspectives, and continue with group
    discussions
  • the larger the group, the more the universality
    of comments can be ascertained
  • also encourages discussion between users

34
Retrospective Interview
  • Post-observation interview to clarify events that
    occurred during system use
  • perform an observational test
  • create a video record of it
  • have users view the video and comment on what
    they did
  • excellent for grounding a post-test interview
  • avoids erroneous reconstruction
  • users often offer concrete suggestions

Do you know why you never tried that option?
I didnt see it. Why dont you make it look like
a button?
35
Querying users via Questionnaires and Surveys
  • Questionnaires / Surveys
  • preparation expensive, but administration cheap
  • can reach a wide subject group (e.g. mail)
  • does not require presence of evaluator
  • results can be quantified
  • only as good as the questions asked

36
Querying Users via Questionnaires / Surveys
  • establish the purpose of the questionnaire
  • what information is sought?
  • how would you analyze the results?
  • what would you do with your analysis?
  • do not ask questions whose answers you will not
    use!
  • e.g. how old are you?
  • determine the audience you want to reach
  • typical survey random sample of between 50 and
    1000 users of the product
  • determine how would you will deliver and collect
    the questionnaire
  • on-line for computer users
  • web site with forms
  • surface mail
  • including a pre-addressed reply envelope gives
    far better response
  • determine the demographics
  • e.g. computer experience

37
Styles of questions
  • Open-ended questions
  • asks for unprompted opinions
  • good for general subjective information
  • but difficult to analyze rigorously
  • E.g., Can you suggest any improvements to the
    interfaces?

38
Styles of questions
  • Closed questions
  • restricts the respondents responses by supplying
    alternative answers
  • can be easily analyzed
  • but watch out for hard to interpret responses!
  • alternative answers should be very specific
  • Do you use computers at work
  • O often O sometimes
    O rarely
  • vs
  • In your typical work day, do you use
    computers
  • O over 4 hrs a day
  • O between 2 and 4 hrs daily
  • O between 1 and 2 hrs daily
  • O less than 1 hr a day

39
Styles of questions
  • Scalar
  • ask user to judge a specific statement on a
    numeric scale
  • scale usually corresponds with agreement or
    disagreement with a statement
  • Characters on the computer screen are
  • hard to read easy to read
  • 1 2 3 4 5
  • Scale usually has an uneven length why?

40
Styles of questions
  • Multi-choice
  • respondent offered a choice of explicit
    responses
  • How do you most often get help with the system?
    (tick one)
  • O on-line manual
  • O paper manual
  • O ask a colleague
  • Which types of software have you used? (tick all
    that apply)
  • O word processor
  • O data base
  • O spreadsheet
  • O compiler

41
Styles of questions
  • Ranked
  • respondent places an ordering on items in a list
  • useful to indicate a users preferences
  • forced choice
  • Rank the usefulness of these methods of issuing a
    command
  • (1 most useful, 2 next most useful..., 0 if not
    used
  • __2__ command line
  • __1__ menu selection
  • __3__ control key accelerator

42
Styles of questions
  • Combining open-ended and closed questions
  • gets specific response, but allows room for
    users opinion
  • It is easy to recover from mistakes
  • disagree agree
    comment the undo facility is really
    helpful
  • 1 2 3 4 5

43
Assessing any evaluation
  • What strategy, method, measures were used?
  • What are the inherent weaknesses/strengths of the
    strategies, methods, measures?
  • How (if at all) did the investigators
    mitigate/address the weaknesses? (Did they
    acknowledge the weaknesses?)
  • Key think of these questions when you are
    planning your own evaluation!

44
Readings
  • McGrath, J. (1994). Methodology matters Doing
    research in the behavioural and social sciences.
    (BGBG 152-169)
  • Kennedy, S. (1989). Using video in the BNR
    usability lab. (Reprinted in BGBG 182-185)
Write a Comment
User Comments (0)
About PowerShow.com