Evaluating Exploratory Search Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Evaluating Exploratory Search Systems

Description:

Questions should be meaningful, answerable, concise, open-ended, and value-free ... For study of advanced query syntax (e.g., , -, '',site:), the research ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 62
Provided by: RyenW
Category:

less

Transcript and Presenter's Notes

Title: Evaluating Exploratory Search Systems


1
Evaluating Exploratory Search Systems
  • Ryen White
  • Microsoft Research
  • ryenw_at_microsoft.com
  • research.microsoft.com/ryenw/talks/ppt/WhiteIMT54
    2E.ppt

2
Overview
  • Short, selfish bit about me
  • User evaluation in IR
  • Case study combining two approaches
  • User study
  • Log-based
  • Introduction to Exploratory Search Systems
  • Focus on evaluation
  • Short group activity
  • Wrap-up

3
Me, Me, Me
  • Interested in understanding and supporting
    peoples search behaviors, in particular on the
    Web
  • Ph.D. in Interactive Information Retrieval from
    University of Glasgow, Scotland (2001 2004)
  • Post-doc at University of Maryland Human-Computer
    Interaction Lab (2004 2006)
  • Instructor for course on Human-Computer
    Interaction at UMD College of Library and
    Information Studies
  • Researcher in Text Mining, Search, and Navigation
    group at Microsoft Research, Redmond (2006 -
    present)

4
Overview
  • Short, selfish bit about me
  • User evaluation in IR
  • Case study combining two approaches
  • User study
  • Log-based
  • Introduction to Exploratory Search Systems
  • Focus on evaluation
  • Short group activity
  • Wrap-up

5
Search Interfaces
  • There are lots of different search interfaces,
    for lots of different situations
  • Big question How do we evaluate these interfaces?

6
Some Approaches
  • Laboratory Experiments
  • Naturalistic Studies
  • Longitudinal Studies
  • Formative (during) and Summative (after)
    evaluations
  • Traditional usability studies
  • Is an interface usable? Generally not
    comparative.
  • Case Studies
  • Often designer, not user, driven

7
Research Questions
  • Research questions are questions that you hope
    that your study will answer (a formal statement
    of your goal)
  • Hypotheses are specific predictions about
    relationships among variables
  • Questions should be meaningful, answerable,
    concise, open-ended, and value-free

8
Research Questions Example 1
  • For study of advanced query syntax (e.g., , -,
    , site), the research questions were
  • Is there a relationship between the use of
    advanced syntax and other characteristics of a
    search?
  • Is there a relationship between the use of
    advanced syntax and post-query navigation
    behaviors?
  • Is there a relationship between the use of
    advanced syntax and measures of search success?

9
Research Questions Example 2
  • For a study of an interface gadget that points
    users to popular destinations (i.e., pages that
    many people visit)
  • Are popular destinations preferable and more
    effective than query refinement suggestions and
    unaided Web search for
  • Searches that are well-defined (known-item
    tasks)?
  • Searches that are ill-defined (exploratory
    tasks)?
  • Should popular destinations be taken from the end
    of query trails or the end of session trails?
  • More on this research question in the case study
    later!

10
Variables
  • Independent Variable (IV) the cause this is
    often (but not always) controlled or manipulated
    by the investigator
  • Dependent Variable (DV) the effect this is
    what is proposed to change as a result of
    different values of the independent variable
  • Other variables
  • Intervening variable explains link between
    variables
  • Moderating variable affects direction/strength
    IV-to-DV
  • Confounding variable not controlled for, affects
    DV

11
Hypotheses
  • Alternative Hypothesis a statement describing
    the relationship between two or more variables,
    e.g.,
  • E.g., Search engine users that use advanced query
    syntax find more relevant Web pages
  • Null Hypothesis a statement declaring that there
    is no relationship among variables you may have
    heard of
  • reject the null hypothesis
  • failing to reject the null hypothesis
  • E.g., Search engine users that use advanced query
    syntax find Web pages that are no more or less
    relevant than other users

12
Experimental Design
  • Within and/or Between Subjects
  • Within-subjects All subjects use all systems
  • Between-subjects Subjects use only one system,
    different blocks of users use each system
  • Control
  • System with no modifications (in within-subjects)
  • Group of subjects that do not use experimental
    system, but instead use a baseline (in
    between-subjects)
  • Factorial Designs
  • 1 variable (factor), e.g., system task type

13
Tasks
  • Task or topic?
  • Task is the activity the user is asked to perform
  • Topic is the subject matter of the task
  • Artificial tasks
  • Subjects given task or even queries relevance
    pre-determined
  • Simulated work tasks (Borlund, 2000)
  • Subjects given task compose queries determine
    relevance
  • Natural tasks (Kelly Belkin, 2004)
  • Subjects construct own tasks as part of real needs

14
System Task Rotation
  • Rotation counterbalancing to counteract
    learning effects
  • Latin Square rotation
  • n n table filled with n different symbols so
    that each symbol occurs exactly once in each row
    and exactly once in each column
  • Factorial rotation
  • all possible combinations
  • Factorial has twice as many subjects
  • Twice as expensive to perform

15
Data Collection
  • Questionnaires
  • Diaries
  • Interviews
  • Focus groups
  • Observation
  • Think-aloud
  • Logging (system, proxy server, client)

16
Data Analysis Quantitative
  • Descriptive Statistics
  • Describes the characteristics of a sample of the
    relationship among variables
  • Presents summary information about the example
  • E.g., mean, correlation coefficient
  • Inferential Statistics
  • Used for hypotheses testing
  • Demonstrate cause/effect relationships
  • E.g., t-value (from t-test), F-value (from ANOVA)

17
Data Analysis Qualitative
  • Coding open-questions, transcribed think-aloud,
  • Classifying or categorizing individual pieces of
    data
  • Open Coding codes are suggested by the
    investigators examination and questioning of the
    data
  • Iterative process
  • Closed Coding codes are identified before the
    data is collected
  • Each passage can have more than one code
  • All passages do not have to have a code
  • Code, code, and code some more!

18
Overview
  • Short, selfish bit about me
  • User evaluation in IR
  • Case study combining two approaches
  • User study
  • Log-based
  • Introduction to Exploratory Search Systems
  • Focus on evaluation
  • Short group activity
  • Wrap-up

19
Case StudyLeveraging popular destinations to
enhance Web search interaction
  • White, R.W., Bilenko, M., Cucerzan, S. (2007).
    Studying the use of popular destinations to
    enhance web search interaction. In Proceedings
    of the 30th ACM SIGIR Conference on Research and
    Development in Information Retrieval, pp. 159-166.

20
Motivation
  • Query suggestion is a popular approach to help
    users better define their information needs
  • Incremental may be inappropriate for exploratory
    needs
  • In exploratory searches users rely a lot on
    browsing
  • Can we use places others go rather than what they
    say?

Query suggestions
Query hubble telescope
21
Search Trails from user logs
  • Initiated with a query to a top-5 search engine
  • Query trails
  • Query ? Query
  • Session trails
  • Query ? Event
  • Session timeout
  • Visit homepage
  • Type URL
  • Check Web-based
    email or logon to

    online service

digital cameras
dpreview.com
pmai.org
S1
S4
S3
S2
S3
S2
Query Trail End
22
Popular Destinations
  • Pages at which other users end up frequently
    after submitting the same or similar queries, and
    then browsing away from initially clicked search
    results
  • Popular destinations lie at the end of many
    users trails
  • May not be among the top-ranked results
  • May not contain the queried terms
  • May not even be indexed by the search engine

23
Suggesting Destinations
  • Can we exploit a corpus of trails to support Web
    search?

24
Research Questions
  • RQ1 Are destination suggestions preferable and
    more effective than query refinement suggestions
    and unaided Web search for
  • Searches that are well-defined (known-item
    tasks)
  • Searches that are ill-defined (exploratory
    tasks)
  • RQ2 Should destination suggestions be taken from
    the end of the query trails or the end of the
    session trails?

25
User Study
  • Conducted a user study to answer these questions
  • 36 subjects drawn from subject pool within our
    organization
  • 4 systems
  • 2 task types (known-item and exploratory)
  • Within-subject experimental design
  • Graeco-Latin square design
  • Subjects attempted 2 known-item and 2 exploratory
    tasks, one on each system

26
Systems Unaided Web Search
  • Live Search backend
  • No direct support for query refinement

Query hubble telescope
27
Systems Query Suggestion
  • Suggests queries based on popular extensions for
    the current query type by the user

Query hubble telescope
28
Systems Destination Suggestion
  • Query Destination (unaided page support)
  • Suggests pages many users visit before next query
  • Session Destination (unaided page support)
  • Same as above, but before session end not next
    query

Query hubble telescope
29
Tasks
  • Tasks taken and adapted from TREC Interactive
    Track and QA communities (e.g., Live QnA, Yahoo!
    Answers)
  • Six of each task type, subject chose without
    replacement
  • Two task types known-item and exploratory
  • Known-item Identify three tropical storms
    (hurricanes and typhoons) that have caused
    property damage and/or loss of life.
  • Exploratory task You are considering purchasing
    a Voice Over Internet Protocol (VoIP) telephone.
    You want to learn more about VoIP technology and
    providers that offer the service, and select the
    provider and telephone that best suits you.

30
Methodology
  • Subjects
  • Chose two known-item and two exploratory tasks
    from six
  • Completed demographic and experience
    questionnaire
  • For each of four interfaces, subjects were
  • Given an explanation of interface functionality
    (2 min.)
  • Attempt the task on the assigned system (10
    min.)
  • Asked to complete a post-search questionnaire
    after each task
  • After using four systems, subjects answered exit
    questionnaire

31
Findings System Ranking
  • Subjects asked to rank the systems in preference
    order
  • Subjects preferred QuerySuggestion and
    QueryDestination
  • Differences not statistically significant
  • Overall ranking merges performance on different
    types of search task to produce one ranking

Relative ranking of systems (lower better).
32
Findings Subject Comments
  • Responses to open-ended questions
  • Baseline
  • familiarity of the system (e.g., was familiar
    and I didnt end up using suggestions (S36))
  • - lack of support for query formulation (Can be
    difficult if you dont pick good search terms
    (S20))
  • - difficulty locating relevant documents (e.g.,
    Difficult to find what I was looking for (S13))

33
Findings Subject Comments
  • Query Suggestion
  • rapid support for query formulation (e.g., was
    useful in saving typing and coming up with new
    ideas for query expansion (S12) helps me
    better phrase the search term (S24) made my
    next query easier (S21))
  • - suggestion quality (e.g., Not relevant (S11)
    Popular queries werent what I was looking for
    (S18))
  • - quality of results they led to (e.g., Results
    (after clicking on suggestions) were of low
    quality (S35) Ultimately unhelpful (S1))

34
Findings Subject Comments
  • QueryDestination
  • support for accessing new information sources
    (e.g., provided potentially helpful and new
    areas / domains to look at (S27))
  • bypassing the need to browse to these pages
    (Useful to try to cut to the chase and go
    where others may have found answers to the topic
    (S3))
  • - lack of specificity in the suggested domains
    (Should just link to site-specific query, not
    site itself (S16) Sites were not very
    specific (S24) Too general/vague (S28))
  • - quality of the suggestions (Not relevant
    (S11) Irrelevant (S6))

35
Findings Subject Comments
  • SessionDestination
  • utility of the suggested domains (suggestions
    make an awful lot of sense in providing search
    assistance, and seemed to help very nicely (S5))
  • - irrelevance of the suggestions (e.g., did not
    seem reliable, not much help (S30) irrelevant,
    not my style (S21))
  • - need to include explanations about why the
    suggestions were offered (e.g., low-quality
    results, not enough information presented (S35))

36
Findings Task Completion
  • Subjects felt that they were more successful for
    known-item searches on QuerySuggestion and more
    successful for exploratory searches in
    QueryDestination

Perceptions of task success (lower better,
scale 1-5 )
37
Findings Task Completion Time
600
Time (seconds)
Systems
513.7
Baseline
474.2
467.8
472.2
500
QSuggest
QDestination
400
359.8
348.8
SDestination
272.3
300
232.3
200
100
0
Known-item
Exploratory
Task categories
  • QuerySuggestion and QueryDestination sped up
    known-item performance
  • Exploratory tasks took longer

38
Findings Interaction
Suggestion uptake (values are percentages).
  • Known-item tasks
  • subjects used query suggestion most heavily
  • Exploratory tasks
  • subjects benefited most from destination
    suggestions
  • Subjects submitted fewer queries and clicked
    fewer search results on QueryDestination

39
Log Analysis
  • These findings are all from the laboratory
  • Logs from consenting users of the Windows Live
    Toolbar allowed us to determine the external
    validity of our experimental findings
  • Do the behaviors observed in the study mimic
    those of real users in the wild?
  • Extracted search sessions from the logs that
    started with the same initial queries as our user
    study subjects

40
Log Analysis Search Trails
  • Initiated with a query to a top-5 search engine
  • Query trails
  • Query ? Query
  • Session trails
  • Query ? Event
  • Session timeout
  • Visit homepage
  • Type URL
  • Check Web-based
    email or logon to

    online service

digital cameras
dpreview.com
pmai.org
S1
S4
S3
S2
S3
S2
Query Trail End
41
Log Analysis Trails
  • We extracted 2,038 trails from the logs that
    began with the same query as a user study session
  • 700 from known-item and 1,338 from exploratory
    tasks
  • In vitro group User study subjects
  • Ex vitro group Remote subjects
  • Compared
  • query iterations, unique query terms,
    result clicks, and of unique domains visited

42
Log Analysis Results
These numbers are high!
These numbers are high!
  • Generally same, apart from in the number of
    unique query terms submitted
  • Subjects may be taking terms from the textual
    task descriptions provided to them

43
Log Analysis Results
  • Known-item tasks
  • 72 overlap between queries issued and terms
    appearing in the task description
  • Exploratory tasks
  • 79 overlap between queries issued and terms
    appearing in the task description
  • Could confound experiment if we are interested in
    query formulation behavior need to address!

44
Conclusions
  • User study compared the popular destinations with
    traditional query refinement and unaided Web
    search
  • Results revealed that
  • RQ1a Query suggestion preferred for known-item
    tasks
  • RQ1b Destination suggestion preferred for
    exploratory tasks
  • RQ2 Destinations from query trails rather than
    session trails
  • Differences in number of unique query terms
    suggests that textual task descriptions may
    introduce some degree of experimental bias

45
Case Study
  • What did we learn?
  • Showed how a user evaluation can be conducted
  • Showed how analysis of different sources
    questionnaire responses and interaction logs
    (both local and remote) can be combined to
    answer our research questions
  • Showed that the findings of a user study can be
    generalized in some respects to the real world
    (i.e., has some external validity)
  • Anything else?

46
Overview
  • Short, selfish bit about me
  • User evaluation in IR
  • Case study combining two approaches
  • User study
  • Log-based
  • Introduction to Exploratory Search Systems
  • Focus on evaluation
  • Short group activity
  • Wrap-up

47
Exploratory Search
Users search problem
  • Exploratory search describes
  • an information-seeking problem context that is
    open-ended, persistent, and multi-faceted
  • commonly used in scientific discovery, learning,
    and decision making contexts
  • information-seeking processes that are
    opportunistic, iterative, and multi-tactical
  • exploratory tactics are used in all manner of
    information seeking and reflect seeker
    preferences and experience as much as the goal

Users search strategies
48
Marchioninis definition
49
Exploratory Search Systems
  • Support both querying and browsing activities
  • Search engines generally just support querying
  • Help users explore complex information spaces
  • Help users learn about new topics go beyond
    finding
  • Can consider user context
  • E.g., Task constraints, user emotion, changing
    needs

50
Overview
  • Short, selfish bit about me
  • User evaluation in IR
  • Case study combining two approaches
  • User study
  • Log-based
  • Introduction to Exploratory Search Systems
  • Focus on evaluation
  • Short group activity
  • Wrap-up

51
Group Activity
  • Divide into two groups of 3-4 people
  • Each group designs an evaluation of an
    exploratory search system
  • Two systems
  • mSpace faceted spatial browser for classical
    music
  • PhotoMesa photo browser with flexible filtering,
    grouping, and zooming tools
  • You pick the evaluation criteria, comparator
    systems, approach, metrics, etc.

52
mSpace (mspace.fm)
53
PhotoMesa (photomesa.com)
54
Some questions to think about
  • What are the independent/dependent variables?
  • Which experimental design?
  • What task types? What tasks? What topics?
  • Any comparator systems?
  • What subjects? How many? How will you recruit?
  • Which instruments? (e.g., questionnaires)
  • Which data analysis methods (qualitative/quantitat
    ive)?
  • Most importantly Which metrics?
  • How do you determine user and system performance?

55
Overview
  • Short, selfish bit about me
  • User evaluation in IR
  • Case study combining two approaches
  • User study
  • Log-based
  • Introduction to Exploratory Search Systems
  • Focus on evaluation
  • Short group activity
  • Wrap-up

56
Evaluating Exploratory Search
  • SIGIR 2006 workshop on Evaluating Exploratory
    Search Systems
  • Brought together around 40 experts to discuss
    issues in the evaluation of exploratory search
    systems
  • http//research.microsoft.com/ryenw/eess
  • What metrics did they come up with?
  • How do they compare to yours?

57
Metrics from workshop
  • Engagement and enjoyment
  • e.g., task focus, happiness with system
    responses, the number of actionable events (e.g.,
    purchases, forms filled)
  • Information novelty
  • e.g., the amount of new information encountered
  • Task success
  • e.g., reach target document? encountered
    sufficient information en route?
  • Task time to assess efficiency
  • Learning and cognition
  • e.g., cognitive loads, attainment of learning
    outcomes, richness/completeness of
    post-exploration perspective, amount of topic
    space covered, number of insights

58
Activity Wrap-up
  • insert summary of comments from group activity

59
Conclusion
  • We have
  • Described aspects of user experimentation in IR
  • Walked through a case study
  • Introduced exploratory search
  • Planned evaluation of exploratory search systems
  • Related our proposed metrics to those of others
    interested in evaluating exploratory search
    systems

60
Acknowledgements
  • Although modified, a few of the earlier slides in
    this lecture were based on an excellent SIGIR
    2006 tutorial given by Diane Kelly and David
    Harper Thank you Diane and David!

61
Referenced Reading
  • Borlund, P. (2000). Experimental components for
    the evaluation of interaction information
    retrieval systems. Journal of Documentation,
    56(1) 71-90.
  • Kelly, D. and Belkin, N.J. (2004). Display time
    as implicit feedback Understanding task effects.
    Proceedings of the 29th ACM SIGIR Conference on
    Research and Development in Information
    Retrieval, pp. 377-384.
Write a Comment
User Comments (0)
About PowerShow.com