Evaluating Exploratory Search Systems

About This Presentation

Title:

Evaluating Exploratory Search Systems

Description:

Questions should be meaningful, answerable, concise, open-ended, and value-free ... For study of advanced query syntax (e.g., , -, '',site:), the research ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 62

Provided by: RyenW

Category:

more less

Transcript and Presenter's Notes

Title: Evaluating Exploratory Search Systems

1
Evaluating Exploratory Search Systems

Ryen White
Microsoft Research
ryenw_at_microsoft.com
research.microsoft.com/ryenw/talks/ppt/WhiteIMT54
2E.ppt

2
Overview

Short, selfish bit about me
User evaluation in IR
Case study combining two approaches
User study
Log-based
Introduction to Exploratory Search Systems
Focus on evaluation
Short group activity
Wrap-up

3
Me, Me, Me

Interested in understanding and supporting
peoples search behaviors, in particular on the
Web
Ph.D. in Interactive Information Retrieval from
University of Glasgow, Scotland (2001 2004)
Post-doc at University of Maryland Human-Computer
Interaction Lab (2004 2006)
Instructor for course on Human-Computer
Interaction at UMD College of Library and
Information Studies
Researcher in Text Mining, Search, and Navigation
group at Microsoft Research, Redmond (2006 -
present)

4
Overview

Short, selfish bit about me
User evaluation in IR
Case study combining two approaches
User study
Log-based
Introduction to Exploratory Search Systems
Focus on evaluation
Short group activity
Wrap-up

5
Search Interfaces

There are lots of different search interfaces,
for lots of different situations
Big question How do we evaluate these interfaces?

6
Some Approaches

Laboratory Experiments
Naturalistic Studies
Longitudinal Studies
Formative (during) and Summative (after)
evaluations
Traditional usability studies
Is an interface usable? Generally not
comparative.
Case Studies
Often designer, not user, driven

7
Research Questions

Research questions are questions that you hope
that your study will answer (a formal statement
of your goal)
Hypotheses are specific predictions about
relationships among variables
Questions should be meaningful, answerable,
concise, open-ended, and value-free

8
Research Questions Example 1

For study of advanced query syntax (e.g., , -,
, site), the research questions were
Is there a relationship between the use of
advanced syntax and other characteristics of a
search?
Is there a relationship between the use of
advanced syntax and post-query navigation
behaviors?
Is there a relationship between the use of
advanced syntax and measures of search success?

9
Research Questions Example 2

For a study of an interface gadget that points
users to popular destinations (i.e., pages that
many people visit)
Are popular destinations preferable and more
effective than query refinement suggestions and
unaided Web search for
Searches that are well-defined (known-item
tasks)?
Searches that are ill-defined (exploratory
tasks)?
Should popular destinations be taken from the end
of query trails or the end of session trails?
More on this research question in the case study
later!

10
Variables

Independent Variable (IV) the cause this is
often (but not always) controlled or manipulated
by the investigator
Dependent Variable (DV) the effect this is
what is proposed to change as a result of
different values of the independent variable
Other variables
Intervening variable explains link between
variables
Moderating variable affects direction/strength
IV-to-DV
Confounding variable not controlled for, affects
DV

11
Hypotheses

Alternative Hypothesis a statement describing
the relationship between two or more variables,
e.g.,
E.g., Search engine users that use advanced query
syntax find more relevant Web pages
Null Hypothesis a statement declaring that there
is no relationship among variables you may have
heard of
reject the null hypothesis
failing to reject the null hypothesis
E.g., Search engine users that use advanced query
syntax find Web pages that are no more or less
relevant than other users

12
Experimental Design

Within and/or Between Subjects
Within-subjects All subjects use all systems
Between-subjects Subjects use only one system,
different blocks of users use each system
Control
System with no modifications (in within-subjects)
Group of subjects that do not use experimental
system, but instead use a baseline (in
between-subjects)
Factorial Designs
1 variable (factor), e.g., system task type

13
Tasks

Task or topic?
Task is the activity the user is asked to perform
Topic is the subject matter of the task
Artificial tasks
Subjects given task or even queries relevance
pre-determined
Simulated work tasks (Borlund, 2000)
Subjects given task compose queries determine
relevance
Natural tasks (Kelly Belkin, 2004)
Subjects construct own tasks as part of real needs

14
System Task Rotation

Rotation counterbalancing to counteract
learning effects
Latin Square rotation
n n table filled with n different symbols so
that each symbol occurs exactly once in each row
and exactly once in each column
Factorial rotation
all possible combinations
Factorial has twice as many subjects
Twice as expensive to perform

15
Data Collection

Questionnaires
Diaries
Interviews
Focus groups
Observation
Think-aloud
Logging (system, proxy server, client)

16
Data Analysis Quantitative

Descriptive Statistics
Describes the characteristics of a sample of the
relationship among variables
Presents summary information about the example
E.g., mean, correlation coefficient
Inferential Statistics
Used for hypotheses testing
Demonstrate cause/effect relationships
E.g., t-value (from t-test), F-value (from ANOVA)

17
Data Analysis Qualitative

Coding open-questions, transcribed think-aloud,
Classifying or categorizing individual pieces of
data
Open Coding codes are suggested by the
investigators examination and questioning of the
data
Iterative process
Closed Coding codes are identified before the
data is collected
Each passage can have more than one code
All passages do not have to have a code
Code, code, and code some more!

18
Overview

Short, selfish bit about me
User evaluation in IR
Case study combining two approaches
User study
Log-based
Introduction to Exploratory Search Systems
Focus on evaluation
Short group activity
Wrap-up

19
Case StudyLeveraging popular destinations to
enhance Web search interaction

White, R.W., Bilenko, M., Cucerzan, S. (2007).
Studying the use of popular destinations to
enhance web search interaction. In Proceedings
of the 30th ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 159-166.

20
Motivation

Query suggestion is a popular approach to help
users better define their information needs
Incremental may be inappropriate for exploratory
needs
In exploratory searches users rely a lot on
browsing
Can we use places others go rather than what they
say?

Query suggestions
Query hubble telescope
21
Search Trails from user logs

Initiated with a query to a top-5 search engine
Query trails
Query ? Query
Session trails
Query ? Event
Session timeout
Visit homepage
Type URL
Check Web-based
email or logon to

online service

digital cameras
dpreview.com
pmai.org
S1
S4
S3
S2
S3
S2
Query Trail End
22
Popular Destinations

Pages at which other users end up frequently
after submitting the same or similar queries, and
then browsing away from initially clicked search
results
Popular destinations lie at the end of many
users trails
May not be among the top-ranked results
May not contain the queried terms
May not even be indexed by the search engine

23
Suggesting Destinations

Can we exploit a corpus of trails to support Web
search?

24
Research Questions

RQ1 Are destination suggestions preferable and
more effective than query refinement suggestions
and unaided Web search for
Searches that are well-defined (known-item
tasks)
Searches that are ill-defined (exploratory
tasks)
RQ2 Should destination suggestions be taken from
the end of the query trails or the end of the
session trails?

25
User Study

Conducted a user study to answer these questions
36 subjects drawn from subject pool within our
organization
4 systems
2 task types (known-item and exploratory)
Within-subject experimental design
Graeco-Latin square design
Subjects attempted 2 known-item and 2 exploratory
tasks, one on each system

26
Systems Unaided Web Search

Live Search backend
No direct support for query refinement

Query hubble telescope
27
Systems Query Suggestion

Suggests queries based on popular extensions for
the current query type by the user

Query hubble telescope
28
Systems Destination Suggestion

Query Destination (unaided page support)
Suggests pages many users visit before next query
Session Destination (unaided page support)
Same as above, but before session end not next
query

Query hubble telescope
29
Tasks

Tasks taken and adapted from TREC Interactive
Track and QA communities (e.g., Live QnA, Yahoo!
Answers)
Six of each task type, subject chose without
replacement
Two task types known-item and exploratory
Known-item Identify three tropical storms
(hurricanes and typhoons) that have caused
property damage and/or loss of life.
Exploratory task You are considering purchasing
a Voice Over Internet Protocol (VoIP) telephone.
You want to learn more about VoIP technology and
providers that offer the service, and select the
provider and telephone that best suits you.

30
Methodology

Subjects
Chose two known-item and two exploratory tasks
from six
Completed demographic and experience
questionnaire
For each of four interfaces, subjects were
Given an explanation of interface functionality
(2 min.)
Attempt the task on the assigned system (10
min.)
Asked to complete a post-search questionnaire
after each task
After using four systems, subjects answered exit
questionnaire

31
Findings System Ranking

Subjects asked to rank the systems in preference
order
Subjects preferred QuerySuggestion and
QueryDestination
Differences not statistically significant
Overall ranking merges performance on different
types of search task to produce one ranking

Relative ranking of systems (lower better).
32
Findings Subject Comments

Responses to open-ended questions
Baseline
familiarity of the system (e.g., was familiar
and I didnt end up using suggestions (S36))
- lack of support for query formulation (Can be
difficult if you dont pick good search terms
(S20))
- difficulty locating relevant documents (e.g.,
Difficult to find what I was looking for (S13))

33
Findings Subject Comments

Query Suggestion
rapid support for query formulation (e.g., was
useful in saving typing and coming up with new
ideas for query expansion (S12) helps me
better phrase the search term (S24) made my
next query easier (S21))
- suggestion quality (e.g., Not relevant (S11)
Popular queries werent what I was looking for
(S18))
- quality of results they led to (e.g., Results
(after clicking on suggestions) were of low
quality (S35) Ultimately unhelpful (S1))

34
Findings Subject Comments

QueryDestination
support for accessing new information sources
(e.g., provided potentially helpful and new
areas / domains to look at (S27))
bypassing the need to browse to these pages
(Useful to try to cut to the chase and go
where others may have found answers to the topic
(S3))
- lack of specificity in the suggested domains
(Should just link to site-specific query, not
site itself (S16) Sites were not very
specific (S24) Too general/vague (S28))
- quality of the suggestions (Not relevant
(S11) Irrelevant (S6))

35
Findings Subject Comments

SessionDestination
utility of the suggested domains (suggestions
make an awful lot of sense in providing search
assistance, and seemed to help very nicely (S5))
- irrelevance of the suggestions (e.g., did not
seem reliable, not much help (S30) irrelevant,
not my style (S21))
- need to include explanations about why the
suggestions were offered (e.g., low-quality
results, not enough information presented (S35))

36
Findings Task Completion

Subjects felt that they were more successful for
known-item searches on QuerySuggestion and more
successful for exploratory searches in
QueryDestination

Perceptions of task success (lower better,
scale 1-5 )
37
Findings Task Completion Time
600
Time (seconds)
Systems
513.7
Baseline
474.2
467.8
472.2
500
QSuggest
QDestination
400
359.8
348.8
SDestination
272.3
300
232.3
200
100
0
Known-item
Exploratory
Task categories

QuerySuggestion and QueryDestination sped up
known-item performance
Exploratory tasks took longer

38
Findings Interaction
Suggestion uptake (values are percentages).

Known-item tasks
subjects used query suggestion most heavily
Exploratory tasks
subjects benefited most from destination
suggestions
Subjects submitted fewer queries and clicked
fewer search results on QueryDestination

39
Log Analysis

These findings are all from the laboratory
Logs from consenting users of the Windows Live
Toolbar allowed us to determine the external
validity of our experimental findings
Do the behaviors observed in the study mimic
those of real users in the wild?
Extracted search sessions from the logs that
started with the same initial queries as our user
study subjects

40
Log Analysis Search Trails

Initiated with a query to a top-5 search engine
Query trails
Query ? Query
Session trails
Query ? Event
Session timeout
Visit homepage
Type URL
Check Web-based
email or logon to

online service

digital cameras
dpreview.com
pmai.org
S1
S4
S3
S2
S3
S2
Query Trail End
41
Log Analysis Trails

We extracted 2,038 trails from the logs that
began with the same query as a user study session
700 from known-item and 1,338 from exploratory
tasks
In vitro group User study subjects
Ex vitro group Remote subjects
Compared
query iterations, unique query terms,
result clicks, and of unique domains visited

42
Log Analysis Results
These numbers are high!
These numbers are high!

Generally same, apart from in the number of
unique query terms submitted
Subjects may be taking terms from the textual
task descriptions provided to them

43
Log Analysis Results

Known-item tasks
72 overlap between queries issued and terms
appearing in the task description
Exploratory tasks
79 overlap between queries issued and terms
appearing in the task description
Could confound experiment if we are interested in
query formulation behavior need to address!

44
Conclusions

User study compared the popular destinations with
traditional query refinement and unaided Web
search
Results revealed that
RQ1a Query suggestion preferred for known-item
tasks
RQ1b Destination suggestion preferred for
exploratory tasks
RQ2 Destinations from query trails rather than
session trails
Differences in number of unique query terms
suggests that textual task descriptions may
introduce some degree of experimental bias

45
Case Study

What did we learn?
Showed how a user evaluation can be conducted
Showed how analysis of different sources
questionnaire responses and interaction logs
(both local and remote) can be combined to
answer our research questions
Showed that the findings of a user study can be
generalized in some respects to the real world
(i.e., has some external validity)
Anything else?

46
Overview

Short, selfish bit about me
User evaluation in IR
Case study combining two approaches
User study
Log-based
Introduction to Exploratory Search Systems
Focus on evaluation
Short group activity
Wrap-up

47
Exploratory Search
Users search problem

Exploratory search describes
an information-seeking problem context that is
open-ended, persistent, and multi-faceted
commonly used in scientific discovery, learning,
and decision making contexts
information-seeking processes that are
opportunistic, iterative, and multi-tactical
exploratory tactics are used in all manner of
information seeking and reflect seeker
preferences and experience as much as the goal

Users search strategies
48
Marchioninis definition
49
Exploratory Search Systems

Support both querying and browsing activities
Search engines generally just support querying
Help users explore complex information spaces
Help users learn about new topics go beyond
finding
Can consider user context
E.g., Task constraints, user emotion, changing
needs

50
Overview

Short, selfish bit about me
User evaluation in IR
Case study combining two approaches
User study
Log-based
Introduction to Exploratory Search Systems
Focus on evaluation
Short group activity
Wrap-up

51
Group Activity

Divide into two groups of 3-4 people
Each group designs an evaluation of an
exploratory search system
Two systems
mSpace faceted spatial browser for classical
music
PhotoMesa photo browser with flexible filtering,
grouping, and zooming tools
You pick the evaluation criteria, comparator
systems, approach, metrics, etc.

52
mSpace (mspace.fm)
53
PhotoMesa (photomesa.com)
54
Some questions to think about

What are the independent/dependent variables?
Which experimental design?
What task types? What tasks? What topics?
Any comparator systems?
What subjects? How many? How will you recruit?
Which instruments? (e.g., questionnaires)
Which data analysis methods (qualitative/quantitat
ive)?
Most importantly Which metrics?
How do you determine user and system performance?

55
Overview

Short, selfish bit about me
User evaluation in IR
Case study combining two approaches
User study
Log-based
Introduction to Exploratory Search Systems
Focus on evaluation
Short group activity
Wrap-up

56
Evaluating Exploratory Search

SIGIR 2006 workshop on Evaluating Exploratory
Search Systems
Brought together around 40 experts to discuss
issues in the evaluation of exploratory search
systems
http//research.microsoft.com/ryenw/eess
What metrics did they come up with?
How do they compare to yours?

57
Metrics from workshop

Engagement and enjoyment
e.g., task focus, happiness with system
responses, the number of actionable events (e.g.,
purchases, forms filled)
Information novelty
e.g., the amount of new information encountered
Task success
e.g., reach target document? encountered
sufficient information en route?
Task time to assess efficiency
Learning and cognition
e.g., cognitive loads, attainment of learning
outcomes, richness/completeness of
post-exploration perspective, amount of topic
space covered, number of insights

58
Activity Wrap-up

insert summary of comments from group activity

59
Conclusion

We have
Described aspects of user experimentation in IR
Walked through a case study
Introduced exploratory search
Planned evaluation of exploratory search systems
Related our proposed metrics to those of others
interested in evaluating exploratory search
systems

60
Acknowledgements

Although modified, a few of the earlier slides in
this lecture were based on an excellent SIGIR
2006 tutorial given by Diane Kelly and David
Harper Thank you Diane and David!

61
Referenced Reading

Borlund, P. (2000). Experimental components for
the evaluation of interaction information
retrieval systems. Journal of Documentation,
56(1) 71-90.
Kelly, D. and Belkin, N.J. (2004). Display time
as implicit feedback Understanding task effects.
Proceedings of the 29th ACM SIGIR Conference on
Research and Development in Information
Retrieval, pp. 377-384.

Write a Comment

User Comments (0)