Title: Evaluating Exploratory Search Systems
1Evaluating Exploratory Search Systems
- Ryen White
- Microsoft Research
- ryenw_at_microsoft.com
- research.microsoft.com/ryenw/talks/ppt/WhiteIMT54
2E.ppt
2Overview
- Short, selfish bit about me
- User evaluation in IR
- Case study combining two approaches
- User study
- Log-based
- Introduction to Exploratory Search Systems
- Focus on evaluation
- Short group activity
- Wrap-up
3Me, Me, Me
- Interested in understanding and supporting
peoples search behaviors, in particular on the
Web - Ph.D. in Interactive Information Retrieval from
University of Glasgow, Scotland (2001 2004) - Post-doc at University of Maryland Human-Computer
Interaction Lab (2004 2006) - Instructor for course on Human-Computer
Interaction at UMD College of Library and
Information Studies - Researcher in Text Mining, Search, and Navigation
group at Microsoft Research, Redmond (2006 -
present)
4Overview
- Short, selfish bit about me
- User evaluation in IR
- Case study combining two approaches
- User study
- Log-based
- Introduction to Exploratory Search Systems
- Focus on evaluation
- Short group activity
- Wrap-up
5Search Interfaces
- There are lots of different search interfaces,
for lots of different situations - Big question How do we evaluate these interfaces?
6Some Approaches
- Laboratory Experiments
- Naturalistic Studies
- Longitudinal Studies
- Formative (during) and Summative (after)
evaluations - Traditional usability studies
- Is an interface usable? Generally not
comparative. - Case Studies
- Often designer, not user, driven
7Research Questions
- Research questions are questions that you hope
that your study will answer (a formal statement
of your goal) - Hypotheses are specific predictions about
relationships among variables - Questions should be meaningful, answerable,
concise, open-ended, and value-free
8Research Questions Example 1
- For study of advanced query syntax (e.g., , -,
, site), the research questions were - Is there a relationship between the use of
advanced syntax and other characteristics of a
search? - Is there a relationship between the use of
advanced syntax and post-query navigation
behaviors? - Is there a relationship between the use of
advanced syntax and measures of search success?
9Research Questions Example 2
- For a study of an interface gadget that points
users to popular destinations (i.e., pages that
many people visit) - Are popular destinations preferable and more
effective than query refinement suggestions and
unaided Web search for - Searches that are well-defined (known-item
tasks)? - Searches that are ill-defined (exploratory
tasks)? - Should popular destinations be taken from the end
of query trails or the end of session trails? - More on this research question in the case study
later!
10Variables
- Independent Variable (IV) the cause this is
often (but not always) controlled or manipulated
by the investigator - Dependent Variable (DV) the effect this is
what is proposed to change as a result of
different values of the independent variable - Other variables
- Intervening variable explains link between
variables - Moderating variable affects direction/strength
IV-to-DV - Confounding variable not controlled for, affects
DV
11Hypotheses
- Alternative Hypothesis a statement describing
the relationship between two or more variables,
e.g., - E.g., Search engine users that use advanced query
syntax find more relevant Web pages - Null Hypothesis a statement declaring that there
is no relationship among variables you may have
heard of - reject the null hypothesis
- failing to reject the null hypothesis
- E.g., Search engine users that use advanced query
syntax find Web pages that are no more or less
relevant than other users
12Experimental Design
- Within and/or Between Subjects
- Within-subjects All subjects use all systems
- Between-subjects Subjects use only one system,
different blocks of users use each system - Control
- System with no modifications (in within-subjects)
- Group of subjects that do not use experimental
system, but instead use a baseline (in
between-subjects) - Factorial Designs
- 1 variable (factor), e.g., system task type
13Tasks
- Task or topic?
- Task is the activity the user is asked to perform
- Topic is the subject matter of the task
- Artificial tasks
- Subjects given task or even queries relevance
pre-determined - Simulated work tasks (Borlund, 2000)
- Subjects given task compose queries determine
relevance - Natural tasks (Kelly Belkin, 2004)
- Subjects construct own tasks as part of real needs
14System Task Rotation
- Rotation counterbalancing to counteract
learning effects - Latin Square rotation
- n n table filled with n different symbols so
that each symbol occurs exactly once in each row
and exactly once in each column - Factorial rotation
- all possible combinations
- Factorial has twice as many subjects
- Twice as expensive to perform
15Data Collection
- Questionnaires
- Diaries
- Interviews
- Focus groups
- Observation
- Think-aloud
- Logging (system, proxy server, client)
16Data Analysis Quantitative
- Descriptive Statistics
- Describes the characteristics of a sample of the
relationship among variables - Presents summary information about the example
- E.g., mean, correlation coefficient
- Inferential Statistics
- Used for hypotheses testing
- Demonstrate cause/effect relationships
- E.g., t-value (from t-test), F-value (from ANOVA)
17Data Analysis Qualitative
- Coding open-questions, transcribed think-aloud,
- Classifying or categorizing individual pieces of
data - Open Coding codes are suggested by the
investigators examination and questioning of the
data - Iterative process
- Closed Coding codes are identified before the
data is collected - Each passage can have more than one code
- All passages do not have to have a code
- Code, code, and code some more!
18Overview
- Short, selfish bit about me
- User evaluation in IR
- Case study combining two approaches
- User study
- Log-based
- Introduction to Exploratory Search Systems
- Focus on evaluation
- Short group activity
- Wrap-up
19Case StudyLeveraging popular destinations to
enhance Web search interaction
- White, R.W., Bilenko, M., Cucerzan, S. (2007).
Studying the use of popular destinations to
enhance web search interaction. In Proceedings
of the 30th ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 159-166.
20Motivation
- Query suggestion is a popular approach to help
users better define their information needs - Incremental may be inappropriate for exploratory
needs - In exploratory searches users rely a lot on
browsing - Can we use places others go rather than what they
say?
Query suggestions
Query hubble telescope
21Search Trails from user logs
- Initiated with a query to a top-5 search engine
- Query trails
- Query ? Query
- Session trails
- Query ? Event
- Session timeout
- Visit homepage
- Type URL
- Check Web-based
email or logon to
online service
digital cameras
dpreview.com
pmai.org
S1
S4
S3
S2
S3
S2
Query Trail End
22Popular Destinations
- Pages at which other users end up frequently
after submitting the same or similar queries, and
then browsing away from initially clicked search
results - Popular destinations lie at the end of many
users trails - May not be among the top-ranked results
- May not contain the queried terms
- May not even be indexed by the search engine
23Suggesting Destinations
- Can we exploit a corpus of trails to support Web
search?
24Research Questions
- RQ1 Are destination suggestions preferable and
more effective than query refinement suggestions
and unaided Web search for - Searches that are well-defined (known-item
tasks) - Searches that are ill-defined (exploratory
tasks) - RQ2 Should destination suggestions be taken from
the end of the query trails or the end of the
session trails?
25User Study
- Conducted a user study to answer these questions
- 36 subjects drawn from subject pool within our
organization - 4 systems
- 2 task types (known-item and exploratory)
- Within-subject experimental design
- Graeco-Latin square design
- Subjects attempted 2 known-item and 2 exploratory
tasks, one on each system
26Systems Unaided Web Search
- Live Search backend
- No direct support for query refinement
Query hubble telescope
27Systems Query Suggestion
- Suggests queries based on popular extensions for
the current query type by the user
Query hubble telescope
28Systems Destination Suggestion
- Query Destination (unaided page support)
- Suggests pages many users visit before next query
- Session Destination (unaided page support)
- Same as above, but before session end not next
query
Query hubble telescope
29Tasks
- Tasks taken and adapted from TREC Interactive
Track and QA communities (e.g., Live QnA, Yahoo!
Answers) - Six of each task type, subject chose without
replacement - Two task types known-item and exploratory
- Known-item Identify three tropical storms
(hurricanes and typhoons) that have caused
property damage and/or loss of life. - Exploratory task You are considering purchasing
a Voice Over Internet Protocol (VoIP) telephone.
You want to learn more about VoIP technology and
providers that offer the service, and select the
provider and telephone that best suits you.
30Methodology
- Subjects
- Chose two known-item and two exploratory tasks
from six - Completed demographic and experience
questionnaire - For each of four interfaces, subjects were
- Given an explanation of interface functionality
(2 min.) - Attempt the task on the assigned system (10
min.) - Asked to complete a post-search questionnaire
after each task - After using four systems, subjects answered exit
questionnaire
31Findings System Ranking
- Subjects asked to rank the systems in preference
order - Subjects preferred QuerySuggestion and
QueryDestination - Differences not statistically significant
- Overall ranking merges performance on different
types of search task to produce one ranking
Relative ranking of systems (lower better).
32Findings Subject Comments
- Responses to open-ended questions
- Baseline
- familiarity of the system (e.g., was familiar
and I didnt end up using suggestions (S36)) - - lack of support for query formulation (Can be
difficult if you dont pick good search terms
(S20)) - - difficulty locating relevant documents (e.g.,
Difficult to find what I was looking for (S13))
33Findings Subject Comments
- Query Suggestion
- rapid support for query formulation (e.g., was
useful in saving typing and coming up with new
ideas for query expansion (S12) helps me
better phrase the search term (S24) made my
next query easier (S21)) - - suggestion quality (e.g., Not relevant (S11)
Popular queries werent what I was looking for
(S18)) - - quality of results they led to (e.g., Results
(after clicking on suggestions) were of low
quality (S35) Ultimately unhelpful (S1))
34Findings Subject Comments
- QueryDestination
- support for accessing new information sources
(e.g., provided potentially helpful and new
areas / domains to look at (S27)) - bypassing the need to browse to these pages
(Useful to try to cut to the chase and go
where others may have found answers to the topic
(S3)) - - lack of specificity in the suggested domains
(Should just link to site-specific query, not
site itself (S16) Sites were not very
specific (S24) Too general/vague (S28)) - - quality of the suggestions (Not relevant
(S11) Irrelevant (S6))
35Findings Subject Comments
- SessionDestination
- utility of the suggested domains (suggestions
make an awful lot of sense in providing search
assistance, and seemed to help very nicely (S5)) - - irrelevance of the suggestions (e.g., did not
seem reliable, not much help (S30) irrelevant,
not my style (S21)) - - need to include explanations about why the
suggestions were offered (e.g., low-quality
results, not enough information presented (S35))
36Findings Task Completion
- Subjects felt that they were more successful for
known-item searches on QuerySuggestion and more
successful for exploratory searches in
QueryDestination
Perceptions of task success (lower better,
scale 1-5 )
37Findings Task Completion Time
600
Time (seconds)
Systems
513.7
Baseline
474.2
467.8
472.2
500
QSuggest
QDestination
400
359.8
348.8
SDestination
272.3
300
232.3
200
100
0
Known-item
Exploratory
Task categories
- QuerySuggestion and QueryDestination sped up
known-item performance - Exploratory tasks took longer
38Findings Interaction
Suggestion uptake (values are percentages).
- Known-item tasks
- subjects used query suggestion most heavily
- Exploratory tasks
- subjects benefited most from destination
suggestions - Subjects submitted fewer queries and clicked
fewer search results on QueryDestination
39Log Analysis
- These findings are all from the laboratory
- Logs from consenting users of the Windows Live
Toolbar allowed us to determine the external
validity of our experimental findings - Do the behaviors observed in the study mimic
those of real users in the wild? - Extracted search sessions from the logs that
started with the same initial queries as our user
study subjects
40Log Analysis Search Trails
- Initiated with a query to a top-5 search engine
- Query trails
- Query ? Query
- Session trails
- Query ? Event
- Session timeout
- Visit homepage
- Type URL
- Check Web-based
email or logon to
online service
digital cameras
dpreview.com
pmai.org
S1
S4
S3
S2
S3
S2
Query Trail End
41Log Analysis Trails
- We extracted 2,038 trails from the logs that
began with the same query as a user study session - 700 from known-item and 1,338 from exploratory
tasks - In vitro group User study subjects
- Ex vitro group Remote subjects
- Compared
- query iterations, unique query terms,
result clicks, and of unique domains visited
42Log Analysis Results
These numbers are high!
These numbers are high!
- Generally same, apart from in the number of
unique query terms submitted - Subjects may be taking terms from the textual
task descriptions provided to them
43Log Analysis Results
- Known-item tasks
- 72 overlap between queries issued and terms
appearing in the task description - Exploratory tasks
- 79 overlap between queries issued and terms
appearing in the task description - Could confound experiment if we are interested in
query formulation behavior need to address!
44Conclusions
- User study compared the popular destinations with
traditional query refinement and unaided Web
search - Results revealed that
- RQ1a Query suggestion preferred for known-item
tasks - RQ1b Destination suggestion preferred for
exploratory tasks - RQ2 Destinations from query trails rather than
session trails - Differences in number of unique query terms
suggests that textual task descriptions may
introduce some degree of experimental bias
45Case Study
- What did we learn?
- Showed how a user evaluation can be conducted
- Showed how analysis of different sources
questionnaire responses and interaction logs
(both local and remote) can be combined to
answer our research questions - Showed that the findings of a user study can be
generalized in some respects to the real world
(i.e., has some external validity) - Anything else?
46Overview
- Short, selfish bit about me
- User evaluation in IR
- Case study combining two approaches
- User study
- Log-based
- Introduction to Exploratory Search Systems
- Focus on evaluation
- Short group activity
- Wrap-up
47Exploratory Search
Users search problem
- Exploratory search describes
- an information-seeking problem context that is
open-ended, persistent, and multi-faceted - commonly used in scientific discovery, learning,
and decision making contexts - information-seeking processes that are
opportunistic, iterative, and multi-tactical - exploratory tactics are used in all manner of
information seeking and reflect seeker
preferences and experience as much as the goal
Users search strategies
48Marchioninis definition
49Exploratory Search Systems
- Support both querying and browsing activities
- Search engines generally just support querying
- Help users explore complex information spaces
- Help users learn about new topics go beyond
finding - Can consider user context
- E.g., Task constraints, user emotion, changing
needs
50Overview
- Short, selfish bit about me
- User evaluation in IR
- Case study combining two approaches
- User study
- Log-based
- Introduction to Exploratory Search Systems
- Focus on evaluation
- Short group activity
- Wrap-up
51Group Activity
- Divide into two groups of 3-4 people
- Each group designs an evaluation of an
exploratory search system - Two systems
- mSpace faceted spatial browser for classical
music - PhotoMesa photo browser with flexible filtering,
grouping, and zooming tools - You pick the evaluation criteria, comparator
systems, approach, metrics, etc.
52mSpace (mspace.fm)
53PhotoMesa (photomesa.com)
54Some questions to think about
- What are the independent/dependent variables?
- Which experimental design?
- What task types? What tasks? What topics?
- Any comparator systems?
- What subjects? How many? How will you recruit?
- Which instruments? (e.g., questionnaires)
- Which data analysis methods (qualitative/quantitat
ive)? - Most importantly Which metrics?
- How do you determine user and system performance?
55Overview
- Short, selfish bit about me
- User evaluation in IR
- Case study combining two approaches
- User study
- Log-based
- Introduction to Exploratory Search Systems
- Focus on evaluation
- Short group activity
- Wrap-up
56Evaluating Exploratory Search
- SIGIR 2006 workshop on Evaluating Exploratory
Search Systems - Brought together around 40 experts to discuss
issues in the evaluation of exploratory search
systems - http//research.microsoft.com/ryenw/eess
- What metrics did they come up with?
- How do they compare to yours?
57Metrics from workshop
- Engagement and enjoyment
- e.g., task focus, happiness with system
responses, the number of actionable events (e.g.,
purchases, forms filled) - Information novelty
- e.g., the amount of new information encountered
- Task success
- e.g., reach target document? encountered
sufficient information en route? - Task time to assess efficiency
- Learning and cognition
- e.g., cognitive loads, attainment of learning
outcomes, richness/completeness of
post-exploration perspective, amount of topic
space covered, number of insights
58Activity Wrap-up
- insert summary of comments from group activity
59Conclusion
- We have
- Described aspects of user experimentation in IR
- Walked through a case study
- Introduced exploratory search
- Planned evaluation of exploratory search systems
- Related our proposed metrics to those of others
interested in evaluating exploratory search
systems
60Acknowledgements
- Although modified, a few of the earlier slides in
this lecture were based on an excellent SIGIR
2006 tutorial given by Diane Kelly and David
Harper Thank you Diane and David!
61Referenced Reading
- Borlund, P. (2000). Experimental components for
the evaluation of interaction information
retrieval systems. Journal of Documentation,
56(1) 71-90. - Kelly, D. and Belkin, N.J. (2004). Display time
as implicit feedback Understanding task effects.
Proceedings of the 29th ACM SIGIR Conference on
Research and Development in Information
Retrieval, pp. 377-384.