Title: Mining Rich Session Context to Improve Web Search
1Mining Rich Session Context to Improve Web Search
- Guangyu Zhu
- University of Maryland
Gilad Mishne Yahoo! Labs
2Motivations
- To propose an efficient and scalable framework
for mining general web user behavior data - Query/click logs are useful, but limited (lt 5 of
traffic) - All user actions count
- The web and web user behaviors both constantly
evolve - Focus on sessions of general web browsing
activities - A logical unit that is general across all
categories - To learn the preferences, intents, and judgment
of users from rich contextual information - To learn session context models to improve core
web search ranking, and other web search
experience
3Roadmap
- Motivations
- Mining web sessions
- ClickRank
- Applications to web search
- Site ranking
- Page ranking
- Mining dynamic quicklinks
4Session identification
- We define session as an active trail of user
clicks presented by the URL referral structure - A new session starts
- After 30 minutes of inactivity
- Occurrence of a URL without the referrer URL
- We used aggregate, anonymous general user
behavior data collected by Yahoo! Toolbar - 30 billion events over 6 month period in 2008
- cookie, timestamp, URL, referral URL, event
attributes - No personal information in source data
5Session characteristics
- Search sessions is only less than 5 of user
on-line activities - A web session contains significantly richer
activity context and diversity than a search
session
6Session characteristics
- The events per session and session duration
exhibit power law behaviors in web-scale general
user behavior data sources
7Histogram session representation
- We compute a distribution of activities over
structured intents, given a list of URLs and
their intent interpretations
- Sessions are highly diverse
- Use PCA to reduce dimensions
- The first 6 eigenvalues are significant
8Session categorization
Cluster centroids
Cluster Attribute Full
Data 1 2 3 4
5 6 7 8 9
10 100 29.8 16.6
14.3 11.9 11.0 4.7 4.6
3.5 2.1 1.5
Search 23.630 0.340 98.430
1.190 2.350 2.350 56.180 41.520
52.230 6.460 0.090 Mail 16.810
0.070 0.660 97.250 0.390 0.400
1.290 51.790 0.710 9.790
0.080 Information 12.260 0.040 0.270
0.390 1.030 96.500 24.580 2.650
0.500 5.970 0.020 Rich content 34.320
99.420 0.370 0.650 0.450 0.360
0.640 0.950 45.250 60.510
99.540 Shopping 12.850 0.080 0.240
0.410 95.670 0.290 16.920 2.600
0.860 16.840 0.060 Total events
9.040 11.140 2.890 5.660 6.250
5.330 4.240 5.380 4.260 7.850
151.680 Total time 420.300 532.490 261.370
303.850 235.780 298.910 228.400 455.580
218.010 439.780 4237.650
Addiction to content rich websites
Collecting info during shopping
Browsing content rich websites
Reformulating search queries
Reading email
Informational queries
Navigational queries
- Intent-driven web browsing patterns emerge from
session clusters - K-means clustering is sufficient to reveal
meaningful intent patterns, such as long sessions
of content browsing and query reformulation - Simple and effective
9Roadmap
- Motivations
- Mining web sessions
- ClickRank
- Applications to web search
- Site ranking
- Page ranking
- Mining dynamic quicklinks
10ClickRank Overview
- ClickRank is derived from contextual indicators
of user preferences and judgment in general web
sessions - Dwell time on the page
- Click order in the session
- Page load time
- Frequency of occurrence in the session
- We compute a local ClickRank function for each
visited page in a session by incorporating
session context models, and then aggregate these
values to obtain the global ClickRank
11Local ClickRank
- We define the local ClickRank function as
- The weight function is computed
from the rank of the page visit event in
session - The weight function is computed
from temporal information associated with
browsing of the page - is the indicator function
12ClickRank incorporates click order
- We define the weight function for an
event in rank of a session with a total
of events as - where
- Motivated by experiments on implicit user
preference judgments in Joachims etc, SIGIR 2005 - is a monotonically decreasing
function w.r.t. the rank of the event within a
session -
- and the mean and
variance of the local ClickRank function is finite
13ClickRank incorporates temporal signals
- We define another weight function to incorporate
more temporal information - where and are normalized dwell time on
the page and page load time w.r.t. the entire
session - The indicator function above
defines a filter that factors in the time range
of interest
14Global ClickRank
- Given a set of web sessions
, the global ClickRank is computed from local
ClickRank functions by an aggregation function - Aggregation operators to compute global ClickRank
are more general - Sum, average, and filter, e.g. by criterion like
time and demography - Filtering sessions is much flexible compared to
filtering links
15Theoretical framework of ClickRank
- The local ClickRank function defines a random
variable a
associated with the web page , given
an observed session - and
- Convergence Property As
converges to by the strong law
of large numbers
16Relation to graph-based models
- ClickRank is based on an intentional surfer model
- ClickRank is data driven
- ClickRank does not embed rigid assumptions on the
traversing scheme over the web - Better reflects users information need and
adapts faster to constantly changing user
behaviors - Significantly more efficient and scalable
compared to approaches based on explicit graph
formulations - The ClickRank computational framework is well
suited for distributed computing - ClickRank can be computed incrementally
- One pass over entire data and memory friendly
17Roadmap
- Motivations
- Mining web sessions
- ClickRank
- Applications to web search
- Site ranking
- Page ranking
- Mining dynamic quicklinks
18Applications to web search
- Datasets
- 3.3 billion web sessions extracted from Yahoo!
Toolbar data over 6 months in 2008 - Site ranking
-
- Compute ClickRank of 16.3 million websites in 56
minutes - Page ranking
- Compute ClickRank of 3.1 billion web pages in 1
hour and 32 minutes
19Site ranking
- ClickRank is more reliable and richer than
results computed using only static link structure
The BrowseRank results are cited from Liu etc,
SIGIR08, which used MSN Toolbar data
20Page ranking methodology
- We evaluated ClickRank with a state-of-the-art
search engine with hundreds of ranking signals - We learn the ranking model using gradient boosted
decision trees (GDBT) - Quantify the variable importance of individual
feature
21Page ranking
- We used a set of 9,000 randomly sampled queries
from search logs - We computed ClickRank feature only for documents
that are visited by more than 5 users over time
Summary of the page ranking experiment
22Page ranking
- The ClickRank value is quantized within the range
of 0, 255, to mirror the setting in a
production system - We used DCG and NDCG to quantitatively evaluate
ranking performance
23Page ranking
- The ClickRank feature brings 1.02, 0.97, 1.11,
and 1.331 web search improvements in DCG(1),
DCG(5), DCG(10), and NDCG - 1 gain over a production system is very
significant - ClickRank affects 81.2 out of over 9, 000
queries and covers 62.5 of documents
24Competitive insights of ClickRank
- ClickRank brings higher improvements to long
queries - Ranked 25th in variable importance among several
hundreds ranking signals - The highest-ranking feature derived from page
visit count (ranked 56th) and a feature based on
propagation of authority through web link graph
(ranked 108th)
25Mining dynamic quicklinks
- Many commercial search engines provide quick
access links to popular destinations within the
site - These links are traditionally mined from search
engine query logs - Query or search session logs are limited in scope
and coverage - Query logs favor old, navigational links
26Mining dynamic quicklinks
- We demonstrate ClickRank for discovering recent,
dynamic content - We adapt the time range in the temporal weight
function w.r.t. the content refresh rate found by
crawler - Use the indicator function as a term that
specifies recency of the content
27Mining dynamic quicklinks
Search results with quicklinks mined by ClickRank
for August 10, 2008
28Mining dynamic quicklinks
Search results with quicklinks mined by ClickRank
for August 10, 2008
29Mining dynamic quicklinks
Search results with quicklinks mined by ClickRank
for August 16, 2008
30Mining dynamic quicklinks
Search results with quicklinks mined by
ClickRankd for August 16, 2008
31Conclusion
- We expand the use of general user behavior data
for web search ranking and other applications - We introduce ClickRank, an efficient, scalable
algorithm for estimating web page importance by
incorporating rich contextual information - ClickRank is shown to be a novel and effective
query-independent ranking signal, especially on
long queries - Our results highlight the potential of
data-driven user behavior modeling at the web
scale
32Thank You!Guangyu Zhuzhugy_at_umiacs.umd.edu