Mining Rich Session Context to Improve Web Search - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Mining Rich Session Context to Improve Web Search

Description:

To propose an efficient and scalable framework for mining general web user behavior data ... Occurrence of a URL without the referrer URL ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 33
Provided by: patrickpan
Category:

less

Transcript and Presenter's Notes

Title: Mining Rich Session Context to Improve Web Search


1
Mining Rich Session Context to Improve Web Search
  • Guangyu Zhu
  • University of Maryland

Gilad Mishne Yahoo! Labs
2
Motivations
  • To propose an efficient and scalable framework
    for mining general web user behavior data
  • Query/click logs are useful, but limited (lt 5 of
    traffic)
  • All user actions count
  • The web and web user behaviors both constantly
    evolve
  • Focus on sessions of general web browsing
    activities
  • A logical unit that is general across all
    categories
  • To learn the preferences, intents, and judgment
    of users from rich contextual information
  • To learn session context models to improve core
    web search ranking, and other web search
    experience

3
Roadmap
  • Motivations
  • Mining web sessions
  • ClickRank
  • Applications to web search
  • Site ranking
  • Page ranking
  • Mining dynamic quicklinks

4
Session identification
  • We define session as an active trail of user
    clicks presented by the URL referral structure
  • A new session starts
  • After 30 minutes of inactivity
  • Occurrence of a URL without the referrer URL
  • We used aggregate, anonymous general user
    behavior data collected by Yahoo! Toolbar
  • 30 billion events over 6 month period in 2008
  • cookie, timestamp, URL, referral URL, event
    attributes
  • No personal information in source data

5
Session characteristics
  • Search sessions is only less than 5 of user
    on-line activities
  • A web session contains significantly richer
    activity context and diversity than a search
    session

6
Session characteristics
  • The events per session and session duration
    exhibit power law behaviors in web-scale general
    user behavior data sources

7
Histogram session representation
  • We compute a distribution of activities over
    structured intents, given a list of URLs and
    their intent interpretations
  • Sessions are highly diverse
  • Use PCA to reduce dimensions
  • The first 6 eigenvalues are significant

8
Session categorization
Cluster centroids
Cluster Attribute Full
Data 1 2 3 4
5 6 7 8 9
10 100 29.8 16.6
14.3 11.9 11.0 4.7 4.6
3.5 2.1 1.5


Search 23.630 0.340 98.430
1.190 2.350 2.350 56.180 41.520
52.230 6.460 0.090 Mail 16.810
0.070 0.660 97.250 0.390 0.400
1.290 51.790 0.710 9.790
0.080 Information 12.260 0.040 0.270
0.390 1.030 96.500 24.580 2.650
0.500 5.970 0.020 Rich content 34.320
99.420 0.370 0.650 0.450 0.360
0.640 0.950 45.250 60.510
99.540 Shopping 12.850 0.080 0.240
0.410 95.670 0.290 16.920 2.600
0.860 16.840 0.060 Total events
9.040 11.140 2.890 5.660 6.250
5.330 4.240 5.380 4.260 7.850
151.680 Total time 420.300 532.490 261.370
303.850 235.780 298.910 228.400 455.580
218.010 439.780 4237.650
Addiction to content rich websites
Collecting info during shopping
Browsing content rich websites
Reformulating search queries
Reading email
Informational queries
Navigational queries
  • Intent-driven web browsing patterns emerge from
    session clusters
  • K-means clustering is sufficient to reveal
    meaningful intent patterns, such as long sessions
    of content browsing and query reformulation
  • Simple and effective

9
Roadmap
  • Motivations
  • Mining web sessions
  • ClickRank
  • Applications to web search
  • Site ranking
  • Page ranking
  • Mining dynamic quicklinks

10
ClickRank Overview
  • ClickRank is derived from contextual indicators
    of user preferences and judgment in general web
    sessions
  • Dwell time on the page
  • Click order in the session
  • Page load time
  • Frequency of occurrence in the session
  • We compute a local ClickRank function for each
    visited page in a session by incorporating
    session context models, and then aggregate these
    values to obtain the global ClickRank

11
Local ClickRank
  • We define the local ClickRank function as
  • The weight function is computed
    from the rank of the page visit event in
    session
  • The weight function is computed
    from temporal information associated with
    browsing of the page
  • is the indicator function

12
ClickRank incorporates click order
  • We define the weight function for an
    event in rank of a session with a total
    of events as
  • where
  • Motivated by experiments on implicit user
    preference judgments in Joachims etc, SIGIR 2005
  • is a monotonically decreasing
    function w.r.t. the rank of the event within a
    session
  • and the mean and
    variance of the local ClickRank function is finite

13
ClickRank incorporates temporal signals
  • We define another weight function to incorporate
    more temporal information
  • where and are normalized dwell time on
    the page and page load time w.r.t. the entire
    session
  • The indicator function above
    defines a filter that factors in the time range
    of interest

14
Global ClickRank
  • Given a set of web sessions
    , the global ClickRank is computed from local
    ClickRank functions by an aggregation function
  • Aggregation operators to compute global ClickRank
    are more general
  • Sum, average, and filter, e.g. by criterion like
    time and demography
  • Filtering sessions is much flexible compared to
    filtering links

15
Theoretical framework of ClickRank
  • The local ClickRank function defines a random
    variable a
    associated with the web page , given
    an observed session
  • and
  • Convergence Property As
    converges to by the strong law
    of large numbers

16
Relation to graph-based models
  • ClickRank is based on an intentional surfer model
  • ClickRank is data driven
  • ClickRank does not embed rigid assumptions on the
    traversing scheme over the web
  • Better reflects users information need and
    adapts faster to constantly changing user
    behaviors
  • Significantly more efficient and scalable
    compared to approaches based on explicit graph
    formulations
  • The ClickRank computational framework is well
    suited for distributed computing
  • ClickRank can be computed incrementally
  • One pass over entire data and memory friendly

17
Roadmap
  • Motivations
  • Mining web sessions
  • ClickRank
  • Applications to web search
  • Site ranking
  • Page ranking
  • Mining dynamic quicklinks

18
Applications to web search
  • Datasets
  • 3.3 billion web sessions extracted from Yahoo!
    Toolbar data over 6 months in 2008
  • Site ranking
  • Compute ClickRank of 16.3 million websites in 56
    minutes
  • Page ranking
  • Compute ClickRank of 3.1 billion web pages in 1
    hour and 32 minutes

19
Site ranking
  • ClickRank is more reliable and richer than
    results computed using only static link structure

The BrowseRank results are cited from Liu etc,
SIGIR08, which used MSN Toolbar data
20
Page ranking methodology
  • We evaluated ClickRank with a state-of-the-art
    search engine with hundreds of ranking signals
  • We learn the ranking model using gradient boosted
    decision trees (GDBT)
  • Quantify the variable importance of individual
    feature

21
Page ranking
  • We used a set of 9,000 randomly sampled queries
    from search logs
  • We computed ClickRank feature only for documents
    that are visited by more than 5 users over time

Summary of the page ranking experiment
22
Page ranking
  • The ClickRank value is quantized within the range
    of 0, 255, to mirror the setting in a
    production system
  • We used DCG and NDCG to quantitatively evaluate
    ranking performance

23
Page ranking
  • The ClickRank feature brings 1.02, 0.97, 1.11,
    and 1.331 web search improvements in DCG(1),
    DCG(5), DCG(10), and NDCG
  • 1 gain over a production system is very
    significant
  • ClickRank affects 81.2 out of over 9, 000
    queries and covers 62.5 of documents

24
Competitive insights of ClickRank
  • ClickRank brings higher improvements to long
    queries
  • Ranked 25th in variable importance among several
    hundreds ranking signals
  • The highest-ranking feature derived from page
    visit count (ranked 56th) and a feature based on
    propagation of authority through web link graph
    (ranked 108th)

25
Mining dynamic quicklinks
  • Many commercial search engines provide quick
    access links to popular destinations within the
    site
  • These links are traditionally mined from search
    engine query logs
  • Query or search session logs are limited in scope
    and coverage
  • Query logs favor old, navigational links

26
Mining dynamic quicklinks
  • We demonstrate ClickRank for discovering recent,
    dynamic content
  • We adapt the time range in the temporal weight
    function w.r.t. the content refresh rate found by
    crawler
  • Use the indicator function as a term that
    specifies recency of the content

27
Mining dynamic quicklinks
Search results with quicklinks mined by ClickRank
for August 10, 2008
28
Mining dynamic quicklinks
Search results with quicklinks mined by ClickRank
for August 10, 2008
29
Mining dynamic quicklinks
Search results with quicklinks mined by ClickRank
for August 16, 2008
30
Mining dynamic quicklinks
Search results with quicklinks mined by
ClickRankd for August 16, 2008
31
Conclusion
  • We expand the use of general user behavior data
    for web search ranking and other applications
  • We introduce ClickRank, an efficient, scalable
    algorithm for estimating web page importance by
    incorporating rich contextual information
  • ClickRank is shown to be a novel and effective
    query-independent ranking signal, especially on
    long queries
  • Our results highlight the potential of
    data-driven user behavior modeling at the web
    scale

32
Thank You!Guangyu Zhuzhugy_at_umiacs.umd.edu
Write a Comment
User Comments (0)
About PowerShow.com