Title: Characterizing Visitors to a Website Across Multiple Sessions
1Characterizing Visitors to a Website Across
Multiple Sessions
Arindam Banerjee Joydeep Ghosh
2Motivation
- Why Characterize or Predict web user behavior?
- Site-centric view Personalization, sticky
websites - User-centric view personal agents for
information acquisition - Universalist approaches Pagerank, web metrics,
3Clustering Users from Web Logs
- Wide variety of web behavior ? segment users
based on surfing behavior as a first step to
further analysis. - User set of sessions
- Session sequence of
- (page I.d., time spent on that page) tuples
- How to cluster sets of sequences?
4The Approach
- Cluster Sessions
- Session Similarity Measure
- Session Similarity Graph
- Outlier Detection
- Graph Partitioning
- Create a Cluster Space
- Cluster users in this Space
5A Similarity Measure for Sessions
- Overlap between two sessions represented by the
longest common subsequence (LCS) - Obtain session similarity using LCS
and time information session similarity (time
similarity in LCS) x (importance of LCS) - The similarity component
- Average min-max similarity for each page in the
LCS - The importance component
- Average of the fraction of overall session time
spent in the LCS
6Session Clustering
- Find the pairwise similarity values between all
pair of sessions record only similarities gt q - Incrementally construct similarity graph Gq
- the vertices are the sessions, the edge weights
are the session similarity values - no isolated vertices (discard outliers)
- Balanced Graph Partitioning
- we used Metis Karypis, Kumar
7The Cluster Space
- Given each session assigned to one of k clusters
(sets) - ?Sessions of a user are distributed among the k
sets - vector u u1 u2 uk T where ui number of
sessions of the user belonging to cluster I - Stage II User Clustering
- find pairwise similarity values using the
extended Jaccard measure - partition similarity graph
- Gives l user clusters and a set of outlier users
8The Dataset Sulekha.com
9Dataset details
- Logs over a one month period
- Raw log size 184 Mb
- 453,953 files accessed
- 37,753 sessions in all
- 23,310 sessions after some preprocessing/filtering
- 2,493 users
10Results Session Clusters
Cluster 1 interest in coffeehouse, contests Cluster 2 glance through home, articles
-(/,12)(/movies,6)(/contests,178) -(/contests,142) -(/coffeehouse,5)(/contests,183) -(/contests,172) -(/,10)(/contests,143) -(/,22)(/articles,22) -(/,20)(/articles,20) -(/,21)(/articles,21) -(/,19)(/articles,19) -(/,20)(/articles,19)
Cluster 3 interest in author, articles Cluster 4 read articles
-(/,148)(/authors,6)(/articles,77) -(/authors,290)(/articles,290) -(/authors,295)(/articles,295) -(/,33)(/authors,90)(/articles,475) -(/,32)(/authors,91)(/articles,425) -(/,39)(/articles,98)(/misc,17) (/articles,2649) -(/,9)(/articles,2666) -(/authors,26)(/articles,2561) -(/misc,20)(/articles,77)(/misc 32)(/articles,43)(/authors,16) (/articles,2373.1)
11Results User Clusters
- user (128.194.xxx.xxx)
- (/authors,3)(/articles,129)
- (/authors,8)(/articles,8)
- (/authors,80)(/articles,2141)
- user (209.30.xxx.xxx)
- (/home,77)(/articles,111)(/authors,93)(/articles,6
29)(/misc,58) (/coffeehouse,75)(/wo-men,967) - (/articles,2627)
- user (171.68.xxx.xxx)
- (/home,323)(/articles,24)(/authors,45)(/articles,1
290) - A user cluster
- people who read the articles
12Results User Clusters
- user (152.170.xxx.xxx)
- (/home,21)(/wo-men,1075)(/philosophy,52)
- user (209.244.xxx.xxx)
- (/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)
(/wo-men,31) - (/home,52)(/philosophy,67)(/wo-men,955)(/philosoph
y, 26)(/coffeehouse,382)(/biztech,298)(/philosophy
,290) - (/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,
6) (/biztech,94)(/coffeehouse,2)(/philosophy,1093)
- A user cluster
- people interested in wo-men, philosophy,
coffeehouse
13Results User Clusters
- user (216.154.xxx.xxx)
- (/coffeehouse,12)(/biztech,25)(/books,48)
- (/coffeehouse,13)(/biztech,26)(/books,19)
- user (204.220.xxx.xxx)
- (/coffeehouse,162)
- (/coffeehouse,40)
- user (32.100.xxx.xxx)
- (/coffeehouse,12)(/contests 12)
- (/coffeehouse,43)(/contests 44)
- A user cluster
- people interested in coffeehouse bookmarked it
!
14Result Visualization using CLUSION Strehl Ghosh
01
Sessions
Users
15Conclusions
- Segmentation a basic pre-processing step for Web
Mining - Similarity measure Cluster Space Concept
applicable to clustering of sets of any
data-structure - For certain websites, time spent on the pages
matters - not handled by current commercial tools
- Outlier detection before clustering is important
- Results QA-ed by human subjects
- Results for clusters outliers at both levels
were subjectively good - No good way to find cluster quality
analytically - Formation of similarity graph is a slow process
16Future Work
- Improve the present method by
- using cluster seeds for cluster growing
- using alternative clustering algorithms for each
stage - studying the effect of thresholds, number of
clusters on performance - studying the importance of order of page-visits
- studying the importance of balanced clustering
17Backup
18Issues Choice of Parameters
- Number of session clusters, k, should be chosen
appropriately - Thresholds for forming session user similarity
graphs - threshold value should be chosen after looking at
the distribution of edge weights
19Related Work
- Research in Web Mining
- Extraction of navigational patterns
Spiliopoulou, Faulstich - Ordering relationships Mannila, Meek
- Surfing prediction Pitkow, Pirolli
- Clustering web usage sessions Fu, Sandhu, Shih
20Example
- Sessions
- Session1 (a,8) (b,100) (d,8) (c,5) (e,23)
(a,5) - Session2 (b,5) (d,12) (f,1) (a,7) (c,5)
- LCS pages (b)(d)(c)
- Corresponding Index, Times Sequences
- Index1 (1)(2)(3), Time1 (100) (8) (5)
- Index2 (0)(1)(4), Time2 (5) (12) (5)
- Similarity over each LCS page of the
two times - Similarity on page b 5/100 0.05
- Similarity on page d 8/12 0.67
- Similarity on page c 5/5 1.00
21Example (contd.)
- The similarity component
- (0.05 0.67 1.00)/3
- 0.57
- The importance component
- Fraction of time spent in the LCS by Session1
113/149 0.76 - Fraction of time spent in the LCS by Session2
22/30 0.73 - The mean (0.760.73)/2 0.75
- The overall similarity
- 0.57 x 0.75
- 0.43
22Issues Session Resolution
- Generate coarse resolution paths making use of
the concept hierarchy of the website - Reduces computations Increases interpretability
of results
Original Path Concept-level Path
(/authors/ramesh_mahadevan.html,3) (/articles/rm_phattas.html,75) (/articles/rm_desidads.html,39) (/authors,3) (/articles,114)
(/authors/arun_sampath.html,109) (/philosophy/messages/1951.html,102) (/philosophy/messages/1953.html,46) (/,3) (/philosophy/messages/1954.html,69) (/authors,109) (/philosophy,148) (/,3) (/philosophy,69)
23Comments
- Results QA-ed by human subject
- Results for clusters outliers at both levels
were subjectively good - No good way to find cluster quality analytically
- Clustering algorithms for the two stages
- Stage I Graph partitioning works well for large
sparse graphs, so it is desirable in this stage - Stage II Since the space is not
high-dimensional, any reasonable clustering
algorithm should be adequate - Cluster space
- Gives a general framework for mapping any
non-vector clustering problem to an equivalent
vector clustering problem