Characterizing Visitors to a Website Across Multiple Sessions - PowerPoint PPT Presentation

About This Presentation

Title:

Characterizing Visitors to a Website Across Multiple Sessions

Description:

... authors,6)(/articles,77) -(/authors,290)(/articles,290) -(/authors, ... Cluster 2 glance through home, articles. Cluster 1 interest in coffeehouse, contests ... – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 24

Provided by: joydee

Learn more at: https://conferences.cs.umbc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Characterizing Visitors to a Website Across Multiple Sessions

1
Characterizing Visitors to a Website Across
Multiple Sessions
Arindam Banerjee Joydeep Ghosh

NGDM Workshop, Nov 2002

2
Motivation

Why Characterize or Predict web user behavior?
Site-centric view Personalization, sticky
websites
User-centric view personal agents for
information acquisition
Universalist approaches Pagerank, web metrics,

3
Clustering Users from Web Logs

Wide variety of web behavior ? segment users
based on surfing behavior as a first step to
further analysis.
User set of sessions
Session sequence of
(page I.d., time spent on that page) tuples
How to cluster sets of sequences?

4
The Approach

Cluster Sessions
Session Similarity Measure
Session Similarity Graph
Outlier Detection
Graph Partitioning
Create a Cluster Space
Cluster users in this Space

5
A Similarity Measure for Sessions

Overlap between two sessions represented by the
longest common subsequence (LCS)
Obtain session similarity using LCS
and time information session similarity (time
similarity in LCS) x (importance of LCS)
The similarity component
Average min-max similarity for each page in the
LCS
The importance component
Average of the fraction of overall session time
spent in the LCS

6
Session Clustering

Find the pairwise similarity values between all
pair of sessions record only similarities gt q
Incrementally construct similarity graph Gq
the vertices are the sessions, the edge weights
are the session similarity values
no isolated vertices (discard outliers)
Balanced Graph Partitioning
we used Metis Karypis, Kumar

7
The Cluster Space

Given each session assigned to one of k clusters
(sets)
?Sessions of a user are distributed among the k
sets
vector u u1 u2 uk T where ui number of
sessions of the user belonging to cluster I
Stage II User Clustering
find pairwise similarity values using the
extended Jaccard measure
partition similarity graph
Gives l user clusters and a set of outlier users

8
The Dataset Sulekha.com
9
Dataset details

Logs over a one month period
Raw log size 184 Mb
453,953 files accessed
37,753 sessions in all
23,310 sessions after some preprocessing/filtering
2,493 users

10
Results Session Clusters
Cluster 1 interest in coffeehouse, contests Cluster 2 glance through home, articles
-(/,12)(/movies,6)(/contests,178) -(/contests,142) -(/coffeehouse,5)(/contests,183) -(/contests,172) -(/,10)(/contests,143) -(/,22)(/articles,22) -(/,20)(/articles,20) -(/,21)(/articles,21) -(/,19)(/articles,19) -(/,20)(/articles,19)
Cluster 3 interest in author, articles Cluster 4 read articles
-(/,148)(/authors,6)(/articles,77) -(/authors,290)(/articles,290) -(/authors,295)(/articles,295) -(/,33)(/authors,90)(/articles,475) -(/,32)(/authors,91)(/articles,425) -(/,39)(/articles,98)(/misc,17) (/articles,2649) -(/,9)(/articles,2666) -(/authors,26)(/articles,2561) -(/misc,20)(/articles,77)(/misc 32)(/articles,43)(/authors,16) (/articles,2373.1)
11
Results User Clusters

user (128.194.xxx.xxx)
(/authors,3)(/articles,129)
(/authors,8)(/articles,8)
(/authors,80)(/articles,2141)
user (209.30.xxx.xxx)
(/home,77)(/articles,111)(/authors,93)(/articles,6
29)(/misc,58) (/coffeehouse,75)(/wo-men,967)
(/articles,2627)
user (171.68.xxx.xxx)
(/home,323)(/articles,24)(/authors,45)(/articles,1
290)
A user cluster
people who read the articles

12
Results User Clusters

user (152.170.xxx.xxx)
(/home,21)(/wo-men,1075)(/philosophy,52)
user (209.244.xxx.xxx)
(/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)
(/wo-men,31)
(/home,52)(/philosophy,67)(/wo-men,955)(/philosoph
y, 26)(/coffeehouse,382)(/biztech,298)(/philosophy
,290)
(/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,
6) (/biztech,94)(/coffeehouse,2)(/philosophy,1093)
A user cluster
people interested in wo-men, philosophy,
coffeehouse

13
Results User Clusters

user (216.154.xxx.xxx)
(/coffeehouse,12)(/biztech,25)(/books,48)
(/coffeehouse,13)(/biztech,26)(/books,19)
user (204.220.xxx.xxx)
(/coffeehouse,162)
(/coffeehouse,40)
user (32.100.xxx.xxx)
(/coffeehouse,12)(/contests 12)
(/coffeehouse,43)(/contests 44)
A user cluster
people interested in coffeehouse bookmarked it
!

14
Result Visualization using CLUSION Strehl Ghosh
01
Sessions
Users
15
Conclusions

Segmentation a basic pre-processing step for Web
Mining
Similarity measure Cluster Space Concept
applicable to clustering of sets of any
data-structure
For certain websites, time spent on the pages
matters
not handled by current commercial tools
Outlier detection before clustering is important
Results QA-ed by human subjects
Results for clusters outliers at both levels
were subjectively good
No good way to find cluster quality
analytically
Formation of similarity graph is a slow process

16
Future Work

Improve the present method by
using cluster seeds for cluster growing
using alternative clustering algorithms for each
stage
studying the effect of thresholds, number of
clusters on performance
studying the importance of order of page-visits
studying the importance of balanced clustering

17
Backup
18
Issues Choice of Parameters

Number of session clusters, k, should be chosen
appropriately
Thresholds for forming session user similarity
graphs
threshold value should be chosen after looking at
the distribution of edge weights

19
Related Work

Research in Web Mining
Extraction of navigational patterns
Spiliopoulou, Faulstich
Ordering relationships Mannila, Meek
Surfing prediction Pitkow, Pirolli
Clustering web usage sessions Fu, Sandhu, Shih

20
Example

Sessions
Session1 (a,8) (b,100) (d,8) (c,5) (e,23)
(a,5)
Session2 (b,5) (d,12) (f,1) (a,7) (c,5)
LCS pages (b)(d)(c)
Corresponding Index, Times Sequences
Index1 (1)(2)(3), Time1 (100) (8) (5)
Index2 (0)(1)(4), Time2 (5) (12) (5)
Similarity over each LCS page of the
two times
Similarity on page b 5/100 0.05
Similarity on page d 8/12 0.67
Similarity on page c 5/5 1.00

21
Example (contd.)

The similarity component
(0.05 0.67 1.00)/3
0.57
The importance component
Fraction of time spent in the LCS by Session1
113/149 0.76
Fraction of time spent in the LCS by Session2
22/30 0.73
The mean (0.760.73)/2 0.75
The overall similarity
0.57 x 0.75
0.43

22
Issues Session Resolution

Generate coarse resolution paths making use of
the concept hierarchy of the website
Reduces computations Increases interpretability
of results

Original Path Concept-level Path
(/authors/ramesh_mahadevan.html,3) (/articles/rm_phattas.html,75) (/articles/rm_desidads.html,39) (/authors,3) (/articles,114)
(/authors/arun_sampath.html,109) (/philosophy/messages/1951.html,102) (/philosophy/messages/1953.html,46) (/,3) (/philosophy/messages/1954.html,69) (/authors,109) (/philosophy,148) (/,3) (/philosophy,69)
23
Comments

Results QA-ed by human subject
Results for clusters outliers at both levels
were subjectively good
No good way to find cluster quality analytically
Clustering algorithms for the two stages
Stage I Graph partitioning works well for large
sparse graphs, so it is desirable in this stage
Stage II Since the space is not
high-dimensional, any reasonable clustering
algorithm should be adequate
Cluster space
Gives a general framework for mapping any
non-vector clustering problem to an equivalent
vector clustering problem