Title: Mining the Structure of User Activity using Cluster Stability
1Mining the Structure of User Activity using
Cluster Stability
- Jeffrey Heer, Ed H. Chi
- Palo Alto Research Center, Inc.
2002.04.13 SIAM Web Analytics Workshop
2Motivation
- Want to understanding the composition of web user
traffic. - What are users information goals?
- Leads to improved site design, content, and
performance - Strategy Content, Usage, and Topology
3User Session Clustering
- Cluster user sessions into common activities such
as product browsing and job seeking. - A number of approaches have been proposed
(Shahabi97, Fu99, Banerjee01, and Heer01) - These require specifying the number of clusters
in advance or browsing a large cluster hierarchy. - ? Can we automatically infer the structure of
user activity?
4Overview
- System Description
- Clustering Method
- Stability Analysis
- Case Studies
- Discussion
5System Description
- Use web access logs and web site content to
generate a user profile for each site visitor. - How Build a multi-featured vector space model of
user activity (multi-modal clustering). - Group user profiles into common activities like
product browsing and job seeking - How Apply clustering algorithms to user profiles
6System Description
Web Crawl
Access Logs
Document Model
User Sessions
- Process Access Logs
- Crawl Web Site
- Build Document Model
- Extract User Sessions
- Build User Profiles
- Cluster Profiles
User Profiles
Clustered Profiles
7Document Model
Web Crawl
Access Logs
Document Model
User Sessions
User Profiles
- Web site is crawled, relevant pages
listed in web logs are retrieved. - Retrieved data is represented as feature vectors
- Content TF.IDF weighted keyword vector
- URL Tokenized and TF.IDF weighted
- Inlinks Column vectors in topology matrix
- Outlinks Row vectors in topology matrix
- These are concatenated to form a single
multi-modal vector Pd for each document.
Clustered Profiles
8User Sessions
Web Crawl
Access Logs
Document Model
User Sessions
User Profiles
- Sessions are extracted from web logs, and
represented by an attribute vector - For path i A?B?D, si lt1,1,0,1,0gt
- (For site with 5 documents ltA,B,C,D,Egt)
- Experimented with various weightings for s,
including viewing-times and path position. - Viewing times achieved highest accuracy in
empirical studies. - A10s?B20s?D15s, si lt10,20,0,15,0gt
Clustered Profiles
9User Profiles
Web Crawl
Access Logs
Document Model
User Sessions
User Profiles
- User profiles are created by linearly combining
the document and session models
Clustered Profiles
10Clustering
Web Crawl
Access Logs
Document Model
User Sessions
User Profiles
- Similarity Metric is a weighted cosine
measure - Clustering is then done by recursive bisection,
using K-Means to perform the bisections
Karypis00, Zhao01. The corresponding criterion
function is
Clustered Profiles
11User population breakdown
Keywords describing user groups
Frequent documents accessed by group
Detailed stats
12Clustering Evaluation
- Ran user study on www.xerox.com to evaluate
effectiveness of method Heer02. - 15 tasks, 5 task categories (104 user traces)
- Using certain modalities and weighting schemes we
were able to achieve accuracies
- as high as 99!
- Found that page content and page viewing time
significantly contribute to clustering accuracy.
13OK, Great, but
- In real-world applications the number of clusters
is an undetermined variable. - Want a method for automatically choosing the
number of clusters. - After review of literature, decided to apply a
cluster stability technique recently proposed by
BenHur02.
14Measuring Clustering Similarity
- For a given clustering of a data set X, define
- Cij
- Two clusterings can then be compared using a dot
product - This dot product can be normalized to get a
cosine metric
1 if xi, xj are in the same cluster and i ? j 0
otherwise
15Cluster Stability
- for k 2 to kmax
- for i 1 to n
- Si Subsample of data set X using sampling ratio
f - Ci cluster(Si,k)
- Perform pairwise comparisons of all Ci,
generating a distribution of similarity values
for the current k - Analyze the resulting distributions to determine
the most stable clusterings.
16Example
Stability Analysis
- Example using 4 Gaussians BenHur02
- Graph on right shows plot of the cumulative
similarity distribution
17Case Study 1 www.xerox.com
User Study 8/2001 104 sessions n 15, f 0.8,
k 2 to 10
18Case Study 2 guir.berkeley.edu
Nov. 1-16, 2001 7700 sessions n 30, f 0.8, k
2 to 15
19Case Study 2 guir.berkeley.edu
n 30, f 0.8, k 3 to 7
20Cluster Contents (guir, k5)
- Cluster 1 DENIM Web Design Tool
- Cluster 2 Research projects publications
- Cluster 3 Quiz-Bowl Competition Site
- Cluster 4 CSCW (1 project 1 course)
- Cluster 5 Random pubs project JavaDoc
- At higher values of k, more concentrated clusters
appear - Personal pages (faculty, students) cluster
emerges - JavaDoc separates into its own cluster
21Discussion
- Stability method shows some utility, but results
are far from conclusive perhaps web data is not
particularly structured? - User Goals
- Does the user have a specific goal?
- Web Site Structure
- Does the web site support user goals?
- Task Structure
- Level of generality
22Possible Cases
- User has task - Site supports task
- www.xerox.com study
- User has task - Site doesnt support it
- User w/o singular goals - Well designed site
- Possibly guir.berkeley.edu
- User w/o task - Poorly designed site
23The Future
- More actionable empirical data
- Need more users over a range of sites
- Larger user study already begun
- Alternative approaches
- Human supervision
- Augmented stability metric / criterion function
- Other clustering methods
- Fuzzy Clustering
24Questions?Suggestions?