Mining the Structure of User Activity using Cluster Stability - PowerPoint PPT Presentation

About This Presentation
Title:

Mining the Structure of User Activity using Cluster Stability

Description:

... Structure of User Activity using Cluster Stability. Jeffrey Heer, Ed H. Chi. Palo Alto Research Center, Inc. 2002.04.13 SIAM Web Analytics Workshop. Motivation ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 25
Provided by: Ern153
Learn more at: http://jheer.org
Category:

less

Transcript and Presenter's Notes

Title: Mining the Structure of User Activity using Cluster Stability


1
Mining the Structure of User Activity using
Cluster Stability
  • Jeffrey Heer, Ed H. Chi
  • Palo Alto Research Center, Inc.

2002.04.13 SIAM Web Analytics Workshop
2
Motivation
  • Want to understanding the composition of web user
    traffic.
  • What are users information goals?
  • Leads to improved site design, content, and
    performance
  • Strategy Content, Usage, and Topology

3
User Session Clustering
  • Cluster user sessions into common activities such
    as product browsing and job seeking.
  • A number of approaches have been proposed
    (Shahabi97, Fu99, Banerjee01, and Heer01)
  • These require specifying the number of clusters
    in advance or browsing a large cluster hierarchy.
  • ? Can we automatically infer the structure of
    user activity?

4
Overview
  • System Description
  • Clustering Method
  • Stability Analysis
  • Case Studies
  • Discussion

5
System Description
  • Use web access logs and web site content to
    generate a user profile for each site visitor.
  • How Build a multi-featured vector space model of
    user activity (multi-modal clustering).
  • Group user profiles into common activities like
    product browsing and job seeking
  • How Apply clustering algorithms to user profiles

6
System Description
Web Crawl
Access Logs
Document Model
User Sessions
  1. Process Access Logs
  2. Crawl Web Site
  3. Build Document Model
  4. Extract User Sessions
  5. Build User Profiles
  6. Cluster Profiles

User Profiles
Clustered Profiles
7
Document Model
Web Crawl
Access Logs
Document Model
User Sessions
User Profiles
  • Web site is crawled, relevant pages
    listed in web logs are retrieved.
  • Retrieved data is represented as feature vectors
  • Content TF.IDF weighted keyword vector
  • URL Tokenized and TF.IDF weighted
  • Inlinks Column vectors in topology matrix
  • Outlinks Row vectors in topology matrix
  • These are concatenated to form a single
    multi-modal vector Pd for each document.

Clustered Profiles
8
User Sessions
Web Crawl
Access Logs
Document Model
User Sessions
User Profiles
  • Sessions are extracted from web logs, and
    represented by an attribute vector
  • For path i A?B?D, si lt1,1,0,1,0gt
  • (For site with 5 documents ltA,B,C,D,Egt)
  • Experimented with various weightings for s,
    including viewing-times and path position.
  • Viewing times achieved highest accuracy in
    empirical studies.
  • A10s?B20s?D15s, si lt10,20,0,15,0gt

Clustered Profiles
9
User Profiles
Web Crawl
Access Logs
Document Model
User Sessions
User Profiles
  • User profiles are created by linearly combining
    the document and session models

Clustered Profiles
10
Clustering
Web Crawl
Access Logs
Document Model
User Sessions
User Profiles
  • Similarity Metric is a weighted cosine
    measure
  • Clustering is then done by recursive bisection,
    using K-Means to perform the bisections
    Karypis00, Zhao01. The corresponding criterion
    function is

Clustered Profiles
11
User population breakdown
Keywords describing user groups
Frequent documents accessed by group
Detailed stats
12
Clustering Evaluation
  • Ran user study on www.xerox.com to evaluate
    effectiveness of method Heer02.
  • 15 tasks, 5 task categories (104 user traces)
  • Using certain modalities and weighting schemes we
    were able to achieve accuracies
  • as high as 99!
  • Found that page content and page viewing time
    significantly contribute to clustering accuracy.

13
OK, Great, but
  • In real-world applications the number of clusters
    is an undetermined variable.
  • Want a method for automatically choosing the
    number of clusters.
  • After review of literature, decided to apply a
    cluster stability technique recently proposed by
    BenHur02.

14
Measuring Clustering Similarity
  • For a given clustering of a data set X, define
  • Cij
  • Two clusterings can then be compared using a dot
    product
  • This dot product can be normalized to get a
    cosine metric

1 if xi, xj are in the same cluster and i ? j 0
otherwise
15
Cluster Stability
  • for k 2 to kmax
  • for i 1 to n
  • Si Subsample of data set X using sampling ratio
    f
  • Ci cluster(Si,k)
  • Perform pairwise comparisons of all Ci,
    generating a distribution of similarity values
    for the current k
  • Analyze the resulting distributions to determine
    the most stable clusterings.

16
Example
Stability Analysis
  • Example using 4 Gaussians BenHur02
  • Graph on right shows plot of the cumulative
    similarity distribution

17
Case Study 1 www.xerox.com
User Study 8/2001 104 sessions n 15, f 0.8,
k 2 to 10
18
Case Study 2 guir.berkeley.edu
Nov. 1-16, 2001 7700 sessions n 30, f 0.8, k
2 to 15
19
Case Study 2 guir.berkeley.edu
n 30, f 0.8, k 3 to 7
20
Cluster Contents (guir, k5)
  • Cluster 1 DENIM Web Design Tool
  • Cluster 2 Research projects publications
  • Cluster 3 Quiz-Bowl Competition Site
  • Cluster 4 CSCW (1 project 1 course)
  • Cluster 5 Random pubs project JavaDoc
  • At higher values of k, more concentrated clusters
    appear
  • Personal pages (faculty, students) cluster
    emerges
  • JavaDoc separates into its own cluster

21
Discussion
  • Stability method shows some utility, but results
    are far from conclusive perhaps web data is not
    particularly structured?
  • User Goals
  • Does the user have a specific goal?
  • Web Site Structure
  • Does the web site support user goals?
  • Task Structure
  • Level of generality

22
Possible Cases
  • User has task - Site supports task
  • www.xerox.com study
  • User has task - Site doesnt support it
  • User w/o singular goals - Well designed site
  • Possibly guir.berkeley.edu
  • User w/o task - Poorly designed site

23
The Future
  • More actionable empirical data
  • Need more users over a range of sites
  • Larger user study already begun
  • Alternative approaches
  • Human supervision
  • Augmented stability metric / criterion function
  • Other clustering methods
  • Fuzzy Clustering

24
Questions?Suggestions?
Write a Comment
User Comments (0)
About PowerShow.com