Title: Adaptive Web Sites: Automatically Synthesizing Web Pages
1Adaptive Web SitesAutomatically Synthesizing
Web Pages
- Mike Perkowitz and Oren Etzioni
- www.cs.washington.edu/homes/map/adaptive/
2Adaptive Web Sites
- Web sites that automatically reconfigure their
organization and presentation by learning from
user access patterns. - (Perkowitz Etzioni, IJCAI97)
3Adaptive Web Sites
- Individual Customization site learns you like
sports - Group Transformation site learns most sports
lovers also read Tank McNamara and cross-links
them
4Group Transformations
- Our approach history-based
- Previously Simple transformations (Perkowitz
Etzioni, WWW6) - Goal change in view
5machines.hyperreal.org
6Drum Machine Samples
7(No Transcript)
8Index Page Synthesis
- Find groups of related documents at the site and
create new pages linking to those documents. - Input web site, access log
- Output pages of links to related pages
9Questions
- What links are on the index page?
- How are the contents ordered?
- What is the title?
- How are links labeled?
- How do we make the index comprehensive?
10Outline
- Motivation
- Plausible approaches
- Clustering
- Frequent sets
- Our approach Cluster Mining
- Algorithm PageGather
- Evaluation
11Clustering
- Voorhees-86,Willet-88,Rasmussen-92
- Similarity metric over documents
- Cluster items close together, far from others
- Algorithms
- Hierarchical Agglomerative Clustering (HAC)
- K-means clustering
12Clustering
- Visit set of pages accessed by an individual
- Document page
- Similarity co-occurrence in visits
- Cluster ? index page contents
13Clustering Problems
- Clustering induces a partition over data
- Clustering can be slow
14Frequent Sets
- Agrawal, Imielinski, Swami-93
- Set of transactions basket of items
- Find all frequently-occurring itemsets
- Algorithm
- A priori
15Frequent Sets
- Visit set of pages accessed by an individual
- Item page
- Transaction visit
- Frequent set ? index page contents
16Frequent Sets Problems
- Frequent Item Problem
- Finds many similar itemsets
- low minimum frequency ? high running time
17Idea Cluster Mining
- Find only high-quality clusters
- Not a partition
- Clusters may overlap
18The PageGather Algorithm
- Graph-based representation
- Nodes pages
- Edges if P(P1P2) and P(P2P1) is high
- Fast and accurate
19www.hyperreal.comcrawl3.atext.comGET
/robots.txt HTTP/1.0text/html3011997/07/03-235
908-188---ArchitextSpider www.apache.orgbl
izzard-ext.wise.edt.ericsson.seGET
/related_projects.html HTTP/1.0text/html2001997
/07/03-235909-5047--http//www.apache.org/
Mozilla/3.01Gold (X11 I SunOS 5.5.1 sun4u) via
Harvest Cache version 3.0pl5-Solaris www.hyperreal
.orgmd27-001.mun.compuserve.comGET
/music/labels/recycle_or_die/ralf_hildenbeutel.gif
HTTP/1.0image/gif3041997/07/03-235909---
-http//www.hyperreal.org/music/labels/recycle_or
_die/Mozilla/2.02E de-Beta2 (Win95 I
16bit) www.hyperreal.orgras87.brunnet.netGET
/raves/media/cyberia/link.gif HTTP/1.0image/gif2
001997/07/03-235909-415--http//www.hyperr
eal.org/raves/media/cyberia/Mozilla/4.01 en
(Win95 I) www.apache.orgblizzard-ext.wise.edt.er
icsson.seGET /images/apache_sub.gif
HTTP/1.0image/gif2001997/07/03-235910-6083
--http//www.apache.org/related_projects.htmlMo
zilla/3.01Gold (X11 I SunOS 5.5.1 sun4u) via
Harvest Cache version 3.0pl5-Solaris www.apache.or
g210.140.143.27GET /images/apache_pb.gif
HTTP/1.0image/gif3041997/07/03-235910----
http//www.apache.org/Mozilla/3.01 ja (Win95
I) www.apache.orgr2d2.dd.dkGET /docs/
HTTP/1.0text/html2001997/07/03-235911-2207
--http//www.apache.org/Mozilla/2.0
(compatible MSIE 3.01 Windows
95) www.hyperreal.orgmd27-001.mun.compuserve.com
GET /music/labels/recycle_or_die/oliver_lieb.gif
HTTP/1.0image/gif3041997/07/03-235911----
http//www.hyperreal.org/music/labels/recycle_or_
die/Mozilla/2.02E de-Beta2 (Win95 I
16bit) www.hyperreal.orgdu5-ts1.lascruces.comGET
/wally/epsilon.gif HTTP/1.0image/gif2001997/0
7/03-235911-4002--http//www.hyperreal.org/
music/artists/fsol/www/Mozilla/2.0 (compatible
MSIE 3.02 Update a Windows 95) www.hyperreal.org
du5-ts1.lascruces.comGET /wally/hyperreal.gif
HTTP/1.0image/gif2001997/07/03-235911-2525
--http//www.hyperreal.org/music/artists/fsol/ww
w/Mozilla/2.0 (compatible MSIE 3.02 Update a
Windows 95) www.hyperreal.orgmd27-001.mun.compuse
rve.comGET /music/labels/recycle_or_die/baked_bea
ns.gif HTTP/1.0image/gif3041997/07/03-235911
----http//www.hyperreal.org/music/labels/recy
cle_or_die/Mozilla/2.02E de-Beta2 (Win95 I
16bit) www.hyperreal.orgcc6145d.comm.sfu.caGET
/music/machines/categories/effects/
HTTP/1.0text/html2001997/07/03-235912-3844
--http//www.hyperreal.org/music/machines/catego
ries/Mozilla/2.02 (Macintosh I
Log
Visits
Co-occurrence
Graph
Clique/CC
New Page
20PageGather
- Implement with Cliques or CCs
- Find all candidates, return best
- Clique maximal cliques of size ? k
- Clique and CC versions comparable in time and
performance
21Experiments
- machines.hyperreal.org
- Site gets 1200 visitors/day (10k hits)
- Site contains 2500 distinct documents
- Training a month of access data
- Testing ten days of data
22Performance Metric
- Are index pages helpful to users?
- How well do clusters predict user navigation?
- Q(C) Given that a user visits one page in
cluster C, how likely is she to visit any other?
23Cluster Mining vs. Clustering
- PageGather using
- Clique ? 10 clusters 105 min
- HAC ? 10 clusters 48 hours
- K-means ? 10 clusters 335 min
24Cluster Mining vs. Clustering
- PageGather using
- Clique ? 10 clusters 105 min
- HAC ? 10 clusters 48 hours
- K-means ? 10 clusters 335 min
- HAC ? 8 clusters 2155 min
- (threshold, less data, mining)
25Cluster Mining vs. Clustering
- PageGather using
- Clique ? 10 clusters 105 min
- HAC ? 10 clusters 48 hours
- K-means ? 10 clusters 335 min
- HAC ? 7 clusters 29308 min
- (threshold, less data, mining)
26Cluster Mining vs. Clustering
Q
Top 10 Clusters
27Cluster Mining vs. Clustering
Q
Top 10 Clusters
28Cluster Mining vs. Clustering
Q
Top 10 Clusters
29PageGather vs. Frequent Sets
- PG/Clique? 10 clusters 105 min
- A priori ? 10 frequent sets 141 min
30PageGather vs. Frequent Sets
Q
Top 10 Clusters
31Contributions
- Motivating problem Web page synthesis
- Method Cluster mining
- well suited for discovery of coherent sets
- comparison to clustering, frequent sets
- Algorithm PageGather
- graph-based, fast and accurate
32Clique vs. Conn-component
Q
Top 10 Clusters
33Clique vs. Conn-component
- Comparable accuracy
- Clique finds fewer, smaller clusters than CC
- Clique more accurate (at first)
- Comparable running time (in practice)
34Future Directions
- Meta-Information to improve coherence
- Conceptual clustering
- Improve coherence
- Naming pages
- Cluster mining to generate association rules