ICS 278: Data Mining Lecture 17: Web Log Mining

About This Presentation

Title:

ICS 278: Data Mining Lecture 17: Web Log Mining

Description:

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine. ICS 278: Data Mining ... Important to identify robots (also known as crawlers, spiders) ... – PowerPoint PPT presentation

Number of Views:159

Avg rating:3.0/5.0

Slides: 83

Provided by: Informatio367

Category:

more less

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 17: Web Log Mining

1
ICS 278 Data MiningLecture 17 Web Log Mining

Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine

2
Outline

Basic concepts in Web log data analysis
Predictive modeling of Web navigation behavior
Markov modeling methods
Analyzing search engine data
Ecommerce aspects of Web log mining

3
Introduction

Useful to study human digital behavior, e.g.
search engine data can be used for
Exploration e.g. of queries per session?
Modeling e.g. any time of day dependence?
Prediction e.g. which pages are relevant?
Applications
Understand social implications of Web usage
Design of better tools for information access
E-commerce applications

4
How our Web navigation is recorded

Web logs
Record activity between client browser and a
specific Web server
Easily available
Can be augmented with cookies (provide notion of
state)
Search engine records
Text in queries, which responses were clicked on,
etc
Client-side browsing records
Produced for research purposes as part of a study
Automatically recorded by client-side software
Harder to obtain, but much more accurate than
server-side logs
Other sources
Web site registration, purchases, email, etc
ISP recording of Web browsing

5
Web Server Log Files

Server Transfer Log
transactions between a browser and server are
logged
IP address, the time of the request
Method of the request (GET, HEAD, POST)
Status code, a response from the server
Size in byte of the transaction
Referrer Log
where the request originated
Agent Log
browser software making the request (spider)
Error Log
request resulted in errors (404)

6
W3C Extended Log File Format
7
Example of Web Log entries

Apache web log
205.188.209.10 - - 29/Mar/2002035806 -0800
"GET /sophal/whole5.gif HTTP/1.0" 200 9609
"http//www.csua.berkeley.edu/sophal/whole.html"
"Mozilla/4.0 (compatible MSIE 5.0 AOL 6.0
Windows 98 DigExt)"
216.35.116.26 - - 29/Mar/2002035940 -0800
"GET /alexlam/resume.html HTTP/1.0" 200 2674 "-"
"Mozilla/5.0 (Slurp/cat slurp_at_inktomi.com
http//www.inktomi.com/slurp.html)
202.155.20.142 - - 29/Mar/2002030014 -0800
"GET /tahir/indextop.html HTTP/1.1" 200 3510
"http//www.csua.berkeley.edu/tahir/"
"Mozilla/4.0 (compatible MSIE 6.0 Windows NT
5.1)
202.155.20.142 - - 29/Mar/2002030014 -0800
"GET /tahir/animate.js HTTP/1.1" 200 14261
"http//www.csua.berkeley.edu/tahir/indextop.html
" "Mozilla/4.0 (compatible MSIE 6.0 Windows NT
5.1)

8
Routine Server Log Analysis

Most and least visited web pages
Entry and exit pages
Referrals from other sites or search engines
What are the searched keywords
How many clicks/page views a page received
Error reports, like broken links

9
Visualization of Web Log Data over Time
10
Server Log Analysis
11
Descriptive Summary Statistics

Histograms, scatter plots, time-series plots
Very important!
Helps to understand the big picture
Provides marginal context for any
model-building
models aggregate behavior, not individuals
Challenging for Web log data
Examples
Session lengths (e.g., power laws)
Click rates as a function of time, content

12
L number of page requests in a single
session from visitors to www.ics.uci.edu over 1
week in November 2002 (robots removed)
13
Best fit of simple power law model Log P(L) -a
Log L b or P(L) b L-a
14
(No Transcript)
15
Web data measurement issues

Important to understand how data is collected
Web data is collected automatically via software
logging tools
Advantage
No manual supervision required
Disadvantage
Data can be skewed (e.g. due to the presence of
robot traffic)
Important to identify robots (also known as
crawlers, spiders)

16
A time-series plot of ICS Website data
Number of page requests per hour as a function of
time from page requests in the www.ics.uci.edu
Web server logs during the first week of April
2002.
17
Robot / human identification

Robot requests identified by classifying page
requests using a variety of heuristics
e.g. some robots self-identify themselves in the
server logs (robots.txt)
Robots explore the entire website in breadth
first fashion
Humans access web-pages in depth first fashion
Tan and Kumar (2002) discuss more techniques

18
Page requests, caching, and proxy servers

In theory, requester browser requests a page from
a Web server and the request is processed
In practice, there are
Other users
Browser caching
Dynamic addressing in local network
Proxy Server caching

19
Page requests, caching, and proxy servers
A graphical summary of how page requests from an
individual user can be masked at various stages
between the users local computer and the Web
server.
20
Page requests, caching, and proxy servers

Web server logs are therefore not so ideal in
terms of a complete and faithful representation
of individual page views
There are heuristics to try to infer the true
actions of the user -
Path completion (Cooley et al. 1999)
e.g. If known B -gt F and not C -gt F, then session
ABCF can be interpreted as ABCBF
Anderson et al. 2001 for more heuristics
In general case, hard to know what user viewed

21
Identifying individual users from Web server logs

Useful to associate specific page requests to
specific individual users
IP address most frequently used
Disadvantages
One IP address can belong to several users
Dynamic allocation of IP address
Better to use cookies
Information in the cookie can be accessed by the
Web server to identify an individual user over
time
Actions by the same user during different
sessions can be linked together

22
Identifying individual users from Web server logs

Commercial websites use cookies extensively
97 of users have cookies enabled permanently on
their browsers
(source Amazon.com, 2003)
However
There are privacy issues need implicit user
cooperation
Cookies can be deleted / disabled
Another option is to enforce user registration
High reliability
Can discourage potential visitors

23
Sessionizing

Time oriented (robust)
E.g., by gaps between requests
not more than 25 minutes between successive
requests
Navigation oriented (good for short sessions and
when timestamps unreliable)
Referrer is previous page in session, or
Referrer is undefined but request within 10 secs,
or
Link from previous to current page in web site

24
Client-side data

Advantages of collecting data at the client side
Direct recording of page requests (eliminates
masking due to caching)
Recording of all browser-related actions by a
user (including visits to multiple websites)
More-reliable identification of individual users
(e.g. by login ID for multiple users on a single
computer)
Preferred mode of data collection for studies of
navigation behavior on the Web
Companies like comScore and Nielsen use
client-side software to track home computer users

25
Client-side data

Statistics like Time per session and Page-view
duration are more reliable in client-side data
Some limitations
Still some statistics like Page-view duration
cannot be totally reliable e.g. user might go to
fetch coffee
Need explicit user cooperation
Typically recorded on home computers may not
reflect a complete picture of Web browsing
behavior
Web surfing data can be collected at intermediate
points like ISPs, proxy servers
Can be used to create user profile and target
advertise

26
Early studies from 1995 to 1997

Earliest studies on client-side data are Catledge
and Pitkow (1995) and Tauscher and Greenberg
(1997)
In both studies, data was collected by logging
Web browser commands
Population consisted of faculty, staff and
students
Both studies found
clicking on the hypertext anchors as the most
common action
using back button was the second common action

27
Early studies from 1995 to 1997

high probability of page revisitation
(0.58-0.61)
Lower bound because the page requests prior to
the start of the studies are not accounted for
Humans are creatures of habit?
Content of the pages changed over time?
strong recency (page that is revisited is usually
the page that was visited in the recent past)
effect
Correlates with the back button usage
Similar repetitive actions are found in telephone
number dialing etc

28
The Cockburn and McKenzie study from 2002

Previous studies are relatively old
Web has changed dramatically in the past few
years
Cockburn and McKenzie (2002) provides a more
up-to-date analysis
Analyzed the daily history.dat files produced by
the Netscape browser for 17 users for about 4
months
Population studied consisted of faculty, staff
and graduate students
Study found revisitation rates higher than past
94 and 95 studies (0.81)
Time-window is three times that of past studies

29
The Cockburn and McKenzie study from 2002

Revisitation rate less biased than the previous
studies?
Human behavior changed from an exploratory mode
to a utilitarian mode?
The more pages user visits, the more are the
requests for new pages
The most frequently requested page for each user
can account for a relatively large fraction of
his/her page requests
Useful to see the scatter plot of the distinct
number of pages requested per user versus the
total pages requested

30
The Cockburn and McKenzie study from 2002
The number of distinct pages visited versus page
vocabulary size of each of the 17 users in the
Cockburn and McKenzie (2002) study (log-log plot)
31
The Cockburn and McKenzie study from 2002
Bar chart of the ratio of the number of page
requests for the most frequent page divided by
the total number of page requests, for 17 users
in the Cockburn McKenzie (2002) study
32
Outline

Basic concepts in Web log data analysis
Predictive modeling of Web navigation behavior
Markov modeling methods
Analyzing search engine data
Ecommerce aspects of Web log mining

33
Markov models for page prediction

General approach is to use a finite-state Markov
chain
Each state can be a specific Web page or a
category of Web pages
If only interested in the order of visits (and
not in time), each new request can be modeled as
a transition of states
Issues
Self-transition
Time-independence

34
Markov models for page prediction

For simplicity, consider order-dependent,
time-independent finite-state Markov chain with M
states
Let s be a sequence of observed states of length
L. e.g. s ABBCAABBCCBBAA with three states A, B
and C. st is state at position t (1lttltL). In
general,
first-order Markov assumption
This provides a simple generative model to
produce sequential data

35
Markov models for page prediction

If we denote Tij P(st jst-1 i), we can
define a M x M transition matrix
Properties
Strong first-order assumption
Simple way to capture sequential dependence
If each page is a state and if W pages, O(W2), W
can be of the order 105 to 106 for a CS dept. of
a university
To alleviate, we can cluster W pages into M
clusters, each assigned a state in the Markov
model
Clustering can be done manually, based on
directory structure on the Web server, or
automatic clustering using clustering techniques

36
Markov models for page prediction

Tij P(st jst-1 i) represents the
probability that an individual users next
request will be from category j, given they were
in category i
We can add E, an end-state to the model
E.g. for three categories with end state -
E denotes the end of a sequence, and start of a
new sequence

37
Markov models for page prediction

First-order Markov model assumes that the next
state is based only on the current state
Limitations
Doesnt consider long-term memory
We can try to capture more memory with kth-order
Markov chain
Limitations
Inordinate amount of training data O(Mk1)

38
Parameter estimation for Markov model transitions

Smoothed parameter estimates of transition
probabilities are
If nij 0 for some transition (i, j) then
instead of having a parameter estimate of 0 (ML),
we will have allowing prior
knowledge to be incorporated
If nij gt 0, we get a smooth combination of the
data-driven information (nij) and the prior

39
Parameter estimation for Markov models

One simple way to set prior parameter is
Consider alpha as the effective sample size
Partition the states into two sets, set 1
containing all states directly linked to state i
and the remaining in set 2
Assign uniform probability r/K to all states in
set 2 (all set 2 states are equally likely)
The remaining (1-r) can be either uniformly
assigned among set 1 elements or weighted by some
measure
Prior probabilities in and out of E can be set
based on our prior knowledge of how likely we
think a user is to exit the site from a
particular state

40
Predicting page requests with Markov models

Deshpande and Karypis (2001) propose schemes to
prune kth-order Markov state space
Provide systematic but modest improvements
Another way is to use empirical smoothing
techniques that combine different models from
order 1 to order k (Chen and Goodman 1996)

41
Mixtures of Markov Chains

Cadez et al. (2003) and Sen and Hansen (2003)
replace the first-order Markov chain
with a mixture of first-order Markov chains
where c is a discrete-value hidden variable
taking K values Sk P(c k) 1 and
P(st st-1, c k) is the transition matrix
for the kth mixture component
One interpretation of this is user behavior
consists of K different navigation behaviors
described by the K Markov chains

42
Modeling Web Page Requests with Markov chain
mixtures

MSNBC Web logs
2 million individuals per day
different session lengths per individual
difficult visualization and clustering problem
WebCanvas
uses mixtures of Markov chains to cluster
individuals based on their observed sequences
software tool EM mixture modeling
visualization

43
(No Transcript)
44
From Web logs to sequences
128.195.36.195, -, 3/22/00, 103511, W3SVC,
SRVR1, 128.200.39.181, 781, 363, 875, 200, 0,
GET, /top.html, -, 128.195.36.195, -, 3/22/00,
103516, W3SVC, SRVR1, 128.200.39.181, 5288,
524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 103517, W3SVC,
SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.195.36.101, -,
3/22/00, 161850, W3SVC, SRVR1, 128.200.39.181,
60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 161858, W3SVC,
SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0,
POST, /spt/main.html, -, 128.195.36.101, -,
3/22/00, 161859, W3SVC, SRVR1, 128.200.39.181,
0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205437, W3SVC,
SRVR1, 128.200.39.181, 140, 199, 875, 200, 0,
GET, /top.html, -, 128.200.39.17, -, 3/22/00,
205455, W3SVC, SRVR1, 128.200.39.181, 17766,
365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205455, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205507, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205536, W3SVC,
SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0,
POST, /spt/main.html, -, 128.200.39.17, -,
3/22/00, 205536, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205539, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205603, W3SVC, SRVR1, 128.200.39.181,
1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205604, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205633, W3SVC, SRVR1, 128.200.39.181,
0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 205652, W3SVC,
SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0,
POST, /spt/main.html, -,
3
3
3
3
1
3
1
1
1
3
3
3
2
2
3
2
User 1
1
1
1
3
3
3
User 2
User 3
7
7
7
7
7
7
7
7
1
1
1
1
1
1
5
1
5
1
1
1
5
1
User 4
5
1
1
5
User 5

45
Clusters of Finite State Machines
A
A
Cluster 1
Cluster 2
B
B
D
D
E
E
A
B
D
Cluster 3
E
46
Learning Problem

Assumptions
data is being generated by K different groups
Each group is described by a stochastic finite
state machine (SFSM)
aka, a Markov model with an end-state
Given
A set of sequences from different users of
different lengths
Learn
A mixture of K different stochastic finite
state machines
Solution
EM is very easy fractional counts of transitions
efficient and accurate, scales as O(KN)

47
Experimental Methodology

Model Training
fit 2 types of models
mixtures of histograms
mixtures of finite state machines
Train on a full days worth of MSNBC Web data
Model Evaluation
one-step-ahead prediction on unseen test data
Test sequences from a different day of Web logs
negative log probability predictive entropy

48
(No Transcript)
49
(No Transcript)
50
Timing Results
51
WebCanvas

Software tool for Web log visualization
uses Markov mixtures to cluster data for display
in use by msnbc.com administrators at Microsoft
also being applied to non-Web data
Model-based visualization
random sample of actual sequences
interactive tiled windows displayed for
visualization
more effective than
planar graphs
traffic-flow movie in Microsoft Site Server v3.0

52
WebCanvas Cadez, Heckerman, et al, 2003
53
Insights from WebCanvas

From msnbc.com site adminstrators.
significant heterogeneity of behavior
relatively focused activity of many users
typically only 1 or 2 categories of pages
many individuals not entering via main page
detected problems with the weather page
missing transitions (e.g., tech ltgt business)

54
Extensions

Adding time-dependence
adding time-between clicks, time of day effects
Uncategorized Web pages
coupling page content with sequence models
Modeling switching behaviors
allowing users to switch between models
Individualized weights (hierarchical Bayes)
Update WebCanvas tool will be part of 2004
SQLServer release

55
Prediction with Markov mixtures
P(st1 s1,t )
56
Prediction with Markov mixtures
P(st1 s1,t ) S P(st1 , k s1,t )
S
P(st1 k , s1,t ) P(k s1,t )

57
Prediction with Markov mixtures
P(st1 s1,t ) S P(st1 , k s1,t )
S
P(st1 k , s1,t ) P(k s1,t )
S P(st1 k , st ) P(k
s1,t )
Prediction of kth component
Membership, based on sequence history
gt Predictions are a convex combination of K
different component transition matrices, with
weights based on sequence history
58
Related Work

Mixtures of Markov chains
special case Poulsen (1990)
general case Ridgeway (1997), Smyth (1997)
Clustering of Web page sequences
non-probabilistic approaches (Fu et al, 1999)
Markov models for prediction
Anderson et al (IJCAI, 2001)
mixtures of Markov outperform other sequential
models for page-request prediction

59
Predicting page requests with Markov models

K can be chosen by evaluating the out-of-sample
predictive performance based on
Accuracy of prediction
Log probability score
Entropy
Other variations
Sen and Hansen 2003
Position-dependent Markov models (Anderson et al.
2001, 2002)
Zukerman et al. 1999

60
Modeling Clickrate Data

Data
200k Alexa users, client-side, over 24 hours
ignore URLs requested
goal is to build a time-series model that
characterizes user click rates

61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
Markov-Poisson Model

Doubly stochastic process
Locally constant Poisson rate
indexed by M Markov states
Fit a model with M 3 states
absence of a Web session
Web session with slow click rate 1 minute rate
Web session with rapid click rate 10 second rate
Used hierarchical Bayes on individuals

66
Hierarchical Bayes Model
Population Prior p(lq)
l1 Individual 1
li Individual i
lN Individual N
D1
D1
D2
D3
D1
D2
Individuals with little data get shrunk to the
prior Individuals with a lot of data are more
data-driven
67
(No Transcript)
68
Prediction with Hierarchical Bayes
Population Prior p(lq)
New Individual l ?
l1 Individual 1
lN Individual N
D1
D2
D3
D1
D2
First few clicks
Historical Training Data
69
Extensions

Couple click-rate with purchase behavior
Markov-type model through different states
product viewing
detailed product information
reviews
combine states with
click rate and page content
to predict p(purchase data up to time t)
Can these Bayesian techniques be scaled up??

70
Outline

Basic concepts in Web log data analysis
Predictive modeling of Web navigation behavior
Markov modeling methods
Analyzing search engine data
Ecommerce aspects of Web log mining

71
Analysis of Search Engine Query Logs
of Sample Query Source SE Time Period
Lau Horvitz 4690 of 1 Million Excite Sep 1997
Silverstein et al 1 Billion AltaVista 6 weeks in Aug Sep 1998
Spink et al (series of studies) 1Million for each time period Excite Sep 1997Dec 1999May 2001
Xie OHallaron 110,000 Vivisimo 35 days Jan Feb 2001
Xie OHallaron 1.9 Million Excite 8 hrs in a day, Dec 1999
72
Main Results

Average number of terms in a query is ranging
from a low of 2.2 to a high of 2.6
The most common number of terms in a query is 2
The majority of users dont refine their query
The number of users who viewed only a single page
increase 29 (1997) to 51 (2001) (Excite)
85 of users viewed only first page of search
results (AltaVista)
45 (2001) of queries are about Commerce, Travel,
Economy, People (was 20 in 1997)
The queries about adult or entertainment
decreased from 20 (1997) to around 7 (2001)

73
Xie and O Halloran Study (2002)
- Query Length Distributions (bar) - Poisson
Model(dots lines)

All four studies produced a generally consistent
set of findings about user behavior in a search
engine context
most users view relatively few pages per query
most users dont use advanced search features

74
Power-law Characteristics of Common Queries
Power-Law in log-log space

Frequency f(r) of Queries with Rank r
110000 queries from Vivisimo
1.9 Million queries from Excite
There are strong regularities in terms of
patterns of behavior in how we search the Web

75
Outline

Basic concepts in Web log data analysis
Predictive modeling of Web navigation behavior
Markov modeling methods
Analyzing search engine data
Ecommerce aspects of Web log mining

76
The next few slides are from Ronny Kohavi,
director of data mining and personalization at
Amazon.com. His full set of slides are available
online see the PPT slides and related papers on
ecommerce and data mining online at
http//robotics.stanford.edu/ronnyk/ronnyk-bib.ht
ml
77
ECommerce

Page request Web logs combined with
Purchase (market-basket) information
User address information (if they make a
purchase)
Demographics information (can be purchased)
Emails to/from the customer
Main focus here is to increase revenue
Data mining widely used an online commerce
companies like Amazon
This is a very rich source of problems for data
mining
What products should we advertise to this person?
Can we do dynamic pricing?
If a person buys X should we also suggest Y?
Who are our best customers?
etc

78
Combining Data Sources

Comprehensive collection of US consumer and
telephone data available via the internet
Multi-sourced database
Demographic, socioeconomic, and lifestyle
information.
Information on most U.S. households
Contributors files refreshed a minimum of 3-12
times per year.
Data sources include County Real Estate Property
Records, U.S. Telephone Directories, Public
Information, Motor Vehicle Registrations, Census
Directories, Credit Grantors, Public Records and
Consumer Data, Drivers Licenses, Voter
Registrations, Product Registration
Questionnaires, Catalogers, Magazines, Specialty
Retailers, Packaged Goods Manufacturers, Accounts
Receivable Files, Warranty Cards
Much of this data can be accessed in real-time
once a customer self-identifies

79
Map of World Wide Revenue
Although Debenhams online site only ships in the
UK, we see some revenue from the rest of the
world.
UK 98.8
US 0.6
Australia 0.1
Low
Medium
High
NOTE About 50 of the non-UK orders are wedding
list purchases
80
Online Consumer Demographics

Results from Blue-Martini
People who have a Travel and Entertainment credit
card are 48 more likely to be online shoppers
(27 for people with premium credit card)
People whose home was built after 1990 are 45
more likely to be online shoppers
Households with income over 100K are 31 more
likely to be online shoppers
People under the age of 45 are 17 morelikely to
be online shoppers

81
Demographics - Income

A higher household income means you are more
likely to be an online shopper

82
Demographics Credit Cards

The more credit cards, the more likely you are to
be an online shopper

83
Example Web Traffic
Sept-11 Note significant drop in human traffic,
not bot traffic
Weekends
Internal Perfor-mance bot
Registration at Search Engine sites
84
Product Affinities at MEC
Website Recommended Products
Product
Association
Lift Confidence
Orbit Sleeping Pad
Orbit Stuff Sack
222 37
Cygnet Sleeping Bag
Primus Stove
Aladdin 2 Backpack
Bambini Crewneck Sweater Childrens
Bambini Tights Childrens
195 52
Yeti Crew Neck Pullover Childrens
Beneficial Ts Organic Long Sleeve T-Shirt Kids
Silk Long Johns Womens
Silk Crew Womens
304 73
Micro Check Vee Sweater
Volant Pants
Composite Jacket
Cascade Entrant Overmitts
Polartec 300 Double Mitts
51 48
Windstopper Alpine Hat
Volant Pants
Tremblant 575 Vest Womens

Minimum support for the associations is 80
customers
Confidence 37 of people who purchased Orbit
Sleeping Pad also purchased Orbit Stuff Sack
Lift People who purchased Orbit Sleeping Pad
were 222 times more likely to purchase the Orbit
Stuff Sack compared to the general population

85
Customer Locations Relative to Retail Stores
Heavy purchasing areas away from retail stores
can suggest new retail store locations
No stores in several hot areas MEC is building
a store in Montreal right now.
Map of Canada with store locations.
86
Building The Customer Signature

Building a customer signature is a significant
effort, but well worth the effort
A signature summarizes customer or visitor
behavior across hundreds of attributes, many
which are specific to the site
Once a signature is built, it can be used to
answer many questions.
The mining algorithms will pick the most
important attributes for each question
Example attributes computed
Total Visits and Sales
Revenue by Product Family
Revenue by Month
Customer State and Country
Recency, Frequency, Monetary
Latitude/Longitude from the Customers Postal Code