Title: ICS 278: Data Mining Lecture 17: Web Log Mining
1ICS 278 Data MiningLecture 17 Web Log Mining
- Padhraic Smyth
- Department of Information and Computer Science
- University of California, Irvine
2Outline
- Basic concepts in Web log data analysis
- Predictive modeling of Web navigation behavior
- Markov modeling methods
- Analyzing search engine data
- Ecommerce aspects of Web log mining
3Introduction
- Useful to study human digital behavior, e.g.
search engine data can be used for - Exploration e.g. of queries per session?
- Modeling e.g. any time of day dependence?
- Prediction e.g. which pages are relevant?
- Applications
- Understand social implications of Web usage
- Design of better tools for information access
- E-commerce applications
4How our Web navigation is recorded
- Web logs
- Record activity between client browser and a
specific Web server - Easily available
- Can be augmented with cookies (provide notion of
state) - Search engine records
- Text in queries, which responses were clicked on,
etc - Client-side browsing records
- Produced for research purposes as part of a study
- Automatically recorded by client-side software
- Harder to obtain, but much more accurate than
server-side logs - Other sources
- Web site registration, purchases, email, etc
- ISP recording of Web browsing
5Web Server Log Files
- Server Transfer Log
- transactions between a browser and server are
logged - IP address, the time of the request
- Method of the request (GET, HEAD, POST)
- Status code, a response from the server
- Size in byte of the transaction
- Referrer Log
- where the request originated
- Agent Log
- browser software making the request (spider)
- Error Log
- request resulted in errors (404)
6W3C Extended Log File Format
7Example of Web Log entries
- Apache web log
- 205.188.209.10 - - 29/Mar/2002035806 -0800
"GET /sophal/whole5.gif HTTP/1.0" 200 9609
"http//www.csua.berkeley.edu/sophal/whole.html"
"Mozilla/4.0 (compatible MSIE 5.0 AOL 6.0
Windows 98 DigExt)" - 216.35.116.26 - - 29/Mar/2002035940 -0800
"GET /alexlam/resume.html HTTP/1.0" 200 2674 "-"
"Mozilla/5.0 (Slurp/cat slurp_at_inktomi.com
http//www.inktomi.com/slurp.html) - 202.155.20.142 - - 29/Mar/2002030014 -0800
"GET /tahir/indextop.html HTTP/1.1" 200 3510
"http//www.csua.berkeley.edu/tahir/"
"Mozilla/4.0 (compatible MSIE 6.0 Windows NT
5.1) - 202.155.20.142 - - 29/Mar/2002030014 -0800
"GET /tahir/animate.js HTTP/1.1" 200 14261
"http//www.csua.berkeley.edu/tahir/indextop.html
" "Mozilla/4.0 (compatible MSIE 6.0 Windows NT
5.1)
8Routine Server Log Analysis
- Most and least visited web pages
- Entry and exit pages
- Referrals from other sites or search engines
- What are the searched keywords
- How many clicks/page views a page received
- Error reports, like broken links
9Visualization of Web Log Data over Time
10Server Log Analysis
11Descriptive Summary Statistics
- Histograms, scatter plots, time-series plots
- Very important!
- Helps to understand the big picture
- Provides marginal context for any
model-building - models aggregate behavior, not individuals
- Challenging for Web log data
- Examples
- Session lengths (e.g., power laws)
- Click rates as a function of time, content
12L number of page requests in a single
session from visitors to www.ics.uci.edu over 1
week in November 2002 (robots removed)
13Best fit of simple power law model Log P(L) -a
Log L b or P(L) b L-a
14(No Transcript)
15Web data measurement issues
-
- Important to understand how data is collected
- Web data is collected automatically via software
logging tools - Advantage
- No manual supervision required
- Disadvantage
- Data can be skewed (e.g. due to the presence of
robot traffic) - Important to identify robots (also known as
crawlers, spiders)
16A time-series plot of ICS Website data
Number of page requests per hour as a function of
time from page requests in the www.ics.uci.edu
Web server logs during the first week of April
2002.
17Robot / human identification
- Robot requests identified by classifying page
requests using a variety of heuristics - e.g. some robots self-identify themselves in the
server logs (robots.txt) - Robots explore the entire website in breadth
first fashion - Humans access web-pages in depth first fashion
- Tan and Kumar (2002) discuss more techniques
18Page requests, caching, and proxy servers
- In theory, requester browser requests a page from
a Web server and the request is processed - In practice, there are
- Other users
- Browser caching
- Dynamic addressing in local network
- Proxy Server caching
19Page requests, caching, and proxy servers
A graphical summary of how page requests from an
individual user can be masked at various stages
between the users local computer and the Web
server.
20Page requests, caching, and proxy servers
- Web server logs are therefore not so ideal in
terms of a complete and faithful representation
of individual page views - There are heuristics to try to infer the true
actions of the user - - Path completion (Cooley et al. 1999)
- e.g. If known B -gt F and not C -gt F, then session
ABCF can be interpreted as ABCBF - Anderson et al. 2001 for more heuristics
- In general case, hard to know what user viewed
21Identifying individual users from Web server logs
- Useful to associate specific page requests to
specific individual users - IP address most frequently used
- Disadvantages
- One IP address can belong to several users
- Dynamic allocation of IP address
- Better to use cookies
- Information in the cookie can be accessed by the
Web server to identify an individual user over
time - Actions by the same user during different
sessions can be linked together
22Identifying individual users from Web server logs
- Commercial websites use cookies extensively
- 97 of users have cookies enabled permanently on
their browsers - (source Amazon.com, 2003)
- However
- There are privacy issues need implicit user
cooperation - Cookies can be deleted / disabled
- Another option is to enforce user registration
- High reliability
- Can discourage potential visitors
23Sessionizing
- Time oriented (robust)
- E.g., by gaps between requests
- not more than 25 minutes between successive
requests - Navigation oriented (good for short sessions and
when timestamps unreliable) - Referrer is previous page in session, or
- Referrer is undefined but request within 10 secs,
or - Link from previous to current page in web site
24Client-side data
- Advantages of collecting data at the client side
- Direct recording of page requests (eliminates
masking due to caching) - Recording of all browser-related actions by a
user (including visits to multiple websites) - More-reliable identification of individual users
(e.g. by login ID for multiple users on a single
computer) - Preferred mode of data collection for studies of
navigation behavior on the Web - Companies like comScore and Nielsen use
client-side software to track home computer users
25Client-side data
- Statistics like Time per session and Page-view
duration are more reliable in client-side data - Some limitations
- Still some statistics like Page-view duration
cannot be totally reliable e.g. user might go to
fetch coffee - Need explicit user cooperation
- Typically recorded on home computers may not
reflect a complete picture of Web browsing
behavior - Web surfing data can be collected at intermediate
points like ISPs, proxy servers - Can be used to create user profile and target
advertise
26Early studies from 1995 to 1997
- Earliest studies on client-side data are Catledge
and Pitkow (1995) and Tauscher and Greenberg
(1997) - In both studies, data was collected by logging
Web browser commands - Population consisted of faculty, staff and
students - Both studies found
- clicking on the hypertext anchors as the most
common action - using back button was the second common action
27Early studies from 1995 to 1997
- high probability of page revisitation
(0.58-0.61) - Lower bound because the page requests prior to
the start of the studies are not accounted for - Humans are creatures of habit?
- Content of the pages changed over time?
- strong recency (page that is revisited is usually
the page that was visited in the recent past)
effect - Correlates with the back button usage
- Similar repetitive actions are found in telephone
number dialing etc
28The Cockburn and McKenzie study from 2002
- Previous studies are relatively old
- Web has changed dramatically in the past few
years - Cockburn and McKenzie (2002) provides a more
up-to-date analysis - Analyzed the daily history.dat files produced by
the Netscape browser for 17 users for about 4
months - Population studied consisted of faculty, staff
and graduate students - Study found revisitation rates higher than past
94 and 95 studies (0.81) - Time-window is three times that of past studies
29The Cockburn and McKenzie study from 2002
- Revisitation rate less biased than the previous
studies? - Human behavior changed from an exploratory mode
to a utilitarian mode? - The more pages user visits, the more are the
requests for new pages - The most frequently requested page for each user
can account for a relatively large fraction of
his/her page requests - Useful to see the scatter plot of the distinct
number of pages requested per user versus the
total pages requested
30The Cockburn and McKenzie study from 2002
The number of distinct pages visited versus page
vocabulary size of each of the 17 users in the
Cockburn and McKenzie (2002) study (log-log plot)
31The Cockburn and McKenzie study from 2002
Bar chart of the ratio of the number of page
requests for the most frequent page divided by
the total number of page requests, for 17 users
in the Cockburn McKenzie (2002) study
32Outline
- Basic concepts in Web log data analysis
- Predictive modeling of Web navigation behavior
- Markov modeling methods
- Analyzing search engine data
- Ecommerce aspects of Web log mining
33Markov models for page prediction
- General approach is to use a finite-state Markov
chain - Each state can be a specific Web page or a
category of Web pages - If only interested in the order of visits (and
not in time), each new request can be modeled as
a transition of states - Issues
- Self-transition
- Time-independence
34Markov models for page prediction
- For simplicity, consider order-dependent,
time-independent finite-state Markov chain with M
states - Let s be a sequence of observed states of length
L. e.g. s ABBCAABBCCBBAA with three states A, B
and C. st is state at position t (1lttltL). In
general, -
- first-order Markov assumption
- This provides a simple generative model to
produce sequential data
35Markov models for page prediction
- If we denote Tij P(st jst-1 i), we can
define a M x M transition matrix - Properties
- Strong first-order assumption
- Simple way to capture sequential dependence
- If each page is a state and if W pages, O(W2), W
can be of the order 105 to 106 for a CS dept. of
a university - To alleviate, we can cluster W pages into M
clusters, each assigned a state in the Markov
model - Clustering can be done manually, based on
directory structure on the Web server, or
automatic clustering using clustering techniques
36Markov models for page prediction
- Tij P(st jst-1 i) represents the
probability that an individual users next
request will be from category j, given they were
in category i - We can add E, an end-state to the model
- E.g. for three categories with end state -
- E denotes the end of a sequence, and start of a
new sequence
37Markov models for page prediction
- First-order Markov model assumes that the next
state is based only on the current state - Limitations
- Doesnt consider long-term memory
- We can try to capture more memory with kth-order
Markov chain - Limitations
- Inordinate amount of training data O(Mk1)
38Parameter estimation for Markov model transitions
- Smoothed parameter estimates of transition
probabilities are - If nij 0 for some transition (i, j) then
instead of having a parameter estimate of 0 (ML),
we will have allowing prior
knowledge to be incorporated - If nij gt 0, we get a smooth combination of the
data-driven information (nij) and the prior
39Parameter estimation for Markov models
- One simple way to set prior parameter is
- Consider alpha as the effective sample size
- Partition the states into two sets, set 1
containing all states directly linked to state i
and the remaining in set 2 - Assign uniform probability r/K to all states in
set 2 (all set 2 states are equally likely) - The remaining (1-r) can be either uniformly
assigned among set 1 elements or weighted by some
measure - Prior probabilities in and out of E can be set
based on our prior knowledge of how likely we
think a user is to exit the site from a
particular state
40Predicting page requests with Markov models
- Deshpande and Karypis (2001) propose schemes to
prune kth-order Markov state space - Provide systematic but modest improvements
- Another way is to use empirical smoothing
techniques that combine different models from
order 1 to order k (Chen and Goodman 1996)
41Mixtures of Markov Chains
- Cadez et al. (2003) and Sen and Hansen (2003)
replace the first-order Markov chain - with a mixture of first-order Markov chains
- where c is a discrete-value hidden variable
taking K values Sk P(c k) 1 and - P(st st-1, c k) is the transition matrix
for the kth mixture component - One interpretation of this is user behavior
consists of K different navigation behaviors
described by the K Markov chains
42Modeling Web Page Requests with Markov chain
mixtures
- MSNBC Web logs
- 2 million individuals per day
- different session lengths per individual
- difficult visualization and clustering problem
- WebCanvas
- uses mixtures of Markov chains to cluster
individuals based on their observed sequences - software tool EM mixture modeling
visualization
43(No Transcript)
44From Web logs to sequences
128.195.36.195, -, 3/22/00, 103511, W3SVC,
SRVR1, 128.200.39.181, 781, 363, 875, 200, 0,
GET, /top.html, -, 128.195.36.195, -, 3/22/00,
103516, W3SVC, SRVR1, 128.200.39.181, 5288,
524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 103517, W3SVC,
SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.195.36.101, -,
3/22/00, 161850, W3SVC, SRVR1, 128.200.39.181,
60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 161858, W3SVC,
SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0,
POST, /spt/main.html, -, 128.195.36.101, -,
3/22/00, 161859, W3SVC, SRVR1, 128.200.39.181,
0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205437, W3SVC,
SRVR1, 128.200.39.181, 140, 199, 875, 200, 0,
GET, /top.html, -, 128.200.39.17, -, 3/22/00,
205455, W3SVC, SRVR1, 128.200.39.181, 17766,
365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205455, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205507, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205536, W3SVC,
SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0,
POST, /spt/main.html, -, 128.200.39.17, -,
3/22/00, 205536, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205539, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205603, W3SVC, SRVR1, 128.200.39.181,
1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205604, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205633, W3SVC, SRVR1, 128.200.39.181,
0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 205652, W3SVC,
SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0,
POST, /spt/main.html, -,
3
3
3
3
1
3
1
1
1
3
3
3
2
2
3
2
User 1
1
1
1
3
3
3
User 2
User 3
7
7
7
7
7
7
7
7
1
1
1
1
1
1
5
1
5
1
1
1
5
1
User 4
5
1
1
5
User 5
45Clusters of Finite State Machines
A
A
Cluster 1
Cluster 2
B
B
D
D
E
E
A
B
D
Cluster 3
E
46Learning Problem
- Assumptions
- data is being generated by K different groups
- Each group is described by a stochastic finite
state machine (SFSM) - aka, a Markov model with an end-state
- Given
- A set of sequences from different users of
different lengths - Learn
- A mixture of K different stochastic finite
state machines - Solution
- EM is very easy fractional counts of transitions
- efficient and accurate, scales as O(KN)
47Experimental Methodology
- Model Training
- fit 2 types of models
- mixtures of histograms
- mixtures of finite state machines
- Train on a full days worth of MSNBC Web data
- Model Evaluation
- one-step-ahead prediction on unseen test data
- Test sequences from a different day of Web logs
- negative log probability predictive entropy
48(No Transcript)
49(No Transcript)
50Timing Results
51WebCanvas
- Software tool for Web log visualization
- uses Markov mixtures to cluster data for display
- in use by msnbc.com administrators at Microsoft
- also being applied to non-Web data
- Model-based visualization
- random sample of actual sequences
- interactive tiled windows displayed for
visualization - more effective than
- planar graphs
- traffic-flow movie in Microsoft Site Server v3.0
52WebCanvas Cadez, Heckerman, et al, 2003
53Insights from WebCanvas
- From msnbc.com site adminstrators.
- significant heterogeneity of behavior
- relatively focused activity of many users
- typically only 1 or 2 categories of pages
- many individuals not entering via main page
- detected problems with the weather page
- missing transitions (e.g., tech ltgt business)
54Extensions
- Adding time-dependence
- adding time-between clicks, time of day effects
- Uncategorized Web pages
- coupling page content with sequence models
- Modeling switching behaviors
- allowing users to switch between models
- Individualized weights (hierarchical Bayes)
- Update WebCanvas tool will be part of 2004
SQLServer release
55Prediction with Markov mixtures
P(st1 s1,t )
56Prediction with Markov mixtures
P(st1 s1,t ) S P(st1 , k s1,t )
S
P(st1 k , s1,t ) P(k s1,t )
57Prediction with Markov mixtures
P(st1 s1,t ) S P(st1 , k s1,t )
S
P(st1 k , s1,t ) P(k s1,t )
S P(st1 k , st ) P(k
s1,t )
Prediction of kth component
Membership, based on sequence history
gt Predictions are a convex combination of K
different component transition matrices, with
weights based on sequence history
58Related Work
- Mixtures of Markov chains
- special case Poulsen (1990)
- general case Ridgeway (1997), Smyth (1997)
- Clustering of Web page sequences
- non-probabilistic approaches (Fu et al, 1999)
- Markov models for prediction
- Anderson et al (IJCAI, 2001)
- mixtures of Markov outperform other sequential
models for page-request prediction
59Predicting page requests with Markov models
- K can be chosen by evaluating the out-of-sample
predictive performance based on - Accuracy of prediction
- Log probability score
- Entropy
- Other variations
- Sen and Hansen 2003
- Position-dependent Markov models (Anderson et al.
2001, 2002) - Zukerman et al. 1999
60Modeling Clickrate Data
- Data
- 200k Alexa users, client-side, over 24 hours
- ignore URLs requested
- goal is to build a time-series model that
characterizes user click rates
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65Markov-Poisson Model
- Doubly stochastic process
- Locally constant Poisson rate
- indexed by M Markov states
- Fit a model with M 3 states
- absence of a Web session
- Web session with slow click rate 1 minute rate
- Web session with rapid click rate 10 second rate
- Used hierarchical Bayes on individuals
66Hierarchical Bayes Model
Population Prior p(lq)
l1 Individual 1
li Individual i
lN Individual N
D1
D1
D2
D3
D1
D2
Individuals with little data get shrunk to the
prior Individuals with a lot of data are more
data-driven
67(No Transcript)
68Prediction with Hierarchical Bayes
Population Prior p(lq)
New Individual l ?
l1 Individual 1
lN Individual N
D1
D2
D3
D1
D2
First few clicks
Historical Training Data
69Extensions
- Couple click-rate with purchase behavior
- Markov-type model through different states
- product viewing
- detailed product information
- reviews
- combine states with
- click rate and page content
- to predict p(purchase data up to time t)
- Can these Bayesian techniques be scaled up??
70Outline
- Basic concepts in Web log data analysis
- Predictive modeling of Web navigation behavior
- Markov modeling methods
- Analyzing search engine data
- Ecommerce aspects of Web log mining
71Analysis of Search Engine Query Logs
of Sample Query Source SE Time Period
Lau Horvitz 4690 of 1 Million Excite Sep 1997
Silverstein et al 1 Billion AltaVista 6 weeks in Aug Sep 1998
Spink et al (series of studies) 1Million for each time period Excite Sep 1997Dec 1999May 2001
Xie OHallaron 110,000 Vivisimo 35 days Jan Feb 2001
Xie OHallaron 1.9 Million Excite 8 hrs in a day, Dec 1999
72Main Results
- Average number of terms in a query is ranging
from a low of 2.2 to a high of 2.6 - The most common number of terms in a query is 2
- The majority of users dont refine their query
- The number of users who viewed only a single page
increase 29 (1997) to 51 (2001) (Excite) - 85 of users viewed only first page of search
results (AltaVista) - 45 (2001) of queries are about Commerce, Travel,
Economy, People (was 20 in 1997) - The queries about adult or entertainment
decreased from 20 (1997) to around 7 (2001)
73Xie and O Halloran Study (2002)
- Query Length Distributions (bar) - Poisson
Model(dots lines)
- All four studies produced a generally consistent
set of findings about user behavior in a search
engine context - most users view relatively few pages per query
- most users dont use advanced search features
74Power-law Characteristics of Common Queries
Power-Law in log-log space
- Frequency f(r) of Queries with Rank r
- 110000 queries from Vivisimo
- 1.9 Million queries from Excite
- There are strong regularities in terms of
patterns of behavior in how we search the Web
75Outline
- Basic concepts in Web log data analysis
- Predictive modeling of Web navigation behavior
- Markov modeling methods
- Analyzing search engine data
- Ecommerce aspects of Web log mining
76The next few slides are from Ronny Kohavi,
director of data mining and personalization at
Amazon.com. His full set of slides are available
online see the PPT slides and related papers on
ecommerce and data mining online at
http//robotics.stanford.edu/ronnyk/ronnyk-bib.ht
ml
77ECommerce
- Page request Web logs combined with
- Purchase (market-basket) information
- User address information (if they make a
purchase) - Demographics information (can be purchased)
- Emails to/from the customer
- Main focus here is to increase revenue
- Data mining widely used an online commerce
companies like Amazon - This is a very rich source of problems for data
mining - What products should we advertise to this person?
- Can we do dynamic pricing?
- If a person buys X should we also suggest Y?
- Who are our best customers?
- etc
78Combining Data Sources
- Comprehensive collection of US consumer and
telephone data available via the internet - Multi-sourced database
- Demographic, socioeconomic, and lifestyle
information. - Information on most U.S. households
- Contributors files refreshed a minimum of 3-12
times per year. - Data sources include County Real Estate Property
Records, U.S. Telephone Directories, Public
Information, Motor Vehicle Registrations, Census
Directories, Credit Grantors, Public Records and
Consumer Data, Drivers Licenses, Voter
Registrations, Product Registration
Questionnaires, Catalogers, Magazines, Specialty
Retailers, Packaged Goods Manufacturers, Accounts
Receivable Files, Warranty Cards - Much of this data can be accessed in real-time
once a customer self-identifies
79Map of World Wide Revenue
Although Debenhams online site only ships in the
UK, we see some revenue from the rest of the
world.
UK 98.8
US 0.6
Australia 0.1
Low
Medium
High
NOTE About 50 of the non-UK orders are wedding
list purchases
80Online Consumer Demographics
- Results from Blue-Martini
- People who have a Travel and Entertainment credit
card are 48 more likely to be online shoppers
(27 for people with premium credit card) - People whose home was built after 1990 are 45
more likely to be online shoppers - Households with income over 100K are 31 more
likely to be online shoppers - People under the age of 45 are 17 morelikely to
be online shoppers
81Demographics - Income
- A higher household income means you are more
likely to be an online shopper
82Demographics Credit Cards
- The more credit cards, the more likely you are to
be an online shopper
83Example Web Traffic
Sept-11 Note significant drop in human traffic,
not bot traffic
Weekends
Internal Perfor-mance bot
Registration at Search Engine sites
84Product Affinities at MEC
Website Recommended Products
Product
Association
Lift Confidence
Orbit Sleeping Pad
Orbit Stuff Sack
222 37
Cygnet Sleeping Bag
Primus Stove
Aladdin 2 Backpack
Bambini Crewneck Sweater Childrens
Bambini Tights Childrens
195 52
Yeti Crew Neck Pullover Childrens
Beneficial Ts Organic Long Sleeve T-Shirt Kids
Silk Long Johns Womens
Silk Crew Womens
304 73
Micro Check Vee Sweater
Volant Pants
Composite Jacket
Cascade Entrant Overmitts
Polartec 300 Double Mitts
51 48
Windstopper Alpine Hat
Volant Pants
Tremblant 575 Vest Womens
- Minimum support for the associations is 80
customers - Confidence 37 of people who purchased Orbit
Sleeping Pad also purchased Orbit Stuff Sack - Lift People who purchased Orbit Sleeping Pad
were 222 times more likely to purchase the Orbit
Stuff Sack compared to the general population
85Customer Locations Relative to Retail Stores
Heavy purchasing areas away from retail stores
can suggest new retail store locations
No stores in several hot areas MEC is building
a store in Montreal right now.
Map of Canada with store locations.
86Building The Customer Signature
- Building a customer signature is a significant
effort, but well worth the effort - A signature summarizes customer or visitor
behavior across hundreds of attributes, many
which are specific to the site - Once a signature is built, it can be used to
answer many questions. - The mining algorithms will pick the most
important attributes for each question - Example attributes computed
- Total Visits and Sales
- Revenue by Product Family
- Revenue by Month
- Customer State and Country
- Recency, Frequency, Monetary
- Latitude/Longitude from the Customers Postal Code