Title: Weblog Cleaning for Constructing Sequential Classifiers
1Web-log Cleaning for Constructing Sequential
Classifiers
- Qiang Yang
- Hong Kong University of Science and Technology
- Hong Kong
- T.Y. Li and Ke Wang
- Simon Fraser University, Canada
2Web Usage Mining
- uplherc.upl.com -- 01/Aug/1995000852 -0400
"GET /shuttle/resources/orbiters/endeavour/index.h
tml HTTP/1.0" 200 5052 - pm9.j51.com -- 01/Aug/1995000852 -0400 "GET
/images/WORLD-logosmall.gif HTTP/1.0" 200 669 - 139.230.35.135 -- 01/Aug/1995000852 -0400
"GET /images/NASA-logosmall.gif HTTP/1.0" 200 786
A user session
A.html
B.html
C.html
D.html
IP
3Web Logs are Available
47,748 requests for 3770 pages
EPA 24 hours, Aug 29, 95
NASA Log (Kennedy Flight Center 1 month, 95)
1,569,898 visits on 15,429 pages
525,378 visits on 6727 pages
Monarsh University (UM99, 50 day, 98)
4Web Access Prediction
- Method Association-Rule based Models
5Build Prediction Models
Association Rule based Predictive Model
A,B,C,D A,B,C,F A,B,E B,C B,C,D,G C, D
1
2
Window of Prediction
Current Observations
?
Sizem
B
C
A
1
2
Extract rules
Select rules
6Moving Window Algorithm
- Curent window W W1 W2
- W1 observation window
- W2 prediction window (size m)
- W slides from beginning of session to end
(A, B, C, A, C, D, G) ?
7Association Rules
- LHS ? RHS
- RHS restricted to one URL,
- but can be relaxed to more than one URL
- RHS the most popular URL with the same LHS
- For each W1, if rule applies and RHS is in W2,
then Success!
8Rule-Representation Methods (min sup2)
- Subset
- A, C?C
- Substring
- BC?C
- Latest SubstringC?C
- Subsequence
- Latest Subsequence
9Rule Representation
- Subset Rules
- LHS a subset of items appearing in W1, with no
order imposed - Corresponds to traditional association rules
- Substring rules
- LHS substring in W1 items must be adjacent,
but LHS can start anywhere in W1 - Latest Substring rules
- Substring rules where LHS must end with W1
- Also known as n-gram rules (n is a variable
ltW1) - C4.5 Decision Trees
- Default rule most popular item in web log
root
A
B
C
W1
W2
?
10Information Embedded In Rules
- Subset method appearance in any order
- Subsequence method appearance order
information - Latest-subsequence method appearance, order
recency information - Substring method appearance, order adjacency
information - Latest-substring method appearance, order,
adjacency recency information
11Rule-Selection Criteria
- Among the rules whose LHS matches W1,
- Longest-Match Selection
- Select a rule whose left hand side is the longest
to apply - Corresponds to using the strongest signature to
predict - Most Confident
- Select the rule with highest confidence to apply
- Pessimistic Selection
- UCF(E,N) is the upper bound on the estimated
error for a given confidence value, assuming a
normal distribution of error
12Comparison Matrix
- Comparison Criteria Precision Model Size
13Longest Match (NASA)
- rules controlled by min support
- Latest-substring a clear winner
14Most-confident (NASA)
- Again latest-substring a winner
- Drop off after 10,000 due to overfitting
15Greedy-Dual-Size Frequency
- Cache replacement algorithm
- A key value K(p) is assigned to each cached
object p - Arlltt et al. USENIX 1998, Cao Irani, 97
- K(p) L F(p) C(p) / S(p)
- C(p) Cost of loading a page (e.g., amount of
time) - S(p) Size of a page
- F(p) Frequency Count of a page
- L An Inflation factor to reflect cache aging
16Predicting future frequency
using latest-substringlongest match
O1 0.70 O2 0.90 O3 0.30 O4 0.11
Session 1
Predicted Frequency
W1 0.700.600.70 2.00 W2 0.900.700.90
2.50 W3 0.300.20 0.50 W4 0.110.30
0.41 W5 0.420.33 0.75
O1 0.60 O2 0.70 O3 0.20 O5 0.42
Session 2
O1 0.70 O2 0.90 O4 0.30 O5 0.33
Session 3
- Ki L ( WiFi ) Ci / Si
- Wi Future frequency Fi Past frequency
17Hit Rate measures latency reduction
18Rule Pruning not all rules are useful!
- Suppose that we have two rules for testing case
ltB, Cgt ? ? - Rule 1 ltA, B, Cgt ? D (confidence 50)
- Rule 2 ltB, Cgt ? E (confidence 70)
- In general, rules form a hierarchy we call
Latest-Substring Index Tree (LSIT) - Each rule is represented by a node in the LSIT
- The root of the LSIT representing the default
rule. - The node representing the direct parent rule is
the parent node of the node(s) representing the
direct children rule(s)
19LSIT Example
20LSIT Pruning
21LSIT Evaluation
Experiments are based on NASA data1,569,898
visits on 15,429 pages
22Conclusions
- Web-data mining requires extensive data cleaning
- Data cleaning involves not only cleaning the raw
data, but also the mined knowledge - In our case, the rule set is also cleaned to
yield better results