Title: Advanced Topics in Data Mining: Web Mining
1Advanced Topics in Data MiningWeb Mining
2Web Mining
3Web Mining
- Applications are ported to the Web at rapid pace
- On-line services, such as America Online (AOL),
and CompuServe (merged to AOL), are anxious to
know user access patterns not just search in
the Web - How Amazon does it?
- Understanding Web user behavior is important
- It can improve Web page organization
- It can increase Web server performance
- It can exploit Web advertising
- It can increase business opportunity
4Amazon Web Page
Association Rules
5More Information Desired
- Collect statistical information (page hits) only,
which is insufficient since - The hit frequency of a page depends not only on
its content but also on its location - The number of users accessing a page is not
available - Information on what pages accessed together is
not available - Data mining in the Web (Web Mining)
- Web Access Pattern Collection
- Web User Pattern Mining
6Web Access Pattern Collection
- Server-Based Data Collection
- Who are visiting a given Web site and what are
they doing - Agent-Based Data Collection
- What are the Web sites a particular user has
visited?
7Server-Based Data Collection
- Examine the logs collected by HTTPd
- Access Log (IP, Time, Access Data), Referred Log
(A?B), Error Log, - We can combining some of them for our use if
necessary - Problems
- The use of proxy servers
- The effect of caching
8Server-Based Data Collection
9Access Log
IP/Domain Name
Time
Access Data
10Referred Log
???Caching???
11Server-Based Data Collection
- Have to be done in accordance with technology
advances - The use of Active Server Pages (Session ID
available) - The use of proxy servers
- The effect of caching
- HTTPd 1.1
- Limitation
- Can only capture the user behavior when they are
within this site
12Agent-Based Data Collection
- Understanding individual Web behavior needs
client-based data collection - Results are useful
- Better Personalized Service
- Improved Web Page Organization
- Better Pricing Policies
- Methods
- Applets can only read/write files in their source
servers - a big security constraint
- Using Active Components (ActiveX Control) and
PlugIns - APCS (Access Pattern Collection Server)
13APCS
14APCS
15APCS
16APCS
17APCS
18Agent-Based Data Collection
- Very difficult to do for non-registered users in
the current Web environment - We have to be conducted with users consent
- Very dependent upon available Web technologies
19Web User Pattern Mining
- Web user pattern mining is to discover user
access patterns in Web servers - Pattern discovery and analysis tools
- Some existing Web tools provide mechanisms for
reporting user activity in the servers - Web Trends (http//www.webtrends.com.tw/)
- Open Market (http//www.openmarket.com/)
- Net.Genesis (http//www.netgen.com/)
20Path Traversal Patterns Mining
- Mining path traversal patterns in a distributed
information providing environment (WWW) where
documents or objects are linked together (via
hyperlinks) to facilitate interactive access - Solution procedure consists of three steps
- Convert the original sequence of log data into a
set of maximal forward references (MF) - Filter out the effect of some backward references
- Mainly made for ease of traveling and concentrate
on mining meaningful user access sequences - Some objects are visited because of their
locations rather than their content - Determine the frequent traversal patterns, i.e.,
large reference sequences, from the maximal
forward references obtained - Determine the maximal reference sequences from
large reference sequences (Trivial)
21Step1 MF References
- Suppose the traversal log contains the following
traversal path for a user - A, B, C, D, C, B, E, G, H, G, W, A, O, U, O, V
When backward references occur, a forward
reference path terminate.
The set of maximal forward references is ABCD,
ABEGH, ABEGW, AOU, AOV
22Step1 Another Example
23Step1 Arrange Database
Encoding
24Step1 Database Reduction
Database Reduction
25Step2 Find Frequent Reference Sequences
- Two algorithms for finding Frequent Traversal
Patterns (Frequent Reference Sequences, Frequent
Consecutive Subsequences) - Full-Scan (FS) Algorithm
- FS utilizes key ideas of the DHP algorithm
- Selective-Scan (SS) Algorithm
- SS reduces the number of database scans
26Full-Scan (FS) Algorithm
Generate L1 Hash Table
Scan DB-1
27Generate L1 Hash Table
Scan DB-1
h(x,y) ( order of x ) 23 ( order of y )
mod 17
28Generate C2
29Generate L2 Reduce DB
Scan DB-2
30Generate L2 Reduce DB
Scan DB-2
31Generate C3, L3 Reduce DB
Scan DB-3
32Generate C4, L4 Reduce DB
Scan DB-4
33Selective-Scan (SS) Algorithm
Scan DB-3
34Step 3 Generate FrequentTraversal Patterns
Maximal Reference Sequences
35WAP-Mine Algorithm
- The key consideration is how to facilitate the
tedious support counting and candidate generating
operations in the mining procedure - Given Web Access Sequence database WAS and a
support threshold ?, mine the complete set of
?-patterns of WAS
User ID Web Access Sequence
100 abdac
200 eaebcac
300 babfaec
400 afbacfc
WAS
36WAP-Mine Algorithm
(1)Scan WAS once,find all frequent-1 events
(2)Scan WAS again,construct a WAP-tree
(3)Recursively mine the WAP-tree using
conditional search
Access patterns
37Find All Frequent-1 Events
Item Support Frequency
a 4
b 4
c 4
d 1
e 2
f 2
User ID Web Access Sequence
100 abdac
200 eaebcac
300 babfaec
400 afbacfc
Min_Sup75
User ID Web Access Sequence Frequent Subsequence
100 abdac abac
200 eaebcac abcac
300 babfaec babac
400 afbacfc abacc
38WAP-Tree Construction
- Using frequent events to register all count
information for further mining
User ID Frequent Subsequence
100 abac
200 abcac
300 babac
400 abacc
39Mining Web Access Patterns from WAP-Tree
Conditional Sequence Based on c
Sequence Count
aba 2
ab 1
abca 1
ab -1
baba 1
abac 1
aba -1
Sequence Count
aba 1
abca 1
baba 1
abac 1
Item Sup Frequency
a 4
b 4
c 2
Generate Web Access Patterns ac, bc
40Mining Web Access Patterns from WAP-Tree
Conditional Sequence Based on ac
Sequence Count
ab 3
bab 1
Sequence Count
ab 3
b 1
bab 1
b -1
Item Sup Frequency
a 4
b 4
Generate Web Access Patterns aac, bac
41Mining Web Access Patterns from WAP-Tree
Conditional Sequence Based on bac
Sequence Count
a 3
ba 1
Item Sup Frequent
a 4
b 1
Generate Web Access Patterns abac
42Mining Web Access Patterns from WAP-Tree
Conditional Sequence Based on abac
Sequence Count
a 4
No Web Access Patterns are Generated
43Mining for Web Transactions
- To capture Web customer buying behavior
- It is not just market basket transaction for the
set of items bought by a customer in a single
purchase (Association Rules) - It is not just Web user travel patterns (Path
Traversal Patterns) - It is an extension from path traversal patterns
- Exploring the relationship between traveling and
buying
44Mining for Web Transactions
Web Transaction
Algorithm WR (Web-transaction-Record)
Web Transaction Records ltPath a Set of
Purchasesgt
Algorithm WTM, MTSPJ, MTSPC
Frequent Transaction Patterns
Web Transaction Association Rules
45Mining for Web Transactions
- Web-transaction-Record (WR) Algorithm
- Extract meaningful Web transaction records from
the given Web transaction - WTM (Web Transaction Mining) Algorithm
- Mining Web Transaction Patterns
- MTS (Maximal Transaction Segment) Algorithms are
the improvement versions of WTM
46Mining for Web Transactions
47Mining for Web Transactions
48WTM Algorithm
- It joins the purchased itemsets for generating
candidate transaction patterns - WTM employs a two-level hash tree, called Web
transaction tree, to store candidate transaction
patterns - WTM hashes not only each item but also each
purchase in the path
49WTM Algorithm
50Support Count
WT_ID Path Purchase
100 ABCE Bi1, Ci2, Ei4
100 ABFGH Bi1, Hi6
100 ASJL Si7, Li9
200 ABCE Bi1, Ci2, Ei4
200 ASJLQ Si7, Qi10
Path Purchase Support Count
AB Bi1 2
ABC Ci2 2
51WTM Algorithm
Support Count gt 2
C1
T1
Sup.
Purchase
Path
Path Purchase Sup.
AB Bi1 3
ABC Ci2 2
ABCE Ei4 3
ABFG Gi5 2
AS Si7 4
ASJ Ji8 2
ASJL Li9 2
ASJLQ Qi10 2
3
Bi1
AB
2
Ci2
ABC
1
Di3
ABD
3
Ei4
ABCE
2
Gi5
ABFG
1
Hi6
ABFGH
4
Si7
AS
2
Ji8
ASJ
2
Li9
ASJL
2
Qi10
ASJLQ
52WTM Algorithm
Support Count gt 2
?28?
53WTM Algorithm
Support Count gt 2
C3
Sup.
Purchase
Path
2
Bi1 Ci2 Ei4
ABCE
T3
Sup.
Purchase
Path
2
Bi1 Ci2 Ei4
ABCE
54WTM Disadvantages
- WTM may generate a lot of unqualified candidate
transaction patterns without utilizing the paths
of frequent transaction patterns - This will degrade the performance
55MTSPJ Algorithm
- Algorithm MTSPJ uses maximal transaction segment
that contains frequent transaction patterns and
the maximal path, to solve the unqualified
candidate transaction pattern problem - MTSPJ generalizes candidate transaction patterns
only when the leaf node of the Web transaction
tree is reached
56MTSPJ Algorithm
57MTSPJ Algorithm
Support Count gt 2
C1
T1
Sup.
Purchase
Path
Path Purchase Sup.
AB Bi1 3
ABC Ci2 2
ABCE Ei4 3
ABFG Gi5 2
AS Si7 4
ASJ Ji8 2
ASJL Li9 2
ASJLQ Qi10 2
3
Bi1
AB
2
Ci2
ABC
1
Di3
ABCD
3
Ei4
ABCE
2
Gi5
ABFG
1
Hi6
ABFGH
4
Si7
AS
2
Ji8
ASJ
2
Li9
ASJL
2
Qi10
ASJLQ
58MTSPJ Algorithm
C2
Sup.
Purchase
Path
Si7 Ji8
ASJ
2 2 1 2 1 0
Si7 Li9
ASJL
Ji8 Li9
ASJL
Si7 Qi10
ASJLQ
Ji8 Qi10
ASJLQ
Li9 Qi10
ASJLQ
2 3 2
1
59MTSPJ Algorithm
C2
T2
Path Purchase Sup.
ABC Bi1 Ci2 2
ABCE Bi1 Ei4 3
ABCE Ci2 Ei4 2
ABFG Bi1 Gi5 1
ASJ Si7 Ji8 2
ASJL Si7 Li9 2
ASJL Ji8 Li9 1
ASJLQ Si7 Qi10 2
ASJLQ Ji8 Qi10 1
ASJLQ Li9 Qi10 0
Path Purchase Sup.
ABC Bi1 Ci2 2
ABCE Bi1 Ei4 3
ABCE Ci2 Ei4 2
ASJ Si7 Ji8 2
ASJL Si7 Li9 2
ASJLQ Si7 Qi10 2
60MTSPJ Algorithm
61MTSPC Algorithm
MTSPC utilizes the LC (Large Count) to Filter
Candidates
Support Count gt 2
C1
T1
Sup.
Purchase
Path
Path Purchase Sup.
AB Bi1 3
ABC Ci2 2
ABCE Ei4 3
ABFG Gi5 2
AS Si7 4
ASJ Ji8 2
ASJL Li9 2
ASJLQ Qi10 2
3
Bi1
AB
2
Ci2
ABC
1
Di3
ABCD
3
Ei4
ABCE
2
Gi5
ABFG
1
Hi6
ABFGH
4
Si7
AS
2
Ji8
ASJ
2
Li9
ASJL
2
Qi10
ASJLQ
62MTSPC Algorithm
Maximal Transaction Segment Maximal Transaction Segment Maximal Transaction Segment
Maximal Path Item LC
ASJLQ Si7 1
ASJLQ Ji8 1
ASJLQ Li9 1
ASJLQ Qi10 1
I 4 gt 1
I 3 gt 1 (K-1)
C2
Maximal Transaction Segment Maximal Transaction Segment Maximal Transaction Segment
Maximal Path Item LC
ABFG Bi1 1
ABFG Gi5 1
Sup.
Purchase
Path
2
Si7 Ji8
ASJ
2
Si7 Li9
ASJL
1
Ji8 Li9
ASJL
I 2 gt 1
2
Si7 Qi10
ASJLQ
1
Ji8 Qi10
ASJLQ
0
Li9 Qi10
ASJLQ
63MTSPC Algorithm
C2
T2
Path Purchase Sup.
ABC Bi1 Ci2 2
ABCE Bi1 Ei4 3
ABCE Ci2 Ei4 2
ABFG Bi1 Gi5 1
ASJ Si7 Ji8 2
ASJL Si7 Li9 2
ASJL Ji8 Li9 1
ASJLQ Si7 Qi10 2
ASJLQ Ji8 Qi10 1
ASJLQ Li9 Qi10 0
Path Purchase Sup.
ABC Bi1 Ci2 2
ABCE Bi1 Ei4 3
ABCE Ci2 Ei4 2
ASJ Si7 Ji8 2
ASJL Si7 Li9 2
ASJLQ Si7 Qi10 2
64MTSPC Algorithm
Maximal Transaction Segment Maximal Transaction Segment Maximal Transaction Segment
Maximal Path Item LC
ASJLQ Si7 3
ASJLQ Ji8 1
ASJLQ Li9 1
ASJLQ Qi10 1
T2
I 3 gt 2
Path Purchase Sup.
ABC Bi1 Ci2 2
ABCE Bi1 Ei4 3
ABCE Ci2 Ei4 2
ASJ Si7 Ji8 2
ASJL Si7 Li9 2
ASJLQ Si7 Qi10 2
I 1 lt 2
No Generations
65Mining for Web Transactions
- ltABCE B1, E4gt 2
- ltAB B1gt 3
- We can derive ltABCE B1 gt E4gt
- support_count(ltABCE B1 gt E4gt) 2
- confidence(ltABCE B1 gt E4gt)
66Summary
- Data mining in the Web is an area of growing
importance - In particular, the emerging of EC
- More and more applications will benefit from the
knowledge from data mining - Web Mining Web Data Collection Traditional
Data Mining? - Important Issues
- Incremental Web Mining