Title: Algorithms for Webpage Traversal Pattern Mining
1Algorithms for Webpage Traversal Pattern Mining
- Spring 2002
- CSE791 Final Project
- by Dalei Xing
2Highlights
- What is webpage traversal pattern mining and
- its motivations.
- Problem description.
- Two algorithms for the two key steps in webpage
traversal pattern mining. - Finding Maximal Forward References (the MF
algorithm). - Finding Frequent Reference Sequences (the FS
algorithm). - Demo of a simple MF algorithm implementation.
3What is webpage traversal pattern mining?
- Association rule mining for people's internet
browsing activities. - In the WWW environment, users access information
of interest and travel from one object to another
via the corresponding facilities provided. We
want to capture the regulations of user access
patterns.
4Motivations
- WWWs ubiquity increasing
- Complexity of websites increasing
- Service providers and online business want to
track user browsing habits to better their
services and get more profits. For example - More efficient access between highly correlated
webpages - Better customer classification and behavior
analysis.
5Problem Description
- Finding maximal forward references
- Data preprocessing.
- There are things we do not need.
- Generate a transactional database.
- Determine Frequent Reference
6Some Terms
Traversal Paths A,B,C,D,E,D,C,B,F,A,G,H,G,I,J,I,
K, Maximal Forward References ABCDE, ABF,
AGH,AGIJ,AGIK
- Traversal path and
- Maximal Forward Reference
7Some Terms (Continued)
- Frequent reference sequence
- A frequent reference sequence is a reference
sequence that occurred more than or equal to the
number of times decided by the minimal support.
(Remember Frequent Itemset in the textbook)
8Algorithm MF
- Raw data input comes from log files, which can be
found in client-level, proxy-level or server
level. - MF is applied on the information of each user to
find Maximal Forward References
9Algorithm MF (Continued)
Input A,B,C,D,E,D,C,B,F,A,G,H,G,I,J,I,K,
Output ABCDE, ABF, AGH,AGIJ,AGIK
10Algorithm MF (Continued)
- Step 1 Initialization
- Set S null and the forward Flag true. Start
scanning database - Step 2 Forward References
- Set flag to true, keep writing the forward visits
to S - Step 3 Backward Reference happens
- Set flag to false. Write the maximal reference to
result database and discard the backward nodes in
S, until a new node is encountered then go to 2 - gt Steps 2 and 3 are iterated until the end of log
is found
11Finding Frequent k-references Sequences
- The Idea of Full-Scan(FS) algorithm
- The full-scan algorithm essentially utilizes the
concept of hashing and pruning while solving the
discrepancy between traversal patterns and
association rules. In the FS algorithm, although
trimming the transaction database as it proceeds
to later passes, it requires the scan of the
database in each pass.
12MF Implementation Demo
13Conclusion
- Hopefully, the following key points are conveyed
- gt Webpage access pattern mining is an
- special kind of association rule
mining. - gt Key steps are finding maximal forward
- references and frequent k-reference
- sequences.
14References
1 Ming-Syan Chen, Jong Soo Park, Philips S. Yu.
Data Mining for Path Traversal Patterns in a Web
Environment (1995) Â 2 Ming-Syan Chen, Jong Soo
Park, Philips S. Yu. An Effective Hash-Based
Algorithm for Mining Association Rules
(1996) Â 3 Behzad Mortazavi-Asl. Discovering and
Mining User Web-page Traversal Patterns (1999)
Questions?