Algorithms for Webpage Traversal Pattern Mining

About This Presentation

Title:

Algorithms for Webpage Traversal Pattern Mining

Description:

Two algorithms for the two key steps in webpage traversal pattern mining. ... The full-scan algorithm essentially utilizes the concept of hashing and pruning ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 15

Provided by: publi3

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms for Webpage Traversal Pattern Mining

1
Algorithms for Webpage Traversal Pattern Mining

Spring 2002
CSE791 Final Project
by Dalei Xing

2
Highlights

What is webpage traversal pattern mining and
its motivations.
Problem description.
Two algorithms for the two key steps in webpage
traversal pattern mining.
Finding Maximal Forward References (the MF
algorithm).
Finding Frequent Reference Sequences (the FS
algorithm).
Demo of a simple MF algorithm implementation.

3
What is webpage traversal pattern mining?

Association rule mining for people's internet
browsing activities.
In the WWW environment, users access information
of interest and travel from one object to another
via the corresponding facilities provided. We
want to capture the regulations of user access
patterns.

4
Motivations

WWWs ubiquity increasing
Complexity of websites increasing
Service providers and online business want to
track user browsing habits to better their
services and get more profits. For example
More efficient access between highly correlated
webpages
Better customer classification and behavior
analysis.

5
Problem Description

Finding maximal forward references
Data preprocessing.
There are things we do not need.
Generate a transactional database.
Determine Frequent Reference

6
Some Terms
Traversal Paths A,B,C,D,E,D,C,B,F,A,G,H,G,I,J,I,
K, Maximal Forward References ABCDE, ABF,
AGH,AGIJ,AGIK

Traversal path and
Maximal Forward Reference

7
Some Terms (Continued)

Frequent reference sequence
A frequent reference sequence is a reference
sequence that occurred more than or equal to the
number of times decided by the minimal support.
(Remember Frequent Itemset in the textbook)

8
Algorithm MF

Raw data input comes from log files, which can be
found in client-level, proxy-level or server
level.
MF is applied on the information of each user to
find Maximal Forward References

9
Algorithm MF (Continued)
Input A,B,C,D,E,D,C,B,F,A,G,H,G,I,J,I,K,
Output ABCDE, ABF, AGH,AGIJ,AGIK
10
Algorithm MF (Continued)

Step 1 Initialization
Set S null and the forward Flag true. Start
scanning database
Step 2 Forward References
Set flag to true, keep writing the forward visits
to S
Step 3 Backward Reference happens
Set flag to false. Write the maximal reference to
result database and discard the backward nodes in
S, until a new node is encountered then go to 2
gt Steps 2 and 3 are iterated until the end of log
is found

11
Finding Frequent k-references Sequences

The Idea of Full-Scan(FS) algorithm
The full-scan algorithm essentially utilizes the
concept of hashing and pruning while solving the
discrepancy between traversal patterns and
association rules. In the FS algorithm, although
trimming the transaction database as it proceeds
to later passes, it requires the scan of the
database in each pass.

12
MF Implementation Demo
13
Conclusion

Hopefully, the following key points are conveyed
gt Webpage access pattern mining is an
special kind of association rule
mining.
gt Key steps are finding maximal forward
references and frequent k-reference
sequences.

14
References
1 Ming-Syan Chen, Jong Soo Park, Philips S. Yu.
Data Mining for Path Traversal Patterns in a Web
Environment (1995) 2 Ming-Syan Chen, Jong Soo
Park, Philips S. Yu. An Effective Hash-Based
Algorithm for Mining Association Rules
(1996) 3 Behzad Mortazavi-Asl. Discovering and
Mining User Web-page Traversal Patterns (1999)
Questions?

Write a Comment

User Comments (0)