Web Mining : A Birds Eye View

About This Presentation

Title:

Web Mining : A Birds Eye View

Description:

mining techniques to discover interesting usage patterns from the secondary data ... Web Usage Mining ... Customized Usage Tracking. Adaptive Sites (Perkowitz ... – PowerPoint PPT presentation

Number of Views:2157

Avg rating:3.0/5.0

Slides: 211

Provided by: devesh9

Category:

more less

Transcript and Presenter's Notes

Title: Web Mining : A Birds Eye View

1
Web Mining A Birds Eye View

Sanjay Kumar Madria
Department of Computer Science
University of Missouri-Rolla, MO 65401
madrias_at_umr.edu

2
Web Mining

Web mining - data mining techniques to
automatically discover and extract information
from Web documents/services (Etzioni, 1996).
Web mining research integrate research from
several research communities (Kosala and
Blockeel, July 2000) such as
Database (DB)
Information retrieval (IR)
The sub-areas of machine learning (ML)
Natural language processing (NLP)

3
Mining the World-Wide Web

WWW is huge, widely distributed, global
information source for
Information services news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc.
Hyper-link information
Access and usage information
Web Site contents and Organization

4
Mining the World-Wide Web

Growing and changing very rapidly
Broad diversity of user communities
Only a small portion of the information on the
Web is truly relevant or useful to Web users
How to find high-quality Web pages on a specified
topic?
WWW provides rich sources for data mining

5
Challenges on WWW Interactions

Finding Relevant Information
Creating knowledge from Information available
Personalization of the information
Learning about customers / individual users
Web Mining can play an important Role!

6
Web Mining more challenging

Searches for
Web access patterns
Web structures
Regularity and dynamics of Web contents
Problems
The abundance problem
Limited coverage of the Web hidden Web sources,
majority of data in DBMS
Limited query interface based on keyword-oriented
search
Limited customization to individual users
Dynamic and semistructured

7
Web Mining Subtasks

Resource Finding
Task of retrieving intended web-documents
Information Selection Pre-processing
Automatic selection and pre-processing specific
information from retrieved web resources
Generalization
Automatic Discovery of patterns in web sites
Analysis
Validation and / or interpretation of mined
patterns

8
Web Mining Taxonomy
Web Mining
Web Content Mining
Web Usage Mining
Web Structure Mining
9
Web Content Mining

Discovery of useful information from web contents
/ data / documents
Web data contents text, image, audio, video,
metadata and hyperlinks.
Information Retrieval View ( Structured
Semi-Structured)
Assist / Improve information finding
Filtering Information to users on user profiles
Database View
Model Data on the web
Integrate them for more sophisticated queries

10
Issues in Web Content Mining

Developing intelligent tools for IR
-
Finding keywords and key phrases
- Discovering grammatical
rules and collocations
- Hypertext classification/categorization

- Extracting key phrases from text
documents
- Learning extraction models/rules
-
Hierarchical clustering
- Predicting
(words) relationship

11
Cont.

Developing Web query systems
WebOQL, XML-QL
Mining multimedia data
- Mining image
from satellite (Fayyad, et al. 1996)
- Mining image to identify small volcanoes on
Venus (Smyth, et al 1996) .

12
Web Structure Mining

To discover the link structure of the hyperlinks
at the inter-document level to generate
structural summary about the Website and Web
page.
Direction 1 based on the hyperlinks,
categorizing the Web pages and generated
information.
Direction 2 discovering the structure of Web
document itself.
Direction 3 discovering the nature of the
hierarchy or network of hyperlinks in the Website
of a particular domain.

13
Web Structure Mining

Finding authoritative Web pages
Retrieving pages that are not only relevant, but
also of high quality, or authoritative on the
topic
Hyperlinks can infer the notion of authority
The Web consists not only of pages, but also of
hyperlinks pointing from one page to another
These hyperlinks contain an enormous amount of
latent human annotation
A hyperlink pointing to another Web page, this
can be considered as the author's endorsement of
the other page

14
Web Structure Mining

Web pages categorization (Chakrabarti, et al.,
1998)
Discovering micro communities on the web
- Example Clever system (Chakrabarti, et
al., 1999), Google (Brin and Page, 1998)
Schema Discovery in Semistructured Environment

15
Web Usage Mining

Web usage mining also known as Web log mining
mining techniques to discover interesting usage
patterns from the secondary data derived from the
interactions of the users while surfing the web

16
Web Usage Mining

Applications
Target potential customers for electronic
commerce
Enhance the quality and delivery of Internet
information services to the end user
Improve Web server system performance
Identify potential prime advertisement locations
Facilitates personalization/adaptive sites
Improve site design
Fraud/intrusion detection
Predict users actions (allows prefetching)

17
(No Transcript)
18
Problems with Web Logs

Identifying users
Clients may have multiple streams
Clients may access web from multiple hosts
Proxy servers many clients/one address
Proxy servers one client/many addresses
Data not in log
POST data (i.e., CGI request) not recorded
Cookie data stored elsewhere

19
Cont

Missing data
Pages may be cached
Referring page requires client cooperation
When does a session end?
Use of forward and backward pointers
Typically a 30 minute timeout is used
Web content may be dynamic
May not be able to reconstruct what the
user saw
Use of spiders and automated agents automatic
request we pages

20
Cont

Like most data mining tasks, web log
mining requires preprocessing
To identify users
To match sessions to other data
To fill in missing data
Essentially, to reconstruct the click stream

21
Log Data - Simple Analysis

Statistical analysis of users
Length of path
Viewing time
Number of page views
Statistical analysis of site
Most common pages viewed
Most common invalid URL

22
Web Log Data Mining Applications

Association rules
Find pages that are often viewed together
Clustering
Cluster users based on browsing patterns
Cluster pages based on content
Classification
Relate user attributes to patterns

23
Web Logs

Web servers have the ability to log all
requests
Web server log formats
Most use the Common Log Format (CLF)
New, Extended Log Format allows
configuration of log file
Generate vast amounts of data

Common Log Format
Remotehost browser hostname or IP
Remote log name of user (almost
always "-" meaning "unknown")
Authuser authenticated username
Date Date and time of the request
"request exact request lines from client
Status The HTTP status code returned
Bytes The content-length of response

25
Server Logs
26
Fields

Client IP 128.101.228.20
Authenticated User ID - -
Time/Date 10/Nov/1999101639 -0600
Request "GET / HTTP/1.0"
Status 200
Bytes -
Referrer -
Agent "Mozilla/4.61 en (WinNT I)"

27
Web Usage Mining

Commonly used approaches (Borges and Levene,
1999)
- Maps the log
data into relational tables before an adapted
data mining technique is performed.
- Uses the log
data directly by utilizing special pre-processing
techniques.
Typical problems
- Distinguishing among
unique users, server sessions, episodes, etc. in
the presence of caching and proxy servers
(McCallum, et al., 2000 Srivastava, et al.,
2000).

28
Request

Method GET
Other common methods are POST and HEAD
URI /
This is the file that is being accessed. When a
directory is specified, it is up to the
Server to
decide what to return. Usually, it will be
the file
named index.html or home.html
Protocol HTTP/1.0

29
Status

Status codes are defined by the HTTP
protocol.
Common codes include
200 OK
3xx Some sort of Redirection
4xx Some sort of Client Error
5xx Some sort of Server Error

30
(No Transcript)
31
Web Mining Taxonomy
32
Mining the World Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining

Web Page Content Mining
Web Page Summarization
WebOQL(Mendelzon et.al. 1998)
Web Structuring query languages
Can identify information within given web pages
(Etzioni et.al. 1997)Uses heuristics to
distinguish personal home pages from other web
pages
ShopBot (Etzioni et.al. 1997) Looks for product
prices within web pages

General Access Pattern Tracking
Customized Usage Tracking
Search Result Mining
33
Mining the World Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining
Web Page Content Mining

Search Result Mining
Search Engine Result Summarization
Clustering Search Result (Leouski and Croft,
1996, Zamir and Etzioni, 1997)
Categorizes documents using phrases in titles and
snippets

General Access Pattern Tracking
Customized Usage Tracking
34
Mining the World Wide Web
Web Content Mining
Web Usage Mining

Web Structure Mining
Using Links
PageRank (Brin et al., 1998)
CLEVER (Chakrabarti et al., 1998)
Use interconnections between web pages to give
weight to pages.
Using Generalization
MLDB (1994)
Uses a multi-level database representation of the
Web. Counters (popularity) and link lists are
used for capturing structure.

General Access Pattern Tracking
Search Result Mining
Web Page Content Mining
Customized Usage Tracking
35
Mining the World Wide Web
Web Structure Mining
Web Content Mining
Web Usage Mining
Web Page Content Mining
Customized Usage Tracking

General Access Pattern Tracking
Web Log Mining (Zaïane, Xin and Han, 1998)
Uses KDD techniques to understand general access
patterns and trends.
Can shed light on better structure and grouping
of resource providers.

Search Result Mining
36
Mining the World Wide Web
Web Usage Mining
Web Structure Mining
Web Content Mining

Customized Usage Tracking
Adaptive Sites (Perkowitz and Etzioni, 1997)
Analyzes access patterns of each user at a time.
Web site restructures itself automatically by
learning from user access patterns.

General Access Pattern Tracking
Web Page Content Mining
Search Result Mining
37
Web Content Mining

Agent-based Approaches
Intelligent Search Agents
Information Filtering/Categorization
Personalized Web Agents
Database Approaches
Multilevel Databases
Web Query Systems

38
Intelligent Search Agents

Locating documents and services on the Web
WebCrawler, Alta Vista (http//www.altavista.com)
scan millions of Web documents and create index
of words (too many irrelevant, outdated
responses)
MetaCrawler mines robot-created indices
Retrieve product information from a variety of
vendor sites using only general information about
the product domain
ShopBot

39
Intelligent Search Agents (Contd)

Rely either on pre-specified domain information
about particular types of documents, or on hard
coded models of the information sources to
retrieve and interpret documents
Harvest
FAQ-Finder
Information Manifold
OCCAM
Parasite
Learn models of various information sources and
translates these into its own concept hierarchy
ILA (Internet Learning Agent)

40
Information Filtering/Categorization

Using various information retrieval techniques
and characteristics of open hypertext Web
documents to automatically retrieve, filter, and
categorize them.
HyPursuit uses semantic information embedded in
link structures and document content to create
cluster hierarchies of hypertext documents, and
structure an information space
BO (Bookmark Organizer) combines hierarchical
clustering techniques and user interaction to
organize a collection of Web documents based on
conceptual information

41
Personalized Web Agents

This category of Web agents learn user
preferences and discover Web information sources
based on these preferences, and those of other
individuals with similar interests (using
collaborative filtering)
WebWatcher
PAINT
SyskillWebert
GroupLens
Firefly
others

42
Multiple Layered Web Architecture
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
43
Multilevel Databases

At the higher levels, meta data or
generalizations are
extracted from lower levels
organized in structured collections, i.e.
relational or object-oriented database.
At the lowest level, semi-structured information
are
stored in various Web repositories, such as
hypertext documents

44
Multilevel Databases (Contd)

(Han, et. al.)
use a multi-layered database where each layer is
obtained via generalization and transformation
operations performed on the lower layers
(Kholsa, et. al.)
propose the creation and maintenance of
meta-databases at each information providing
domain and the use of a global schema for the
meta-database

45
Multilevel Databases (Contd)

(King, et. al.)
propose the incremental integration of a portion
of the schema from each information source,
rather than relying on a global heterogeneous
database schema
The ARANEUS system
extracts relevant information from hypertext
documents and integrates these into higher-level
derived Web Hypertexts which are generalizations
of the notion of database views

46
Multi-Layered Database (MLDB)

A multiple layered database model
based on semi-structured data hypothesis
queried by NetQL using a syntax similar to the
relational language SQL
Layer-0
An unstructured, massive, primitive, diverse
global information-base.
Layer-1
A relatively structured, descriptor-like,
massive, distributed database by data analysis,
transformation and generalization techniques.
Tools to be developed for descriptor extraction.
Higher-layers
Further generalization to form progressively
smaller, better structured, and less remote
databases for efficient browsing, retrieval, and
information discovery.

47
Three major components in MLDB

S (a database schema)
outlines the overall database structure of the
global MLDB
presents a route map for data and meta-data
(i.e., schema) browsing
describes how the generalization is performed
H (a set of concept hierarchies)
provides a set of concept hierarchies which
assist the system to generalize lower layer
information to high layeres and map queries to
appropriate concept layers for processing
D (a set of database relations)
the whole global information base at the
primitive information level (i.e., layer-0)
the generalized database relations at the
nonprimitive layers

48
The General architecture of WebLogMiner(a Global
MLDB)
Generalized Data
Higher layers
Site 1
Concept Hierarchies
Site 2
Resource Discovery (MLDB)
Knowledge Discovery (WLM)
Site 3
Characteristic Rules Discriminant
Rules Association Rules
49
Techniques for Web usage mining

Construct multidimensional view on the Weblog
database
Perform multidimensional OLAP analysis to find
the top N users, top N accessed Web pages, most
frequently accessed time periods, etc.
Perform data mining on Weblog records
Find association patterns, sequential patterns,
and trends of Web accessing
May need additional information,e.g., user
browsing sequences of the Web pages in the Web
server buffer
Conduct studies to
Analyze system performance, improve system design
by Web caching, Web page prefetching, and Web
page swapping

50
Web Usage Mining - Phases

Three distinctive phases preprocessing, pattern
discovery, and pattern analysis
Preprocessing - process to convert the raw data
into the data abstraction necessary for the
further applying the data mining algorithm
Resources server-side, client-side, proxy
servers, or database.
Raw data Web usage logs, Web page descriptions,
Web site topology, user registries, and
questionnaire.
Conversion Content converting, Structure
converting, Usage converting

User The principal using a client to
interactively retrieve and render resources or
resource manifestations.
Page view Visual rendering of a Web page in a
specific client environment at a specific point
of time
Click stream a sequential series of page view
request
User session a delimited set of user clicks
(click stream) across one or more Web servers.
Server session (visit) a collection of user
clicks to a single Web server during a user
session.
Episode a subset of related user clicks that
occur within a user session.

Content Preprocessing - the process of converting
text, image, scripts and other files into the
forms that can be used by the usage mining.
Structure Preprocessing - The structure of a
Website is formed by the hyperlinks between page
views, the structure preprocessing can be done by
parsing and reformatting the information.
Usage Preprocessing - the most difficult task in
the usage mining processes, the data cleaning
techniques to eliminate the impact of the
irrelevant items to the analysis result.

53
Pattern Discovery

Pattern Discovery is the key component of the
Web mining, which converges the algorithms and
techniques from data mining, machine learning,
statistics and pattern recognition etc research
categories.
Separate subsections statistical analysis,
association rules, clustering,
classification,
sequential pattern, dependency Modeling.

Statistical Analysis - the analysts may perform
different kinds of descriptive statistical
analyses based on different variables when
analyzing the session file powerful tools in
extracting knowledge about visitors to a Web
site.

Association Rules - refers to sets of pages that
are accessed together with a support value
exceeding some specified threshold.
Clustering a technique to group together users
or data items (pages) with the similar
characteristics.
It can facilitate the development and execution
of future marketing strategies.
Classification the technique to map a data item
into one of several predefined classes, which
help to establish a profile of users belonging to
a particular class or category.

56
Pattern Analysis

Pattern Analysis - final stage of the Web usage
mining.
To eliminate the irrelative rules or patterns and
to extract the interesting rules or patterns from
the output of the pattern discovery process.
Analysis methodologies and tools query
mechanism like SQL, OLAP, visualization etc.

57
(No Transcript)
58
WUM Pre-Processing

Data Cleaning
Removes log entries that are not needed for
the mining process
Data Integration
Synchronize data from multiple server logs,
metadata
User Identification
Associates page references with different
users
Session/Episode Identification
Groups users page references into user
sessions
Page View Identification
Path Completion
Fills in page references missing due to
browser and proxy caching

59
WUM Issues in User Session Identification

A single IP address is used by many users
Different IP addresses in a single session
Missing cache hits in the server logs

Proxy server
different users
Web server
Single user
ISP server
Web server
60
User and Session Identification Issues

Distinguish among different users to a site
Reconstruct the activities of the users within
the site
Proxy servers and anonymizers
Rotating IP addresses connections through ISPs
Missing references due to caching
Inability of servers to distinguish among
different visits

61
WUM Solutions

Remote Agent
A remote agent is implemented in Java Applet
It is loaded into the client only once when the
first page is accessed
The subsequent requests are captured and send
back to the server
Modified Browser
The source code of the existing browser can
be modified to gain user specific data at the
client side
Dynamic page rewriting
When the user first submit the request, the
server returns the requested page rewritten to
include a session specific ID
Each subsequent request will supply this ID to
the server
Heuristics
Use a set of assumptions to identify user
sessions and find the missing cache hits in the
server log

62
(No Transcript)
63
WUM Heuristics

The session identification heuristics
Timeout if the time between pages requests
exceeds a certain limit, it is assumed that the
user is starting a new session
IP/Agent Each different agent type for an IP
address represents a different sessions
Referring page If the referring page file for a
request is not part of an open session, it is
assumed that the request is coming from a
different session
Same IP-Agent/different sessions (Closest)
Assigns the request to the session that is
closest to the referring page at the time of the
request
Same IP-Agent/different sessions (Recent) In
the case where multiple sessions are same
distance from a page request, assigns the request
to the session with the most recent referrer
access in terms of time

64
Cont.

The path completion heuristics
If the referring page file of a session is not
part of the previous page file of that session,
the user must have accessed a cached page
The back button method is used to refer a
cached page
Assigns a constant view time for each of the
cached page file

65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
WUM Association Rule Generation

Discovers the correlations between pages that are
most often referenced together in a single server
session
Provide the information
What are the set of pages frequently accessed
together by Web users?
What page will be fetched next?
What are paths frequently accessed by Web users?
Association rule
A B Support 60,
Confidence 80
Example
50 of visitors who accessed URLs
/infor-f.html and labo/infos.html also visited
situation.html

71
Associations Correlations

Page associations from usage data
User sessions
User transactions
Page associations from content data
similarity based on content analysis
Page associations based on structure
link connectivity between pages
gt Obtain frequent itemsets

72
Examples

è60 of clients who accessed /products/, also
accessed /products/software/webminer.htm.
è30 of clients who accessed /special-offer.html,
placed an online order in /products/software/.
è(Example from IBM official Olympics Site)
Badminton, Diving gt Table Tennis (a
69.7, s 0.35)

73
WUM Clustering

Groups together a set of items having similar
characteristics
User Clusters
Discover groups of users exhibiting similar
browsing patterns
Page recommendation
Users partial session is classified into a
single cluster
The links contained in this cluster are
recommended

74
Cont..

Page clusters
Discover groups of pages having related content
Usage based frequent pages
Page recommendation
The links are presented based on how often URL
references occur together across user sessions

75
Website Usage Analysis

Why developing a Website usage / utilization
analyzation tool?
Knowledge about how visitors use Website could
- Prevent disorientation and help designers
place important information/functions exactly
where the visitors look for and in the way users
need it
- Build up adaptive Website server

76
Clustering and Classification

èclients who often access
/products/software/webminer.html tend to be from
educational institutions.
èclients who placed an online order for software
tend to be students in the 20-25 age group and
live in the United States.
è75 of clients who download software from
/products/software/demos/ visit between 700
and 1100 pm on weekends.

77
Website Usage Analysis

Discover user navigation patterns in using
Website
- Establish a aggregated
log structure as a preprocessor to reduce the
search space before the actual log mining phase

- Introduce
a model for Website usage pattern discovery by
extending the classical mining model, and
establish the processing framework of this model

78
Sequential Patterns Clusters

è30 of clients who visited /products/software/,
had done a search in Yahoo using the keyword
software before their visit
è60 of clients who placed an online order for
WEBMINER, placed another online order for
software within 15 days

79
Website Usage Analysis

Website client-server architecture facilitates
recording user behaviors in every steps by

- submit client-side log files to server
when users use clear functions or exit
window/modules
The special design for local and universal
back/forward/clear functions makes users
navigation pattern more clear for designer by
- analyzing local back/forward history and
incorporate it with universal back/forward
history

80
Website Usage Analysis

What will be included in SUA
1.
Identify and collect log data
2. Transfer the data to
server-side and save them in a structure desired
for analysis
3. Prepare mined data by establishing a
customized aggregated log tree/frame
4. Use
modifications of the typical data mining methods,
particularly an extension of a traditional
sequence discovery algorithm, to mine user
navigation patterns

81
Website Usage Analysis

Problem need to be considered
- How to identify the log data when a user go
through uninteresting function/module
- What marks the end of a user session?
- Client connect Website through proxy servers
Differences in Website usage analysis with common
Web usage mining
- Client-side log files available
- Log files format (Web log files follow Common
Log Format specified as a part of HTTP protocol)
- Not necessary for log file cleaning/filtering
(which usually performed in preprocess of Web log
mining)

82
Web Usage Mining - Patterns Discovery Algorithms

(Chen et. al.) Design algorithms for Path
Traversal Patterns, finding maximal forward
references and large reference sequences.

83
Path Traversal Patterns

Procedure for mining traversal patterns
(Step 1) Determine maximal forward references
from the original log data (Algorithm MF)
(Step 2) Determine large reference sequences
(i.e., Lk, k?1) from the set of maximal forward
references (Algorithm FS and SS)
(Step 3) Determine maximal reference sequences
from large reference sequences
Focus on Step 1 and 2, and devise algorithms for
the efficient determination of large reference
sequences

84
Determine large reference sequeces

Algorithm FS
Utilizes the key ideas of algorithm DHP
employs hashing and pruning techniques
DHP is very efficient for the generation of
candidate itemsets, in particular for the large
two-itemsets, thus greatly improving the
performance bottleneck of the whole process
Algorithm SS
employs hashing and pruning techniques to reduce
both CPU and I/O costs
by properly utilizing the information in
candidate references in prior passes, is able to
avoid database scans in some passes, thus further
reducing the disk I/O cost

85
Patterns Analysis Tools

WebViz pitkwa94 --- provides appropriate tools
and techniques to understand, visualize, and
interpret access patterns.
Proposes OLAP techniques such as data cubes for
the purpose of simplifying the analysis of usage
statistics from server access logs. dyreua et
al

86
Patterns Discovery and Analysis Tools

The emerging tools for user pattern discovery use
sophisticated techniques from AI, data mining,
psychology, and information theory, to mine for
knowledge from collected data
(Pirolli et. al.) use information foraging theory
to combine path traversal patterns, Web page
typing, and site topology information to
categorize pages for easier access by users.

87
(Contd)

WEBMINER
introduces a general architecture for Web usage
mining, automatically discovering association
rules and sequential patterns from server access
logs.
proposes an SQL-like query mechanism for querying
the discovered knowledge in the form of
association rules and sequential patterns.
WebLogMiner
Web log is filtered to generate a relational
database
Data mining on web log data cube and web log
database

88
WEBMINER

SQL-like Query
A framework for Web mining, the applications of
data mining and knowledge discovery techniques,
association rules and sequential patterns, to Web
data
Association rules using apriori algorithm
40 of clients who accessed the Web page with URL
/company/products/product1.html, also accessed
/company/products/product2.html
Sequential patterns using modified apriori
algorithm
60 of clients who placed an online order in
/company/products/product1.html, also placed an
online order in /company/products/product4.html
within 15 days

89
WebLogMiner

Database construction from server log file
data cleaning
data transformation
Multi-dimensional web log data cube construction
and manipulation
Data mining on web log data cube and web log
database

90
Mining the World-Wide Web

Design of a Web Log Miner
Web log is filtered to generate a relational
database
A data cube is generated form database
OLAP is used to drill-down and roll-up in the
cube
OLAM is used for mining interesting knowledge

Knowledge
Web log
Database
Data Cube
Sliced and diced cube
1 Data Cleaning
2 Data Cube Creation
4 Data Mining
3 OLAP
91
Construction of Data Cubes(http//db.cs.sfu.ca/se
ctions/publication/slides/slides.html)
All Amount Comp_Method, B.C.
Amount
0-20K
20-40K
60K-
sum
40-60K
Province
B.C.
Comp_Method
Prairies
Ontario
sum
Database
Discipline
...
sum
Each dimension contains a hierarchy of values
for one attribute A cube cell stores aggregate
values, e.g., count, sum, max, etc. A sum cell
stores dimension summation values. Sparse-cube
technology and MOLAP/ROLAP integration. Chunk-ba
sed multi-way aggregation and single-pass
computation.
92
WebLogMiner Architecture

Web log is filtered to generate a relational
database
A data cube is generated from database
OLAP is used to drill-down and roll-up in the
cube
OLAM is used for mining interesting knowledge

Knowledge
Database
Web log
Data Cube
Sliced and diced cube
2 Data Cube Creation
1 Data Cleaning
3 OLAP
4 Data Mining
93
WEBSIFT
94
What is WebSIFT?

a Web Usage Mining framework that
performs preprocessing
performs knowledge discovery
uses the structure and content information about
a Web site to automatically define a belief set.

95
Overview of WebSIFT

Based on WEBMINER prototype
Divides the Web Usage Mining process into three
main parts

96
Overview of WebSIFT

Input
Access
Referrer and agent
HTML files
Optional data (e.g., registration data or remote
agent logs)

97
Overview of WebSIFT

Preprocessing
uses input data to construct a user session file
site files are used to classify pages of a site
Knowledge discovery phase
uses existing data mining techniques to generate
rules and patterns.
generation of general usage stats

98
Information Filtering

Links between pages provide evidence for
supporting the belief that those pages are
related.
Strength of evidence for a set pages being
related is proportional to the strength of the
topological connection between the set of pages.
Based on site content, can also look at content
similarity and by calculating distance between
pages.

99
Information Filtering
100
Information Filtering

Uses two different methods to identify
interesting results from a list of discovered
frequent itemsets

101
Information Filtering

Method 1
declare itemsets that contain pages not directly
connected to be interesting
corresponds to a situation where a belief that a
set of pages are related has no domain or
existing evidence but there is mined evidence. ?
called Beliefs with Mined Evidence algo (BME)

102
Information Filtering

Method 2
Absence of itemsets ? evidence against a belief
that pages are related.
Pages that have individual support above a
threshold but are not present together in larger
frequent itemsets ? evidence against the pages
being related.
domain evidence suggests that pages are related?
the absence of the frequent itemset can be
considered interesting. This is handled by the
Beliefs with Contradicting Evidence algo (BCE )

103
Experimental Evaluation

Performed on web server of U of MN Dept of Comp
Sci Engg web site
Log spanned eight days in Feb 1999
Physical size of log 19.3 MB
102,838 entries
After preprocessing 43,158 page views (divided
among 10,609 user sessions)
Threshold of 0.1 for support used to generate
693 frequent itemsets with maximum set size of
six pages.
178 unique pages represented in all the rules.
BCE and BME algos run on frequent itemsets.

104
Experimental Evaluation
105
Experimental Evaluation
106
Future work

Filtering frequent itemsets, sequential patterns
and clusters
Incorporate probabilities and fuzzy logic into
information filter
Future works include path completion
verification, page usage determination,
application of the pattern analysis results, etc.

107
Link Analysis
108
Link Analysis

Finding patterns in graphs
Bibliometrics finding patterns in citation
graphs
Sociometry finding patterns in social networks
Collaborative Filtering finding patterns in
rank(person, item) graph
Webometrics finding patterns in web page links

109
Web Link Analysis

Used for
ordering documents matching a user query ranking
deciding what pages to add to a collection
crawling
page categorization
finding related pages
finding duplicated web sites

110
Web as Graph

Link graph
node for each page
directed edge (u,v) if page u contains a
hyperlink to page v
Co-citation graph
node for each page
undirected edge (u,v) iff exists a third page w
linking to both u and v
Assumption
link from page A to page B is a recommendation of
page B by A
If A and B are connected by a link, there is a
higher probability that they are on the same
topic

111
Web structure mining

HITS (Topic distillation)
PageRank (Ranking web pages used by Google)
Algorithm in Cyber-community

112
HITS Algorithm--Topic Distillation on WWW
113
HITS Method

Hyperlink Induced Topic Search
Kleinberg, 1998
A simple approach by finding hubs and authorities
View web as a directed graph
Assumption if document A has hyperlink to
document B, then the author of document A thinks
that document B contains valuable information

114
Main Ideas

Concerned with the identification of the most
authoritative, or definitive, Web pages on a
broad-topic
Focused on only one topic
Viewing the Web as a graph
A purely link structure-based computation,
ignoring the textual content

115
HITS Hubs and Authority

Hub web page links to a collection of prominent
sites on a common topic
Authority Pages that link to a collection of
authoritative pages on a broad topic web page
pointed to by hubs
Mutual Reinforcing Relationship a good authority
is a page that is pointed to by many good hubs,
while a good hub is a page that points to many
good authorities

116
Hub-Authority Relations
117
HITS Two Main Steps

A sampling component, which constructs a focused
collection of several thousand web pages likely
to be rich in relevant authorities
A weight-propagation component, which determines
numerical estimates of hub and authority weights
by an iterative procedure
As the result, pages with highest weights are
returned as hubs and authorities for the research
topic

118
HITS Root Set and Base Set

Using query term to collect a root set (S) of
pages from index-based search engine (AltaVista)
Expand root set to base set (T) by including all
pages linked to by pages in root set and all
pages that link to a page in root set (up to a
designated size cut-off)
Typical base set contains roughly 1000-5000 pages

119
Step 1 Constructing Subgraph

1.1 Creating a root set (S)
- Given a query string on a broad topic
- Collect the t highest-ranked pages for the
query
from a text-based search engine
1.2 Expanding to a base set (T)
- Add the page pointing to a page in root set
- Add the page pointed to by a page in root set

120
Root Set and Base Set (Contd)
121
Step 2 Computing Hubs and Authorities

2.1 Associating weights
- Authority weight xp
- Hub weight yp
- Set all values to a uniform constant initially
2.2 Updating weights

122
Updating Authority Weight
Example
xpyq1yq2yq3
123
Updating Hub Weight
Example
ypxq1xq2xq3
124
Flowchart
125
Results

All x- and y-values converge rapidly so that
termination of the iteration is guaranteed
It can be proved in mathematical approach
Pages with the highest x-values are viewed as the
best authorities, while pages with the highest
y-values are regarded as the best hubs

126
Implementation

Search engine AltaVista
Root set 200 pages
Base set 1000-5000 pages
Converging speed Very rapid,
less than 20
times
Running time About 30 minutes

127
HITS Advantages

Weight computation is an intrinsic feature from
collection of linked pages
Provides a densely linked community of related
authorities and hubs
Pure link-based computation once the root set has
been assembled, with no further regard to query
terms
Provides surprisingly good search result for a
wide range of queries

128
Drawbacks

Limit On Narrow Topics
Not enough authoritative pages
Frequently returns resources for a
more general topic
adding a few edges can potentially change scores
considerably
Topic Drifting
- Appear when hubs discuss multiple
topics

129
Improved Work

To improve precision
- Combining content with link information
- Breaking large hub pages into smaller units
- Computing relevance weights for pages
To improve speed
- Building a Connectivity Server that
provides linkage information for all pages

130
Web Structure Mining

Page-Rank Method
CLEVER Method
Connectivity-Server Method

131
1. Page-Rank Method

Introduced by Brin and Page (1998)
Mine hyperlink structure of web to produce
global importance ranking of every web page
Used in Google Search Engine
Web search result is returned in the rank order
Treats link as like academic citation
Assumption Highly linked pages are more
important than pages with a few links
A page has a high rank if the sum of the ranks of
its back-links is high

132
Page Rank Computation

Assume
R(u) Rank of a web page u
Fu Set of pages which u points to
Bu Set of pages that points to u
Nu Number of links from u
C Normalization factor
E(u) Vector of web pages as source of rank
Page Rank Computation

133
Page Rank Implementation

Stanford WebBase project ? Complete crawling and
indexing system of with current repository 24
million web pages (old data)
Store each URL as unique integer and each
hyperlink as integer IDs
Remove dangling links by iterative procedures
Make initial assignment of the ranks
Propagate page ranks in iterative manner
Upon convergence, add the dangling links back and
recompute the rankings

134
Page Rank Results

Google utilizes a number of factors to rank the
search results
proximity, anchor text, page rank
The benefits of Page Rank are the greatest for
underspecified queries, example Stanford
University query using Page Rank lists the
university home page the first

135
Page Rank Advantages

Global ranking of all web pages regardless of
their content, based solely on their location in
web graph structure
Higher quality search results central,
important, and authoritative web pages are given
preference
Help find representative pages to display for a
cluster center
Other applications traffic estimation, back-link
predictor, user navigation, personalized page
rank
Mining structure of web graph is very useful for
various information retrieval

136
CLEVER Method

CLientside EigenVector-Enhanced Retrieval
Developed by a team of IBM researchers at IBM
Almaden Research Centre
Continued refinements of HITS
Ranks pages primarily by measuring links between
them
Basic Principles Authorities, Hubs
Good hubs points to good authorities
Good authorities are referenced by good hubs

137
Problems Prior to CLEVER

Textual content that is ignored leads to problems
caused by some features of web
HITS returns good resources for more general
topic when query topics are narrowly-focused
HITS occasionally drifts when hubs discuss
multiple topics
Usually pages from single Web site take over a
topic and often use same html template therefore
pointing to a single popular site irrelevant to
query topic

138
CLEVER Solution

Replacing the sums of Equation (1) and (2) of
HITS with weighted sums
Assign to each link a non-negative weight
Weight depends on the query term and end point
Extension 1 Anchor Text
using text that surrounds hyperlink definitions
(hrefs) in Web pages, often referred as anchor
text
boost weight enhancements of links that occur
near instances of query terms

139
CLEVER Solution (Contd)

Extension 2 Mini Hub Pagelets
breaking large hub into smaller units
treat contiguous subsets of links as mini-hubs or
pagelets
contiguous sets of links on a hub page are more
focused on single topic than the entire page

140
CLEVER The Process

Starts by collecting a set of pages
Gathers all pages of initial link, plus any pages
linking to them
Ranks result by counting links
Links have noise, not clear which pages are best
Recalculate scores
Pages with most links are established as most
important, links transmit more weigh
Repeat calculation no. of times till scores are
refined

141
CLEVER Advantages

Used to populate categories of different subjects
with minimal human assistance
Able to leverage links to fill category with best
pages on web
Can be used to compile large taxonomies of topics
automatically
Emerging new directions Hypertext
classification, focused crawling, mining
communities

142
Connectivity Server Method

Server that provides linkage information for all
pages indexed by a search engine
In its base operation, server accepts a query
consisting of a set of one or more URLs and
return a list of all pages that point to pages in
(parents) and list of all pages that are pointed
to from pages in (children)
In its base operation, it also provides
neighbourhood graph for query set
Acts as underlying infrastructure, supports
search engine applications

143
Whats Connectivity Server (Contd)
Neighborhood Graph
144
CONSERV Web Structure Mining

Finding Authoritative Pages (Search by topic)
(pages that is high in quality and relevant to
the topic)
Finding Related Pages (Search by URL)
(pages that address same topic as the original
page, not necessarily semantically identical)
Algorithms include Companion, Cocitation

145
CONSERV Finding Related Page
146
CONSERV Companion Algorithm

An extension to HITS algorithm
Features
Exploit not only links but also their order on a
page
Use link weights to reduce the influence of pages
that all reside on one host
Merge nodes that have a large number of duplicate
links
The base graph is structured to exclude
grandparent nodes but include nodes that share
child

147
Companion Algorithm (Contd)

Four steps
1. Build a vicinity graph for u
2. Remove duplicates and near-duplicates in
graph.
3. Compute link weights based on host to host
connection
4. Compute a hub score and a authority score for
each node in the graph, return the top ranked
authority nodes.

148
Companion Algorithm (Contd)Building the
Vicinity Graph

Set up parameters B no of parents of u, BF
no of children per parent, F no of children of
u, FB no of parents per child
Stoplist (pages that are unrelated to most
queries and have a very high in-degree)
Procedure
Go Back (B) choose parents (randomly)
Back-Forward(BF) choose siblings (nearest)
Go Forward (F) choose children (first)
Forward-Back(FB) choose siblings (highest
in-degree)

149
Companion Algorithm (Contd)Remove duplicate

Near-duplicate, if two nodes, each has more than
10 links and they have at least 95 of their
links in common
Replace two nodes with a node whose links are the
union of the links of the two nodes
(mirror sites, aliases)

150
Companion Algorithm (Contd)Assign edge (link)
weights

Link on the same host has weight 0
If there are K links from documents on a host to
a single document on diff host, each link has an
authority weight of 1/k
If there are k links from a single document on a
host to a set of documents on diff host, give
each link a hub weight of 1/k
(prevent a single host from having too much
influence on the computation)

151
Companion Algorithm (Contd)Compute hub and
authority scores

Extension of the HITS algorithm with edge weights
Initialize all elements of the hub vector H to 1
Initialze all elements of the authority vector A
to 1
While the vectors H and A have not converged
For all nodes n in the vicinity graph N,
An ? (n',n)?edges(N) Hn' x
authority_weight(n',n)
For all n in N,
Hn ? (n',n)?edges(N) An' x
hub_weight(n',n)
Normalize the H and A vectors.

152
CONSERV Cocitation Algorithm

Two nodes are co-cited if they have a common
parent
The number of common parents of two nodes is
their degree of co-citation
Determine the related pages by looking for
sibling nodes with the highest degree of
co-citation
In some cases there is an insufficient level of
cocitation to provide meaningful results, chop
off elements of URL, restart algorithm.
e.g. A.com/X/Y/Z ? A.com/X/Y

153
Comparative Study

Page Rank
(Google)
Assigns initial ranking and retains them
independently from queries (fast)
In the forward direction from link to link
Qualitative result

Hub/Authority (CLEVER, C-Server)
Assembles different root set and prioritizes
pages in the context of query
Looks forward and backward direction
Qualitative result

154
Connectivity-Based Ranking

Query-independent gives an intrinsic quality
score to a page
Approach 1 larger number of hyperlinks pointing
to a page, the better the page
drawback?
each link is equally important
Approach 2 weight each hyperlink proportionally
to the quality of the page containing the
hyperlink

155
Query-dependent Connectivity-Based Ranking

Carrier and Kazman
For each query, build a subgraph of the link
graph G limited to pages on query topic
Build the neighborhood graph
A start set S of documents matching query given
by search engine (200)
Set augmented by its neighborhood, the set of
documents that either point to or are pointed to
by documents in S (limit to 50)
Then rank based on indegree

156
Idea

We desire pages that are relevant (in the
neighborhood graph) and authoritative
As in page rank, not only the in-degree of a page
p, but the quality of the pages that point to p.
If more important pages point to p, that means p
is more authoritative
Key idea Good hub pages have links to good
authority pages
given user query, compute a hub score and an
authority score for each document
high authority score ?? relevant content
high hub score ?? links to documents with
relevant content