Title: Web Mining : A Bird
1Web Mining A Birds Eye View
- Sanjay Kumar Madria
- Department of Computer Science
- University of Missouri-Rolla, MO 65401
- madrias_at_umr.edu
2Web Mining
- Web mining - data mining techniques to
automatically discover and extract information
from Web documents/services (Etzioni, 1996). - Web mining research integrate research from
several research communities (Kosala and
Blockeel, July 2000) such as
- Database (DB)
- Information retrieval (IR)
- The sub-areas of machine learning (ML)
- Natural language processing (NLP)
3Mining the World-Wide Web
- WWW is huge, widely distributed, global
information source for - Information services news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc. - Hyper-link information
- Access and usage information
- Web Site contents and Organization
4Mining the World-Wide Web
- Growing and changing very rapidly
- Broad diversity of user communities
- Only a small portion of the information on the
Web is truly relevant or useful to Web users - How to find high-quality Web pages on a specified
topic? - WWW provides rich sources for data mining
5Challenges on WWW Interactions
- Finding Relevant Information
- Creating knowledge from Information available
- Personalization of the information
- Learning about customers / individual users
- Web Mining can play an important Role!
6Web Mining more challenging
- Searches for
- Web access patterns
- Web structures
- Regularity and dynamics of Web contents
- Problems
- The abundance problem
- Limited coverage of the Web hidden Web sources,
majority of data in DBMS - Limited query interface based on keyword-oriented
search - Limited customization to individual users
- Dynamic and semistructured
7Web Mining Subtasks
- Resource Finding
- Task of retrieving intended web-documents
- Information Selection Pre-processing
- Automatic selection and pre-processing specific
information from retrieved web resources - Generalization
- Automatic Discovery of patterns in web sites
- Analysis
- Validation and / or interpretation of mined
patterns
8Web Mining Taxonomy
Web Mining
Web Content Mining
Web Usage Mining
Web Structure Mining
9Web Content Mining
- Discovery of useful information from web contents
/ data / documents - Web data contents text, image, audio, video,
- metadata and hyperlinks.
- Information Retrieval View ( Structured
Semi-Structured) - Assist / Improve information finding
- Filtering Information to users on user profiles
- Database View
- Model Data on the web
- Integrate them for more sophisticated queries
10Issues in Web Content Mining
- Developing intelligent tools for IR
-
Finding keywords and key phrases
- Discovering grammatical
rules and collocations - - Hypertext classification/categorization
- Extracting key phrases from text
documents - - Learning extraction models/rules
-
Hierarchical clustering
- Predicting
(words) relationship
11Cont.
- Developing Web query systems
- WebOQL, XML-QL
- Mining multimedia data
- Mining image
from satellite (Fayyad, et al. 1996) - - Mining image to identify small volcanoes on
Venus (Smyth, et al 1996) .
12Web Structure Mining
- To discover the link structure of the hyperlinks
at the inter-document level to generate
structural summary about the Website and Web
page. - Direction 1 based on the hyperlinks,
categorizing the Web pages and generated
information. - Direction 2 discovering the structure of Web
document itself. - Direction 3 discovering the nature of the
hierarchy or network of hyperlinks in the Website
of a particular domain.
13Web Structure Mining
- Finding authoritative Web pages
- Retrieving pages that are not only relevant, but
also of high quality, or authoritative on the
topic - Hyperlinks can infer the notion of authority
- The Web consists not only of pages, but also of
hyperlinks pointing from one page to another - These hyperlinks contain an enormous amount of
latent human annotation - A hyperlink pointing to another Web page, this
can be considered as the author's endorsement of
the other page
14Web Structure Mining
- Web pages categorization (Chakrabarti, et al.,
1998)
- Discovering micro communities on the web
- - Example Clever system (Chakrabarti, et
al., 1999), Google (Brin and Page, 1998) - Schema Discovery in Semistructured Environment
-
15Web Usage Mining
- Web usage mining also known as Web log mining
- mining techniques to discover interesting usage
patterns from the secondary data derived from the
interactions of the users while surfing the web
16Web Usage Mining
- Applications
- Target potential customers for electronic
commerce - Enhance the quality and delivery of Internet
information services to the end user - Improve Web server system performance
- Identify potential prime advertisement locations
- Facilitates personalization/adaptive sites
- Improve site design
- Fraud/intrusion detection
- Predict users actions (allows prefetching)
17(No Transcript)
18Problems with Web Logs
- Identifying users
- Clients may have multiple streams
- Clients may access web from multiple hosts
- Proxy servers many clients/one address
- Proxy servers one client/many addresses
- Data not in log
- POST data (i.e., CGI request) not recorded
- Cookie data stored elsewhere
19Cont
- Missing data
- Pages may be cached
- Referring page requires client cooperation
- When does a session end?
- Use of forward and backward pointers
- Typically a 30 minute timeout is used
- Web content may be dynamic
- May not be able to reconstruct what the
user saw - Use of spiders and automated agents automatic
request we pages
20Cont
- Like most data mining tasks, web log
- mining requires preprocessing
- To identify users
- To match sessions to other data
- To fill in missing data
- Essentially, to reconstruct the click stream
21Log Data - Simple Analysis
- Statistical analysis of users
- Length of path
- Viewing time
- Number of page views
- Statistical analysis of site
- Most common pages viewed
- Most common invalid URL
22Web Log Data Mining Applications
- Association rules
- Find pages that are often viewed together
- Clustering
- Cluster users based on browsing patterns
- Cluster pages based on content
- Classification
- Relate user attributes to patterns
23Web Logs
- Web servers have the ability to log all
- requests
- Web server log formats
- Most use the Common Log Format (CLF)
- New, Extended Log Format allows
configuration of log file - Generate vast amounts of data
24- Common Log Format
- Remotehost browser hostname or IP
- Remote log name of user (almost
- always "-" meaning "unknown")
- Authuser authenticated username
- Date Date and time of the request
- "request exact request lines from client
- Status The HTTP status code returned
- Bytes The content-length of response
25Server Logs
26Fields
- Client IP 128.101.228.20
- Authenticated User ID - -
- Time/Date 10/Nov/1999101639 -0600
- Request "GET / HTTP/1.0"
- Status 200
- Bytes -
- Referrer -
- Agent "Mozilla/4.61 en (WinNT I)"
27Web Usage Mining
- Commonly used approaches (Borges and Levene,
1999)
- Maps the log
data into relational tables before an adapted
data mining technique is performed.
- Uses the log
data directly by utilizing special pre-processing
techniques. - Typical problems
- Distinguishing among
unique users, server sessions, episodes, etc. in
the presence of caching and proxy servers
(McCallum, et al., 2000 Srivastava, et al.,
2000).
28Request
- Method GET
- Other common methods are POST and HEAD
- URI /
- This is the file that is being accessed. When a
- directory is specified, it is up to the
Server to - decide what to return. Usually, it will be
the file - named index.html or home.html
- Protocol HTTP/1.0
29Status
- Status codes are defined by the HTTP
- protocol.
- Common codes include
- 200 OK
- 3xx Some sort of Redirection
- 4xx Some sort of Client Error
- 5xx Some sort of Server Error
30(No Transcript)
31Web Mining Taxonomy
32Mining the World Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining
- Web Page Content Mining
- Web Page Summarization
- WebOQL(Mendelzon et.al. 1998)
- Web Structuring query languages
- Can identify information within given web pages
- (Etzioni et.al. 1997)Uses heuristics to
distinguish personal home pages from other web
pages - ShopBot (Etzioni et.al. 1997) Looks for product
prices within web pages
General Access Pattern Tracking
Customized Usage Tracking
Search Result Mining
33Mining the World Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining
Web Page Content Mining
- Search Result Mining
- Search Engine Result Summarization
- Clustering Search Result (Leouski and Croft,
1996, Zamir and Etzioni, 1997) - Categorizes documents using phrases in titles and
snippets
General Access Pattern Tracking
Customized Usage Tracking
34Mining the World Wide Web
Web Content Mining
Web Usage Mining
- Web Structure Mining
- Using Links
- PageRank (Brin et al., 1998)
- CLEVER (Chakrabarti et al., 1998)
- Use interconnections between web pages to give
weight to pages. -
- Using Generalization
- MLDB (1994)
- Uses a multi-level database representation of the
Web. Counters (popularity) and link lists are
used for capturing structure.
General Access Pattern Tracking
Search Result Mining
Web Page Content Mining
Customized Usage Tracking
35Mining the World Wide Web
Web Structure Mining
Web Content Mining
Web Usage Mining
Web Page Content Mining
Customized Usage Tracking
- General Access Pattern Tracking
- Web Log Mining (Zaïane, Xin and Han, 1998)
- Uses KDD techniques to understand general access
patterns and trends. - Can shed light on better structure and grouping
of resource providers.
Search Result Mining
36Mining the World Wide Web
Web Usage Mining
Web Structure Mining
Web Content Mining
- Customized Usage Tracking
- Adaptive Sites (Perkowitz and Etzioni, 1997)
- Analyzes access patterns of each user at a time.
- Web site restructures itself automatically by
learning from user access patterns.
General Access Pattern Tracking
Web Page Content Mining
Search Result Mining
37Web Content Mining
- Agent-based Approaches
- Intelligent Search Agents
- Information Filtering/Categorization
- Personalized Web Agents
- Database Approaches
- Multilevel Databases
- Web Query Systems
38Intelligent Search Agents
- Locating documents and services on the Web
- WebCrawler, Alta Vista (http//www.altavista.com)
scan millions of Web documents and create index
of words (too many irrelevant, outdated
responses) - MetaCrawler mines robot-created indices
- Retrieve product information from a variety of
vendor sites using only general information about
the product domain - ShopBot
39Intelligent Search Agents (Contd)
- Rely either on pre-specified domain information
about particular types of documents, or on hard
coded models of the information sources to
retrieve and interpret documents - Harvest
- FAQ-Finder
- Information Manifold
- OCCAM
- Parasite
- Learn models of various information sources and
translates these into its own concept hierarchy - ILA (Internet Learning Agent)
40Information Filtering/Categorization
- Using various information retrieval techniques
and characteristics of open hypertext Web
documents to automatically retrieve, filter, and
categorize them. - HyPursuit uses semantic information embedded in
link structures and document content to create
cluster hierarchies of hypertext documents, and
structure an information space - BO (Bookmark Organizer) combines hierarchical
clustering techniques and user interaction to
organize a collection of Web documents based on
conceptual information
41Personalized Web Agents
- This category of Web agents learn user
preferences and discover Web information sources
based on these preferences, and those of other
individuals with similar interests (using
collaborative filtering) - WebWatcher
- PAINT
- SyskillWebert
- GroupLens
- Firefly
- others
42Multiple Layered Web Architecture
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
43Multilevel Databases
- At the higher levels, meta data or
generalizations are - extracted from lower levels
- organized in structured collections, i.e.
relational or object-oriented database. - At the lowest level, semi-structured information
are - stored in various Web repositories, such as
hypertext documents
44Multilevel Databases (Contd)
- (Han, et. al.)
- use a multi-layered database where each layer is
obtained via generalization and transformation
operations performed on the lower layers - (Kholsa, et. al.)
- propose the creation and maintenance of
meta-databases at each information providing
domain and the use of a global schema for the
meta-database
45Multilevel Databases (Contd)
- (King, et. al.)
- propose the incremental integration of a portion
of the schema from each information source,
rather than relying on a global heterogeneous
database schema - The ARANEUS system
- extracts relevant information from hypertext
documents and integrates these into higher-level
derived Web Hypertexts which are generalizations
of the notion of database views
46Multi-Layered Database (MLDB)
- A multiple layered database model
- based on semi-structured data hypothesis
- queried by NetQL using a syntax similar to the
relational language SQL - Layer-0
- An unstructured, massive, primitive, diverse
global information-base. - Layer-1
- A relatively structured, descriptor-like,
massive, distributed database by data analysis,
transformation and generalization techniques. - Tools to be developed for descriptor extraction.
- Higher-layers
- Further generalization to form progressively
smaller, better structured, and less remote
databases for efficient browsing, retrieval, and
information discovery.
47Three major components in MLDB
- S (a database schema)
- outlines the overall database structure of the
global MLDB - presents a route map for data and meta-data
(i.e., schema) browsing - describes how the generalization is performed
- H (a set of concept hierarchies)
- provides a set of concept hierarchies which
assist the system to generalize lower layer
information to high layeres and map queries to
appropriate concept layers for processing - D (a set of database relations)
- the whole global information base at the
primitive information level (i.e., layer-0) - the generalized database relations at the
nonprimitive layers
48The General architecture of WebLogMiner(a Global
MLDB)
Generalized Data
Higher layers
Site 1
Concept Hierarchies
Site 2
Resource Discovery (MLDB)
Knowledge Discovery (WLM)
Site 3
Characteristic Rules Discriminant
Rules Association Rules
49Techniques for Web usage mining
- Construct multidimensional view on the Weblog
database - Perform multidimensional OLAP analysis to find
the top N users, top N accessed Web pages, most
frequently accessed time periods, etc. - Perform data mining on Weblog records
- Find association patterns, sequential patterns,
and trends of Web accessing - May need additional information,e.g., user
browsing sequences of the Web pages in the Web
server buffer - Conduct studies to
- Analyze system performance, improve system design
by Web caching, Web page prefetching, and Web
page swapping
50Web Usage Mining - Phases
- Three distinctive phases preprocessing, pattern
discovery, and pattern analysis - Preprocessing - process to convert the raw data
into the data abstraction necessary for the
further applying the data mining algorithm - Resources server-side, client-side, proxy
servers, or database. - Raw data Web usage logs, Web page descriptions,
Web site topology, user registries, and
questionnaire. - Conversion Content converting, Structure
converting, Usage converting
51- User The principal using a client to
interactively retrieve and render resources or
resource manifestations. - Page view Visual rendering of a Web page in a
specific client environment at a specific point
of time - Click stream a sequential series of page view
request - User session a delimited set of user clicks
(click stream) across one or more Web servers. - Server session (visit) a collection of user
clicks to a single Web server during a user
session. - Episode a subset of related user clicks that
occur within a user session.
52- Content Preprocessing - the process of converting
text, image, scripts and other files into the
forms that can be used by the usage mining. - Structure Preprocessing - The structure of a
Website is formed by the hyperlinks between page
views, the structure preprocessing can be done by
parsing and reformatting the information. - Usage Preprocessing - the most difficult task in
the usage mining processes, the data cleaning
techniques to eliminate the impact of the
irrelevant items to the analysis result.
53Pattern Discovery
- Pattern Discovery is the key component of the
- Web mining, which converges the algorithms and
techniques from data mining, machine learning,
statistics and pattern recognition etc research
categories. - Separate subsections statistical analysis,
- association rules, clustering,
classification, - sequential pattern, dependency Modeling.
54- Statistical Analysis - the analysts may perform
different kinds of descriptive statistical
analyses based on different variables when
analyzing the session file powerful tools in
extracting knowledge about visitors to a Web
site.
55- Association Rules - refers to sets of pages that
are accessed together with a support value
exceeding some specified threshold. - Clustering a technique to group together users
or data items (pages) with the similar
characteristics. - It can facilitate the development and execution
of future marketing strategies. - Classification the technique to map a data item
into one of several predefined classes, which
help to establish a profile of users belonging to
a particular class or category.
56Pattern Analysis
- Pattern Analysis - final stage of the Web usage
mining. - To eliminate the irrelative rules or patterns and
to extract the interesting rules or patterns from
the output of the pattern discovery process. - Analysis methodologies and tools query
- mechanism like SQL, OLAP, visualization etc.
57(No Transcript)
58WUM Pre-Processing
- Data Cleaning
- Removes log entries that are not needed for
the mining process - Data Integration
- Synchronize data from multiple server logs,
metadata - User Identification
- Associates page references with different
users - Session/Episode Identification
- Groups users page references into user
sessions - Page View Identification
- Path Completion
- Fills in page references missing due to
browser and proxy caching -
59WUM Issues in User Session Identification
- A single IP address is used by many users
- Different IP addresses in a single session
- Missing cache hits in the server logs
-
Proxy server
different users
Web server
Single user
ISP server
Web server
60User and Session Identification Issues
- Distinguish among different users to a site
- Reconstruct the activities of the users within
the site - Proxy servers and anonymizers
- Rotating IP addresses connections through ISPs
- Missing references due to caching
- Inability of servers to distinguish among
different visits
61WUM Solutions
- Remote Agent
- A remote agent is implemented in Java Applet
- It is loaded into the client only once when the
first page is accessed - The subsequent requests are captured and send
back to the server - Modified Browser
- The source code of the existing browser can
be modified to gain user specific data at the
client side - Dynamic page rewriting
- When the user first submit the request, the
server returns the requested page rewritten to
include a session specific ID - Each subsequent request will supply this ID to
the server - Heuristics
- Use a set of assumptions to identify user
sessions and find the missing cache hits in the
server log
62(No Transcript)
63WUM Heuristics
- The session identification heuristics
- Timeout if the time between pages requests
exceeds a certain limit, it is assumed that the
user is starting a new session - IP/Agent Each different agent type for an IP
address represents a different sessions - Referring page If the referring page file for a
request is not part of an open session, it is
assumed that the request is coming from a
different session - Same IP-Agent/different sessions (Closest)
Assigns the request to the session that is
closest to the referring page at the time of the
request - Same IP-Agent/different sessions (Recent) In
the case where multiple sessions are same
distance from a page request, assigns the request
to the session with the most recent referrer
access in terms of time
64Cont.
- The path completion heuristics
- If the referring page file of a session is not
part of the previous page file of that session,
the user must have accessed a cached page - The back button method is used to refer a
cached page - Assigns a constant view time for each of the
cached page file -
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69(No Transcript)
70WUM Association Rule Generation
- Discovers the correlations between pages that are
most often referenced together in a single server
session - Provide the information
- What are the set of pages frequently accessed
together by Web users? - What page will be fetched next?
- What are paths frequently accessed by Web users?
- Association rule
- A B Support 60,
Confidence 80 - Example
- 50 of visitors who accessed URLs
/infor-f.html and labo/infos.html also visited
situation.html
71Associations Correlations
- Page associations from usage data
- User sessions
- User transactions
- Page associations from content data
- similarity based on content analysis
- Page associations based on structure
- link connectivity between pages
- gt Obtain frequent itemsets
72Examples
- è60 of clients who accessed /products/, also
accessed /products/software/webminer.htm. - è30 of clients who accessed /special-offer.html,
placed an online order in /products/software/. - è(Example from IBM official Olympics Site)
- Badminton, Diving gt Table Tennis (a
69.7, s 0.35)
73WUM Clustering
- Groups together a set of items having similar
characteristics - User Clusters
- Discover groups of users exhibiting similar
browsing patterns - Page recommendation
- Users partial session is classified into a
single cluster - The links contained in this cluster are
recommended
74Cont..
- Page clusters
- Discover groups of pages having related content
- Usage based frequent pages
- Page recommendation
- The links are presented based on how often URL
references occur together across user sessions
75Website Usage Analysis
- Why developing a Website usage / utilization
analyzation tool? - Knowledge about how visitors use Website could
- - Prevent disorientation and help designers
place important information/functions exactly
where the visitors look for and in the way users
need it - - Build up adaptive Website server
76Clustering and Classification
- èclients who often access
- /products/software/webminer.html tend to be from
educational institutions. - èclients who placed an online order for software
tend to be students in the 20-25 age group and
live in the United States. - è75 of clients who download software from
- /products/software/demos/ visit between 700
and 1100 pm on weekends.
77Website Usage Analysis
- Discover user navigation patterns in using
Website
- Establish a aggregated
log structure as a preprocessor to reduce the
search space before the actual log mining phase
- Introduce
a model for Website usage pattern discovery by
extending the classical mining model, and
establish the processing framework of this model
78Sequential Patterns Clusters
- è30 of clients who visited /products/software/,
had done a search in Yahoo using the keyword
software before their visit - è60 of clients who placed an online order for
WEBMINER, placed another online order for
software within 15 days
79Website Usage Analysis
- Website client-server architecture facilitates
recording user behaviors in every steps by
- submit client-side log files to server
when users use clear functions or exit
window/modules - The special design for local and universal
back/forward/clear functions makes users
navigation pattern more clear for designer by
- - analyzing local back/forward history and
incorporate it with universal back/forward
history
80Website Usage Analysis
- What will be included in SUA
1.
Identify and collect log data
2. Transfer the data to
server-side and save them in a structure desired
for analysis
3. Prepare mined data by establishing a
customized aggregated log tree/frame
4. Use
modifications of the typical data mining methods,
particularly an extension of a traditional
sequence discovery algorithm, to mine user
navigation patterns
81Website Usage Analysis
- Problem need to be considered
- - How to identify the log data when a user go
through uninteresting function/module - - What marks the end of a user session?
- - Client connect Website through proxy servers
- Differences in Website usage analysis with common
Web usage mining - - Client-side log files available
- - Log files format (Web log files follow Common
Log Format specified as a part of HTTP protocol) - - Not necessary for log file cleaning/filtering
(which usually performed in preprocess of Web log
mining)
82Web Usage Mining - Patterns Discovery Algorithms
- (Chen et. al.) Design algorithms for Path
Traversal Patterns, finding maximal forward
references and large reference sequences.
83Path Traversal Patterns
- Procedure for mining traversal patterns
- (Step 1) Determine maximal forward references
from the original log data (Algorithm MF) - (Step 2) Determine large reference sequences
(i.e., Lk, k?1) from the set of maximal forward
references (Algorithm FS and SS) - (Step 3) Determine maximal reference sequences
from large reference sequences - Focus on Step 1 and 2, and devise algorithms for
the efficient determination of large reference
sequences
84Determine large reference sequeces
- Algorithm FS
- Utilizes the key ideas of algorithm DHP
- employs hashing and pruning techniques
- DHP is very efficient for the generation of
candidate itemsets, in particular for the large
two-itemsets, thus greatly improving the
performance bottleneck of the whole process - Algorithm SS
- employs hashing and pruning techniques to reduce
both CPU and I/O costs - by properly utilizing the information in
candidate references in prior passes, is able to
avoid database scans in some passes, thus further
reducing the disk I/O cost
85Patterns Analysis Tools
- WebViz pitkwa94 --- provides appropriate tools
and techniques to understand, visualize, and
interpret access patterns. - Proposes OLAP techniques such as data cubes for
the purpose of simplifying the analysis of usage
statistics from server access logs. dyreua et
al
86Patterns Discovery and Analysis Tools
- The emerging tools for user pattern discovery use
sophisticated techniques from AI, data mining,
psychology, and information theory, to mine for
knowledge from collected data - (Pirolli et. al.) use information foraging theory
to combine path traversal patterns, Web page
typing, and site topology information to
categorize pages for easier access by users.
87(Contd)
- WEBMINER
- introduces a general architecture for Web usage
mining, automatically discovering association
rules and sequential patterns from server access
logs. - proposes an SQL-like query mechanism for querying
the discovered knowledge in the form of
association rules and sequential patterns. - WebLogMiner
- Web log is filtered to generate a relational
database - Data mining on web log data cube and web log
database
88WEBMINER
- SQL-like Query
- A framework for Web mining, the applications of
data mining and knowledge discovery techniques,
association rules and sequential patterns, to Web
data - Association rules using apriori algorithm
- 40 of clients who accessed the Web page with URL
/company/products/product1.html, also accessed
/company/products/product2.html - Sequential patterns using modified apriori
algorithm - 60 of clients who placed an online order in
/company/products/product1.html, also placed an
online order in /company/products/product4.html
within 15 days
89WebLogMiner
- Database construction from server log file
- data cleaning
- data transformation
- Multi-dimensional web log data cube construction
and manipulation - Data mining on web log data cube and web log
database
90Mining the World-Wide Web
- Design of a Web Log Miner
- Web log is filtered to generate a relational
database - A data cube is generated form database
- OLAP is used to drill-down and roll-up in the
cube - OLAM is used for mining interesting knowledge
Knowledge
Web log
Database
Data Cube
Sliced and diced cube
1 Data Cleaning
2 Data Cube Creation
4 Data Mining
3 OLAP
91Construction of Data Cubes(http//db.cs.sfu.ca/se
ctions/publication/slides/slides.html)
All Amount Comp_Method, B.C.
Amount
0-20K
20-40K
60K-
sum
40-60K
Province
B.C.
Comp_Method
Prairies
Ontario
sum
Database
Discipline
...
sum
Each dimension contains a hierarchy of values
for one attribute A cube cell stores aggregate
values, e.g., count, sum, max, etc. A sum cell
stores dimension summation values. Sparse-cube
technology and MOLAP/ROLAP integration. Chunk-ba
sed multi-way aggregation and single-pass
computation.
92WebLogMiner Architecture
- Web log is filtered to generate a relational
database - A data cube is generated from database
- OLAP is used to drill-down and roll-up in the
cube - OLAM is used for mining interesting knowledge
Knowledge
Database
Web log
Data Cube
Sliced and diced cube
2 Data Cube Creation
1 Data Cleaning
3 OLAP
4 Data Mining
93 WEBSIFT
94What is WebSIFT?
- a Web Usage Mining framework that
- performs preprocessing
- performs knowledge discovery
- uses the structure and content information about
a Web site to automatically define a belief set.
95Overview of WebSIFT
- Based on WEBMINER prototype
- Divides the Web Usage Mining process into three
main parts
96Overview of WebSIFT
- Input
- Access
- Referrer and agent
- HTML files
- Optional data (e.g., registration data or remote
agent logs)
97Overview of WebSIFT
- Preprocessing
- uses input data to construct a user session file
- site files are used to classify pages of a site
- Knowledge discovery phase
- uses existing data mining techniques to generate
rules and patterns. - generation of general usage stats
98Information Filtering
- Links between pages provide evidence for
supporting the belief that those pages are
related. - Strength of evidence for a set pages being
related is proportional to the strength of the
topological connection between the set of pages. - Based on site content, can also look at content
similarity and by calculating distance between
pages.
99Information Filtering
100Information Filtering
- Uses two different methods to identify
interesting results from a list of discovered
frequent itemsets
101Information Filtering
- Method 1
- declare itemsets that contain pages not directly
connected to be interesting - corresponds to a situation where a belief that a
set of pages are related has no domain or
existing evidence but there is mined evidence. ?
called Beliefs with Mined Evidence algo (BME)
102Information Filtering
- Method 2
- Absence of itemsets ? evidence against a belief
that pages are related. - Pages that have individual support above a
threshold but are not present together in larger
frequent itemsets ? evidence against the pages
being related. - domain evidence suggests that pages are related?
the absence of the frequent itemset can be
considered interesting. This is handled by the
Beliefs with Contradicting Evidence algo (BCE )
103Experimental Evaluation
- Performed on web server of U of MN Dept of Comp
Sci Engg web site - Log spanned eight days in Feb 1999
- Physical size of log 19.3 MB
- 102,838 entries
- After preprocessing 43,158 page views (divided
among 10,609 user sessions) - Threshold of 0.1 for support used to generate
693 frequent itemsets with maximum set size of
six pages. - 178 unique pages represented in all the rules.
- BCE and BME algos run on frequent itemsets.
104Experimental Evaluation
105Experimental Evaluation
106Future work
- Filtering frequent itemsets, sequential patterns
and clusters - Incorporate probabilities and fuzzy logic into
information filter - Future works include path completion
verification, page usage determination,
application of the pattern analysis results, etc.
107Link Analysis
108Link Analysis
- Finding patterns in graphs
- Bibliometrics finding patterns in citation
graphs - Sociometry finding patterns in social networks
- Collaborative Filtering finding patterns in
rank(person, item) graph - Webometrics finding patterns in web page links
109Web Link Analysis
- Used for
- ordering documents matching a user query ranking
- deciding what pages to add to a collection
crawling - page categorization
- finding related pages
- finding duplicated web sites
110Web as Graph
- Link graph
- node for each page
- directed edge (u,v) if page u contains a
hyperlink to page v - Co-citation graph
- node for each page
- undirected edge (u,v) iff exists a third page w
linking to both u and v - Assumption
- link from page A to page B is a recommendation of
page B by A - If A and B are connected by a link, there is a
higher probability that they are on the same
topic
111Web structure mining
- HITS (Topic distillation)
- PageRank (Ranking web pages used by Google)
- Algorithm in Cyber-community
112HITS Algorithm--Topic Distillation on WWW
113HITS Method
- Hyperlink Induced Topic Search
- Kleinberg, 1998
- A simple approach by finding hubs and authorities
- View web as a directed graph
- Assumption if document A has hyperlink to
document B, then the author of document A thinks
that document B contains valuable information
114Main Ideas
- Concerned with the identification of the most
authoritative, or definitive, Web pages on a
broad-topic - Focused on only one topic
- Viewing the Web as a graph
- A purely link structure-based computation,
ignoring the textual content
115HITS Hubs and Authority
- Hub web page links to a collection of prominent
sites on a common topic - Authority Pages that link to a collection of
authoritative pages on a broad topic web page
pointed to by hubs - Mutual Reinforcing Relationship a good authority
is a page that is pointed to by many good hubs,
while a good hub is a page that points to many
good authorities
116Hub-Authority Relations
117HITS Two Main Steps
- A sampling component, which constructs a focused
collection of several thousand web pages likely
to be rich in relevant authorities - A weight-propagation component, which determines
numerical estimates of hub and authority weights
by an iterative procedure - As the result, pages with highest weights are
returned as hubs and authorities for the research
topic
118HITS Root Set and Base Set
- Using query term to collect a root set (S) of
pages from index-based search engine (AltaVista) - Expand root set to base set (T) by including all
pages linked to by pages in root set and all
pages that link to a page in root set (up to a
designated size cut-off) - Typical base set contains roughly 1000-5000 pages
119Step 1 Constructing Subgraph
- 1.1 Creating a root set (S)
- - Given a query string on a broad topic
- - Collect the t highest-ranked pages for the
query - from a text-based search engine
- 1.2 Expanding to a base set (T)
- - Add the page pointing to a page in root set
- - Add the page pointed to by a page in root set
120Root Set and Base Set (Contd)
121Step 2 Computing Hubs and Authorities
- 2.1 Associating weights
- - Authority weight xp
- - Hub weight yp
- - Set all values to a uniform constant initially
- 2.2 Updating weights
122Updating Authority Weight
Example
xpyq1yq2yq3
123Updating Hub Weight
Example
ypxq1xq2xq3
124Flowchart
125Results
- All x- and y-values converge rapidly so that
termination of the iteration is guaranteed - It can be proved in mathematical approach
- Pages with the highest x-values are viewed as the
best authorities, while pages with the highest
y-values are regarded as the best hubs
126Implementation
- Search engine AltaVista
- Root set 200 pages
- Base set 1000-5000 pages
- Converging speed Very rapid,
- less than 20
times - Running time About 30 minutes
127HITS Advantages
- Weight computation is an intrinsic feature from
collection of linked pages - Provides a densely linked community of related
authorities and hubs - Pure link-based computation once the root set has
been assembled, with no further regard to query
terms - Provides surprisingly good search result for a
wide range of queries
128Drawbacks
- Limit On Narrow Topics
- Not enough authoritative pages
- Frequently returns resources for a
- more general topic
- adding a few edges can potentially change scores
considerably - Topic Drifting
- - Appear when hubs discuss multiple
- topics
129Improved Work
- To improve precision
- - Combining content with link information
- - Breaking large hub pages into smaller units
- - Computing relevance weights for pages
- To improve speed
- - Building a Connectivity Server that
provides linkage information for all pages
130Web Structure Mining
- Page-Rank Method
- CLEVER Method
- Connectivity-Server Method
1311. Page-Rank Method
- Introduced by Brin and Page (1998)
- Mine hyperlink structure of web to produce
global importance ranking of every web page - Used in Google Search Engine
- Web search result is returned in the rank order
- Treats link as like academic citation
- Assumption Highly linked pages are more
important than pages with a few links - A page has a high rank if the sum of the ranks of
its back-links is high
132Page Rank Computation
- Assume
- R(u) Rank of a web page u
- Fu Set of pages which u points to
- Bu Set of pages that points to u
- Nu Number of links from u
- C Normalization factor
- E(u) Vector of web pages as source of rank
- Page Rank Computation
133Page Rank Implementation
- Stanford WebBase project ? Complete crawling and
indexing system of with current repository 24
million web pages (old data) - Store each URL as unique integer and each
hyperlink as integer IDs - Remove dangling links by iterative procedures
- Make initial assignment of the ranks
- Propagate page ranks in iterative manner
- Upon convergence, add the dangling links back and
recompute the rankings
134Page Rank Results
- Google utilizes a number of factors to rank the
search results - proximity, anchor text, page rank
- The benefits of Page Rank are the greatest for
underspecified queries, example Stanford
University query using Page Rank lists the
university home page the first
135Page Rank Advantages
- Global ranking of all web pages regardless of
their content, based solely on their location in
web graph structure - Higher quality search results central,
important, and authoritative web pages are given
preference - Help find representative pages to display for a
cluster center - Other applications traffic estimation, back-link
predictor, user navigation, personalized page
rank - Mining structure of web graph is very useful for
various information retrieval
136CLEVER Method
- CLientside EigenVector-Enhanced Retrieval
- Developed by a team of IBM researchers at IBM
Almaden Research Centre - Continued refinements of HITS
- Ranks pages primarily by measuring links between
them - Basic Principles Authorities, Hubs
- Good hubs points to good authorities
- Good authorities are referenced by good hubs
137Problems Prior to CLEVER
- Textual content that is ignored leads to problems
caused by some features of web - HITS returns good resources for more general
topic when query topics are narrowly-focused - HITS occasionally drifts when hubs discuss
multiple topics - Usually pages from single Web site take over a
topic and often use same html template therefore
pointing to a single popular site irrelevant to
query topic
138CLEVER Solution
- Replacing the sums of Equation (1) and (2) of
HITS with weighted sums - Assign to each link a non-negative weight
- Weight depends on the query term and end point
- Extension 1 Anchor Text
- using text that surrounds hyperlink definitions
(hrefs) in Web pages, often referred as anchor
text - boost weight enhancements of links that occur
near instances of query terms
139CLEVER Solution (Contd)
- Extension 2 Mini Hub Pagelets
- breaking large hub into smaller units
- treat contiguous subsets of links as mini-hubs or
pagelets - contiguous sets of links on a hub page are more
focused on single topic than the entire page
140CLEVER The Process
- Starts by collecting a set of pages
- Gathers all pages of initial link, plus any pages
linking to them - Ranks result by counting links
- Links have noise, not clear which pages are best
- Recalculate scores
- Pages with most links are established as most
important, links transmit more weigh - Repeat calculation no. of times till scores are
refined
141CLEVER Advantages
- Used to populate categories of different subjects
with minimal human assistance - Able to leverage links to fill category with best
pages on web - Can be used to compile large taxonomies of topics
automatically - Emerging new directions Hypertext
classification, focused crawling, mining
communities
142Connectivity Server Method
- Server that provides linkage information for all
pages indexed by a search engine - In its base operation, server accepts a query
consisting of a set of one or more URLs and
return a list of all pages that point to pages in
(parents) and list of all pages that are pointed
to from pages in (children) - In its base operation, it also provides
neighbourhood graph for query set - Acts as underlying infrastructure, supports
search engine applications
143Whats Connectivity Server (Contd)
Neighborhood Graph
144CONSERV Web Structure Mining
- Finding Authoritative Pages (Search by topic)
- (pages that is high in quality and relevant to
the topic) - Finding Related Pages (Search by URL)
- (pages that address same topic as the original
page, not necessarily semantically identical) - Algorithms include Companion, Cocitation
145CONSERV Finding Related Page
146CONSERV Companion Algorithm
- An extension to HITS algorithm
- Features
- Exploit not only links but also their order on a
page - Use link weights to reduce the influence of pages
that all reside on one host - Merge nodes that have a large number of duplicate
links - The base graph is structured to exclude
grandparent nodes but include nodes that share
child
147Companion Algorithm (Contd)
- Four steps
- 1. Build a vicinity graph for u
- 2. Remove duplicates and near-duplicates in
graph. - 3. Compute link weights based on host to host
connection - 4. Compute a hub score and a authority score for
each node in the graph, return the top ranked
authority nodes.
148Companion Algorithm (Contd)Building the
Vicinity Graph
- Set up parameters B no of parents of u, BF
no of children per parent, F no of children of
u, FB no of parents per child - Stoplist (pages that are unrelated to most
queries and have a very high in-degree) - Procedure
- Go Back (B) choose parents (randomly)
- Back-Forward(BF) choose siblings (nearest)
- Go Forward (F) choose children (first)
- Forward-Back(FB) choose siblings (highest
in-degree)
149Companion Algorithm (Contd)Remove duplicate
- Near-duplicate, if two nodes, each has more than
10 links and they have at least 95 of their
links in common - Replace two nodes with a node whose links are the
union of the links of the two nodes - (mirror sites, aliases)
150Companion Algorithm (Contd)Assign edge (link)
weights
- Link on the same host has weight 0
- If there are K links from documents on a host to
a single document on diff host, each link has an
authority weight of 1/k - If there are k links from a single document on a
host to a set of documents on diff host, give
each link a hub weight of 1/k - (prevent a single host from having too much
influence on the computation)
151Companion Algorithm (Contd)Compute hub and
authority scores
- Extension of the HITS algorithm with edge weights
- Initialize all elements of the hub vector H to 1
- Initialze all elements of the authority vector A
to 1 - While the vectors H and A have not converged
- For all nodes n in the vicinity graph N,
- An ? (n',n)?edges(N) Hn' x
authority_weight(n',n) - For all n in N,
- Hn ? (n',n)?edges(N) An' x
hub_weight(n',n) - Normalize the H and A vectors.
152CONSERV Cocitation Algorithm
- Two nodes are co-cited if they have a common
parent - The number of common parents of two nodes is
their degree of co-citation - Determine the related pages by looking for
sibling nodes with the highest degree of
co-citation - In some cases there is an insufficient level of
cocitation to provide meaningful results, chop
off elements of URL, restart algorithm. - e.g. A.com/X/Y/Z ? A.com/X/Y
153Comparative Study
- Page Rank
- (Google)
- Assigns initial ranking and retains them
independently from queries (fast) - In the forward direction from link to link
- Qualitative result
-
- Hub/Authority (CLEVER, C-Server)
- Assembles different root set and prioritizes
pages in the context of query - Looks forward and backward direction
- Qualitative result
154Connectivity-Based Ranking
- Query-independent gives an intrinsic quality
score to a page - Approach 1 larger number of hyperlinks pointing
to a page, the better the page - drawback?
- each link is equally important
- Approach 2 weight each hyperlink proportionally
to the quality of the page containing the
hyperlink
155Query-dependent Connectivity-Based Ranking
- Carrier and Kazman
- For each query, build a subgraph of the link
graph G limited to pages on query topic - Build the neighborhood graph
- A start set S of documents matching query given
by search engine (200) - Set augmented by its neighborhood, the set of
documents that either point to or are pointed to
by documents in S (limit to 50) - Then rank based on indegree
156Idea
- We desire pages that are relevant (in the
neighborhood graph) and authoritative - As in page rank, not only the in-degree of a page
p, but the quality of the pages that point to p.
If more important pages point to p, that means p
is more authoritative - Key idea Good hub pages have links to good
authority pages - given user query, compute a hub score and an
authority score for each document - high authority score ?? relevant content
- high hub score ?? links to documents with
relevant content
157Impr