Web Mining : A Bird - PowerPoint PPT Presentation

1 / 210
About This Presentation
Title:

Web Mining : A Bird

Description:

Web Mining : A Bird s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias_at_umr.edu – PowerPoint PPT presentation

Number of Views:232
Avg rating:3.0/5.0
Slides: 211
Provided by: Devesh4
Category:

less

Transcript and Presenter's Notes

Title: Web Mining : A Bird


1
Web Mining A Birds Eye View
  • Sanjay Kumar Madria
  • Department of Computer Science
  • University of Missouri-Rolla, MO 65401
  • madrias_at_umr.edu

2
Web Mining
  • Web mining - data mining techniques to
    automatically discover and extract information
    from Web documents/services (Etzioni, 1996).
  • Web mining research integrate research from
    several research communities (Kosala and
    Blockeel, July 2000) such as
  • Database (DB)
  • Information retrieval (IR)
  • The sub-areas of machine learning (ML)
  • Natural language processing (NLP)

3
Mining the World-Wide Web
  • WWW is huge, widely distributed, global
    information source for
  • Information services news, advertisements,
    consumer information, financial management,
    education, government, e-commerce, etc.
  • Hyper-link information
  • Access and usage information
  • Web Site contents and Organization

4
Mining the World-Wide Web
  • Growing and changing very rapidly
  • Broad diversity of user communities
  • Only a small portion of the information on the
    Web is truly relevant or useful to Web users
  • How to find high-quality Web pages on a specified
    topic?
  • WWW provides rich sources for data mining

5
Challenges on WWW Interactions
  • Finding Relevant Information
  • Creating knowledge from Information available
  • Personalization of the information
  • Learning about customers / individual users
  • Web Mining can play an important Role!

6
Web Mining more challenging
  • Searches for
  • Web access patterns
  • Web structures
  • Regularity and dynamics of Web contents
  • Problems
  • The abundance problem
  • Limited coverage of the Web hidden Web sources,
    majority of data in DBMS
  • Limited query interface based on keyword-oriented
    search
  • Limited customization to individual users
  • Dynamic and semistructured

7
Web Mining Subtasks
  • Resource Finding
  • Task of retrieving intended web-documents
  • Information Selection Pre-processing
  • Automatic selection and pre-processing specific
    information from retrieved web resources
  • Generalization
  • Automatic Discovery of patterns in web sites
  • Analysis
  • Validation and / or interpretation of mined
    patterns

8
Web Mining Taxonomy
Web Mining
Web Content Mining
Web Usage Mining
Web Structure Mining
9
Web Content Mining
  • Discovery of useful information from web contents
    / data / documents
  • Web data contents text, image, audio, video,
  • metadata and hyperlinks.
  • Information Retrieval View ( Structured
    Semi-Structured)
  • Assist / Improve information finding
  • Filtering Information to users on user profiles
  • Database View
  • Model Data on the web
  • Integrate them for more sophisticated queries

10
Issues in Web Content Mining
  • Developing intelligent tools for IR
    -
    Finding keywords and key phrases
    - Discovering grammatical
    rules and collocations
  • - Hypertext classification/categorization

    - Extracting key phrases from text
    documents
  • - Learning extraction models/rules
    -
    Hierarchical clustering
    - Predicting
    (words) relationship

11
Cont.
  • Developing Web query systems
  • WebOQL, XML-QL
  • Mining multimedia data
    - Mining image
    from satellite (Fayyad, et al. 1996)
  • - Mining image to identify small volcanoes on
    Venus (Smyth, et al 1996) .

12
Web Structure Mining
  • To discover the link structure of the hyperlinks
    at the inter-document level to generate
    structural summary about the Website and Web
    page.
  • Direction 1 based on the hyperlinks,
    categorizing the Web pages and generated
    information.
  • Direction 2 discovering the structure of Web
    document itself.
  • Direction 3 discovering the nature of the
    hierarchy or network of hyperlinks in the Website
    of a particular domain.

13
Web Structure Mining
  • Finding authoritative Web pages
  • Retrieving pages that are not only relevant, but
    also of high quality, or authoritative on the
    topic
  • Hyperlinks can infer the notion of authority
  • The Web consists not only of pages, but also of
    hyperlinks pointing from one page to another
  • These hyperlinks contain an enormous amount of
    latent human annotation
  • A hyperlink pointing to another Web page, this
    can be considered as the author's endorsement of
    the other page

14
Web Structure Mining
  • Web pages categorization (Chakrabarti, et al.,
    1998)
  • Discovering micro communities on the web
  • - Example Clever system (Chakrabarti, et
    al., 1999), Google (Brin and Page, 1998)
  • Schema Discovery in Semistructured Environment

15
Web Usage Mining
  • Web usage mining also known as Web log mining
  • mining techniques to discover interesting usage
    patterns from the secondary data derived from the
    interactions of the users while surfing the web

16
Web Usage Mining
  • Applications
  • Target potential customers for electronic
    commerce
  • Enhance the quality and delivery of Internet
    information services to the end user
  • Improve Web server system performance
  • Identify potential prime advertisement locations
  • Facilitates personalization/adaptive sites
  • Improve site design
  • Fraud/intrusion detection
  • Predict users actions (allows prefetching)

17
(No Transcript)
18
Problems with Web Logs
  • Identifying users
  • Clients may have multiple streams
  • Clients may access web from multiple hosts
  • Proxy servers many clients/one address
  • Proxy servers one client/many addresses
  • Data not in log
  • POST data (i.e., CGI request) not recorded
  • Cookie data stored elsewhere

19
Cont
  • Missing data
  • Pages may be cached
  • Referring page requires client cooperation
  • When does a session end?
  • Use of forward and backward pointers
  • Typically a 30 minute timeout is used
  • Web content may be dynamic
  • May not be able to reconstruct what the
    user saw
  • Use of spiders and automated agents automatic
    request we pages

20
Cont
  • Like most data mining tasks, web log
  • mining requires preprocessing
  • To identify users
  • To match sessions to other data
  • To fill in missing data
  • Essentially, to reconstruct the click stream

21
Log Data - Simple Analysis
  • Statistical analysis of users
  • Length of path
  • Viewing time
  • Number of page views
  • Statistical analysis of site
  • Most common pages viewed
  • Most common invalid URL

22
Web Log Data Mining Applications
  • Association rules
  • Find pages that are often viewed together
  • Clustering
  • Cluster users based on browsing patterns
  • Cluster pages based on content
  • Classification
  • Relate user attributes to patterns

23
Web Logs
  • Web servers have the ability to log all
  • requests
  • Web server log formats
  • Most use the Common Log Format (CLF)
  • New, Extended Log Format allows
    configuration of log file
  • Generate vast amounts of data

24
  • Common Log Format
  • Remotehost browser hostname or IP
  • Remote log name of user (almost
  • always "-" meaning "unknown")
  • Authuser authenticated username
  • Date Date and time of the request
  • "request exact request lines from client
  • Status The HTTP status code returned
  • Bytes The content-length of response

25
Server Logs
26
Fields
  • Client IP 128.101.228.20
  • Authenticated User ID - -
  • Time/Date 10/Nov/1999101639 -0600
  • Request "GET / HTTP/1.0"
  • Status 200
  • Bytes -
  • Referrer -
  • Agent "Mozilla/4.61 en (WinNT I)"

27
Web Usage Mining
  • Commonly used approaches (Borges and Levene,
    1999)
    - Maps the log
    data into relational tables before an adapted
    data mining technique is performed.
    - Uses the log
    data directly by utilizing special pre-processing
    techniques.
  • Typical problems
    - Distinguishing among
    unique users, server sessions, episodes, etc. in
    the presence of caching and proxy servers
    (McCallum, et al., 2000 Srivastava, et al.,
    2000).

28
Request
  • Method GET
  • Other common methods are POST and HEAD
  • URI /
  • This is the file that is being accessed. When a
  • directory is specified, it is up to the
    Server to
  • decide what to return. Usually, it will be
    the file
  • named index.html or home.html
  • Protocol HTTP/1.0

29
Status
  • Status codes are defined by the HTTP
  • protocol.
  • Common codes include
  • 200 OK
  • 3xx Some sort of Redirection
  • 4xx Some sort of Client Error
  • 5xx Some sort of Server Error

30
(No Transcript)
31
Web Mining Taxonomy
32
Mining the World Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining
  • Web Page Content Mining
  • Web Page Summarization
  • WebOQL(Mendelzon et.al. 1998)
  • Web Structuring query languages
  • Can identify information within given web pages
  • (Etzioni et.al. 1997)Uses heuristics to
    distinguish personal home pages from other web
    pages
  • ShopBot (Etzioni et.al. 1997) Looks for product
    prices within web pages

General Access Pattern Tracking
Customized Usage Tracking
Search Result Mining
33
Mining the World Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining
Web Page Content Mining
  • Search Result Mining
  • Search Engine Result Summarization
  • Clustering Search Result (Leouski and Croft,
    1996, Zamir and Etzioni, 1997)
  • Categorizes documents using phrases in titles and
    snippets

General Access Pattern Tracking
Customized Usage Tracking
34
Mining the World Wide Web
Web Content Mining
Web Usage Mining
  • Web Structure Mining
  • Using Links
  • PageRank (Brin et al., 1998)
  • CLEVER (Chakrabarti et al., 1998)
  • Use interconnections between web pages to give
    weight to pages.
  • Using Generalization
  • MLDB (1994)
  • Uses a multi-level database representation of the
    Web. Counters (popularity) and link lists are
    used for capturing structure.

General Access Pattern Tracking
Search Result Mining
Web Page Content Mining
Customized Usage Tracking
35
Mining the World Wide Web
Web Structure Mining
Web Content Mining
Web Usage Mining
Web Page Content Mining
Customized Usage Tracking
  • General Access Pattern Tracking
  • Web Log Mining (Zaïane, Xin and Han, 1998)
  • Uses KDD techniques to understand general access
    patterns and trends.
  • Can shed light on better structure and grouping
    of resource providers.

Search Result Mining
36
Mining the World Wide Web
Web Usage Mining
Web Structure Mining
Web Content Mining
  • Customized Usage Tracking
  • Adaptive Sites (Perkowitz and Etzioni, 1997)
  • Analyzes access patterns of each user at a time.
  • Web site restructures itself automatically by
    learning from user access patterns.

General Access Pattern Tracking
Web Page Content Mining
Search Result Mining
37
Web Content Mining
  • Agent-based Approaches
  • Intelligent Search Agents
  • Information Filtering/Categorization
  • Personalized Web Agents
  • Database Approaches
  • Multilevel Databases
  • Web Query Systems

38
Intelligent Search Agents
  • Locating documents and services on the Web
  • WebCrawler, Alta Vista (http//www.altavista.com)
    scan millions of Web documents and create index
    of words (too many irrelevant, outdated
    responses)
  • MetaCrawler mines robot-created indices
  • Retrieve product information from a variety of
    vendor sites using only general information about
    the product domain
  • ShopBot

39
Intelligent Search Agents (Contd)
  • Rely either on pre-specified domain information
    about particular types of documents, or on hard
    coded models of the information sources to
    retrieve and interpret documents
  • Harvest
  • FAQ-Finder
  • Information Manifold
  • OCCAM
  • Parasite
  • Learn models of various information sources and
    translates these into its own concept hierarchy
  • ILA (Internet Learning Agent)

40
Information Filtering/Categorization
  • Using various information retrieval techniques
    and characteristics of open hypertext Web
    documents to automatically retrieve, filter, and
    categorize them.
  • HyPursuit uses semantic information embedded in
    link structures and document content to create
    cluster hierarchies of hypertext documents, and
    structure an information space
  • BO (Bookmark Organizer) combines hierarchical
    clustering techniques and user interaction to
    organize a collection of Web documents based on
    conceptual information

41
Personalized Web Agents
  • This category of Web agents learn user
    preferences and discover Web information sources
    based on these preferences, and those of other
    individuals with similar interests (using
    collaborative filtering)
  • WebWatcher
  • PAINT
  • SyskillWebert
  • GroupLens
  • Firefly
  • others

42
Multiple Layered Web Architecture
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
43
Multilevel Databases
  • At the higher levels, meta data or
    generalizations are
  • extracted from lower levels
  • organized in structured collections, i.e.
    relational or object-oriented database.
  • At the lowest level, semi-structured information
    are
  • stored in various Web repositories, such as
    hypertext documents

44
Multilevel Databases (Contd)
  • (Han, et. al.)
  • use a multi-layered database where each layer is
    obtained via generalization and transformation
    operations performed on the lower layers
  • (Kholsa, et. al.)
  • propose the creation and maintenance of
    meta-databases at each information providing
    domain and the use of a global schema for the
    meta-database

45
Multilevel Databases (Contd)
  • (King, et. al.)
  • propose the incremental integration of a portion
    of the schema from each information source,
    rather than relying on a global heterogeneous
    database schema
  • The ARANEUS system
  • extracts relevant information from hypertext
    documents and integrates these into higher-level
    derived Web Hypertexts which are generalizations
    of the notion of database views

46
Multi-Layered Database (MLDB)
  • A multiple layered database model
  • based on semi-structured data hypothesis
  • queried by NetQL using a syntax similar to the
    relational language SQL
  • Layer-0
  • An unstructured, massive, primitive, diverse
    global information-base.
  • Layer-1
  • A relatively structured, descriptor-like,
    massive, distributed database by data analysis,
    transformation and generalization techniques.
  • Tools to be developed for descriptor extraction.
  • Higher-layers
  • Further generalization to form progressively
    smaller, better structured, and less remote
    databases for efficient browsing, retrieval, and
    information discovery.

47
Three major components in MLDB
  • S (a database schema)
  • outlines the overall database structure of the
    global MLDB
  • presents a route map for data and meta-data
    (i.e., schema) browsing
  • describes how the generalization is performed
  • H (a set of concept hierarchies)
  • provides a set of concept hierarchies which
    assist the system to generalize lower layer
    information to high layeres and map queries to
    appropriate concept layers for processing
  • D (a set of database relations)
  • the whole global information base at the
    primitive information level (i.e., layer-0)
  • the generalized database relations at the
    nonprimitive layers

48
The General architecture of WebLogMiner(a Global
MLDB)
Generalized Data
Higher layers
Site 1
Concept Hierarchies
Site 2
Resource Discovery (MLDB)
Knowledge Discovery (WLM)
Site 3
Characteristic Rules Discriminant
Rules Association Rules
49
Techniques for Web usage mining
  • Construct multidimensional view on the Weblog
    database
  • Perform multidimensional OLAP analysis to find
    the top N users, top N accessed Web pages, most
    frequently accessed time periods, etc.
  • Perform data mining on Weblog records
  • Find association patterns, sequential patterns,
    and trends of Web accessing
  • May need additional information,e.g., user
    browsing sequences of the Web pages in the Web
    server buffer
  • Conduct studies to
  • Analyze system performance, improve system design
    by Web caching, Web page prefetching, and Web
    page swapping

50
Web Usage Mining - Phases
  • Three distinctive phases preprocessing, pattern
    discovery, and pattern analysis
  • Preprocessing - process to convert the raw data
    into the data abstraction necessary for the
    further applying the data mining algorithm
  • Resources server-side, client-side, proxy
    servers, or database.
  • Raw data Web usage logs, Web page descriptions,
    Web site topology, user registries, and
    questionnaire.
  • Conversion Content converting, Structure
    converting, Usage converting

51
  • User The principal using a client to
    interactively retrieve and render resources or
    resource manifestations.
  • Page view Visual rendering of a Web page in a
    specific client environment at a specific point
    of time
  • Click stream a sequential series of page view
    request
  • User session a delimited set of user clicks
    (click stream) across one or more Web servers.
  • Server session (visit) a collection of user
    clicks to a single Web server during a user
    session.
  • Episode a subset of related user clicks that
    occur within a user session.

52
  • Content Preprocessing - the process of converting
    text, image, scripts and other files into the
    forms that can be used by the usage mining.
  • Structure Preprocessing - The structure of a
    Website is formed by the hyperlinks between page
    views, the structure preprocessing can be done by
    parsing and reformatting the information.
  • Usage Preprocessing - the most difficult task in
    the usage mining processes, the data cleaning
    techniques to eliminate the impact of the
    irrelevant items to the analysis result.

53
Pattern Discovery
  • Pattern Discovery is the key component of the
  • Web mining, which converges the algorithms and
    techniques from data mining, machine learning,
    statistics and pattern recognition etc research
    categories.
  • Separate subsections statistical analysis,
  • association rules, clustering,
    classification,
  • sequential pattern, dependency Modeling.

54
  • Statistical Analysis - the analysts may perform
    different kinds of descriptive statistical
    analyses based on different variables when
    analyzing the session file powerful tools in
    extracting knowledge about visitors to a Web
    site.

55
  • Association Rules - refers to sets of pages that
    are accessed together with a support value
    exceeding some specified threshold.
  • Clustering a technique to group together users
    or data items (pages) with the similar
    characteristics.
  • It can facilitate the development and execution
    of future marketing strategies.
  • Classification the technique to map a data item
    into one of several predefined classes, which
    help to establish a profile of users belonging to
    a particular class or category.

56
Pattern Analysis
  • Pattern Analysis - final stage of the Web usage
    mining.
  • To eliminate the irrelative rules or patterns and
    to extract the interesting rules or patterns from
    the output of the pattern discovery process.
  • Analysis methodologies and tools query
  • mechanism like SQL, OLAP, visualization etc.

57
(No Transcript)
58
WUM Pre-Processing
  • Data Cleaning
  • Removes log entries that are not needed for
    the mining process
  • Data Integration
  • Synchronize data from multiple server logs,
    metadata
  • User Identification
  • Associates page references with different
    users
  • Session/Episode Identification
  • Groups users page references into user
    sessions
  • Page View Identification
  • Path Completion
  • Fills in page references missing due to
    browser and proxy caching

59
WUM Issues in User Session Identification
  • A single IP address is used by many users
  • Different IP addresses in a single session
  • Missing cache hits in the server logs

Proxy server
different users
Web server
Single user
ISP server
Web server
60
User and Session Identification Issues
  • Distinguish among different users to a site
  • Reconstruct the activities of the users within
    the site
  • Proxy servers and anonymizers
  • Rotating IP addresses connections through ISPs
  • Missing references due to caching
  • Inability of servers to distinguish among
    different visits

61
WUM Solutions
  • Remote Agent
  • A remote agent is implemented in Java Applet
  • It is loaded into the client only once when the
    first page is accessed
  • The subsequent requests are captured and send
    back to the server
  • Modified Browser
  • The source code of the existing browser can
    be modified to gain user specific data at the
    client side
  • Dynamic page rewriting
  • When the user first submit the request, the
    server returns the requested page rewritten to
    include a session specific ID
  • Each subsequent request will supply this ID to
    the server
  • Heuristics
  • Use a set of assumptions to identify user
    sessions and find the missing cache hits in the
    server log

62
(No Transcript)
63
WUM Heuristics
  • The session identification heuristics
  • Timeout if the time between pages requests
    exceeds a certain limit, it is assumed that the
    user is starting a new session
  • IP/Agent Each different agent type for an IP
    address represents a different sessions
  • Referring page If the referring page file for a
    request is not part of an open session, it is
    assumed that the request is coming from a
    different session
  • Same IP-Agent/different sessions (Closest)
    Assigns the request to the session that is
    closest to the referring page at the time of the
    request
  • Same IP-Agent/different sessions (Recent) In
    the case where multiple sessions are same
    distance from a page request, assigns the request
    to the session with the most recent referrer
    access in terms of time

64
Cont.
  • The path completion heuristics
  • If the referring page file of a session is not
    part of the previous page file of that session,
    the user must have accessed a cached page
  • The back button method is used to refer a
    cached page
  • Assigns a constant view time for each of the
    cached page file

65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
WUM Association Rule Generation
  • Discovers the correlations between pages that are
    most often referenced together in a single server
    session
  • Provide the information
  • What are the set of pages frequently accessed
    together by Web users?
  • What page will be fetched next?
  • What are paths frequently accessed by Web users?
  • Association rule
  • A B Support 60,
    Confidence 80
  • Example
  • 50 of visitors who accessed URLs
    /infor-f.html and labo/infos.html also visited
    situation.html

71
Associations Correlations
  • Page associations from usage data
  • User sessions
  • User transactions
  • Page associations from content data
  • similarity based on content analysis
  • Page associations based on structure
  • link connectivity between pages
  • gt Obtain frequent itemsets

72
Examples
  • è60 of clients who accessed /products/, also
    accessed /products/software/webminer.htm.
  • è30 of clients who accessed /special-offer.html,
    placed an online order in /products/software/.
  • è(Example from IBM official Olympics Site)
  • Badminton, Diving gt Table Tennis (a
    69.7, s 0.35)

73
WUM Clustering
  • Groups together a set of items having similar
    characteristics
  • User Clusters
  • Discover groups of users exhibiting similar
    browsing patterns
  • Page recommendation
  • Users partial session is classified into a
    single cluster
  • The links contained in this cluster are
    recommended

74
Cont..
  • Page clusters
  • Discover groups of pages having related content
  • Usage based frequent pages
  • Page recommendation
  • The links are presented based on how often URL
    references occur together across user sessions

75
Website Usage Analysis
  • Why developing a Website usage / utilization
    analyzation tool?
  • Knowledge about how visitors use Website could
  • - Prevent disorientation and help designers
    place important information/functions exactly
    where the visitors look for and in the way users
    need it
  • - Build up adaptive Website server

76
Clustering and Classification
  • èclients who often access
  • /products/software/webminer.html tend to be from
    educational institutions.
  • èclients who placed an online order for software
    tend to be students in the 20-25 age group and
    live in the United States.
  • è75 of clients who download software from
  • /products/software/demos/ visit between 700
    and 1100 pm on weekends.

77
Website Usage Analysis
  • Discover user navigation patterns in using
    Website
    - Establish a aggregated
    log structure as a preprocessor to reduce the
    search space before the actual log mining phase

    - Introduce
    a model for Website usage pattern discovery by
    extending the classical mining model, and
    establish the processing framework of this model


78
Sequential Patterns Clusters
  • è30 of clients who visited /products/software/,
    had done a search in Yahoo using the keyword
    software before their visit
  • è60 of clients who placed an online order for
    WEBMINER, placed another online order for
    software within 15 days

79
Website Usage Analysis
  • Website client-server architecture facilitates
    recording user behaviors in every steps by

    - submit client-side log files to server
    when users use clear functions or exit
    window/modules
  • The special design for local and universal
    back/forward/clear functions makes users
    navigation pattern more clear for designer by
  • - analyzing local back/forward history and
    incorporate it with universal back/forward
    history

80
Website Usage Analysis
  • What will be included in SUA
    1.
    Identify and collect log data
    2. Transfer the data to
    server-side and save them in a structure desired
    for analysis
    3. Prepare mined data by establishing a
    customized aggregated log tree/frame
    4. Use
    modifications of the typical data mining methods,
    particularly an extension of a traditional
    sequence discovery algorithm, to mine user
    navigation patterns

81
Website Usage Analysis
  • Problem need to be considered
  • - How to identify the log data when a user go
    through uninteresting function/module
  • - What marks the end of a user session?
  • - Client connect Website through proxy servers
  • Differences in Website usage analysis with common
    Web usage mining
  • - Client-side log files available
  • - Log files format (Web log files follow Common
    Log Format specified as a part of HTTP protocol)
  • - Not necessary for log file cleaning/filtering
    (which usually performed in preprocess of Web log
    mining)

82
Web Usage Mining - Patterns Discovery Algorithms
  • (Chen et. al.) Design algorithms for Path
    Traversal Patterns, finding maximal forward
    references and large reference sequences.

83
Path Traversal Patterns
  • Procedure for mining traversal patterns
  • (Step 1) Determine maximal forward references
    from the original log data (Algorithm MF)
  • (Step 2) Determine large reference sequences
    (i.e., Lk, k?1) from the set of maximal forward
    references (Algorithm FS and SS)
  • (Step 3) Determine maximal reference sequences
    from large reference sequences
  • Focus on Step 1 and 2, and devise algorithms for
    the efficient determination of large reference
    sequences

84
Determine large reference sequeces
  • Algorithm FS
  • Utilizes the key ideas of algorithm DHP
  • employs hashing and pruning techniques
  • DHP is very efficient for the generation of
    candidate itemsets, in particular for the large
    two-itemsets, thus greatly improving the
    performance bottleneck of the whole process
  • Algorithm SS
  • employs hashing and pruning techniques to reduce
    both CPU and I/O costs
  • by properly utilizing the information in
    candidate references in prior passes, is able to
    avoid database scans in some passes, thus further
    reducing the disk I/O cost

85
Patterns Analysis Tools
  • WebViz pitkwa94 --- provides appropriate tools
    and techniques to understand, visualize, and
    interpret access patterns.
  • Proposes OLAP techniques such as data cubes for
    the purpose of simplifying the analysis of usage
    statistics from server access logs. dyreua et
    al

86
Patterns Discovery and Analysis Tools
  • The emerging tools for user pattern discovery use
    sophisticated techniques from AI, data mining,
    psychology, and information theory, to mine for
    knowledge from collected data
  • (Pirolli et. al.) use information foraging theory
    to combine path traversal patterns, Web page
    typing, and site topology information to
    categorize pages for easier access by users.

87
(Contd)
  • WEBMINER
  • introduces a general architecture for Web usage
    mining, automatically discovering association
    rules and sequential patterns from server access
    logs.
  • proposes an SQL-like query mechanism for querying
    the discovered knowledge in the form of
    association rules and sequential patterns.
  • WebLogMiner
  • Web log is filtered to generate a relational
    database
  • Data mining on web log data cube and web log
    database

88
WEBMINER
  • SQL-like Query
  • A framework for Web mining, the applications of
    data mining and knowledge discovery techniques,
    association rules and sequential patterns, to Web
    data
  • Association rules using apriori algorithm
  • 40 of clients who accessed the Web page with URL
    /company/products/product1.html, also accessed
    /company/products/product2.html
  • Sequential patterns using modified apriori
    algorithm
  • 60 of clients who placed an online order in
    /company/products/product1.html, also placed an
    online order in /company/products/product4.html
    within 15 days

89
WebLogMiner
  • Database construction from server log file
  • data cleaning
  • data transformation
  • Multi-dimensional web log data cube construction
    and manipulation
  • Data mining on web log data cube and web log
    database

90
Mining the World-Wide Web
  • Design of a Web Log Miner
  • Web log is filtered to generate a relational
    database
  • A data cube is generated form database
  • OLAP is used to drill-down and roll-up in the
    cube
  • OLAM is used for mining interesting knowledge

Knowledge
Web log
Database
Data Cube
Sliced and diced cube
1 Data Cleaning
2 Data Cube Creation
4 Data Mining
3 OLAP
91
Construction of Data Cubes(http//db.cs.sfu.ca/se
ctions/publication/slides/slides.html)
All Amount Comp_Method, B.C.
Amount
0-20K
20-40K
60K-
sum
40-60K
Province
B.C.
Comp_Method
Prairies
Ontario
sum
Database
Discipline
...
sum
Each dimension contains a hierarchy of values
for one attribute A cube cell stores aggregate
values, e.g., count, sum, max, etc. A sum cell
stores dimension summation values. Sparse-cube
technology and MOLAP/ROLAP integration. Chunk-ba
sed multi-way aggregation and single-pass
computation.
92
WebLogMiner Architecture
  • Web log is filtered to generate a relational
    database
  • A data cube is generated from database
  • OLAP is used to drill-down and roll-up in the
    cube
  • OLAM is used for mining interesting knowledge

Knowledge
Database
Web log
Data Cube
Sliced and diced cube
2 Data Cube Creation
1 Data Cleaning
3 OLAP
4 Data Mining
93
WEBSIFT
94
What is WebSIFT?
  • a Web Usage Mining framework that
  • performs preprocessing
  • performs knowledge discovery
  • uses the structure and content information about
    a Web site to automatically define a belief set.

95
Overview of WebSIFT
  • Based on WEBMINER prototype
  • Divides the Web Usage Mining process into three
    main parts

96
Overview of WebSIFT
  • Input
  • Access
  • Referrer and agent
  • HTML files
  • Optional data (e.g., registration data or remote
    agent logs)

97
Overview of WebSIFT
  • Preprocessing
  • uses input data to construct a user session file
  • site files are used to classify pages of a site
  • Knowledge discovery phase
  • uses existing data mining techniques to generate
    rules and patterns.
  • generation of general usage stats

98
Information Filtering
  • Links between pages provide evidence for
    supporting the belief that those pages are
    related.
  • Strength of evidence for a set pages being
    related is proportional to the strength of the
    topological connection between the set of pages.
  • Based on site content, can also look at content
    similarity and by calculating distance between
    pages.

99
Information Filtering
100
Information Filtering
  • Uses two different methods to identify
    interesting results from a list of discovered
    frequent itemsets

101
Information Filtering
  • Method 1
  • declare itemsets that contain pages not directly
    connected to be interesting
  • corresponds to a situation where a belief that a
    set of pages are related has no domain or
    existing evidence but there is mined evidence. ?
    called Beliefs with Mined Evidence algo (BME)

102
Information Filtering
  • Method 2
  • Absence of itemsets ? evidence against a belief
    that pages are related.
  • Pages that have individual support above a
    threshold but are not present together in larger
    frequent itemsets ? evidence against the pages
    being related.
  • domain evidence suggests that pages are related?
    the absence of the frequent itemset can be
    considered interesting. This is handled by the
    Beliefs with Contradicting Evidence algo (BCE )

103
Experimental Evaluation
  • Performed on web server of U of MN Dept of Comp
    Sci Engg web site
  • Log spanned eight days in Feb 1999
  • Physical size of log 19.3 MB
  • 102,838 entries
  • After preprocessing 43,158 page views (divided
    among 10,609 user sessions)
  • Threshold of 0.1 for support used to generate
    693 frequent itemsets with maximum set size of
    six pages.
  • 178 unique pages represented in all the rules.
  • BCE and BME algos run on frequent itemsets.

104
Experimental Evaluation
105
Experimental Evaluation
106
Future work
  • Filtering frequent itemsets, sequential patterns
    and clusters
  • Incorporate probabilities and fuzzy logic into
    information filter
  • Future works include path completion
    verification, page usage determination,
    application of the pattern analysis results, etc.

107
Link Analysis
108
Link Analysis
  • Finding patterns in graphs
  • Bibliometrics finding patterns in citation
    graphs
  • Sociometry finding patterns in social networks
  • Collaborative Filtering finding patterns in
    rank(person, item) graph
  • Webometrics finding patterns in web page links

109
Web Link Analysis
  • Used for
  • ordering documents matching a user query ranking
  • deciding what pages to add to a collection
    crawling
  • page categorization
  • finding related pages
  • finding duplicated web sites

110
Web as Graph
  • Link graph
  • node for each page
  • directed edge (u,v) if page u contains a
    hyperlink to page v
  • Co-citation graph
  • node for each page
  • undirected edge (u,v) iff exists a third page w
    linking to both u and v
  • Assumption
  • link from page A to page B is a recommendation of
    page B by A
  • If A and B are connected by a link, there is a
    higher probability that they are on the same
    topic

111
Web structure mining
  • HITS (Topic distillation)
  • PageRank (Ranking web pages used by Google)
  • Algorithm in Cyber-community

112
HITS Algorithm--Topic Distillation on WWW
113
HITS Method
  • Hyperlink Induced Topic Search
  • Kleinberg, 1998
  • A simple approach by finding hubs and authorities
  • View web as a directed graph
  • Assumption if document A has hyperlink to
    document B, then the author of document A thinks
    that document B contains valuable information

114
Main Ideas
  • Concerned with the identification of the most
    authoritative, or definitive, Web pages on a
    broad-topic
  • Focused on only one topic
  • Viewing the Web as a graph
  • A purely link structure-based computation,
    ignoring the textual content

115
HITS Hubs and Authority
  • Hub web page links to a collection of prominent
    sites on a common topic
  • Authority Pages that link to a collection of
    authoritative pages on a broad topic web page
    pointed to by hubs
  • Mutual Reinforcing Relationship a good authority
    is a page that is pointed to by many good hubs,
    while a good hub is a page that points to many
    good authorities

116
Hub-Authority Relations
117
HITS Two Main Steps
  • A sampling component, which constructs a focused
    collection of several thousand web pages likely
    to be rich in relevant authorities
  • A weight-propagation component, which determines
    numerical estimates of hub and authority weights
    by an iterative procedure
  • As the result, pages with highest weights are
    returned as hubs and authorities for the research
    topic

118
HITS Root Set and Base Set
  • Using query term to collect a root set (S) of
    pages from index-based search engine (AltaVista)
  • Expand root set to base set (T) by including all
    pages linked to by pages in root set and all
    pages that link to a page in root set (up to a
    designated size cut-off)
  • Typical base set contains roughly 1000-5000 pages

119
Step 1 Constructing Subgraph
  • 1.1 Creating a root set (S)
  • - Given a query string on a broad topic
  • - Collect the t highest-ranked pages for the
    query
  • from a text-based search engine
  • 1.2 Expanding to a base set (T)
  • - Add the page pointing to a page in root set
  • - Add the page pointed to by a page in root set

120
Root Set and Base Set (Contd)
121
Step 2 Computing Hubs and Authorities
  • 2.1 Associating weights
  • - Authority weight xp
  • - Hub weight yp
  • - Set all values to a uniform constant initially
  • 2.2 Updating weights

122
Updating Authority Weight
Example
xpyq1yq2yq3
123
Updating Hub Weight
Example
ypxq1xq2xq3
124
Flowchart
125
Results
  • All x- and y-values converge rapidly so that
    termination of the iteration is guaranteed
  • It can be proved in mathematical approach
  • Pages with the highest x-values are viewed as the
    best authorities, while pages with the highest
    y-values are regarded as the best hubs

126
Implementation
  • Search engine AltaVista
  • Root set 200 pages
  • Base set 1000-5000 pages
  • Converging speed Very rapid,
  • less than 20
    times
  • Running time About 30 minutes

127
HITS Advantages
  • Weight computation is an intrinsic feature from
    collection of linked pages
  • Provides a densely linked community of related
    authorities and hubs
  • Pure link-based computation once the root set has
    been assembled, with no further regard to query
    terms
  • Provides surprisingly good search result for a
    wide range of queries

128
Drawbacks
  • Limit On Narrow Topics
  • Not enough authoritative pages
  • Frequently returns resources for a
  • more general topic
  • adding a few edges can potentially change scores
    considerably
  • Topic Drifting
  • - Appear when hubs discuss multiple
  • topics

129
Improved Work
  • To improve precision
  • - Combining content with link information
  • - Breaking large hub pages into smaller units
  • - Computing relevance weights for pages
  • To improve speed
  • - Building a Connectivity Server that
    provides linkage information for all pages

130
Web Structure Mining
  • Page-Rank Method
  • CLEVER Method
  • Connectivity-Server Method

131
1. Page-Rank Method
  • Introduced by Brin and Page (1998)
  • Mine hyperlink structure of web to produce
    global importance ranking of every web page
  • Used in Google Search Engine
  • Web search result is returned in the rank order
  • Treats link as like academic citation
  • Assumption Highly linked pages are more
    important than pages with a few links
  • A page has a high rank if the sum of the ranks of
    its back-links is high

132
Page Rank Computation
  • Assume
  • R(u) Rank of a web page u
  • Fu Set of pages which u points to
  • Bu Set of pages that points to u
  • Nu Number of links from u
  • C Normalization factor
  • E(u) Vector of web pages as source of rank
  • Page Rank Computation

133
Page Rank Implementation
  • Stanford WebBase project ? Complete crawling and
    indexing system of with current repository 24
    million web pages (old data)
  • Store each URL as unique integer and each
    hyperlink as integer IDs
  • Remove dangling links by iterative procedures
  • Make initial assignment of the ranks
  • Propagate page ranks in iterative manner
  • Upon convergence, add the dangling links back and
    recompute the rankings

134
Page Rank Results
  • Google utilizes a number of factors to rank the
    search results
  • proximity, anchor text, page rank
  • The benefits of Page Rank are the greatest for
    underspecified queries, example Stanford
    University query using Page Rank lists the
    university home page the first

135
Page Rank Advantages
  • Global ranking of all web pages regardless of
    their content, based solely on their location in
    web graph structure
  • Higher quality search results central,
    important, and authoritative web pages are given
    preference
  • Help find representative pages to display for a
    cluster center
  • Other applications traffic estimation, back-link
    predictor, user navigation, personalized page
    rank
  • Mining structure of web graph is very useful for
    various information retrieval

136
CLEVER Method
  • CLientside EigenVector-Enhanced Retrieval
  • Developed by a team of IBM researchers at IBM
    Almaden Research Centre
  • Continued refinements of HITS
  • Ranks pages primarily by measuring links between
    them
  • Basic Principles Authorities, Hubs
  • Good hubs points to good authorities
  • Good authorities are referenced by good hubs

137
Problems Prior to CLEVER
  • Textual content that is ignored leads to problems
    caused by some features of web
  • HITS returns good resources for more general
    topic when query topics are narrowly-focused
  • HITS occasionally drifts when hubs discuss
    multiple topics
  • Usually pages from single Web site take over a
    topic and often use same html template therefore
    pointing to a single popular site irrelevant to
    query topic

138
CLEVER Solution
  • Replacing the sums of Equation (1) and (2) of
    HITS with weighted sums
  • Assign to each link a non-negative weight
  • Weight depends on the query term and end point
  • Extension 1 Anchor Text
  • using text that surrounds hyperlink definitions
    (hrefs) in Web pages, often referred as anchor
    text
  • boost weight enhancements of links that occur
    near instances of query terms

139
CLEVER Solution (Contd)
  • Extension 2 Mini Hub Pagelets
  • breaking large hub into smaller units
  • treat contiguous subsets of links as mini-hubs or
    pagelets
  • contiguous sets of links on a hub page are more
    focused on single topic than the entire page

140
CLEVER The Process
  • Starts by collecting a set of pages
  • Gathers all pages of initial link, plus any pages
    linking to them
  • Ranks result by counting links
  • Links have noise, not clear which pages are best
  • Recalculate scores
  • Pages with most links are established as most
    important, links transmit more weigh
  • Repeat calculation no. of times till scores are
    refined

141
CLEVER Advantages
  • Used to populate categories of different subjects
    with minimal human assistance
  • Able to leverage links to fill category with best
    pages on web
  • Can be used to compile large taxonomies of topics
    automatically
  • Emerging new directions Hypertext
    classification, focused crawling, mining
    communities

142
Connectivity Server Method
  • Server that provides linkage information for all
    pages indexed by a search engine
  • In its base operation, server accepts a query
    consisting of a set of one or more URLs and
    return a list of all pages that point to pages in
    (parents) and list of all pages that are pointed
    to from pages in (children)
  • In its base operation, it also provides
    neighbourhood graph for query set
  • Acts as underlying infrastructure, supports
    search engine applications

143
Whats Connectivity Server (Contd)
Neighborhood Graph
144
CONSERV Web Structure Mining
  • Finding Authoritative Pages (Search by topic)
  • (pages that is high in quality and relevant to
    the topic)
  • Finding Related Pages (Search by URL)
  • (pages that address same topic as the original
    page, not necessarily semantically identical)
  • Algorithms include Companion, Cocitation

145
CONSERV Finding Related Page
146
CONSERV Companion Algorithm
  • An extension to HITS algorithm
  • Features
  • Exploit not only links but also their order on a
    page
  • Use link weights to reduce the influence of pages
    that all reside on one host
  • Merge nodes that have a large number of duplicate
    links
  • The base graph is structured to exclude
    grandparent nodes but include nodes that share
    child

147
Companion Algorithm (Contd)
  • Four steps
  • 1. Build a vicinity graph for u
  • 2. Remove duplicates and near-duplicates in
    graph.
  • 3. Compute link weights based on host to host
    connection
  • 4. Compute a hub score and a authority score for
    each node in the graph, return the top ranked
    authority nodes.

148
Companion Algorithm (Contd)Building the
Vicinity Graph
  • Set up parameters B no of parents of u, BF
    no of children per parent, F no of children of
    u, FB no of parents per child
  • Stoplist (pages that are unrelated to most
    queries and have a very high in-degree)
  • Procedure
  • Go Back (B) choose parents (randomly)
  • Back-Forward(BF) choose siblings (nearest)
  • Go Forward (F) choose children (first)
  • Forward-Back(FB) choose siblings (highest
    in-degree)

149
Companion Algorithm (Contd)Remove duplicate
  • Near-duplicate, if two nodes, each has more than
    10 links and they have at least 95 of their
    links in common
  • Replace two nodes with a node whose links are the
    union of the links of the two nodes
  • (mirror sites, aliases)

150
Companion Algorithm (Contd)Assign edge (link)
weights
  • Link on the same host has weight 0
  • If there are K links from documents on a host to
    a single document on diff host, each link has an
    authority weight of 1/k
  • If there are k links from a single document on a
    host to a set of documents on diff host, give
    each link a hub weight of 1/k
  • (prevent a single host from having too much
    influence on the computation)

151
Companion Algorithm (Contd)Compute hub and
authority scores
  • Extension of the HITS algorithm with edge weights
  • Initialize all elements of the hub vector H to 1
  • Initialze all elements of the authority vector A
    to 1
  • While the vectors H and A have not converged
  • For all nodes n in the vicinity graph N,
  • An ? (n',n)?edges(N) Hn' x
    authority_weight(n',n)
  • For all n in N,
  • Hn ? (n',n)?edges(N) An' x
    hub_weight(n',n)
  • Normalize the H and A vectors.

152
CONSERV Cocitation Algorithm
  • Two nodes are co-cited if they have a common
    parent
  • The number of common parents of two nodes is
    their degree of co-citation
  • Determine the related pages by looking for
    sibling nodes with the highest degree of
    co-citation
  • In some cases there is an insufficient level of
    cocitation to provide meaningful results, chop
    off elements of URL, restart algorithm.
  • e.g. A.com/X/Y/Z ? A.com/X/Y

153
Comparative Study
  • Page Rank
  • (Google)
  • Assigns initial ranking and retains them
    independently from queries (fast)
  • In the forward direction from link to link
  • Qualitative result
  • Hub/Authority (CLEVER, C-Server)
  • Assembles different root set and prioritizes
    pages in the context of query
  • Looks forward and backward direction
  • Qualitative result

154
Connectivity-Based Ranking
  • Query-independent gives an intrinsic quality
    score to a page
  • Approach 1 larger number of hyperlinks pointing
    to a page, the better the page
  • drawback?
  • each link is equally important
  • Approach 2 weight each hyperlink proportionally
    to the quality of the page containing the
    hyperlink

155
Query-dependent Connectivity-Based Ranking
  • Carrier and Kazman
  • For each query, build a subgraph of the link
    graph G limited to pages on query topic
  • Build the neighborhood graph
  • A start set S of documents matching query given
    by search engine (200)
  • Set augmented by its neighborhood, the set of
    documents that either point to or are pointed to
    by documents in S (limit to 50)
  • Then rank based on indegree

156
Idea
  • We desire pages that are relevant (in the
    neighborhood graph) and authoritative
  • As in page rank, not only the in-degree of a page
    p, but the quality of the pages that point to p.
    If more important pages point to p, that means p
    is more authoritative
  • Key idea Good hub pages have links to good
    authority pages
  • given user query, compute a hub score and an
    authority score for each document
  • high authority score ?? relevant content
  • high hub score ?? links to documents with
    relevant content

157
Impr
Write a Comment
User Comments (0)
About PowerShow.com