Overview of Web Mining - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Overview of Web Mining

Description:

Web mining can be defined as the automated discovery of useful information from ... taxonomy structure using a patented techniques based on computational semiotics. ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 17
Provided by: csSung
Category:

less

Transcript and Presenter's Notes

Title: Overview of Web Mining


1
Overview of Web Mining
  • 2000. 3.
  • Doheon Lee
  • School of Computer and Information
  • Chonnam National University
  • mailtodhlee_at_chonnam.ac.kr

2
Table of Contents
  • What is Web Mining?
  • Resource Discovery
  • Information Extraction
  • Categorization
  • Clustering
  • Web Usage Mining
  • Case Studies (IBM and Semio)
  • Concluding Remarks

3
What is Web Mining?
  • Web mining can be defined as the automated
    discovery of useful information from the World
    Wide Web documents(and services).

Web Resource Discovery Information
Extraction Categorization Clustering Web Usage
Mining
Cf. Web Content Mining vs. Web Usage Mining
4
Resource Discovery
  • Search engine
  • Automatic creation of searchable indices of Web
    documents
  • Lycos, WebCrawler, Alta Vista, ALIWEB, etc
  • Meta search engine
  • It posts keyword queries to multiple searchable
    indices in parallel it then collates and prunes
    the responses returned, aiming to provide users
    with a manageable amount of high-quality
    information
  • MetaCrawler
  • Automatic text categorization technology

5
Resource Discovery (Contd)
  • Personalized Web Agents
  • Web agents learn user preferences and discover
    Web information sources based on there
    preferences, and those of other individuals with
    similar interest
  • WebWatcher, PAINT, Skskill Webert, GroupLens,
    Firefly, etc
  • Web Query Systems
  • W3QL It combines structure queries based on the
    organization of hypertext documents, and content
    queries based on information retrieval techniques
  • WebLog Logic-based query language
  • Lorel, UnQL Query languages based on a labeled
    graph data model
  • TSIMMIS It generates an integrated database
    representation from Web information.

6
Information Extraction
  • From Web documents
  • Harvest It knows how to find author and title
    information in Latex documents, and how to strip
    position information from Postscript files
  • FAQ-Finders The user poses a question in natural
    language and the text of the question is used to
    search the FAQ files for a matching question
  • From Web services
  • Internet Learning Agent(ILA) It extracts
    information such as phone numbers and e-mail
    addresses from the Internet server Whois and from
    the personnel directories of a dozen universities
  • ShopBot It takes as input the address of a
    stores home page as well as knowledge about a
    product domain, and learns how to shop at the
    store.

7
ShopBot
  • Domain-independent comparison-shopping agent
  • It autonomously learns how to shop at different
    vendors.
  • It does not use full-fledged NLP, rather uses
    heuristic search, pattern matching, and inductive
    learning.
  • Phase 1 Learning phase
  • Starting from the root page of a store, it finds
    forms for searchable indices.
  • For each form, it applies test queries, and
    constructs vendor descriptions.
  • To analyze query result pages, it applies
    heuristic rules.
  • Phase 2 Shopping phase
  • Based on the vendor descriptions, it extract
    product descriptions such as prices.

8
Categorization
  • Conventional text categorization
  • Support Vector Machines (SVM)
  • k-Nearest Neighbor Classifier
  • Neural Network Approaches
  • Linear Least Square Fit (LLSF) Mapping
  • Naïve Bayes Classifier
  • Limitations on applying to web categorization
  • Diverse Vocabulary
  • Hyperlinks
  • (Intra) Structural Characteristics
  • Cf. 87 accuracy on the Reuters data set is
    reduced to 32 accuracy on a Yahoo! document set.

9
Clustering
  • Grouping Web documents based on their semantic
    relationships (e.g. HyPursuit at MIT)
  • An algorithm starts with a set where each
    original document represents an independent
    cluster.
  • It iteratively reduces the number of clusters by
    merging the two most relevant clusters.
  • It uses pair-wise evaluation of component
    clusters to compute the relevance of two compound
    clusters. The relevance of the compound clusters
    is the minimal relevance between any of these
    pairs

10
Clustering (Contd)
  • Relevance between two documents
  • Content-Based
  • The number of common terms
  • Term frequency
  • Document size factor
  • Document frequency (hard to compute)
  • Link Structure-Based
  • The number of common ancestors
  • The number of common descendants
  • The number of direct paths between two documents
  • Cf. Shortest path between two documents

11
Web Usage Mining
  • Analysis of Web access log, referral log, user
    profiles to obtain Web usage information.
  • Preprocessing
  • Data cleaning, user identification, actual path
    identification, transaction identification,
    session identification
  • Local cashes and proxy servers make them
    difficult.
  • Pattern discovery
  • Association rules, sequential patterns,
    classification rules, clustering analysis
  • Analysis of discovered patterns
  • Visualization(WebWiz), OLAP, query
    language(WEBMINER)

12
Patterns in Web Usage
  • Association rules
  • 40 of clients who accessed the Web page with URL
    /company/product1, also accessed
    /company/product2.
  • 30 of clients who accessed /company/special,
    placed an online order in /company/product1.
  • Sequential patterns
  • 30 of clients who visited /company/products, had
    done a search in Yahoo, within the past week on
    keyword w.
  • 60 of clients who placed an online order in
    /company/product1, also placed an online order in
    /company/product4 within 15 days.
  • Classification rules
  • Clients from state or government agencies who
    visit the site tend to interested in the page
    /company/product1.
  • 50 of clients who placed an online order in
    /company/product2, were in the 20-25 age group
    and lived on the West Coast.

13
A General Architecture for Web Usage Mining
From R. Cooley, et al, Web Mining Information
and Pattern Discovery on the World Wide Web,
ICTAI97
14
IBM Intelligent Miner for Text
  • Extract key information from text
  • Language identification based on a set of
    training documents in the languages
  • Feature extraction based on Information
    Quotient(IQ)
  • Names of people, organizations, places
  • Linguistically motivated heuristics that exploit
    typography and other regularities of languages
  • Multiword terms
  • Heuristics , which are based on a dictionary
    containing part-of-speech information for English
    words, involve doing simple pattern matching in
    order to find expressions having the noun phrase
    structures.
  • Abbreviations
  • Dates, currency amounts
  • Organize documents by subject
  • Hierarchical clustering based on lexical
    affinities
  • Cf. Overlap of single words vs. semantic analysis
  • Find the predominant themes in a collection of
    documents
  • Search for relevant documents using flexible
    queries
  • Boolean queries with wild cards, free text
    queries, hybrid queries

15
Semios Automatic Taxonomy Building
  • Three groups of layers in Semio Taxonomy
  • Ontology The highest level of the directory.
    These levels are primarily containers for other
    categories, not for specific documents. The
    topmost level is provided by the directory owner,
    while subsequent levels are provided from the
    Semio Topic Library
  • Taxonomy Semio Builder automatically generates
    two levels of taxonomy structure using a patented
    techniques based on computational semiotics.
  • Thesaurus It contains related to links between
    concepts in the collection. Semio Builder
    automatically generates related to links.

16
Concluding Remarks
  • Diverse types of Web Mining targets
  • Data preparation for Web Mining
  • Parallel and scalable Web Mining solutions
  • Capturing common operators
Write a Comment
User Comments (0)
About PowerShow.com