Overview of Web Mining - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

Overview of Web Mining

Description:

Web mining can be defined as the automated discovery of useful information from ... taxonomy structure using a patented techniques based on computational semiotics. ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 17

Provided by: csSung

Category:

more less

Transcript and Presenter's Notes

Title: Overview of Web Mining

1
Overview of Web Mining

2000. 3.
Doheon Lee
School of Computer and Information
Chonnam National University
mailtodhlee_at_chonnam.ac.kr

2
Table of Contents

What is Web Mining?
Resource Discovery
Information Extraction
Categorization
Clustering
Web Usage Mining
Case Studies (IBM and Semio)
Concluding Remarks

3
What is Web Mining?

Web mining can be defined as the automated
discovery of useful information from the World
Wide Web documents(and services).

Web Resource Discovery Information
Extraction Categorization Clustering Web Usage
Mining
Cf. Web Content Mining vs. Web Usage Mining
4
Resource Discovery

Search engine
Automatic creation of searchable indices of Web
documents
Lycos, WebCrawler, Alta Vista, ALIWEB, etc
Meta search engine
It posts keyword queries to multiple searchable
indices in parallel it then collates and prunes
the responses returned, aiming to provide users
with a manageable amount of high-quality
information
MetaCrawler
Automatic text categorization technology

5
Resource Discovery (Contd)

Personalized Web Agents
Web agents learn user preferences and discover
Web information sources based on there
preferences, and those of other individuals with
similar interest
WebWatcher, PAINT, Skskill Webert, GroupLens,
Firefly, etc
Web Query Systems
W3QL It combines structure queries based on the
organization of hypertext documents, and content
queries based on information retrieval techniques
WebLog Logic-based query language
Lorel, UnQL Query languages based on a labeled
graph data model
TSIMMIS It generates an integrated database
representation from Web information.

6
Information Extraction

From Web documents
Harvest It knows how to find author and title
information in Latex documents, and how to strip
position information from Postscript files
FAQ-Finders The user poses a question in natural
language and the text of the question is used to
search the FAQ files for a matching question
From Web services
Internet Learning Agent(ILA) It extracts
information such as phone numbers and e-mail
addresses from the Internet server Whois and from
the personnel directories of a dozen universities
ShopBot It takes as input the address of a
stores home page as well as knowledge about a
product domain, and learns how to shop at the
store.

7
ShopBot

Domain-independent comparison-shopping agent
It autonomously learns how to shop at different
vendors.
It does not use full-fledged NLP, rather uses
heuristic search, pattern matching, and inductive
learning.
Phase 1 Learning phase
Starting from the root page of a store, it finds
forms for searchable indices.
For each form, it applies test queries, and
constructs vendor descriptions.
To analyze query result pages, it applies
heuristic rules.
Phase 2 Shopping phase
Based on the vendor descriptions, it extract
product descriptions such as prices.

8
Categorization

Conventional text categorization
Support Vector Machines (SVM)
k-Nearest Neighbor Classifier
Neural Network Approaches
Linear Least Square Fit (LLSF) Mapping
Naïve Bayes Classifier
Limitations on applying to web categorization
Diverse Vocabulary
Hyperlinks
(Intra) Structural Characteristics
Cf. 87 accuracy on the Reuters data set is
reduced to 32 accuracy on a Yahoo! document set.

9
Clustering

Grouping Web documents based on their semantic
relationships (e.g. HyPursuit at MIT)
An algorithm starts with a set where each
original document represents an independent
cluster.
It iteratively reduces the number of clusters by
merging the two most relevant clusters.
It uses pair-wise evaluation of component
clusters to compute the relevance of two compound
clusters. The relevance of the compound clusters
is the minimal relevance between any of these
pairs

10
Clustering (Contd)

Relevance between two documents
Content-Based
The number of common terms
Term frequency
Document size factor
Document frequency (hard to compute)
Link Structure-Based
The number of common ancestors
The number of common descendants
The number of direct paths between two documents
Cf. Shortest path between two documents

11
Web Usage Mining

Analysis of Web access log, referral log, user
profiles to obtain Web usage information.
Preprocessing
Data cleaning, user identification, actual path
identification, transaction identification,
session identification
Local cashes and proxy servers make them
difficult.
Pattern discovery
Association rules, sequential patterns,
classification rules, clustering analysis
Analysis of discovered patterns
Visualization(WebWiz), OLAP, query
language(WEBMINER)

12
Patterns in Web Usage

Association rules
40 of clients who accessed the Web page with URL
/company/product1, also accessed
/company/product2.
30 of clients who accessed /company/special,
placed an online order in /company/product1.
Sequential patterns
30 of clients who visited /company/products, had
done a search in Yahoo, within the past week on
keyword w.
60 of clients who placed an online order in
/company/product1, also placed an online order in
/company/product4 within 15 days.
Classification rules
Clients from state or government agencies who
visit the site tend to interested in the page
/company/product1.
50 of clients who placed an online order in
/company/product2, were in the 20-25 age group
and lived on the West Coast.

13
A General Architecture for Web Usage Mining
From R. Cooley, et al, Web Mining Information
and Pattern Discovery on the World Wide Web,
ICTAI97
14
IBM Intelligent Miner for Text

Extract key information from text
Language identification based on a set of
training documents in the languages
Feature extraction based on Information
Quotient(IQ)
Names of people, organizations, places
Linguistically motivated heuristics that exploit
typography and other regularities of languages
Multiword terms
Heuristics , which are based on a dictionary
containing part-of-speech information for English
words, involve doing simple pattern matching in
order to find expressions having the noun phrase
structures.
Abbreviations
Dates, currency amounts
Organize documents by subject
Hierarchical clustering based on lexical
affinities
Cf. Overlap of single words vs. semantic analysis
Find the predominant themes in a collection of
documents
Search for relevant documents using flexible
queries
Boolean queries with wild cards, free text
queries, hybrid queries

15
Semios Automatic Taxonomy Building

Three groups of layers in Semio Taxonomy
Ontology The highest level of the directory.
These levels are primarily containers for other
categories, not for specific documents. The
topmost level is provided by the directory owner,
while subsequent levels are provided from the
Semio Topic Library
Taxonomy Semio Builder automatically generates
two levels of taxonomy structure using a patented
techniques based on computational semiotics.
Thesaurus It contains related to links between
concepts in the collection. Semio Builder
automatically generates related to links.

16
Concluding Remarks