Mining Content and Structure on the Web - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Mining Content and Structure on the Web

Description:

... on or assisted by meta-Web (a multi-layer digital library catalogue, yellow page) ... uk. w1. w2. wk. 15. HITS: Problems and Solutions. Some edges are ... – PowerPoint PPT presentation

Number of Views:149

Avg rating:3.0/5.0

Slides: 35

Provided by: bamshadm

Category:

more less

Transcript and Presenter's Notes

Title: Mining Content and Structure on the Web

1
Mining Content and Structure on the Web
Bamshad Mobasher School of CTI, DePaul University
2
Web Mining Taxonomy
3
Web Mining Taxonomy
4
Web Mining Taxonomy
5
Mining Search Engine Results

Current Web search engines convenient source for
mining
keyword-based
return too many answers need to filter
information
low quality answers
results not customized or tailored to individuals
Data mining will help
query formulation concept hierarchies, user
profiles, etc.
better search primitives user preferences/hints
categorization and filtering of results
linkage analysis authoritative pages and
clusters
Web-based query languages XML WebSQL WebML
customization / personalization content usage
structure

6
Multi-Layer Warehousing of the Web(Han, et al,
1997-1998)

Meta-Web A structure which summarizes the
contents, structure, linkage, and access of the
Web and which evolves with the Web
Layer 0 the Web itself
Layer 1 the lowest layer of the Meta-Web
an entry a Web page summary, including class,
time, URL, contents, keywords, popularity,
weight, links, etc.
Layer 2 and up
summary/classification/clustering in various ways
and distributed for various applications
Meta-Web can be warehoused and incrementally
updated
Querying and mining can be performed on or
assisted by meta-Web (a multi-layer digital
library catalogue, yellow page)

7
Multiple Layered Database Architecture
8
Construction of Multi-Layer Meta-Web

XML facilitates structured and meta-information
extraction
Hidden Web DB schema extraction other meta
information
Automatic classification of Web documents
based on Yahoo!, etc. as training set
keyword-based correlation/classification analysis
(IR/AI assistance)
Automatic ranking of important Web pages
authoritative site recognition and clustering Web
pages
Generalization-based multi-layer meta-Web
construction
With the assistance of clustering and
classification analysis

9
Citation Analysis in Information Retrieval

Citation analysis was studied in information
retrieval long before WWW came into scene
(bibliometrics)
Garfield's impact factor (1972)
It provides a numerical assessment of journals in
the journal citation
Pinski and Narin (1976) proposed a significant
variation, based on the observation that not all
citations are equally important
A journal is influential if, recursively, it is
heavily cited by other influential journals
influence weight The influence of a journal j
is equal to the sum of the influence of all
journals citing j, with the sum weighted by the
amount that each cites j

10
Co-citation analysis (From Garfield 98)
(http//165.123.33.33/eugene_garfield/papers/mapsc
iworld.html)
The Global Map of Science, based on co-citation
clustering Size of the circle represents number
of papers published in the area Distance
between circles represents the level of
co-citation between the fields By zooming in,
deeper levels in the hierarchy can be exposed.
11
Co-citation analysis (From Garfield 98)
Zooming in on biomedicine, specialties including
cardiology, immunology, etc., can be viewed.
12
Discovery of Authoritative Pages

Page-rank method ( Brin and Page, 1998)
Rank the "importance" of Web pages, based on a
model of a "random browser
Hypertext Induced Topic Search (Kleinberg, 1998)
Prominent authorities often do not endorse one
another directly on the Web
Hub pages have a large number of links to many
relevant authorities
Thus hubs and authorities exhibit a mutually
reinforcing relationship
Both the page-rank and hub/authority
methodologies have been shown to provide
qualitatively good search results for broad query
topics on the WWW
incorporated into some search engines such as
Google

13
Hypertext Induced Topic Search

Intuition behind the HITS algorithm
Authority comes from in-edges
Being a good hub comes from out-edges
Mutually re-enforcing relationship
Better authority comes from in-edges of good hubs
Being a better hub comes from out-edges of to
good authorities

A good authority is a page that is pointed to by
many good hubs. A good hub is a page that points
to many good authorities.
Hubs
Authorities
14
HITS Algorithm

Let HUBv and AUTHv represent the hub and
authority values associated with a vertex v
Repeat until HUB and AUTH vectors converge
Normalize HUB and AUTH
HUBv S AUTHui for all ui with Edge(v, ui)
AUTHv S HUBwi for all ui with Edge(wi, v)

w1
u1
v
w2
u2
A
H
...
...
wk
uk
15
HITS Problems and Solutions

Some edges are wrong (not recommendations)
multiple edges from the same author
automatically generated
spam
Solution weight edges to limit influence
Topic Drift
Query jaguar AND cars
Result pages about cars in general
Solution analyze content and assign topic
scores to nodes

16
Modified HITS Algorithm

Let HUBv and AUTHv represent the hub and
authority values associated with a vertex v
Repeat until HUB and AUTH vectors converge
Normalize HUB and AUTH
HUBv S AUTHui . TopicScoreui . Weight(v,
ui)
for all ui with Edge(v, ui)
AUTHv S HUBwi . TopicScorewi .
Weight(wi, v)
for all ui with Edge(wi, v)
Topic score is determined based on similarity
measure between the query and the documents

17
Agent-based Information Systems

Agents facilitate access to multiple information
sources
Mediated architectures facilitate access to
specialized collections
Distributed (agent) architectures facilitate
scalability
Personal assistant agents (client-side)
facilitate information management from multiple
sources based on user profile
Can be supervised or unsupervised

documents
feedback
queries
Documents queries
Basic Agent-based IR Architecture
Agents
Multiple Information Sources
Multiple Users
18
Learning Interface Agents

Add agents to the user interface and delegate
tasks to them
Use machine learning to improve performance
learn user behavior, preferences
Useful when
1) past behavior is a useful predictor of the
future behavior
2) wide variety of behaviors amongst users
Examples
mail clerk sort incoming messages in right
mailboxes
calendar manager automatically schedule meeting
times?
Personal news agents
portfolio manager agents
Advantages
less work for user and application writer
adaptive behavior
user and agent build trust relationship gradually

19
Letizia Autonomous Interface Agent (Lieberman 96)
user
letizia
heuristics
recommendations
user profile

Recommends web pages during browsing based on
user profile
Learns user profile using simple heuristics
Passive observation, recommend on request
Provides relative ordering of link
interestingness
Assumes recommendations near current page are
more valuable than others

20
WebWatcher

Dayne Freitag, Thorsten Joachims, Tom Mitchell
(CMU)
A "tour guide" agent for the WWW
user tells agent what kind of information he/she
is seeking
WebWatcher then accompanies user while browsing
the web
highlights hyperlinks that it believes will be of
interest
its strategy for giving advice is learned from
feedback in earlier tours

21
Another Example of Web Content Mining WebACE
(Mobasher, Boley, Gini, Han - 1998)

WebACE is a client-side agent that automatically
categorizes a set of documents (collected as part
of a learned user profile), generates queries
used to search for new related or similar
documents, and filters the resulting documents to
extract the set of documents most closely related
to the user profile
The document categories are not given apriori
The resulting document set could also be used to
update the initial document categories
Two new clustering algorithms which provide
significant improvement over traditional
clustering algorithms and form the basis for the
query generation and filtering components of the
agent
Association Rule Hypergraph Partitioning (ARHP)
Principal Direction Divisive Partitioning (PDDP)

22
WebACE Architecture
Clusters
Clustering Modules
Profile Creation Module
Query Generator
Search Mechanism
Filter (optional)
Cluster Updater (optional)
User Input
23
Hypergraph-Based Clustering

Construct a hypergraph from sets of related items
Each hyperedge represents a frequent itemset
Weight of each hyperedge can be based on the
characteristics of frequent itemsets or
association rules

Recursively partition hypergraph so that each
partition contains only highly connected data
items
Given a hypergraph G(V,E) we find a k-way
partitioning such that the weight of the
hyperedges that are cut is minimized
The fitness of partitions measured in terms of
the ratio of weights of cut edges to the weights
of uncut edges within the partitions
The connectivity measures the percentage of edges
within the partition with which the vertex is
associated -- used for filtering partitions
Vertices from partial edges can be added back to
clusters based on a user-specified overlap factor

24
Attaching Weights to Hyperedges

Weight can be a function of the confidence of the
rules contained in the frequent itemsets
Diaper,Milk,Beer will be a hyperedge of 3
vertices labeled with the specified weight
Weight can be a function of the interest of the
itemsets
if interest is 1, then items appear together
more often than what would be expected in a
random distribution
support(Diaper) 0.1 support(Milk) 0.5
support(Beer) 0.2 support(Diaper,Milk,Beer)
0.02
Interest (Diaper,Milk,Beer) 0.02 / 0.01 2

25
Selection of Good Partitions

Eliminate bad partitions using a fitness
criterion
The fitness measures the ratio of weights of
edges that are within the partition and weights
of edges involving any vertex of this partition
For each good partition, filter out vertices that
are not highly connected to the rest of the
vertices of the partition
The connectivity of vertex v in partition C is
defined as follow
The connectivity measures the percentage of edges
within the partition with which the vertex is
associated

26
What are the Transactions?

Documents correspond to items to be clustered and
words (features) correspond to transactions.
Frequent item sets correspond to groups of
documents in which many words occur in common.
These groups are the hyperedges in the hypergraph
of documents.

27
Experiments with WebACE

Data set 185 documents in 10 categories
The documents in each category were obtained by
doing a keyword search for the category label
using a standard search engine
The word lists from all documents were filtered
through a stop-list and then stemmed using a
suffix-stripping algorithm
Entropy used as a measure of the goodness of
the clusters
When a cluster contains documents from only one
category, the entropy value for the cluster is
0.0
When a cluster contains documents from many
different categories, the entropy value for the
cluster is higher

business capital (BC), intellectual property
(IP), electronic commerce (EC), information
systems (IS), affirmative action (AA), employee
rights (ER), personnel management (PM),
industrial partnership (IPT), manufacturing
systems integration (MSI), materials processing
(MP)
28
Feature Selection Criteria Used in the Experiments

In E9, features selected were most frequent words
until accumulated frequency of 25 was reached.
In E10, the association rule algorithm (Apriori)
was run on words (using words as items). Features
selected were the words remaining in the frequent
item sets discovered.

29
Entropy Comparison of Algorithms
Note lower entropy indicates better
cohesiveness of clusters.
30
Document Clusters in AutoClass
31
Document Clusters in ARHP
32
Applying ARHP to SP Stock Data

SP 500 stock price movement from Jan. 1994 to
Oct. 1996.
Frequent item sets from the stock data

Day 1 Intel-UP Microsoft-UP
Morgan-Stanley-DOWN Day 2 Intel-DOWN
Microsoft-DOWN Morgan-Stanley-UP Day
3 Intel-UP Microsoft-DOWN
Morgan-Stanley-DOWN . . .
Intel-up, Microsoft-UP Intel-down,
Microsoft-DOWN, Morgan-Stanley-UP Morgan-Stanley
-UP, MBNA-Corp-UP, Fed-Home-Loan-UP
. . .
33
Clustering of SP 500 Stock Data
Other clusters Bank, Paper/Lumber,
Motor/Machinery, Retail, Telecommunication,
Tech/Electronics
34
Intelligent Web Assistant Agent (Mobasher,
Lytinen 1999-2000)

Write a Comment

User Comments (0)