Title: Mining Content and Structure on the Web
1Mining Content and Structure on the Web
Bamshad Mobasher School of CTI, DePaul University
2Web Mining Taxonomy
3Web Mining Taxonomy
4Web Mining Taxonomy
5Mining Search Engine Results
- Current Web search engines convenient source for
mining - keyword-based
- return too many answers need to filter
information - low quality answers
- results not customized or tailored to individuals
- Data mining will help
- query formulation concept hierarchies, user
profiles, etc. - better search primitives user preferences/hints
- categorization and filtering of results
- linkage analysis authoritative pages and
clusters - Web-based query languages XML WebSQL WebML
- customization / personalization content usage
structure
6Multi-Layer Warehousing of the Web(Han, et al,
1997-1998)
- Meta-Web A structure which summarizes the
contents, structure, linkage, and access of the
Web and which evolves with the Web - Layer 0 the Web itself
- Layer 1 the lowest layer of the Meta-Web
- an entry a Web page summary, including class,
time, URL, contents, keywords, popularity,
weight, links, etc. - Layer 2 and up
- summary/classification/clustering in various ways
and distributed for various applications - Meta-Web can be warehoused and incrementally
updated - Querying and mining can be performed on or
assisted by meta-Web (a multi-layer digital
library catalogue, yellow page)
7Multiple Layered Database Architecture
8Construction of Multi-Layer Meta-Web
- XML facilitates structured and meta-information
extraction - Hidden Web DB schema extraction other meta
information - Automatic classification of Web documents
- based on Yahoo!, etc. as training set
keyword-based correlation/classification analysis
(IR/AI assistance) - Automatic ranking of important Web pages
- authoritative site recognition and clustering Web
pages - Generalization-based multi-layer meta-Web
construction - With the assistance of clustering and
classification analysis
9Citation Analysis in Information Retrieval
- Citation analysis was studied in information
retrieval long before WWW came into scene
(bibliometrics) - Garfield's impact factor (1972)
- It provides a numerical assessment of journals in
the journal citation - Pinski and Narin (1976) proposed a significant
variation, based on the observation that not all
citations are equally important - A journal is influential if, recursively, it is
heavily cited by other influential journals - influence weight The influence of a journal j
is equal to the sum of the influence of all
journals citing j, with the sum weighted by the
amount that each cites j
10Co-citation analysis (From Garfield 98)
(http//165.123.33.33/eugene_garfield/papers/mapsc
iworld.html)
The Global Map of Science, based on co-citation
clustering Size of the circle represents number
of papers published in the area Distance
between circles represents the level of
co-citation between the fields By zooming in,
deeper levels in the hierarchy can be exposed.
11Co-citation analysis (From Garfield 98)
Zooming in on biomedicine, specialties including
cardiology, immunology, etc., can be viewed.
12Discovery of Authoritative Pages
- Page-rank method ( Brin and Page, 1998)
- Rank the "importance" of Web pages, based on a
model of a "random browser - Hypertext Induced Topic Search (Kleinberg, 1998)
- Prominent authorities often do not endorse one
another directly on the Web - Hub pages have a large number of links to many
relevant authorities - Thus hubs and authorities exhibit a mutually
reinforcing relationship - Both the page-rank and hub/authority
methodologies have been shown to provide
qualitatively good search results for broad query
topics on the WWW - incorporated into some search engines such as
Google
13Hypertext Induced Topic Search
- Intuition behind the HITS algorithm
- Authority comes from in-edges
- Being a good hub comes from out-edges
- Mutually re-enforcing relationship
- Better authority comes from in-edges of good hubs
- Being a better hub comes from out-edges of to
good authorities
A good authority is a page that is pointed to by
many good hubs. A good hub is a page that points
to many good authorities.
Hubs
Authorities
14HITS Algorithm
- Let HUBv and AUTHv represent the hub and
authority values associated with a vertex v - Repeat until HUB and AUTH vectors converge
- Normalize HUB and AUTH
- HUBv S AUTHui for all ui with Edge(v, ui)
- AUTHv S HUBwi for all ui with Edge(wi, v)
w1
u1
v
w2
u2
A
H
...
...
wk
uk
15HITS Problems and Solutions
- Some edges are wrong (not recommendations)
- multiple edges from the same author
- automatically generated
- spam
- Solution weight edges to limit influence
- Topic Drift
- Query jaguar AND cars
- Result pages about cars in general
- Solution analyze content and assign topic
scores to nodes
16Modified HITS Algorithm
- Let HUBv and AUTHv represent the hub and
authority values associated with a vertex v - Repeat until HUB and AUTH vectors converge
- Normalize HUB and AUTH
- HUBv S AUTHui . TopicScoreui . Weight(v,
ui) - for all ui with Edge(v, ui)
- AUTHv S HUBwi . TopicScorewi .
Weight(wi, v) - for all ui with Edge(wi, v)
- Topic score is determined based on similarity
measure between the query and the documents
17Agent-based Information Systems
- Agents facilitate access to multiple information
sources - Mediated architectures facilitate access to
specialized collections - Distributed (agent) architectures facilitate
scalability - Personal assistant agents (client-side)
facilitate information management from multiple
sources based on user profile - Can be supervised or unsupervised
documents
feedback
queries
Documents queries
Basic Agent-based IR Architecture
Agents
Multiple Information Sources
Multiple Users
18Learning Interface Agents
- Add agents to the user interface and delegate
tasks to them - Use machine learning to improve performance
- learn user behavior, preferences
- Useful when
- 1) past behavior is a useful predictor of the
future behavior - 2) wide variety of behaviors amongst users
- Examples
- mail clerk sort incoming messages in right
mailboxes - calendar manager automatically schedule meeting
times? - Personal news agents
- portfolio manager agents
- Advantages
- less work for user and application writer
- adaptive behavior
- user and agent build trust relationship gradually
19Letizia Autonomous Interface Agent (Lieberman 96)
user
letizia
heuristics
recommendations
user profile
- Recommends web pages during browsing based on
user profile - Learns user profile using simple heuristics
- Passive observation, recommend on request
- Provides relative ordering of link
interestingness - Assumes recommendations near current page are
more valuable than others
20WebWatcher
- Dayne Freitag, Thorsten Joachims, Tom Mitchell
(CMU) - A "tour guide" agent for the WWW
- user tells agent what kind of information he/she
is seeking - WebWatcher then accompanies user while browsing
the web - highlights hyperlinks that it believes will be of
interest - its strategy for giving advice is learned from
feedback in earlier tours
21Another Example of Web Content Mining WebACE
(Mobasher, Boley, Gini, Han - 1998)
- WebACE is a client-side agent that automatically
categorizes a set of documents (collected as part
of a learned user profile), generates queries
used to search for new related or similar
documents, and filters the resulting documents to
extract the set of documents most closely related
to the user profile - The document categories are not given apriori
- The resulting document set could also be used to
update the initial document categories - Two new clustering algorithms which provide
significant improvement over traditional
clustering algorithms and form the basis for the
query generation and filtering components of the
agent - Association Rule Hypergraph Partitioning (ARHP)
- Principal Direction Divisive Partitioning (PDDP)
22WebACE Architecture
Clusters
Clustering Modules
Profile Creation Module
Query Generator
Search Mechanism
Filter (optional)
Cluster Updater (optional)
User Input
23Hypergraph-Based Clustering
- Construct a hypergraph from sets of related items
- Each hyperedge represents a frequent itemset
- Weight of each hyperedge can be based on the
characteristics of frequent itemsets or
association rules
- Recursively partition hypergraph so that each
partition contains only highly connected data
items - Given a hypergraph G(V,E) we find a k-way
partitioning such that the weight of the
hyperedges that are cut is minimized - The fitness of partitions measured in terms of
the ratio of weights of cut edges to the weights
of uncut edges within the partitions - The connectivity measures the percentage of edges
within the partition with which the vertex is
associated -- used for filtering partitions - Vertices from partial edges can be added back to
clusters based on a user-specified overlap factor
24Attaching Weights to Hyperedges
- Weight can be a function of the confidence of the
rules contained in the frequent itemsets - Diaper,Milk,Beer will be a hyperedge of 3
vertices labeled with the specified weight - Weight can be a function of the interest of the
itemsets - if interest is 1, then items appear together
more often than what would be expected in a
random distribution - support(Diaper) 0.1 support(Milk) 0.5
support(Beer) 0.2 support(Diaper,Milk,Beer)
0.02 - Interest (Diaper,Milk,Beer) 0.02 / 0.01 2
25Selection of Good Partitions
- Eliminate bad partitions using a fitness
criterion - The fitness measures the ratio of weights of
edges that are within the partition and weights
of edges involving any vertex of this partition - For each good partition, filter out vertices that
are not highly connected to the rest of the
vertices of the partition - The connectivity of vertex v in partition C is
defined as follow - The connectivity measures the percentage of edges
within the partition with which the vertex is
associated
26What are the Transactions?
- Documents correspond to items to be clustered and
words (features) correspond to transactions. - Frequent item sets correspond to groups of
documents in which many words occur in common.
These groups are the hyperedges in the hypergraph
of documents.
27Experiments with WebACE
- Data set 185 documents in 10 categories
- The documents in each category were obtained by
doing a keyword search for the category label
using a standard search engine - The word lists from all documents were filtered
through a stop-list and then stemmed using a
suffix-stripping algorithm - Entropy used as a measure of the goodness of
the clusters - When a cluster contains documents from only one
category, the entropy value for the cluster is
0.0 - When a cluster contains documents from many
different categories, the entropy value for the
cluster is higher
business capital (BC), intellectual property
(IP), electronic commerce (EC), information
systems (IS), affirmative action (AA), employee
rights (ER), personnel management (PM),
industrial partnership (IPT), manufacturing
systems integration (MSI), materials processing
(MP)
28Feature Selection Criteria Used in the Experiments
- In E9, features selected were most frequent words
until accumulated frequency of 25 was reached. - In E10, the association rule algorithm (Apriori)
was run on words (using words as items). Features
selected were the words remaining in the frequent
item sets discovered.
29Entropy Comparison of Algorithms
Note lower entropy indicates better
cohesiveness of clusters.
30Document Clusters in AutoClass
31Document Clusters in ARHP
32Applying ARHP to SP Stock Data
- SP 500 stock price movement from Jan. 1994 to
Oct. 1996. - Frequent item sets from the stock data
Day 1 Intel-UP Microsoft-UP
Morgan-Stanley-DOWN Day 2 Intel-DOWN
Microsoft-DOWN Morgan-Stanley-UP Day
3 Intel-UP Microsoft-DOWN
Morgan-Stanley-DOWN . . .
Intel-up, Microsoft-UP Intel-down,
Microsoft-DOWN, Morgan-Stanley-UP Morgan-Stanley
-UP, MBNA-Corp-UP, Fed-Home-Loan-UP
. . .
33Clustering of SP 500 Stock Data
Other clusters Bank, Paper/Lumber,
Motor/Machinery, Retail, Telecommunication,
Tech/Electronics
34Intelligent Web Assistant Agent (Mobasher,
Lytinen 1999-2000)