Mining Content and Structure on the Web - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Mining Content and Structure on the Web

Description:

... on or assisted by meta-Web (a multi-layer digital library catalogue, yellow page) ... uk. w1. w2. wk. 15. HITS: Problems and Solutions. Some edges are ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 35
Provided by: bamshadm
Category:

less

Transcript and Presenter's Notes

Title: Mining Content and Structure on the Web


1
Mining Content and Structure on the Web
Bamshad Mobasher School of CTI, DePaul University
2
Web Mining Taxonomy
3
Web Mining Taxonomy
4
Web Mining Taxonomy
5
Mining Search Engine Results
  • Current Web search engines convenient source for
    mining
  • keyword-based
  • return too many answers need to filter
    information
  • low quality answers
  • results not customized or tailored to individuals
  • Data mining will help
  • query formulation concept hierarchies, user
    profiles, etc.
  • better search primitives user preferences/hints
  • categorization and filtering of results
  • linkage analysis authoritative pages and
    clusters
  • Web-based query languages XML WebSQL WebML
  • customization / personalization content usage
    structure

6
Multi-Layer Warehousing of the Web(Han, et al,
1997-1998)
  • Meta-Web A structure which summarizes the
    contents, structure, linkage, and access of the
    Web and which evolves with the Web
  • Layer 0 the Web itself
  • Layer 1 the lowest layer of the Meta-Web
  • an entry a Web page summary, including class,
    time, URL, contents, keywords, popularity,
    weight, links, etc.
  • Layer 2 and up
  • summary/classification/clustering in various ways
    and distributed for various applications
  • Meta-Web can be warehoused and incrementally
    updated
  • Querying and mining can be performed on or
    assisted by meta-Web (a multi-layer digital
    library catalogue, yellow page)

7
Multiple Layered Database Architecture
8
Construction of Multi-Layer Meta-Web
  • XML facilitates structured and meta-information
    extraction
  • Hidden Web DB schema extraction other meta
    information
  • Automatic classification of Web documents
  • based on Yahoo!, etc. as training set
    keyword-based correlation/classification analysis
    (IR/AI assistance)
  • Automatic ranking of important Web pages
  • authoritative site recognition and clustering Web
    pages
  • Generalization-based multi-layer meta-Web
    construction
  • With the assistance of clustering and
    classification analysis

9
Citation Analysis in Information Retrieval
  • Citation analysis was studied in information
    retrieval long before WWW came into scene
    (bibliometrics)
  • Garfield's impact factor (1972)
  • It provides a numerical assessment of journals in
    the journal citation
  • Pinski and Narin (1976) proposed a significant
    variation, based on the observation that not all
    citations are equally important
  • A journal is influential if, recursively, it is
    heavily cited by other influential journals
  • influence weight The influence of a journal j
    is equal to the sum of the influence of all
    journals citing j, with the sum weighted by the
    amount that each cites j

10
Co-citation analysis (From Garfield 98)
(http//165.123.33.33/eugene_garfield/papers/mapsc
iworld.html)
The Global Map of Science, based on co-citation
clustering Size of the circle represents number
of papers published in the area Distance
between circles represents the level of
co-citation between the fields By zooming in,
deeper levels in the hierarchy can be exposed.
11
Co-citation analysis (From Garfield 98)
Zooming in on biomedicine, specialties including
cardiology, immunology, etc., can be viewed.
12
Discovery of Authoritative Pages
  • Page-rank method ( Brin and Page, 1998)
  • Rank the "importance" of Web pages, based on a
    model of a "random browser
  • Hypertext Induced Topic Search (Kleinberg, 1998)
  • Prominent authorities often do not endorse one
    another directly on the Web
  • Hub pages have a large number of links to many
    relevant authorities
  • Thus hubs and authorities exhibit a mutually
    reinforcing relationship
  • Both the page-rank and hub/authority
    methodologies have been shown to provide
    qualitatively good search results for broad query
    topics on the WWW
  • incorporated into some search engines such as
    Google

13
Hypertext Induced Topic Search
  • Intuition behind the HITS algorithm
  • Authority comes from in-edges
  • Being a good hub comes from out-edges
  • Mutually re-enforcing relationship
  • Better authority comes from in-edges of good hubs
  • Being a better hub comes from out-edges of to
    good authorities

A good authority is a page that is pointed to by
many good hubs. A good hub is a page that points
to many good authorities.
Hubs
Authorities
14
HITS Algorithm
  • Let HUBv and AUTHv represent the hub and
    authority values associated with a vertex v
  • Repeat until HUB and AUTH vectors converge
  • Normalize HUB and AUTH
  • HUBv S AUTHui for all ui with Edge(v, ui)
  • AUTHv S HUBwi for all ui with Edge(wi, v)

w1
u1
v
w2
u2
A
H
...
...
wk
uk
15
HITS Problems and Solutions
  • Some edges are wrong (not recommendations)
  • multiple edges from the same author
  • automatically generated
  • spam
  • Solution weight edges to limit influence
  • Topic Drift
  • Query jaguar AND cars
  • Result pages about cars in general
  • Solution analyze content and assign topic
    scores to nodes

16
Modified HITS Algorithm
  • Let HUBv and AUTHv represent the hub and
    authority values associated with a vertex v
  • Repeat until HUB and AUTH vectors converge
  • Normalize HUB and AUTH
  • HUBv S AUTHui . TopicScoreui . Weight(v,
    ui)
  • for all ui with Edge(v, ui)
  • AUTHv S HUBwi . TopicScorewi .
    Weight(wi, v)
  • for all ui with Edge(wi, v)
  • Topic score is determined based on similarity
    measure between the query and the documents

17
Agent-based Information Systems
  • Agents facilitate access to multiple information
    sources
  • Mediated architectures facilitate access to
    specialized collections
  • Distributed (agent) architectures facilitate
    scalability
  • Personal assistant agents (client-side)
    facilitate information management from multiple
    sources based on user profile
  • Can be supervised or unsupervised

documents
feedback
queries
Documents queries
Basic Agent-based IR Architecture
Agents
Multiple Information Sources
Multiple Users
18
Learning Interface Agents
  • Add agents to the user interface and delegate
    tasks to them
  • Use machine learning to improve performance
  • learn user behavior, preferences
  • Useful when
  • 1) past behavior is a useful predictor of the
    future behavior
  • 2) wide variety of behaviors amongst users
  • Examples
  • mail clerk sort incoming messages in right
    mailboxes
  • calendar manager automatically schedule meeting
    times?
  • Personal news agents
  • portfolio manager agents
  • Advantages
  • less work for user and application writer
  • adaptive behavior
  • user and agent build trust relationship gradually

19
Letizia Autonomous Interface Agent (Lieberman 96)
user
letizia
heuristics
recommendations
user profile
  • Recommends web pages during browsing based on
    user profile
  • Learns user profile using simple heuristics
  • Passive observation, recommend on request
  • Provides relative ordering of link
    interestingness
  • Assumes recommendations near current page are
    more valuable than others

20
WebWatcher
  • Dayne Freitag, Thorsten Joachims, Tom Mitchell
    (CMU)
  • A "tour guide" agent for the WWW
  • user tells agent what kind of information he/she
    is seeking
  • WebWatcher then accompanies user while browsing
    the web
  • highlights hyperlinks that it believes will be of
    interest
  • its strategy for giving advice is learned from
    feedback in earlier tours

21
Another Example of Web Content Mining WebACE
(Mobasher, Boley, Gini, Han - 1998)
  • WebACE is a client-side agent that automatically
    categorizes a set of documents (collected as part
    of a learned user profile), generates queries
    used to search for new related or similar
    documents, and filters the resulting documents to
    extract the set of documents most closely related
    to the user profile
  • The document categories are not given apriori
  • The resulting document set could also be used to
    update the initial document categories
  • Two new clustering algorithms which provide
    significant improvement over traditional
    clustering algorithms and form the basis for the
    query generation and filtering components of the
    agent
  • Association Rule Hypergraph Partitioning (ARHP)
  • Principal Direction Divisive Partitioning (PDDP)

22
WebACE Architecture
Clusters
Clustering Modules
Profile Creation Module
Query Generator
Search Mechanism
Filter (optional)
Cluster Updater (optional)
User Input
23
Hypergraph-Based Clustering
  • Construct a hypergraph from sets of related items
  • Each hyperedge represents a frequent itemset
  • Weight of each hyperedge can be based on the
    characteristics of frequent itemsets or
    association rules
  • Recursively partition hypergraph so that each
    partition contains only highly connected data
    items
  • Given a hypergraph G(V,E) we find a k-way
    partitioning such that the weight of the
    hyperedges that are cut is minimized
  • The fitness of partitions measured in terms of
    the ratio of weights of cut edges to the weights
    of uncut edges within the partitions
  • The connectivity measures the percentage of edges
    within the partition with which the vertex is
    associated -- used for filtering partitions
  • Vertices from partial edges can be added back to
    clusters based on a user-specified overlap factor

24
Attaching Weights to Hyperedges
  • Weight can be a function of the confidence of the
    rules contained in the frequent itemsets
  • Diaper,Milk,Beer will be a hyperedge of 3
    vertices labeled with the specified weight
  • Weight can be a function of the interest of the
    itemsets
  • if interest is 1, then items appear together
    more often than what would be expected in a
    random distribution
  • support(Diaper) 0.1 support(Milk) 0.5
    support(Beer) 0.2 support(Diaper,Milk,Beer)
    0.02
  • Interest (Diaper,Milk,Beer) 0.02 / 0.01 2

25
Selection of Good Partitions
  • Eliminate bad partitions using a fitness
    criterion
  • The fitness measures the ratio of weights of
    edges that are within the partition and weights
    of edges involving any vertex of this partition
  • For each good partition, filter out vertices that
    are not highly connected to the rest of the
    vertices of the partition
  • The connectivity of vertex v in partition C is
    defined as follow
  • The connectivity measures the percentage of edges
    within the partition with which the vertex is
    associated

26
What are the Transactions?
  • Documents correspond to items to be clustered and
    words (features) correspond to transactions.
  • Frequent item sets correspond to groups of
    documents in which many words occur in common.
    These groups are the hyperedges in the hypergraph
    of documents.

27
Experiments with WebACE
  • Data set 185 documents in 10 categories
  • The documents in each category were obtained by
    doing a keyword search for the category label
    using a standard search engine
  • The word lists from all documents were filtered
    through a stop-list and then stemmed using a
    suffix-stripping algorithm
  • Entropy used as a measure of the goodness of
    the clusters
  • When a cluster contains documents from only one
    category, the entropy value for the cluster is
    0.0
  • When a cluster contains documents from many
    different categories, the entropy value for the
    cluster is higher

business capital (BC), intellectual property
(IP), electronic commerce (EC), information
systems (IS), affirmative action (AA), employee
rights (ER), personnel management (PM),
industrial partnership (IPT), manufacturing
systems integration (MSI), materials processing
(MP)
28
Feature Selection Criteria Used in the Experiments
  • In E9, features selected were most frequent words
    until accumulated frequency of 25 was reached.
  • In E10, the association rule algorithm (Apriori)
    was run on words (using words as items). Features
    selected were the words remaining in the frequent
    item sets discovered.

29
Entropy Comparison of Algorithms
Note lower entropy indicates better
cohesiveness of clusters.
30
Document Clusters in AutoClass
31
Document Clusters in ARHP
32
Applying ARHP to SP Stock Data
  • SP 500 stock price movement from Jan. 1994 to
    Oct. 1996.
  • Frequent item sets from the stock data

Day 1 Intel-UP Microsoft-UP
Morgan-Stanley-DOWN Day 2 Intel-DOWN
Microsoft-DOWN Morgan-Stanley-UP Day
3 Intel-UP Microsoft-DOWN
Morgan-Stanley-DOWN . . .
Intel-up, Microsoft-UP Intel-down,
Microsoft-DOWN, Morgan-Stanley-UP Morgan-Stanley
-UP, MBNA-Corp-UP, Fed-Home-Loan-UP
. . .
33
Clustering of SP 500 Stock Data
Other clusters Bank, Paper/Lumber,
Motor/Machinery, Retail, Telecommunication,
Tech/Electronics
34
Intelligent Web Assistant Agent (Mobasher,
Lytinen 1999-2000)
Write a Comment
User Comments (0)
About PowerShow.com