Detecting and Representing Relevant Web Deltas in WHOWEDA - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Detecting and Representing Relevant Web Deltas in WHOWEDA

Description:

Detecting and Representing Relevant Web Deltas in WHOWEDA Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla madrias_at_umr.edu – PowerPoint PPT presentation

Number of Views:171
Avg rating:3.0/5.0
Slides: 78
Provided by: yili8
Learn more at: https://web.mst.edu
Category:

less

Transcript and Presenter's Notes

Title: Detecting and Representing Relevant Web Deltas in WHOWEDA


1
Detecting and Representing Relevant Web Deltas in
WHOWEDA
  • Sanjay Kumar Madria
  • Department of Computer Science
  • University of Missouri-Rolla
  • madrias_at_umr.edu
  • Based on IEEE ICDCS00 and IEEE TKDE (under minor
    revision)

2
Current Situation of W3
  • The Web allows information to change at any time
    and in any way
  • Two forms of changes
  • Existence
  • Structure and content modification
  • Leaves no trace of the previous document

3
Problems of Change Management
  • Problems
  • Detecting, Representing and Querying these
    changes
  • The problem is challenging
  • Typical database approaches to detect changes
    based on triggering mechanisms are not usable
  • No access right, no support for triggers
  • Information sources typically do not keep track
    of historical information to a format that is
    accessible to the outside user

4
Applications
  • Provides the framework for
  • Web Site Administrator
  • Trend analysis and Mining
  • E-commerce
  • Customers of E-commerce Web Site
  • Competitive Intelligence Product and Price
    comparisons
  • Notification Services (with PDA)

5
Objectives
  • Web deltas - Changes to web data
  • Detecting and representing relevant page-level
    web deltas
  • changes that are relevant to users query, not
    any arbitrary changes or web deltas
  • Restricted to page level
  • Detect those documents
  • which are added to the site
  • deleted from the site
  • those documents which have undergone content or
    structural modification
  • How these delta documents are related to one
    another and with other documents relevant to the
    users query

6
Related Work
  • Lore (Stanford) change management (SIGMOD97 and
    ICDE98)
  • Contrast
  • OEM based, not applied on Web
  • WebCQ (Georgia Tech)
  • Needs a set of URLs.
  • No interdocument changes
  • Htmldiff (ATT)
  • Input - two versions
  • Output marked up copy highlight changes
  • Contrast
  • Difficult to browse in case of large file
  • Ours is based on query , not any change

7
Change Mgmt in DBMS
  • Two Approaches
  • Snapshot collection at times t1, t2,..
  • Snapshot deltas, D and ?Ds at time t1, t2,..
  • Contrast we use snapshot delta approach, but
    with semi-structured data

8
Motivating Example
  • Assume that there is a web site at
    www.panacea.gov
  • Provides information related to drugs used for
    various diseases
  • Suppose, on 15th January, a user wishes to find
    out periodically (every 30 days)
  • information related to side effects and uses of
    drugs used for various drugs and
  • changes to these information at the page-level
    compared to its previous version

9
Structure of www.panacea.gov
  • www.panacea.gov contains a list of diseases
  • Each link of a particular disease points to a web
    page containing a list of drugs used for
    prevention and cure of the disease
  • Hyperlinks associated with each drug points to
    documents containing a list of various issues
    related to a particular drug (description,
    manufacturers, clinical pharmacology, uses,
    side-effects etc)
  • From the hyperlinks associated with each issue,
    one can retrieve details of these issues for a
    particular drug

10
A Snapshot as on 15th Jan
Side effects
Indavir
Ritonavir
Uses
AIDS
Cancer
Heart disease
Alzheimers Disease
Side effects
Hirudin
Uses
Diabetes
Niacin
Ibuprofen
Impotence
Side effects
Vasomax
Side effects
Side effects
Caverject
Uses
Uses
11
A Partial Snapshot as on 25th Jan
Side effects
Tolcapone
Parkinsons Disease
Uses
update
Cancer
New Link
www.panacea.gov
Diabetes
Side effects
12
A Partial Snapshot as on 30th Jan
Side effects
www.panacea.gov
Uses
Caverject
Impotence
Side effects
Vasomax
Viagra
Uses
13
On 8th February
www.panacea.gov
Heart disorder
Alzheimers Disease
Side effects
Hirudin
Uses
Niacin
Side effects
14
A Snapshot as on 15th Feb
Indavir
Ritonavir
AIDS
Alzheimers Disease
Cancer
Heart disease
Parkinsons Disease
Hirudin
Niacin
Impotence
Viagra
Vasomax
Caverject
15
Types of Changes
  • Insert Node
  • Delete Node
  • Update Node (update contents)
  • Insert Link same as either Insert node or
    update node
  • Delete Link same as either delete node or
    update node
  • Update link same as update node

16
WHOWEDA Project
  • Key Objectives
  • Design a suitable data model to store web data,
    called WHOM (Warehouse of Object Model)
  • Development of web algebra and query language to
    extract and manipulate web data
  • Change Management of Web data
  • Development of knowledge discovery and web mining
    tools
  • Joint project with NTU, Singapore

17
Overview of WHOM
  • Collection of web tables
  • Set of web tuples and a set of web schemas
    represents a web table
  • Web tuple - directed graph containing nodes and
    links and satisfies a web schema
  • Nodes and links contain content, metadata and
    structural information associated with Web
    documents and hyperlinks
  • Tree representation (Can handle XML)
  • Web algebra containing web operators to
    manipulate web tables
  • Global Coupling, Web Select, Web Join etc.

18
Step 1 Retrieving Snapshots of Web Data
Using Coupling Query Graph Example
  • Suppose, on 15th January, a user wishes to find
    out periodically (every 30 days) from the web
    site at www.panacea.gov
  • information related to side effects and uses of
    drugs used for various diseases
  • Result of the query is stored in the form of web
    table

19
Pictorial Representation
side effects
d
1, 6
www.panacea.gov
a
b
drug list
1, 3
k
uses
20
Coupling Query
  • Set of node variables Xn, Xn a, b, d, k
  • Each variable represents set of Web documents
  • Set of link variables Xl, Xl -
  • Each variable represent set of hyperlinks
  • Set of predicates P defined over some of the node
    and link variables
  • P p1, p2, p3, p4
  • p1(a) METADATA aurl EQUALS
    www.panacea.gov
  • p2(b) CONTENT bhtml.body.title
    NON-ATTR-CONT drug list
  • p3(k) CONTENT khtml.body.title
    NON-ATTR-CONT uses
  • p4(d) CONTENT dhtml.body.title
    NON-ATTR-CONT side effects

21
Coupling Query
  • Set of connectivities C in defined over node and
    link variables
  • To specify hyperlink structure of the documents
  • Specify metadata, content or structural
    conditions
  • C k1 AND k2 AND k3
  • k1 a lt - gt b
  • k2 b lt -1, 6 gt d
  • k3 b lt -1, 3 gt k
  • Set of coupling query predicates Q
  • Conditions on execution of the query
  • Q q1
  • q1(G) COUPLING_QUERY Gpolling_frequency
    EQUALS 30 days

22
Web Table Drugs (15th Jan)
b0
a0
u0
d0
Indavir
AIDS
k0
b0
a0
u1
d1
Ritonavir
AIDS
k1
Beta Carotene
b1
a0
d2
Cancer
k2
b5
a0
d12
Ibuprofen
Alzheimers Disease
k12
23
Web Table Drugs (15th Jan)
24
Web Table New Drugs (15th Feb)
Beta Carotene
b1
a0
d2
Cancer
k2
25
Web Table New Drugs (15th Feb)
b4
a0
u7
d6
Cavarject
Impotence
k7
26
Web Table New Drugs (15th Feb)
27
Storage of Web Objects
  • Warehouse Node pool distinct nodes, each node
    has node-id, version-ids
  • warehouse document pool actual documents
  • Web table pool
  • Table node pool- type identifier name that node
    and link represents in schema,link-id,
    version-ids, URL of the node, target node-id,
    label, and link type of the link
  • web tuple pool- ids of all the nodes and links
    belonging to web tuple
  • web schema pool store the web schema and
    coupling query

28
Step 2 Performing Web Join, Left and Right Outer
Web Join
  • Web Join
  • Combine two web tables by concatenating two web
    tuples whenever there exist joinable nodes
  • Two nodes are joinable if they are identical
  • Two nodes are identical if the URL and last
    modification date of the nodes are same
  • The joined web tuple is stored in a different web
    table

29
Web Join
  • Join web tables Drugs and New Drugs
  • Nodes which have not undergone any changes are
    the joinable nodes in these two web tables.
  • Content modified nodes, new nodes and deleted
    nodes cannot be joinable nodes

30
Joined web table
b0
a0
u0
d0
AIDS
Indavir
(1)
AIDS
k0
a0
AIDS
b0
a0
d1
u1
Ritonavir
(2)
AIDS
a0
k1
31
Joined Web Table
b2
a0
u3
d7
Heart Disorder
Niacin
(4)
k4
a0
u2
d3
Heart Disease
Hirudin
k3
32
Joined Table
b2
a0
u2
d3
Heart Disease
Hirudin
(6)
k3
Hirudin
a0
u2
d3
Heart Disorder
k3
33
Types of web tuples
  • Web tuples in which all the nodes are joinable
  • Results of joining two versions of web tuples
    that has remained unchanged during the transition
  • Web tuples in which
  • some of the nodes are joinable nodes
  • remaining nodes are the result of insertion,
    deletion or modification operations

34
Types of web tuples
  • Tuples in which
  • Some of the nodes are joinable nodes
  • Out of the remaining nodes some are result of
    insertion, deletion or modification and
  • The remaining ones remained unchanged during the
    transition, but may be joinable with others

35
Algorithm for Computing joinable nodes
36
Algorithm of web join
37
Algorithm of web join (continued)
38
Outer Web Join
  • Web tuples that do not participate in the web
    join process (dangling web tuples) are absent
    from the joined web table
  • Outer web join enables us to identify them
  • Left outer web join
  • Right outer web join

39
Types of web tuples (Right Outer)
  • New web tuples which are added during the
    transition
  • These tuples contain some new nodes and remaining
    ones content are changed.
  • Tuples in which all the nodes have undergone
    content modification
  • Tuples which existed before and in which some of
    the nodes are new and remaining ones content have
    changed.

40
Web Table New Drugs (15th Feb)
Beta Carotene
b1
a0
d2
Cancer
k2
41
Web Table New Drugs (15th Feb)
b4
a0
u7
d6
Cavarject
Impotence
k7
42
Web Table New Drugs (15th Feb)
43
Types of web tuples (Left Outer)
  • Web tuples which are deleted during the
    transition
  • These tuples do not occur in the new web table
  • Tuples in which all the nodes have undergone
    content modification
  • Tuples in which some of the nodes are deleted and
    of remaining ones content have changed.

44
Web Table Drugs (15th Jan)
b0
a0
u0
d0
Indavir
AIDS
k0
b0
a0
u1
d1
Ritonavir
AIDS
k1
45
Web Table Drugs (15th Jan)
46
Algorithm of outer web join
47
Algorithm of outer web join (continued)
48
Step 3 Generating Delta Web Tables
  • Input
  • Joined, left outer joined and right outer joined
    web tables
  • Output
  • Set of delta web tables

49
Delta Web Tables
  • Encapsulate the relevant changes that have
    occurred in the Web with respect to a users
    query
  • Three types
  • Delta web table
  • Contains a set of tuples containing new nodes
    inserted during transition
  • Delta- web table
  • Set of web tuples containing nodes removed during
    the transition
  • Delta-M web table
  • Set of web tuples representing the previous and
    current sets of modified nodes

50
Steps for Generation
  • Phase 1 Delta Nodes Identification Phase
  • Nodes which are added, deleted or modified during
    the transition are identified
  • Input Old and new version of web tables and a
    set of joinable nodes from the joined web table
  • Output
  • Nodes which exists in new web table but not in
    old web table are the new nodes
  • Nodes which exists in old web table but not in
    new one are the deleted nodes
  • Nodes which exists in both the web tables but are
    not joinable are the nodes which have undergone
    content modification

51
Steps for Generation
  • Phase 2 Delta Tuples Identification Phase
  • Determines how the delta nodes are related to one
    another and how they are associated with those
    nodes which have remained unchanged
  • We identify those tuples which contain nodes
    which are added, deleted or modified during the
    transition
  • Input Joined, left outer joined and right outer
    joined web tables, sets of delta nodes
  • Output Sets of web tuples represented by Delta,
    Delta- and Delta-M web tables

52
Phase 2 (Delta Web Table)
  • Scan joined and right outer joined web tables to
    identify web tuples containing nodes which are
    inserted during the transition
  • New nodes can occur in these tables
  • In the right outer joined table if the remaining
    nodes in the tuple containing the new nodes, are
    modified (hence not joinable)
  • In the joined web table if some of the nodes in
    the tuple containing new nodes, have remained
    unchanged and hence are joinable
  • These web tuples are stored in Delta Web Table

53
Example (Right Outer Web Join)
54
Example (Joined Web Table)
(4)
b2
a0
u3
d7
Heart Disorder
Niacin
k7
a0
u2
d3
Heart Disease
Hirudin
k3
55
Delta Web Table
56
Phase 2 (Delta- Web Table)
  • Scan joined and left outer joined web tables to
    identify web tuples containing nodes which are
    deleted during the transition
  • Deleted nodes can occur in these tables only
    because
  • In the left outer joined table if the remaining
    nodes in the tuple containing the deleted nodes,
    are modified (hence not joinable)
  • In the joined web table if some of the nodes in
    the tuple containing deleted nodes have remained
    unchanged and hence are joinable
  • These web tuples are stored in Delta- Web Table

57
Example (Left Outer Web Join)
58
Example (Joined Web Table)
b4
a0
u7
d6
Cavarject
Impotence
(5)
k7
u8
b4
a0
u7
Cavarject
Impotence
59
Delta- Web Table
60
Phase 2 (Delta-M Web Table)
  • Finally, nodes which are modified during the
    transition can be identified by inspecting all
    the three web tables
  • Tuples in the left and right outer joined tables
    which do not contain any new or deleted node
    represent the old and new version of these nodes
    respectively
  • These tuples do not occur in the joined web table
    as all the nodes are modified
  • Tuples in left and right outer joined tables that
    contain modified nodes as well as inserted or
    deleted nodes
  • These modified nodes may not appear in the joined
    web table if no other joinable web tuples contain
    these modified nodes

61
Phase 2
  • Tuples in the joined web tables where some of the
    nodes represent the old and new version of these
    modified nodes
  • These web tuples are stored in Delta-M Web Table

62
Example (Right Outer Web Join)
63
Example (Left Outer Web Join)
64
Example (Joined web table)
b0
a0
u0
d0
(1)
AIDS
Indavir
AIDS
k0
a0
AIDS
b0
a0
d1
u1
Ritonavir
AIDS
(2)
a0
k1
65
Delta-M Web Table
b0
a0
u0
d0
AIDS
Indavir
(1)
AIDS
k0
a0
AIDS
b0
a0
d1
u1
Ritonavir
(2)
AIDS
a0
k1
b4
a0
u7
d6
Cavarject
(3)
Impotence
k7
u8
b4
a0
u7
Cavarject
Impotence
66
Delta-M Web Table
b2
a0
u2
d3
Heart Disease
Hirudin
(4)
k3
Hirudin
a0
u2
d3
Heart Disorder
k3
(5)
67
Algorithm Delta
68
Algorithm Delta (continued)
69
Algorithm Delta (continued)
70
Algorithm of GenerateResult Tables
71
Algorithm of GenerateResult Tables (continued)
72
Algorithm for DeltasFromRightOuter
73
Algorithm for DeltasFromLeftOuter
74
Algorithm of DeltasFromJoin
75
Algorithm of DeltasFromJoin (continued)
76
Algorithm of CreateDeltaPlus
77
Future Work
  • Analytical and empirical studies of the
    algorithms for generating delta web tables
  • Mechanism to Represent changes modified, new or
    deleted nodes
  • Annotation on delta nodes
  • Extend to sub-page level
  • Query languages for querying the changes
  • Change notification service
Write a Comment
User Comments (0)
About PowerShow.com