Title: Detecting and Representing Relevant Web Deltas in WHOWEDA
1Detecting and Representing Relevant Web Deltas in
WHOWEDA
- Sanjay Kumar Madria
- Department of Computer Science
- University of Missouri-Rolla
- madrias_at_umr.edu
- Based on IEEE ICDCS00 and IEEE TKDE (under minor
revision)
2Current Situation of W3
- The Web allows information to change at any time
and in any way - Two forms of changes
- Existence
- Structure and content modification
- Leaves no trace of the previous document
3Problems of Change Management
- Problems
- Detecting, Representing and Querying these
changes - The problem is challenging
- Typical database approaches to detect changes
based on triggering mechanisms are not usable - No access right, no support for triggers
- Information sources typically do not keep track
of historical information to a format that is
accessible to the outside user
4Applications
- Provides the framework for
- Web Site Administrator
- Trend analysis and Mining
- E-commerce
- Customers of E-commerce Web Site
- Competitive Intelligence Product and Price
comparisons - Notification Services (with PDA)
5Objectives
- Web deltas - Changes to web data
- Detecting and representing relevant page-level
web deltas - changes that are relevant to users query, not
any arbitrary changes or web deltas - Restricted to page level
- Detect those documents
- which are added to the site
- deleted from the site
- those documents which have undergone content or
structural modification - How these delta documents are related to one
another and with other documents relevant to the
users query
6Related Work
- Lore (Stanford) change management (SIGMOD97 and
ICDE98) - Contrast
- OEM based, not applied on Web
- WebCQ (Georgia Tech)
- Needs a set of URLs.
- No interdocument changes
- Htmldiff (ATT)
- Input - two versions
- Output marked up copy highlight changes
- Contrast
- Difficult to browse in case of large file
- Ours is based on query , not any change
7Change Mgmt in DBMS
- Two Approaches
- Snapshot collection at times t1, t2,..
- Snapshot deltas, D and ?Ds at time t1, t2,..
- Contrast we use snapshot delta approach, but
with semi-structured data
8Motivating Example
- Assume that there is a web site at
www.panacea.gov - Provides information related to drugs used for
various diseases - Suppose, on 15th January, a user wishes to find
out periodically (every 30 days) - information related to side effects and uses of
drugs used for various drugs and - changes to these information at the page-level
compared to its previous version
9Structure of www.panacea.gov
- www.panacea.gov contains a list of diseases
- Each link of a particular disease points to a web
page containing a list of drugs used for
prevention and cure of the disease - Hyperlinks associated with each drug points to
documents containing a list of various issues
related to a particular drug (description,
manufacturers, clinical pharmacology, uses,
side-effects etc) - From the hyperlinks associated with each issue,
one can retrieve details of these issues for a
particular drug
10A Snapshot as on 15th Jan
Side effects
Indavir
Ritonavir
Uses
AIDS
Cancer
Heart disease
Alzheimers Disease
Side effects
Hirudin
Uses
Diabetes
Niacin
Ibuprofen
Impotence
Side effects
Vasomax
Side effects
Side effects
Caverject
Uses
Uses
11A Partial Snapshot as on 25th Jan
Side effects
Tolcapone
Parkinsons Disease
Uses
update
Cancer
New Link
www.panacea.gov
Diabetes
Side effects
12A Partial Snapshot as on 30th Jan
Side effects
www.panacea.gov
Uses
Caverject
Impotence
Side effects
Vasomax
Viagra
Uses
13On 8th February
www.panacea.gov
Heart disorder
Alzheimers Disease
Side effects
Hirudin
Uses
Niacin
Side effects
14A Snapshot as on 15th Feb
Indavir
Ritonavir
AIDS
Alzheimers Disease
Cancer
Heart disease
Parkinsons Disease
Hirudin
Niacin
Impotence
Viagra
Vasomax
Caverject
15Types of Changes
- Insert Node
- Delete Node
- Update Node (update contents)
- Insert Link same as either Insert node or
update node - Delete Link same as either delete node or
update node - Update link same as update node
16WHOWEDA Project
- Key Objectives
- Design a suitable data model to store web data,
called WHOM (Warehouse of Object Model) - Development of web algebra and query language to
extract and manipulate web data - Change Management of Web data
- Development of knowledge discovery and web mining
tools - Joint project with NTU, Singapore
17Overview of WHOM
- Collection of web tables
- Set of web tuples and a set of web schemas
represents a web table - Web tuple - directed graph containing nodes and
links and satisfies a web schema - Nodes and links contain content, metadata and
structural information associated with Web
documents and hyperlinks - Tree representation (Can handle XML)
- Web algebra containing web operators to
manipulate web tables - Global Coupling, Web Select, Web Join etc.
18Step 1 Retrieving Snapshots of Web Data
Using Coupling Query Graph Example
- Suppose, on 15th January, a user wishes to find
out periodically (every 30 days) from the web
site at www.panacea.gov - information related to side effects and uses of
drugs used for various diseases - Result of the query is stored in the form of web
table
19Pictorial Representation
side effects
d
1, 6
www.panacea.gov
a
b
drug list
1, 3
k
uses
20Coupling Query
- Set of node variables Xn, Xn a, b, d, k
- Each variable represents set of Web documents
- Set of link variables Xl, Xl -
- Each variable represent set of hyperlinks
- Set of predicates P defined over some of the node
and link variables - P p1, p2, p3, p4
- p1(a) METADATA aurl EQUALS
www.panacea.gov - p2(b) CONTENT bhtml.body.title
NON-ATTR-CONT drug list - p3(k) CONTENT khtml.body.title
NON-ATTR-CONT uses - p4(d) CONTENT dhtml.body.title
NON-ATTR-CONT side effects
21Coupling Query
- Set of connectivities C in defined over node and
link variables - To specify hyperlink structure of the documents
- Specify metadata, content or structural
conditions - C k1 AND k2 AND k3
- k1 a lt - gt b
- k2 b lt -1, 6 gt d
- k3 b lt -1, 3 gt k
- Set of coupling query predicates Q
- Conditions on execution of the query
- Q q1
- q1(G) COUPLING_QUERY Gpolling_frequency
EQUALS 30 days
22Web Table Drugs (15th Jan)
b0
a0
u0
d0
Indavir
AIDS
k0
b0
a0
u1
d1
Ritonavir
AIDS
k1
Beta Carotene
b1
a0
d2
Cancer
k2
b5
a0
d12
Ibuprofen
Alzheimers Disease
k12
23Web Table Drugs (15th Jan)
24Web Table New Drugs (15th Feb)
Beta Carotene
b1
a0
d2
Cancer
k2
25Web Table New Drugs (15th Feb)
b4
a0
u7
d6
Cavarject
Impotence
k7
26Web Table New Drugs (15th Feb)
27Storage of Web Objects
- Warehouse Node pool distinct nodes, each node
has node-id, version-ids - warehouse document pool actual documents
- Web table pool
- Table node pool- type identifier name that node
and link represents in schema,link-id,
version-ids, URL of the node, target node-id,
label, and link type of the link - web tuple pool- ids of all the nodes and links
belonging to web tuple - web schema pool store the web schema and
coupling query
28Step 2 Performing Web Join, Left and Right Outer
Web Join
- Web Join
- Combine two web tables by concatenating two web
tuples whenever there exist joinable nodes - Two nodes are joinable if they are identical
- Two nodes are identical if the URL and last
modification date of the nodes are same - The joined web tuple is stored in a different web
table
29Web Join
- Join web tables Drugs and New Drugs
- Nodes which have not undergone any changes are
the joinable nodes in these two web tables. - Content modified nodes, new nodes and deleted
nodes cannot be joinable nodes
30Joined web table
b0
a0
u0
d0
AIDS
Indavir
(1)
AIDS
k0
a0
AIDS
b0
a0
d1
u1
Ritonavir
(2)
AIDS
a0
k1
31Joined Web Table
b2
a0
u3
d7
Heart Disorder
Niacin
(4)
k4
a0
u2
d3
Heart Disease
Hirudin
k3
32Joined Table
b2
a0
u2
d3
Heart Disease
Hirudin
(6)
k3
Hirudin
a0
u2
d3
Heart Disorder
k3
33Types of web tuples
- Web tuples in which all the nodes are joinable
- Results of joining two versions of web tuples
that has remained unchanged during the transition - Web tuples in which
- some of the nodes are joinable nodes
- remaining nodes are the result of insertion,
deletion or modification operations
34Types of web tuples
- Tuples in which
- Some of the nodes are joinable nodes
- Out of the remaining nodes some are result of
insertion, deletion or modification and - The remaining ones remained unchanged during the
transition, but may be joinable with others
35Algorithm for Computing joinable nodes
36Algorithm of web join
37Algorithm of web join (continued)
38Outer Web Join
- Web tuples that do not participate in the web
join process (dangling web tuples) are absent
from the joined web table - Outer web join enables us to identify them
- Left outer web join
- Right outer web join
39Types of web tuples (Right Outer)
- New web tuples which are added during the
transition - These tuples contain some new nodes and remaining
ones content are changed. - Tuples in which all the nodes have undergone
content modification - Tuples which existed before and in which some of
the nodes are new and remaining ones content have
changed.
40Web Table New Drugs (15th Feb)
Beta Carotene
b1
a0
d2
Cancer
k2
41Web Table New Drugs (15th Feb)
b4
a0
u7
d6
Cavarject
Impotence
k7
42Web Table New Drugs (15th Feb)
43Types of web tuples (Left Outer)
- Web tuples which are deleted during the
transition - These tuples do not occur in the new web table
- Tuples in which all the nodes have undergone
content modification - Tuples in which some of the nodes are deleted and
of remaining ones content have changed.
44Web Table Drugs (15th Jan)
b0
a0
u0
d0
Indavir
AIDS
k0
b0
a0
u1
d1
Ritonavir
AIDS
k1
45Web Table Drugs (15th Jan)
46Algorithm of outer web join
47Algorithm of outer web join (continued)
48Step 3 Generating Delta Web Tables
- Input
- Joined, left outer joined and right outer joined
web tables - Output
- Set of delta web tables
49Delta Web Tables
- Encapsulate the relevant changes that have
occurred in the Web with respect to a users
query - Three types
- Delta web table
- Contains a set of tuples containing new nodes
inserted during transition - Delta- web table
- Set of web tuples containing nodes removed during
the transition - Delta-M web table
- Set of web tuples representing the previous and
current sets of modified nodes
50Steps for Generation
- Phase 1 Delta Nodes Identification Phase
- Nodes which are added, deleted or modified during
the transition are identified - Input Old and new version of web tables and a
set of joinable nodes from the joined web table - Output
- Nodes which exists in new web table but not in
old web table are the new nodes - Nodes which exists in old web table but not in
new one are the deleted nodes - Nodes which exists in both the web tables but are
not joinable are the nodes which have undergone
content modification
51Steps for Generation
- Phase 2 Delta Tuples Identification Phase
- Determines how the delta nodes are related to one
another and how they are associated with those
nodes which have remained unchanged - We identify those tuples which contain nodes
which are added, deleted or modified during the
transition - Input Joined, left outer joined and right outer
joined web tables, sets of delta nodes - Output Sets of web tuples represented by Delta,
Delta- and Delta-M web tables
52Phase 2 (Delta Web Table)
- Scan joined and right outer joined web tables to
identify web tuples containing nodes which are
inserted during the transition - New nodes can occur in these tables
- In the right outer joined table if the remaining
nodes in the tuple containing the new nodes, are
modified (hence not joinable) - In the joined web table if some of the nodes in
the tuple containing new nodes, have remained
unchanged and hence are joinable - These web tuples are stored in Delta Web Table
53Example (Right Outer Web Join)
54Example (Joined Web Table)
(4)
b2
a0
u3
d7
Heart Disorder
Niacin
k7
a0
u2
d3
Heart Disease
Hirudin
k3
55Delta Web Table
56Phase 2 (Delta- Web Table)
- Scan joined and left outer joined web tables to
identify web tuples containing nodes which are
deleted during the transition - Deleted nodes can occur in these tables only
because - In the left outer joined table if the remaining
nodes in the tuple containing the deleted nodes,
are modified (hence not joinable) - In the joined web table if some of the nodes in
the tuple containing deleted nodes have remained
unchanged and hence are joinable - These web tuples are stored in Delta- Web Table
57Example (Left Outer Web Join)
58Example (Joined Web Table)
b4
a0
u7
d6
Cavarject
Impotence
(5)
k7
u8
b4
a0
u7
Cavarject
Impotence
59Delta- Web Table
60Phase 2 (Delta-M Web Table)
- Finally, nodes which are modified during the
transition can be identified by inspecting all
the three web tables - Tuples in the left and right outer joined tables
which do not contain any new or deleted node
represent the old and new version of these nodes
respectively - These tuples do not occur in the joined web table
as all the nodes are modified - Tuples in left and right outer joined tables that
contain modified nodes as well as inserted or
deleted nodes - These modified nodes may not appear in the joined
web table if no other joinable web tuples contain
these modified nodes
61Phase 2
- Tuples in the joined web tables where some of the
nodes represent the old and new version of these
modified nodes - These web tuples are stored in Delta-M Web Table
62Example (Right Outer Web Join)
63Example (Left Outer Web Join)
64Example (Joined web table)
b0
a0
u0
d0
(1)
AIDS
Indavir
AIDS
k0
a0
AIDS
b0
a0
d1
u1
Ritonavir
AIDS
(2)
a0
k1
65Delta-M Web Table
b0
a0
u0
d0
AIDS
Indavir
(1)
AIDS
k0
a0
AIDS
b0
a0
d1
u1
Ritonavir
(2)
AIDS
a0
k1
b4
a0
u7
d6
Cavarject
(3)
Impotence
k7
u8
b4
a0
u7
Cavarject
Impotence
66Delta-M Web Table
b2
a0
u2
d3
Heart Disease
Hirudin
(4)
k3
Hirudin
a0
u2
d3
Heart Disorder
k3
(5)
67Algorithm Delta
68Algorithm Delta (continued)
69Algorithm Delta (continued)
70Algorithm of GenerateResult Tables
71Algorithm of GenerateResult Tables (continued)
72Algorithm for DeltasFromRightOuter
73Algorithm for DeltasFromLeftOuter
74Algorithm of DeltasFromJoin
75Algorithm of DeltasFromJoin (continued)
76Algorithm of CreateDeltaPlus
77Future Work
- Analytical and empirical studies of the
algorithms for generating delta web tables - Mechanism to Represent changes modified, new or
deleted nodes - Annotation on delta nodes
- Extend to sub-page level
- Query languages for querying the changes
- Change notification service
-