Efficient Discovery of XML Data Redundancies - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Efficient Discovery of XML Data Redundancies

Description:

book. state. ISBN. title. au. au 'Borders' 'Borders' '... 269' 'DB' 'R.R.' 'J.G.' store ... For any two books, if they have the same ISBN, then they have the ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 37

Provided by: UofM9

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Discovery of XML Data Redundancies

1
Efficient Discovery of XML Data Redundancies

Cong Yu and H. V. Jagadish
University of Michigan, Ann Arbor
-
VLDB 2006, Seoul, Korea
September 12th, 2006

2
Talk Outline

Motivating Example
A Comprehensive Notion of XML FD
XML Redundancy Discovery Algorithms
Experimental Evaluation
Conclusion

3
An Example XML Document
warehouse
state
state
state
store
store
store

name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
title
ISBN
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
4
Constraints on XML Data

An example constraint
For any two books, if they have the same ISBN,
then they have the same title.
Similar to Equality Generating Dependencies
(EGDs) BV84 and Nested EGDs YP04

Condition Element(s)
Implication Element(s)
Target
5
Data Redundancies

E.g., title is redundantly stored
Result of non-optimal design of the database
schema in the presence of constraints
Lead to
Update anomalies
Increased cost for data transfer and manipulation
Constraints are the properties of data
May not be known at the design phase

Goal
Efficiently Discover Redundancies From the XML
Database By Discovering Satisfied Constraints

7
Main Contributions

A comprehensive notion of XML FD
Capturing a semantically richer set of XML
constraints
Definition of XML data redundancy in terms of XML
FDs and XML Keys
Efficient algorithms for discovering FDs and data
redundancies from an XML database
Experimental Evaluation

8
Talk Outline

Motivating Example
A Comprehensive Notion of XML FD
XML Redundancy Discovery Algorithms
Experimental Evaluation
Conclusion

9
Backup slide Example XML Constraints

Regular condition and implication elements are
children of target

state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
10
Example XML Constraints

Hierarchical condition and/or implication
elements can come from multiple hierarchies

Set elements condition and/or implication
elements can involve set elements

FDs are used to describe constraints in
relational databases
A similar notion of FD is needed for XML
Challenges
Target is difficult to specify due to the
hierarchical structure
Set elements introduce new semantics

XML FD needs richer semantics !
13
Previous Notions

Path Based Notion LLL02,VLL04
Example /warehouse/state/store/book/ISBN ?
/warehouse/state/store/book/title
Format LHS ? RHS
Semantics for any two RHS nodes, same
(associated) LHS indicates same RHS
Tree Tuple Based Notion AL04
A tree tuple is a data tree, with exactly one
data node for each schema element
Format LHS ? RHS
Semantics for any two tree tuples, same LHS
indicates same RHS

14
Previous Notions, contd

Both capture hierarchical constraints
Neither can capture set constraints
/store/book/ISBN ? /store/book/au
Violated in previous
Satisfied if the two au
nodes are a single set
/store/book/title,
/store/book/au ?
/store/book/ISBN
Undefined in previous
Intuitive if au nodes are
a single set

store
name
book
Borders
au
au
price
ISBN
title
269
DB
R.R.
J.G.
59.9
15
A New Comprehensive Notion

Generalized Tree Tuple
A data tree constructed around a pivot data node
(np)
Entire subtree rooted at np is kept
All ancestors of np and their attributes are
kept
Tuple Class CP
The set of all generalized tree tuples, whose
pivot nodes share the same path P (called pivot
path)

16
Example Generalized Tree Tuple
warehouse
Pivot
state
state
state
store
store
store

name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
17
Example Generalized Tree Tuple
Pivot
warehouse
state
state
state
store
store
store

name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
title
ISBN
269
DB
51.1
18
XML FD

LHS ? RHS w.r.t. CP
Semantics
for any two generalized tree tuple t1, t2 in CP,
if they share the same LHS, they have the same
RHS.
E.g., ./title, ./au ? ./ISBN, w.r.t.
C/warehouse/state/store/book

19
Repeatable Elements Are Special
warehouse
state
state
state
store
store
store

name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
title
au
au
ISBN
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
DB
51.1
269
20
Essential Tuple Classes

Definition
Tuple classes with pivot paths that correspond
to repeatable schema elements
C/warehouse/state/store/book is essential
C/warehouse/state/store/name is not
Express XML FDs that are expressible with
non-essential tuple classes
See paper for detailed proof

21
Backup slide Structurally Redundant XML FDs

Definition FDs where none of the paths in LHS
and RHS is a descendant of pivot path
Their satisfaction on a data tree is mirrored by
other FDs
I.e., they are satisfied if and only if some
other FD is satisfied
See paper for detailed explanation

22
Backup slide Interesting XML FD

RHS is not contained in LHS
CP is an essential Tuple Class
RHS is descendent of pivot node
See paper for details

23
XML Key and Data Redundancy

Let attribute _at_key uniquely identify each node in
the entire data tree
is an XML Key, when the database
satisfies XML FD LHS ? ./_at_key w.r.t. CP
Similar to the relative key notion proposed in
BDF01
Data redundancy exists if the database
Satisfies the XML FD ,
But is not an XML key
? RHS is redundantly stored.

24
Talk Outline

Motivating Example
A Comprehensive Notion of XML FD
XML Redundancy Discovery Algorithms
Experimental Evaluation
Conclusion

25
Strategy

Discover satisfied XML FDs and Keys
Data redundancies can then be discovered based on
the definition
First, we need an efficient representation of the
XML data

26
Hierarchical Representation of XML Data

Each essential tuple class ? a relation
Similar to nested relations OY87,MNE96
All relations together form a hierarchy
Tree tuples can be reconstructed by joining _at_key
with parent

R_state _at_key parent 2 root 3 root 18
root . . . . .
R_book _at_key parent ISBN title price 6 4
269 DB 59.9 13 12 269 DB
51.1 20 19 269 DB 59.9
R_au _at_key parent _at_text 10 6 R.R. 11
6 J.G. 24 20 R.R. 25 20 J.G.
R_store _at_key parent name 4 3 Borders
12 3 Amazon 19 18 Borders
27
Intra-Relation FDs

./ISBN ? ./title, w.r.t. C/warehouse/state/store
/book

state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
title
ISBN
au
au
price
ISBN
title
DB
R.R.
J.G.
59.9
269
269
price
DB
R.R.
J.G.
59.9
title
ISBN
DB
51.1
269
28
Inter-Relation FDs

../name, ./ISBN ? ./price, w.r.t.
C/warehouse/state/store/book

Present in R_store
state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
title
ISBN
au
au
price
ISBN
title
DB
R.R.
J.G.
59.9
269
269
price
DB
R.R.
J.G.
59.9
title
ISBN
DB
51.1
269
Present in R_book
29
Overview of the Discovery Process

Only interested in minimal FDs
Bottom-Up
At each relation
Discover intra-relation FDs and Keys
Discover inter-relation FDs and Keys involving
descendant relations
Generate candidate inter-relation FDs and Keys
for examination at the parent level
Attribute Partition as the basic data structure

30
Attribute Partition

Groups tuples
according to the
attribute value
?price for Cbook t6,t20, t13
?_at_key for Cbook t6, t20, t13
?price, _at_key for Cbook t6, t20, t13
FD LHS ? RHS w.r.t. CP is satisfied iff
?LHS?RHS ?LHS

R_book _at_key parent ISBN title price 6 4
269 DB 59.9 13 12 269 DB
51.1 20 19 269 DB 59.9
31
Set Attribute Partition

Generated through refinement
? Initialize ?au for R_book to be t6, t13,
t20
? ?_at_text for R_au
t10, t24, t11, t25
? t6, t20, t6, t20
? ?au for R_book
t6, t20, t13
?au can then be used as
a normal partition

R_au _at_key parent _at_text 10 6 R.R. 11
6 J.G. 24 20 R.R. 25 20 J.G.
R_book _at_key parent ISBN title price 6 4
269 DB 59.9 13 12 269 DB
51.1 20 19 269 DB 59.9
Convert to parent
Refine ?au using partitions in ?_at_text
32
Discovery Algorithms

DiscoverFD
Discover intra-relation FDs and Keys
Similar to existing relational algorithms
DiscoverXFD
Discover inter-relation FDs and Keys
Key component
Candidate inter-relation XML FD generation

33
Generating Candidate Inter-Relation FDs

Let P' be a parent relation of P
Parent satisfaction property
For LHS?X ? RHS w.r.t. CP to hold for any
attribute set X in relation P', LHS?./parent ?
RHS w.r.t. CP must hold
Child implication property
For LHS?X ? RHS w.r.t. CP to be a non-trivial FD
for any attribute set X in relation P', LHS ? RHS
w.r.t. CP must not hold
An FD is a candidate inter-relation FD if it
satisfies both properties

34
Backup slide Generating Partition Target

Example candidate FD
./ISBN ? ./price w.r.t. Cbook
We associate each FD with a Partition Target
(PT)
Specifying inequalities parent attribute
partitions must satisfy

R_book _at_key parent ISBN title price 6 4
269 DB 59.9 13 12 269 DB
51.1 20 19 269 DB 59.9
?ISBN t6, t13, t20 ?price t6,
t20, t13 PT t4 ? t12, t19 ? t12
35
Backup slide Checking Partition Target

Candidate FD
./ISBN ? ./price w.r.t. Cbook
We check each parent attribute partition against
the PT to discover inter-relation FDs
We use various techniques to compactly represent
PT
See analysis in Paper

R_store _at_key parent name 4 3 Borders
12 3 Amazon 19 18 Borders
PT t4 ? t12, t19 ? t12 ?name t4,
t19, t12 ../name ? ./price w.r.t. Cbook
36
Talk Outline

Motivating Example
A Comprehensive Notion of XML FD
XML Redundancy Discovery Algorithms
Experimental Evaluation
Conclusion

37
Real Datasets

DBLP contains a fair amount of redundancy, as
noted earlier in AL04 as well
10 redundancies in PIR (measured as of
redundant elements over total of elements),
schema modification reported to PIR

38
Scalability on XMark

Linear in terms of scale factor ( of elements)
even though exponential in theory
Orders of magnitude faster than direct
application of a state-of-the-art relational
discovery algorithm
The latter takes over 3 hours to run on XMark
scale factor 1

39
Related Work

XML Integrity Constraints (FDs and Keys)
BDF01, LLL02, FS03
XML Normal Form
AL04, VLL04
Nested Relation Normal Form
OY87, MNE96
Relational FD discovery
FUN, Dep-Miner, TANE, fdep, FastFDs

40
Backup slide GORDIAN

Both use extensive pruning strategies based on
the properties of FDs
E.g., singleton pruning are adopted in both
GORDIAN is more aggressive since it only looks
for keys
Our algorithm is more comprehensive, it discovers
satisfied FDs, in addition to keys

41
Conclusion

A comprehensive notion of XML FDs and Keys,
capturing set semantics
A system for for detecting XML data redundancies
through the discovery of FDs and Keys
The system is practical for real datasets and
out-performs direct application of the best
available relational algorithm by orders of
magnitude.

42
Questions ?

Write a Comment

User Comments (0)