Title: Efficient Discovery of XML Data Redundancies
1Efficient Discovery of XML Data Redundancies
- Cong Yu and H. V. Jagadish
- University of Michigan, Ann Arbor
- -
- VLDB 2006, Seoul, Korea
- September 12th, 2006
2Talk Outline
- Motivating Example
- A Comprehensive Notion of XML FD
- XML Redundancy Discovery Algorithms
- Experimental Evaluation
- Conclusion
3An Example XML Document
warehouse
state
state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
title
ISBN
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
4Constraints on XML Data
- An example constraint
- For any two books, if they have the same ISBN,
then they have the same title. - Similar to Equality Generating Dependencies
(EGDs) BV84 and Nested EGDs YP04
Condition Element(s)
Implication Element(s)
Target
5Data Redundancies
- E.g., title is redundantly stored
- Result of non-optimal design of the database
schema in the presence of constraints - Lead to
- Update anomalies
- Increased cost for data transfer and manipulation
- Constraints are the properties of data
- May not be known at the design phase
6- Goal
- Efficiently Discover Redundancies From the XML
Database By Discovering Satisfied Constraints
7Main Contributions
- A comprehensive notion of XML FD
- Capturing a semantically richer set of XML
constraints - Definition of XML data redundancy in terms of XML
FDs and XML Keys - Efficient algorithms for discovering FDs and data
redundancies from an XML database - Experimental Evaluation
8Talk Outline
- Motivating Example
- A Comprehensive Notion of XML FD
- XML Redundancy Discovery Algorithms
- Experimental Evaluation
- Conclusion
9Backup slide Example XML Constraints
- Regular condition and implication elements are
children of target
state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
10Example XML Constraints
- Hierarchical condition and/or implication
elements can come from multiple hierarchies
state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
11Example XML Constraints, Contd
- Set elements condition and/or implication
elements can involve set elements
state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
12Functional Dependencies (FDs)
- FDs are used to describe constraints in
relational databases - A similar notion of FD is needed for XML
- Challenges
- Target is difficult to specify due to the
hierarchical structure - Set elements introduce new semantics
XML FD needs richer semantics !
13Previous Notions
- Path Based Notion LLL02,VLL04
- Example /warehouse/state/store/book/ISBN ?
/warehouse/state/store/book/title - Format LHS ? RHS
- Semantics for any two RHS nodes, same
(associated) LHS indicates same RHS - Tree Tuple Based Notion AL04
- A tree tuple is a data tree, with exactly one
data node for each schema element - Format LHS ? RHS
- Semantics for any two tree tuples, same LHS
indicates same RHS
14Previous Notions, contd
- Both capture hierarchical constraints
- Neither can capture set constraints
- /store/book/ISBN ? /store/book/au
- Violated in previous
- Satisfied if the two au
- nodes are a single set
- /store/book/title,
- /store/book/au ?
- /store/book/ISBN
- Undefined in previous
- Intuitive if au nodes are
- a single set
store
name
book
Borders
au
au
price
ISBN
title
269
DB
R.R.
J.G.
59.9
15A New Comprehensive Notion
- Generalized Tree Tuple
- A data tree constructed around a pivot data node
(np) - Entire subtree rooted at np is kept
- All ancestors of np and their attributes are
kept - Tuple Class CP
- The set of all generalized tree tuples, whose
pivot nodes share the same path P (called pivot
path)
16Example Generalized Tree Tuple
warehouse
Pivot
state
state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
17Example Generalized Tree Tuple
Pivot
warehouse
state
state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
title
ISBN
269
DB
51.1
18XML FD
- LHS ? RHS w.r.t. CP
- Semantics
- for any two generalized tree tuple t1, t2 in CP,
if they share the same LHS, they have the same
RHS. - E.g., ./title, ./au ? ./ISBN, w.r.t.
C/warehouse/state/store/book
19Repeatable Elements Are Special
warehouse
state
state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
title
au
au
ISBN
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
DB
51.1
269
20Essential Tuple Classes
- Definition
- Tuple classes with pivot paths that correspond
to repeatable schema elements - C/warehouse/state/store/book is essential
- C/warehouse/state/store/name is not
- Express XML FDs that are expressible with
non-essential tuple classes - See paper for detailed proof
21Backup slide Structurally Redundant XML FDs
- Definition FDs where none of the paths in LHS
and RHS is a descendant of pivot path - Their satisfaction on a data tree is mirrored by
other FDs - I.e., they are satisfied if and only if some
other FD is satisfied - See paper for detailed explanation
22Backup slide Interesting XML FD
- RHS is not contained in LHS
- CP is an essential Tuple Class
- RHS is descendent of pivot node
- See paper for details
23XML Key and Data Redundancy
- Let attribute _at_key uniquely identify each node in
the entire data tree - is an XML Key, when the database
satisfies XML FD LHS ? ./_at_key w.r.t. CP - Similar to the relative key notion proposed in
BDF01 - Data redundancy exists if the database
- Satisfies the XML FD ,
- But is not an XML key
- ? RHS is redundantly stored.
24Talk Outline
- Motivating Example
- A Comprehensive Notion of XML FD
- XML Redundancy Discovery Algorithms
- Experimental Evaluation
- Conclusion
25Strategy
- Discover satisfied XML FDs and Keys
- Data redundancies can then be discovered based on
the definition - First, we need an efficient representation of the
XML data
26Hierarchical Representation of XML Data
- Each essential tuple class ? a relation
- Similar to nested relations OY87,MNE96
- All relations together form a hierarchy
- Tree tuples can be reconstructed by joining _at_key
with parent
R_state _at_key parent 2 root 3 root 18
root . . . . .
R_book _at_key parent ISBN title price 6 4
269 DB 59.9 13 12 269 DB
51.1 20 19 269 DB 59.9
R_au _at_key parent _at_text 10 6 R.R. 11
6 J.G. 24 20 R.R. 25 20 J.G.
R_store _at_key parent name 4 3 Borders
12 3 Amazon 19 18 Borders
27Intra-Relation FDs
- ./ISBN ? ./title, w.r.t. C/warehouse/state/store
/book
state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
title
ISBN
au
au
price
ISBN
title
DB
R.R.
J.G.
59.9
269
269
price
DB
R.R.
J.G.
59.9
title
ISBN
DB
51.1
269
28Inter-Relation FDs
- ../name, ./ISBN ? ./price, w.r.t.
C/warehouse/state/store/book
Present in R_store
state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
title
ISBN
au
au
price
ISBN
title
DB
R.R.
J.G.
59.9
269
269
price
DB
R.R.
J.G.
59.9
title
ISBN
DB
51.1
269
Present in R_book
29Overview of the Discovery Process
- Only interested in minimal FDs
- Bottom-Up
- At each relation
- Discover intra-relation FDs and Keys
- Discover inter-relation FDs and Keys involving
descendant relations - Generate candidate inter-relation FDs and Keys
for examination at the parent level - Attribute Partition as the basic data structure
30Attribute Partition
- Groups tuples
- according to the
- attribute value
- ?price for Cbook t6,t20, t13
- ?_at_key for Cbook t6, t20, t13
- ?price, _at_key for Cbook t6, t20, t13
- FD LHS ? RHS w.r.t. CP is satisfied iff
- ?LHS?RHS ?LHS
R_book _at_key parent ISBN title price 6 4
269 DB 59.9 13 12 269 DB
51.1 20 19 269 DB 59.9
31Set Attribute Partition
- Generated through refinement
- ? Initialize ?au for R_book to be t6, t13,
t20 - ? ?_at_text for R_au
- t10, t24, t11, t25
- ? t6, t20, t6, t20
- ? ?au for R_book
- t6, t20, t13
- ?au can then be used as
- a normal partition
R_au _at_key parent _at_text 10 6 R.R. 11
6 J.G. 24 20 R.R. 25 20 J.G.
R_book _at_key parent ISBN title price 6 4
269 DB 59.9 13 12 269 DB
51.1 20 19 269 DB 59.9
Convert to parent
Refine ?au using partitions in ?_at_text
32Discovery Algorithms
- DiscoverFD
- Discover intra-relation FDs and Keys
- Similar to existing relational algorithms
- DiscoverXFD
- Discover inter-relation FDs and Keys
- Key component
- Candidate inter-relation XML FD generation
33Generating Candidate Inter-Relation FDs
- Let P' be a parent relation of P
- Parent satisfaction property
- For LHS?X ? RHS w.r.t. CP to hold for any
attribute set X in relation P', LHS?./parent ?
RHS w.r.t. CP must hold - Child implication property
- For LHS?X ? RHS w.r.t. CP to be a non-trivial FD
for any attribute set X in relation P', LHS ? RHS
w.r.t. CP must not hold - An FD is a candidate inter-relation FD if it
satisfies both properties
34Backup slide Generating Partition Target
- Example candidate FD
- ./ISBN ? ./price w.r.t. Cbook
- We associate each FD with a Partition Target
(PT) - Specifying inequalities parent attribute
partitions must satisfy
R_book _at_key parent ISBN title price 6 4
269 DB 59.9 13 12 269 DB
51.1 20 19 269 DB 59.9
?ISBN t6, t13, t20 ?price t6,
t20, t13 PT t4 ? t12, t19 ? t12
35Backup slide Checking Partition Target
- Candidate FD
- ./ISBN ? ./price w.r.t. Cbook
- We check each parent attribute partition against
the PT to discover inter-relation FDs - We use various techniques to compactly represent
PT - See analysis in Paper
R_store _at_key parent name 4 3 Borders
12 3 Amazon 19 18 Borders
PT t4 ? t12, t19 ? t12 ?name t4,
t19, t12 ../name ? ./price w.r.t. Cbook
36Talk Outline
- Motivating Example
- A Comprehensive Notion of XML FD
- XML Redundancy Discovery Algorithms
- Experimental Evaluation
- Conclusion
37Real Datasets
- DBLP contains a fair amount of redundancy, as
noted earlier in AL04 as well - 10 redundancies in PIR (measured as of
redundant elements over total of elements),
schema modification reported to PIR
38Scalability on XMark
- Linear in terms of scale factor ( of elements)
even though exponential in theory - Orders of magnitude faster than direct
application of a state-of-the-art relational
discovery algorithm - The latter takes over 3 hours to run on XMark
scale factor 1
39Related Work
- XML Integrity Constraints (FDs and Keys)
- BDF01, LLL02, FS03
- XML Normal Form
- AL04, VLL04
- Nested Relation Normal Form
- OY87, MNE96
- Relational FD discovery
- FUN, Dep-Miner, TANE, fdep, FastFDs
40Backup slide GORDIAN
- Both use extensive pruning strategies based on
the properties of FDs - E.g., singleton pruning are adopted in both
- GORDIAN is more aggressive since it only looks
for keys - Our algorithm is more comprehensive, it discovers
satisfied FDs, in addition to keys
41Conclusion
- A comprehensive notion of XML FDs and Keys,
capturing set semantics - A system for for detecting XML data redundancies
through the discovery of FDs and Keys - The system is practical for real datasets and
out-performs direct application of the best
available relational algorithm by orders of
magnitude.
42Questions ?