Title: Schema Summarization
1Schema Summarization
- Cong Yu and H. V. Jagadish
- University of Michigan, Ann Arbor
- -
- VLDB 2006, Seoul, Korea
- September 13th, 2006
2Many Databases Are Complex
Number of elements tables columns
(relational) elements attributes (XML)
3Reactome Schema
4Whats the Problem ?
- Why are complex schemas difficult to deal with ?
- For data integration administrators (DIAs)
Difficult to grasp the major topics of a complex
schema - For ordinary users Difficult to identify the
small subset of relevant schema elements - Can we avoid them ?
- Probably not scientific databases are in fact
getting more and more complex MiMI is an example
5Existing Approaches
- Ignore the schema
- Keyword-based search over relational and XML
databases - Guess the schema
- Schema-Free XQuery, FleXPath, etc.
- Limitations
- Provide imprecise (and sometimes incorrect)
answers - No help in understanding the schema (and the
database) itself
6Our Approach
- Summarize the schema
- Represent the original complex schema with a
simpler schema, i.e., a summary of the original
schema - Help users explore the schema via the summary
- Illustrates the main topics of the database
- Filters away irrelevant parts of the schema
Challenge how to create a good summary ?
7Talk Outline
- Motivation
- Background Definitions
- Desiderata of Schema Summary
- Efficient Schema Summarization
- Evaluation
- Conclusion and Related Work
8Schema
- A labeled, directed graph
- Nodes
- Relational table and column
- Hierarchical element and attribute
- Links
- Structural links parent/child constraints
- Value links inclusion constraints (key / foreign
key)
warehouse
state
authors
store
_at_name
author
contact
book
_at_id
_at_name
_at_name
isbn
price
title
_at_address
author
9Schema Summary
- A schema itself, but
- Fewer number of elements ? Simpler
- Contains abstract elements and links
- Abstract element
- Represents a group of original elements
- Abstract link
- Connects at least one abstract element
warehouse
author
book
10Talk Outline
- Motivation
- Background Definitions
- Desiderata of Schema Summary
- Efficient Schema Summarization
- Evaluation
- Conclusion and Related Work
11What Makes a Good Schema Summary ?
warehouse
warehouse
warehouse
state
authors
store
_at_name
store
author
book
author
contact
book
_at_id
_at_name
book
_at_name
isbn
price
title
_at_address
author
- Which one should be the summary ?
12What Information Do We Need ?
- Schema summary is not only a summary of the
schema, but also in fact a summary of the
database !
schema structure and data distribution
13Desired Properties of Schema Summary
- Small enough (in terms of number of elements) to
comprehend Summary Complexity - Show elements in which users are more likely to
be interested Summary Importance - Show elements that represent the entire database
well Summary Coverage - Importance and Coverage calculation will need to
consider both schema structure and data
distribution
14Intuition Behind Importance
warehouse
- Not all schema elements are created equal !
- First Observation
- more links, more important - schema
- Second Observation
- more popular, more important - data
state
authors
store
_at_name
author
contact
book
_at_id
_at_name
_at_name
isbn
price
title
_at_address
author
15Compute Summary Importance
- Schema Element Importance
- W Neighbor Weight the percentage of ejs
information flows into e, estimated using
relative cardinalities - Summary Importance
16Backup Slide Neighbor Weight
warehouse
5/1
state
2/1
authors
150/1
store
_at_name
100/1
author
contact
book
_at_id
- Reflects the relative importance of an element
toward an element it directly connects, among the
neighbors - Captures both element connectivity and data
cardinality
_at_name
3/2
_at_name
isbn
3/1
_at_address
title
price
author
17Intuition Behind Coverage
warehouse
- Important ? Inclusion in the summary
- Elements can be too close to each other
- Two basic notions
- Element Affinity
- Element Coverage
state
authors
store
_at_name
author
contact
book
_at_id
_at_name
_at_name
isbn
price
title
_at_address
author
18Intuition Behind Coverage, contd
- Element Affinity
- less hops, higher affinity
- higher relative cardinality, lower affinity
- Element Coverage
- Element Affinity
- Neighbor Weight
warehouse
state
authors
store
_at_name
author
contact
book
_at_id
_at_name
_at_name
isbn
price
title
_at_address
author
19Compute Summary Coverage
- Schema element affinity from ea to eb
- Schema element coverage of eb by ea
20What makes a good schema summary ?
data distribution
schema structure
summary importance
summary coverage
21Talk Outline
- Motivation
- Background Definitions
- Desiderata of Schema Summary
- Efficient Schema Summarization
- Evaluation
- Conclusion and Related Work
22Overview
Database
K
Schema
(1) Annotating Schema Graph
(Computing statistics)
(Algorithms MaxImportance and MaxCoverage)
(2.1) Calculating Importance
(2.2) Calculating Coverage
Set of K elements with high coverage Set S of
Coverage Domination Pairs
List L of elements sorted by Importance
(3) Determine K summary elements
(Algorithm BalanceSummary)
(4) Cluster Original Schema Elements
Balanced Summary of Size K
23Algorithm MaxImportance
- MaxImportance generates a summary of a given size
k, maximizing summary importance
Compute steady-state element importance values
Sort and pick top-k important elements
Compute assignments of remaining elements
- Complexity O(N2 NlogN)
- Convergence is proved in MGR02.
24Algorithm MaxCoverage
- MaxCoverage generates a summary of a given size
k, maximizing summary coverage in a heuristic way
Eliminate elements being dominated Compute
summary coverage for all element set of size-k
Compute coverage dominance (bottom up with A/D
pairs)
Pick the set with highest coverage
- Complexity O(kN2nk)
- See paper for details on coverage dominance
25Backup Slide Calculating K-elements with Maximum
Combined Coverage
- Problem
- K highest coverage elements ? A set of K
elements with highest combined coverage - Need to consider all K-element sets
- A pruning strategy
- Heuristic approach
- Coverage Domination Pair ea dominates eb iif for
any summary with only eb, we can replace eb with
ea and obtain a summary with higher coverage. - We eliminate eb from consideration
26Backup Slide Coverage Domination Pair
- Conditions
- Let ej be the elements eb covers better than ea
ec be the element covers ea best except ea itself - Ca be the total coverage ea has for all ej
- Cb be the total coverage eb has for all ej
27Generate Balanced Summary
- No single optimal criteria to balance the two
desired properties - A heuristic approach
- Pick elements in the order of their importance
- Ignore elements that are dominated by elements
already in the summary - Works well in practice
28Talk Outline
- Motivation
- Background Definitions
- Desiderata of Schema Summary
- Efficient Schema Summarization
- Evaluation
- Conclusion and Related Work
29Evaluation Strategies
- Observation
- Comparing automatic summaries with summaries
generated by human experts - In general, automatic summaries agree well with
human ( 80) - An objective evaluation framework
- Models schema exploration based query behavior
- Query discovery cost the number of extra
elements visited in order to construct a correct
query from a query intention
30Query Discovery Cost Example
- Query Intention Retrieve ISBN of all books
- Query for b in doc()/state/store/book return
b/isbn
warehouse
warehouse
Cost 3
Cost 5
state
state
authors
store
_at_name
store
_at_name
author
author
book
contact
book
contact
book
_at_id
_at_name
_at_name
isbn
_at_name
isbn
price
price
title
_at_address
title
_at_address
author
author
31Data Sets
32Summary Benefits
33Contributions of Schema Structure and Data
Distribution
34Impact of Balancing Importance and Coverage
Percentage in parenthesis shows the reduction
in savings
35Talk Outline
- Motivation
- Background Definitions
- Desiderata of Schema Summary
- Efficient Schema Summarization
- Evaluation
- Conclusion and Related Work
36Related Work
- First study on summarizing schemas
- Related to ER model abstraction
- Limitations of ER model abstraction
- Does not reflect the data distribution
- ER models may not be available and may be
out-of-date - For most database schemas, structure or value
links are semantics-free, ER model abstraction
methods are ineffective in this case (tagging
those links involve significant amount of manual
effort)
37Related Work, contd
- Summary element importance calculation is
partially inspired by PageRank - Summary element affinity calculation (used in
summary coverage) is partially inspired by
similar measurements in social network analysis
38Conclusions and Contributions
- Introduced concept of schema summary
- Defined summary importance and summary coverage
as desiderata of schema summary - Emphasized both schema structure and data
distribution as essential features for importance
and coverage calculation - Designed and implemented efficient schema
summarization algorithms - An objective evaluation framework
39Questions ?