Title: Typing Semistructured Data
1Typing Semistructured Data
By, Keshava Reddy Kottapally Goutham
Chinnapolamada
Source Serge Abiteboul, Dan Suciu, Peter
Buneman, Data on the web From relations to
semistructured data and XML, Morgan Kaufmann
Series, ISBN 1-55860-622-X, 1999
2Typing Semistructured Data
- Introduction Schema for Semistructured data
- Motivation for typing Semistructured data
- Schema formalisms
- First-order logic
- Datalog
- Graph simulations
- Extracting schemas from data
- Inferring schemas from queries
- Path constraints
3What is semistructured data..?
- Semistructured data has some structure, but is
difficult to describe with a predefined, rigid
schema - Irregularity
- Continual evolution
- Structure that is implicit or unknown to the user
4What is typing..?
- Typing is about finding the structure of
semistructured data - The idea of structuring semistructured data is
still an area of much research activity - Typing involves finding methods to provide
schemas for semistructured data - Typing for SSD differ from those for relational
or object-oriented data and hence needs separate
methods
5Uses of typing SSD
- To optimize query evaluation
- Example
- Original query
- select X.title
- from biblio._X
- where X..zip 12345
- Optimized form
- select X.title
- from biblio.book X
- where X.address.zip 12345
6title
string
biblio
book
C1
C2
C3
C4
C5
author
first name
C5
C5
address
last name
paper
street
C5
C5
string
city
title
string
C5
C5
journal
zip
C5
string
C5
string
year
C5
string
C5
string
7Uses of typing continued...
- To facilitate the task of integrating several
data sources - To improve storage
- Better clustering may reduce number of page
fetches, thus improving query performance - To construct indexes
- To describe the database content to users and
facilitate query formulation - To proscribe certain updates
8Two ways of typing..
- Schema extraction
- Given one particular data instance, finding the
most specific schema for it - With semistructured data we may specify the type
after the database is populated - A data instance may have more than one type
- Schema inference
- Finding the most specific schema by analyzing the
query - This process is similar to type inference in
programming languages
9The problem
- Given a database and a type,
- does the database conform to this type?
- Classification of objects
- Which objects belong to each class..?
- Typing involves description of the structure of
each class and its relationships with other
classes
10Difference between typing SSD and Object Databases
- Classes are defined less precisely. As a
consequence, objects may belong to several
classes - Some objects may not belong to any class or may
have properties that do not pertain to any class - The typing may be approximate. For example, we
may accept in a class an object that does not
quite conform to the specification of that class.
11Schema formalisms
- First-order logic
- Datalog
- Simulation
12First-order logic
- Example Consider three kinds of objects in the
database - Root object(s) have
- Outgoing edges labeled company to company objects
and person to person objects - Person objects have
- Outgoing edges labeled name and position to
string objects - Outgoing edges labeled worksfor to company
objects - Incoming edges labeled manager and employee from
company objects - Company objects have
- Outgoing edges labeled name and address to string
objects - Outgoing edges labeled manager and employee to
person objects - Incoming edges labeled worksfor from person
objects
13- If
- if an object has a-edges to strings and b-edges
from c objects, then it is a c-object. - ?Y, Z(ref(X,a,Y) string(Y) c(Z)
ref(Z,b,X))) ? c(X) - Only-if
- Any c-object has some a-edges to strings and
some b-edges from c objects - ?Y, Z(ref(X,a,Y) string(Y) c(Z)
ref(Z,b,X))) ? c(X) - If and only if
- ?Y, Z(ref(X,a,Y) string(Y) c(Z)
ref(Z,b,X))) ? c(X) - Consequence
- c(X) ref(Z,b,X) ? c (Z)
- c(X) ref(X,a,Y) ? string(Y)
- c(X) ref(X,L,Y) L ? a L ? b ? false
14Problem definition with first-order logic
- The previous questions on typing can be restated
in terms of first-order logic - Does D satisfy T, noted D T, that is, is there
a model of T that coincides with D over the
extensional predicates..? - If D T, what is the classification that is
induced..? - First-order logic leads to very general typings,
probably too general for what is needed in
semistructured data - It could also lead to undecidability or
intractability
15Datalog A rule-based language
- Datalog allows us to state that if a conjunction
of facts holds, then some new fact can be derived - Datalog rules allow us to define classes by
specifying what incoming and outgoing edges are
required - Example
- r(X) - ref(X, person, Y), p(Y), ref(X, company,
Z), c(Z) - p(X) - c(Y), ref(Y, manager, X), c(Z), ref(Z,
employee, X), ref(X, worksfor, U), c(U), ref(X,
name, N), string(N), ref(X, position, P),
string(P) - c(X) - p(Z), ref(Z, worksfor, X), p(Z), ref(Z,
worksfor, X), ref(X, manager, M), p(M), ref(X,
employee, E), p(E), ref(X, name, N), string(N),
ref(X, address, A), string(A)
16Fixpoint semantics
- Least fixpoint semantics
- We start from an empty set of facts and derive
nothing. Hence, the empty set of facts is the
least fixpoint for this program - Greatest fixpoint semantics
- Typing the largest set of objects
- The goal is to find the greatest fixpoint for a
given data graph. The desired model is the
greatest fixpoint containing D.
17- Consider the following data graph D
- o1 company o2name o5 o2,
- address o6 Versailles,
- manager o3,
- employee o3, employee o4 ,
- person o3 name o7 Francois,
- position o8 CEO,
- worksfor o2 ,
- person o4 name o9 Lucien,
- position o10 programmer,
- worksfor o2
-
- ref(o1, company, o2), ref(o2, name, o5), etc.
- string(o5, string(o6), etc.
18Deriving the greatest fixpoint
- The desired model M can be derived by starting
from a model containing D and all possible typing
facts. Let - Jo D U r(o1), r(o2), r(o3), r(o4),
p(o1), p(o2), p(o3), p(o4), c(o1), c(o2),
c(o3), c(o4), - Deriving from J0 until a fixpoint is reached will
get to the desired model - M J2 J1 D U r(o1), c(o2), p(o3), p(o4)
19Simulation
- The aim is to produce a schema graph for a data
graph whose semantics lead to a listing of all
permitted labels. - A schema graph is similar to a data graph with
the following changes - Labels can be alterations (like address name
url ) or underscore - Atomic values are type names, like string, int,
float, etc. - Oids of complex objects are called as classes,
like Person, Company, etc.
20r1
person
person
person
company
company
manager
emp
p1
c1
p2
c2
p3
mgr
emp
worksfor
worksfor
worksfor
name
name
name
name
name
position
addr
phone
addr
position
url
s0
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
Smith
Mgr
Widget
Trent
Joe
description
description
a4
a1
task
performance
salesrep
procurement
a3
a5
1997
a2
1998
contact
a6
a7
21Schema graph
Root
company
person
employee
manager
manager
Company
Person
worksfor
nameaddressurl
description
namephoneposition
-
Any
String
22- Simulation is defined as follows
- Given graphs G1 (V1, E1), G2 (V2, E2), a
relation R on V1,V2 - is a simulation if it satisfies
- ?l? L ?x1,y1 ? V1 ?x2 ?V2(x1ly1 x1Rx2 ?
?y2?V2(y1Ry2 x2ly2)) - The rule says that every edge in G1 must have a
corresponding edge in G2 under the simulation
R
x2
x1
l
l
R
y1
y2
G2
G1
23- To define a simulation between a semistructured
data instance and a schema graph, we add the
following additional requirements - The roots must be in the simulation r R r
- Whenever x R y, if y is an atomic type (like
string, int), then x must be an atomic node too
and have a value of that type. We say the
simulation is typed
24- Data node Schema node
- r1 Root
- c1, c2 Company
- p1, p2, p3 Person
- s0,s1,s2,s3 string
- a1,a2,a3,a4. Any
- The relation R defined by the example data graph
and the given schema graph is a simulation
25Back to the typing problem.
- When does a data graph D conform to a schema
graph S..? - When there exists a rooted, typed simulation
between the data and the schema - Which objects belong to each class..?
- The principle is that oid o should belong to
class c if o R c. In this way, a rooted
simulation R will always classify all objects. - However, the classification need not be unique!,
which leads to finding maximal simulation
26D
book
o
author
title
author
book
book
S
b2
b1
publisher
year
title
author
title
author
string
string
string
string
string
string
27Maximal simulation
- G1 ltR G2 R is a simulation from G1 to G2
- Fact
- if G1 ltR1 G2 and G1 ltR2 G2 then G1 ltR1UR2 G2
- For any data graph D conforming to some schema
graph S, there is always a maximal simulation
from D to S. - Back to the problem Which objects belong to each
class? - An object o belongs to some class c if oRc,
where R is the maximal solution between the OEM
data and schema graph