Typing Semistructured Data - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Typing Semistructured Data

Description:

Greatest fixpoint semantics. Typing the largest set of objects. The goal is to find the greatest fixpoint for a given data graph. ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 28

Provided by: joseemend

Learn more at: https://www.cs.nmsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Typing Semistructured Data

1
Typing Semistructured Data
By, Keshava Reddy Kottapally Goutham
Chinnapolamada
Source Serge Abiteboul, Dan Suciu, Peter
Buneman, Data on the web From relations to
semistructured data and XML, Morgan Kaufmann
Series, ISBN 1-55860-622-X, 1999
2
Typing Semistructured Data

Introduction Schema for Semistructured data
Motivation for typing Semistructured data
Schema formalisms
First-order logic
Datalog
Graph simulations
Extracting schemas from data
Inferring schemas from queries
Path constraints

3
What is semistructured data..?

Semistructured data has some structure, but is
difficult to describe with a predefined, rigid
schema
Irregularity
Continual evolution
Structure that is implicit or unknown to the user

4
What is typing..?

Typing is about finding the structure of
semistructured data
The idea of structuring semistructured data is
still an area of much research activity
Typing involves finding methods to provide
schemas for semistructured data
Typing for SSD differ from those for relational
or object-oriented data and hence needs separate
methods

5
Uses of typing SSD

To optimize query evaluation
Example
Original query
select X.title
from biblio._X
where X..zip 12345
Optimized form
select X.title
from biblio.book X
where X.address.zip 12345

6
title
string
biblio
book
C1
C2
C3
C4
C5
author
first name
C5
C5
address
last name
paper
street
C5
C5
string
city
title
string
C5
C5
journal
zip
C5
string
C5
string
year
C5
string
C5
string
7
Uses of typing continued...

To facilitate the task of integrating several
data sources
To improve storage
Better clustering may reduce number of page
fetches, thus improving query performance
To construct indexes
To describe the database content to users and
facilitate query formulation
To proscribe certain updates

8
Two ways of typing..

Schema extraction
Given one particular data instance, finding the
most specific schema for it
With semistructured data we may specify the type
after the database is populated
A data instance may have more than one type
Schema inference
Finding the most specific schema by analyzing the
query
This process is similar to type inference in
programming languages

9
The problem

Given a database and a type,
does the database conform to this type?
Classification of objects
Which objects belong to each class..?
Typing involves description of the structure of
each class and its relationships with other
classes

10
Difference between typing SSD and Object Databases

Classes are defined less precisely. As a
consequence, objects may belong to several
classes
Some objects may not belong to any class or may
have properties that do not pertain to any class
The typing may be approximate. For example, we
may accept in a class an object that does not
quite conform to the specification of that class.

11
Schema formalisms

First-order logic
Datalog
Simulation

12
First-order logic

Example Consider three kinds of objects in the
database
Root object(s) have
Outgoing edges labeled company to company objects
and person to person objects
Person objects have
Outgoing edges labeled name and position to
string objects
Outgoing edges labeled worksfor to company
objects
Incoming edges labeled manager and employee from
company objects
Company objects have
Outgoing edges labeled name and address to string
objects
Outgoing edges labeled manager and employee to
person objects
Incoming edges labeled worksfor from person
objects

If
if an object has a-edges to strings and b-edges
from c objects, then it is a c-object.
?Y, Z(ref(X,a,Y) string(Y) c(Z)
ref(Z,b,X))) ? c(X)
Only-if
Any c-object has some a-edges to strings and
some b-edges from c objects
?Y, Z(ref(X,a,Y) string(Y) c(Z)
ref(Z,b,X))) ? c(X)
If and only if
?Y, Z(ref(X,a,Y) string(Y) c(Z)
ref(Z,b,X))) ? c(X)
Consequence
c(X) ref(Z,b,X) ? c (Z)
c(X) ref(X,a,Y) ? string(Y)
c(X) ref(X,L,Y) L ? a L ? b ? false

14
Problem definition with first-order logic

The previous questions on typing can be restated
in terms of first-order logic
Does D satisfy T, noted D T, that is, is there
a model of T that coincides with D over the
extensional predicates..?
If D T, what is the classification that is
induced..?
First-order logic leads to very general typings,
probably too general for what is needed in
semistructured data
It could also lead to undecidability or
intractability

15
Datalog A rule-based language

Datalog allows us to state that if a conjunction
of facts holds, then some new fact can be derived
Datalog rules allow us to define classes by
specifying what incoming and outgoing edges are
required
Example
r(X) - ref(X, person, Y), p(Y), ref(X, company,
Z), c(Z)
p(X) - c(Y), ref(Y, manager, X), c(Z), ref(Z,
employee, X), ref(X, worksfor, U), c(U), ref(X,
name, N), string(N), ref(X, position, P),
string(P)
c(X) - p(Z), ref(Z, worksfor, X), p(Z), ref(Z,
worksfor, X), ref(X, manager, M), p(M), ref(X,
employee, E), p(E), ref(X, name, N), string(N),
ref(X, address, A), string(A)

16
Fixpoint semantics

Least fixpoint semantics
We start from an empty set of facts and derive
nothing. Hence, the empty set of facts is the
least fixpoint for this program
Greatest fixpoint semantics
Typing the largest set of objects
The goal is to find the greatest fixpoint for a
given data graph. The desired model is the
greatest fixpoint containing D.

Consider the following data graph D
o1 company o2name o5 o2,
address o6 Versailles,
manager o3,
employee o3, employee o4 ,
person o3 name o7 Francois,
position o8 CEO,
worksfor o2 ,
person o4 name o9 Lucien,
position o10 programmer,
worksfor o2
ref(o1, company, o2), ref(o2, name, o5), etc.
string(o5, string(o6), etc.

18
Deriving the greatest fixpoint

The desired model M can be derived by starting
from a model containing D and all possible typing
facts. Let
Jo D U r(o1), r(o2), r(o3), r(o4),
p(o1), p(o2), p(o3), p(o4), c(o1), c(o2),
c(o3), c(o4),
Deriving from J0 until a fixpoint is reached will
get to the desired model
M J2 J1 D U r(o1), c(o2), p(o3), p(o4)

19
Simulation

The aim is to produce a schema graph for a data
graph whose semantics lead to a listing of all
permitted labels.
A schema graph is similar to a data graph with
the following changes
Labels can be alterations (like address name
url ) or underscore
Atomic values are type names, like string, int,
float, etc.
Oids of complex objects are called as classes,
like Person, Company, etc.

20
r1
person
person
person
company
company
manager
emp
p1
c1
p2
c2
p3
mgr
emp
worksfor
worksfor
worksfor
name
name
name
name
name
position
addr
phone
addr
position
url
s0
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
Smith
Mgr
Widget
Trent
Joe
description
description
a4
a1
task
performance
salesrep
procurement
a3
a5
1997
a2
1998
contact
a6
a7
21
Schema graph
Root
company
person
employee
manager
manager
Company
Person
worksfor
nameaddressurl
description
namephoneposition
-
Any
String
22

Simulation is defined as follows
Given graphs G1 (V1, E1), G2 (V2, E2), a
relation R on V1,V2
is a simulation if it satisfies
?l? L ?x1,y1 ? V1 ?x2 ?V2(x1ly1 x1Rx2 ?
?y2?V2(y1Ry2 x2ly2))
The rule says that every edge in G1 must have a
corresponding edge in G2 under the simulation

R
x2
x1
l
l
R
y1
y2
G2
G1
23

To define a simulation between a semistructured
data instance and a schema graph, we add the
following additional requirements
The roots must be in the simulation r R r
Whenever x R y, if y is an atomic type (like
string, int), then x must be an atomic node too
and have a value of that type. We say the
simulation is typed

Data node Schema node
r1 Root
c1, c2 Company
p1, p2, p3 Person
s0,s1,s2,s3 string
a1,a2,a3,a4. Any
The relation R defined by the example data graph
and the given schema graph is a simulation

25
Back to the typing problem.

When does a data graph D conform to a schema
graph S..?
When there exists a rooted, typed simulation
between the data and the schema
Which objects belong to each class..?
The principle is that oid o should belong to
class c if o R c. In this way, a rooted
simulation R will always classify all objects.
However, the classification need not be unique!,
which leads to finding maximal simulation

26
D
book
o
author
title
author
book
book
S
b2
b1
publisher
year
title
author
title
author
string
string
string
string
string
string
27
Maximal simulation

G1 ltR G2 R is a simulation from G1 to G2
Fact
if G1 ltR1 G2 and G1 ltR2 G2 then G1 ltR1UR2 G2
For any data graph D conforming to some schema
graph S, there is always a maximal simulation
from D to S.
Back to the problem Which objects belong to each
class?
An object o belongs to some class c if oRc,
where R is the maximal solution between the OEM
data and schema graph