Typing Semistructured Data - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Typing Semistructured Data

Description:

Greatest fixpoint semantics. Typing the largest set of objects. The goal is to find the greatest fixpoint for a given data graph. ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 28
Provided by: joseemend
Learn more at: https://www.cs.nmsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Typing Semistructured Data


1
Typing Semistructured Data
By, Keshava Reddy Kottapally Goutham
Chinnapolamada
Source Serge Abiteboul, Dan Suciu, Peter
Buneman, Data on the web From relations to
semistructured data and XML, Morgan Kaufmann
Series, ISBN 1-55860-622-X, 1999
2
Typing Semistructured Data
  • Introduction Schema for Semistructured data
  • Motivation for typing Semistructured data
  • Schema formalisms
  • First-order logic
  • Datalog
  • Graph simulations
  • Extracting schemas from data
  • Inferring schemas from queries
  • Path constraints

3
What is semistructured data..?
  • Semistructured data has some structure, but is
    difficult to describe with a predefined, rigid
    schema
  • Irregularity
  • Continual evolution
  • Structure that is implicit or unknown to the user

4
What is typing..?
  • Typing is about finding the structure of
    semistructured data
  • The idea of structuring semistructured data is
    still an area of much research activity
  • Typing involves finding methods to provide
    schemas for semistructured data
  • Typing for SSD differ from those for relational
    or object-oriented data and hence needs separate
    methods

5
Uses of typing SSD
  • To optimize query evaluation
  • Example
  • Original query
  • select X.title
  • from biblio._X
  • where X..zip 12345
  • Optimized form
  • select X.title
  • from biblio.book X
  • where X.address.zip 12345

6
title
string
biblio
book
C1
C2
C3
C4
C5
author
first name
C5
C5
address
last name
paper
street
C5
C5
string
city
title
string
C5
C5
journal
zip
C5
string
C5
string
year
C5
string
C5
string
7
Uses of typing continued...
  • To facilitate the task of integrating several
    data sources
  • To improve storage
  • Better clustering may reduce number of page
    fetches, thus improving query performance
  • To construct indexes
  • To describe the database content to users and
    facilitate query formulation
  • To proscribe certain updates

8
Two ways of typing..
  • Schema extraction
  • Given one particular data instance, finding the
    most specific schema for it
  • With semistructured data we may specify the type
    after the database is populated
  • A data instance may have more than one type
  • Schema inference
  • Finding the most specific schema by analyzing the
    query
  • This process is similar to type inference in
    programming languages

9
The problem
  • Given a database and a type,
  • does the database conform to this type?
  • Classification of objects
  • Which objects belong to each class..?
  • Typing involves description of the structure of
    each class and its relationships with other
    classes

10
Difference between typing SSD and Object Databases
  • Classes are defined less precisely. As a
    consequence, objects may belong to several
    classes
  • Some objects may not belong to any class or may
    have properties that do not pertain to any class
  • The typing may be approximate. For example, we
    may accept in a class an object that does not
    quite conform to the specification of that class.

11
Schema formalisms
  • First-order logic
  • Datalog
  • Simulation

12
First-order logic
  • Example Consider three kinds of objects in the
    database
  • Root object(s) have
  • Outgoing edges labeled company to company objects
    and person to person objects
  • Person objects have
  • Outgoing edges labeled name and position to
    string objects
  • Outgoing edges labeled worksfor to company
    objects
  • Incoming edges labeled manager and employee from
    company objects
  • Company objects have
  • Outgoing edges labeled name and address to string
    objects
  • Outgoing edges labeled manager and employee to
    person objects
  • Incoming edges labeled worksfor from person
    objects

13
  • If
  • if an object has a-edges to strings and b-edges
    from c objects, then it is a c-object.
  • ?Y, Z(ref(X,a,Y) string(Y) c(Z)
    ref(Z,b,X))) ? c(X)
  • Only-if
  • Any c-object has some a-edges to strings and
    some b-edges from c objects
  • ?Y, Z(ref(X,a,Y) string(Y) c(Z)
    ref(Z,b,X))) ? c(X)
  • If and only if
  • ?Y, Z(ref(X,a,Y) string(Y) c(Z)
    ref(Z,b,X))) ? c(X)
  • Consequence
  • c(X) ref(Z,b,X) ? c (Z)
  • c(X) ref(X,a,Y) ? string(Y)
  • c(X) ref(X,L,Y) L ? a L ? b ? false

14
Problem definition with first-order logic
  • The previous questions on typing can be restated
    in terms of first-order logic
  • Does D satisfy T, noted D T, that is, is there
    a model of T that coincides with D over the
    extensional predicates..?
  • If D T, what is the classification that is
    induced..?
  • First-order logic leads to very general typings,
    probably too general for what is needed in
    semistructured data
  • It could also lead to undecidability or
    intractability

15
Datalog A rule-based language
  • Datalog allows us to state that if a conjunction
    of facts holds, then some new fact can be derived
  • Datalog rules allow us to define classes by
    specifying what incoming and outgoing edges are
    required
  • Example
  • r(X) - ref(X, person, Y), p(Y), ref(X, company,
    Z), c(Z)
  • p(X) - c(Y), ref(Y, manager, X), c(Z), ref(Z,
    employee, X), ref(X, worksfor, U), c(U), ref(X,
    name, N), string(N), ref(X, position, P),
    string(P)
  • c(X) - p(Z), ref(Z, worksfor, X), p(Z), ref(Z,
    worksfor, X), ref(X, manager, M), p(M), ref(X,
    employee, E), p(E), ref(X, name, N), string(N),
    ref(X, address, A), string(A)

16
Fixpoint semantics
  • Least fixpoint semantics
  • We start from an empty set of facts and derive
    nothing. Hence, the empty set of facts is the
    least fixpoint for this program
  • Greatest fixpoint semantics
  • Typing the largest set of objects
  • The goal is to find the greatest fixpoint for a
    given data graph. The desired model is the
    greatest fixpoint containing D.

17
  • Consider the following data graph D
  • o1 company o2name o5 o2,
  • address o6 Versailles,
  • manager o3,
  • employee o3, employee o4 ,
  • person o3 name o7 Francois,
  • position o8 CEO,
  • worksfor o2 ,
  • person o4 name o9 Lucien,
  • position o10 programmer,
  • worksfor o2
  • ref(o1, company, o2), ref(o2, name, o5), etc.
  • string(o5, string(o6), etc.

18
Deriving the greatest fixpoint
  • The desired model M can be derived by starting
    from a model containing D and all possible typing
    facts. Let
  • Jo D U r(o1), r(o2), r(o3), r(o4),
    p(o1), p(o2), p(o3), p(o4), c(o1), c(o2),
    c(o3), c(o4),
  • Deriving from J0 until a fixpoint is reached will
    get to the desired model
  • M J2 J1 D U r(o1), c(o2), p(o3), p(o4)

19
Simulation
  • The aim is to produce a schema graph for a data
    graph whose semantics lead to a listing of all
    permitted labels.
  • A schema graph is similar to a data graph with
    the following changes
  • Labels can be alterations (like address name
    url ) or underscore
  • Atomic values are type names, like string, int,
    float, etc.
  • Oids of complex objects are called as classes,
    like Person, Company, etc.

20
r1
person
person
person
company
company
manager
emp
p1
c1
p2
c2
p3
mgr
emp
worksfor
worksfor
worksfor
name
name
name
name
name
position
addr
phone
addr
position
url
s0
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
Smith
Mgr
Widget
Trent
Joe
description
description
a4
a1
task
performance
salesrep
procurement
a3
a5
1997
a2
1998
contact
a6
a7
21
Schema graph
Root
company
person
employee
manager
manager
Company
Person
worksfor
nameaddressurl
description
namephoneposition
-
Any
String
22
  • Simulation is defined as follows
  • Given graphs G1 (V1, E1), G2 (V2, E2), a
    relation R on V1,V2
  • is a simulation if it satisfies
  • ?l? L ?x1,y1 ? V1 ?x2 ?V2(x1ly1 x1Rx2 ?
    ?y2?V2(y1Ry2 x2ly2))
  • The rule says that every edge in G1 must have a
    corresponding edge in G2 under the simulation

R
x2
x1
l
l
R
y1
y2
G2
G1
23
  • To define a simulation between a semistructured
    data instance and a schema graph, we add the
    following additional requirements
  • The roots must be in the simulation r R r
  • Whenever x R y, if y is an atomic type (like
    string, int), then x must be an atomic node too
    and have a value of that type. We say the
    simulation is typed

24
  • Data node Schema node
  • r1 Root
  • c1, c2 Company
  • p1, p2, p3 Person
  • s0,s1,s2,s3 string
  • a1,a2,a3,a4. Any
  • The relation R defined by the example data graph
    and the given schema graph is a simulation

25
Back to the typing problem.
  • When does a data graph D conform to a schema
    graph S..?
  • When there exists a rooted, typed simulation
    between the data and the schema
  • Which objects belong to each class..?
  • The principle is that oid o should belong to
    class c if o R c. In this way, a rooted
    simulation R will always classify all objects.
  • However, the classification need not be unique!,
    which leads to finding maximal simulation

26
D
book
o
author
title
author
book
book
S
b2
b1
publisher
year
title
author
title
author
string
string
string
string
string
string
27
Maximal simulation
  • G1 ltR G2 R is a simulation from G1 to G2
  • Fact
  • if G1 ltR1 G2 and G1 ltR2 G2 then G1 ltR1UR2 G2
  • For any data graph D conforming to some schema
    graph S, there is always a maximal simulation
    from D to S.
  • Back to the problem Which objects belong to each
    class?
  • An object o belongs to some class c if oRc,
    where R is the maximal solution between the OEM
    data and schema graph
Write a Comment
User Comments (0)
About PowerShow.com