Web Data and the Resurrection of Database Theory - PowerPoint PPT Presentation

About This Presentation
Title:

Web Data and the Resurrection of Database Theory

Description:

Web Data and the Resurrection of Database Theory Dan Suciu University of Washington – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 41
Provided by: homesCsWa1
Category:

less

Transcript and Presenter's Notes

Title: Web Data and the Resurrection of Database Theory


1
Web Data and the Resurrection of Database Theory
  • Dan Suciu
  • University of Washington

2
  • In theory there is no difference between theory
    and practice. In practice there is.

Jan L.A. van de Snepscheut September 12, 1953 -
February 23, 1994
3
Short History of Database Theory
  • The legendary beginnings, 1970-1971
  • Relational databases are the brainchild of a
    theoretician (Codd)
  • Heavily debated at the time (against CODASYL)
  • It took several years for the concept to be
    validated in practice

Theory driving the industry
4
Short History of Database Theory
  • The golden years (end of 70s, early 80s)
  • Relational theory
  • Functional dependencies
  • Query containment
  • Transactions
  • Access methods

Theory listening to the industry
5
Short History of Database Theory
  • Refined decadence (end of 80s, early 90s)
  • Descriptive complexity
  • Logic databases
  • Complex objects
  • Constraint databases

Divorce ?
6
Database MetatheoryAsking the Big Queries
  • Christos Papadimitriou, in PODS, 1995
  • Theory is inevitable CS is a science of
    the artificial, and its artifact is being
    changed by the very act of studying it

7
Is DB Theory in a Crisis Today ?
  • Industrys focus
  • one particular data model relational/SQL
  • one particular application (client-server)
  • Theorys focus is on Logic
  • New data models, query languages (query
    containment, complex objects, recursion)
  • New applications (incomplete information, query
    rewriting using views)

8
One Example of Unused Theory
  • Containment of conjunctive queries is NP complete
    Chandra and Merlin77
  • Dozens of extensions
  • With union and difference Sagiv and
    Yannakakis81
  • With order predicates Klug88, van den
    Meyden92
  • With complex objects Levy and Suciu97
  • With regular expressions Florescu, Levy and
    Suciu98

9
Query Containment
Q1 SELECT DISTINCT x.name, x.phone
FROM Person x, Person y, Person z WHERE
x.department y.department AND
x.manager z.manager
  • The query

Is minimized to
Q2 SELECT DISTINCT x.name, x.phone FROM
Person x
The following can be checked Q1 ? Q2 and Q1 ?
Q2 hence Q1Q2
Minimization not used by RDBMs today
10
Why Today Things Are Changing
  • Just one reason The Web
  • More precisely
  • A new data model
  • Semistructured data
  • XML syntax
  • New applications
  • Transformation
  • Integration

11
Web Data Management
  • Who creates the new rules
  • W3C working groups
  • Sometimes the industry
  • The new artifacts are not concepts, but standards
  • The double role of theory
  • Long term conceptualize/rationalize
  • E.g. keys for XML Buneman, Davidson, Fan, Hara,
    Tan01
  • Short term answer technical questions

12
Some Questions for Database Theory
  • XML publishing
  • Typechecking XML transformations
  • XML storage
  • Data distribution

13
application
XML Storage
application
object-relational
XML Typechecking
Integrate
XML Data
WEB (HTTP)
Warehouse
Transform
Warehouse
XML Publishing
application
XML Distribution
relational data
legacy data
14
XML Publishing
  • Today
  • Legacy data
  • fragmented into many flat relations
  • 3rd normal form
  • proprietary
  • XML data
  • nested
  • un-normalized
  • public (450 schemas at www.biztalk.org)

15
XML Publishing an Example
Legacy data in E/R
name
country
name
url
euSid
usSid
Eu-Stores
US-Stores
date
tax
Eu-Sales
US-Sales
date
Products
pid
name
priceUSD
16
XML Publishing an Example
  • XML view
  • ltallsalesgt
  • ltcountrygt ltnamegt France lt/namegt
  • ltstoregt ltnamegt Nicolas
    lt/namegt
  • ltproductgt
    ltnamegt Blanc de Blanc lt/namegt

  • ltsoldgt 10/10/2000 lt/soldgt

  • ltsoldgt 12/10/2000 lt/soldgt

  • lt/productgt

  • ltproductgtlt/productgt
  • lt/storegt.
  • lt/countrygt
  • lt/allsalesgt
  • In summary group by country store product

17
allsales
Output schema

country

name
store
?

PCDATA
name
product
url

PCDATA
PCDATA
name
sold
?
PCDATA
date
tax
PCDATA
PCDATA
18
XML Publishing
In SilkRoute Fernandez, Suciu, Tan 00
FROM EuStores S, EuSales L, Products P
WHERE S.euSid L.euSid AND L.pid P.pid
CONSTRUCT ltallsales()gt
ltcountry(S.country)gt ltnamegt
S.country lt/namegt ltstore(S.euSid)gt
ltnamegt S.name lt/namegt
ltproduct(P.pid)gt
ltnamegt P.name lt/namegt
ltpricegt P.priceUSD
lt/pricegt lt/productgt
lt/storegt lt/countrygt
ltallsalesgt
/ union.. /
. / union / FROM USStores S, EuSales
L, Products P WHERE S.usSid L.euSid AND
L.pid P.pid CONSTRUCT ltallsales()gt
ltcountry(USA)gt ltnamegt USA
lt/namegt ltstore(S.euSid)gt
ltnamegt S.name lt/namegt
lturlgt S.url lt/urlgt
ltproduct(P.pid)gt
ltnamegt P.name lt/namegt
ltpricegt P.priceUSD lt/pricegt
lttaxgt L.tax lt/taxgt
lt/productgt lt/storegt
lt/countrygt ltallsalesgt
19
Internal Representation
View Tree
Non-recursive datalog (SELECT DISTINCT )
allsales()
allsales()-

country(c) -EuStores(x,_,c), EuSales(x,y,_),
Products(y,_,_) country(USA) -
country(c)

store(c,x) - EuStores(x,_,c), EuSales(x,y,_),
Products(y,_,_) store(c,x) - USStores(x,_,_),
USSales(x,y,_), Products(y,_,_), cUSA
name(c)
store(c,x)
c

?
name(n)
product(c,x,y)
url(c,x,u)
url(c,x,u)-USStores(x,_,u), USSales(x,y,_),Produc
ts(y,_,_)
n
u

name(n)
sold(c,x,y,d)
n
date(c,x,y,d)
Tax(c,x,y,d,t)
d
t
Large query (x100 lines), large XML answer (x100
MB)
20
Users Ask Specific XML Queries
  • find names, urls of all stores who sold on
    1/1/2000 (in XML-QL / XQuery melange)

WHERE ltallsales/country/storegt
ltproduct/sold/dategt 1/1/2000 lt/gt
ltnamegt X lt/gt
lturlgt Y lt/gt lt/gt RETURN X , Y
Small query, small answer
21
Query Composition
View Tree
XML-QL Query Pattern
allsales
n1
country
n2
name(c)
n3
store
c
n4
product
url
name
n
u
Y
name(n)
X
sold
n5
n
Tax(c,x,y,d,t)
date
Z
d
t
1/1/2000
Evaluate the XML pattern(s) on the view tree,
combine all datalog rules
22
Query Composition
  • Result (in theory)

( SELECT S.name, S.url FROM USStores S, USSales
L, Products P WHERE S.usSidL.usSid AND
L.pidP.pid AND L.date1/1/2000)
UNION ( SELECT S2.name, S2.url FROM EUStores
S1, EUSales L1, Products P1
USStores S2, USSales L2, Products P2, WHERE
S1.usSidL1.usSid AND L1.pidP1.pid AND
L1.date1/1/2000 AND S2.usSidL2.usSid AND
L2.pidP1.pid AND S1.countryUSA AND
S1.euSid S2.usSid)
23
Complexity of XML Publishing
  • But in practice 5-7 times more joins !
  • Need query minimization
  • Could this be avoided ?
  • We thought hard and couldnt find a better way
  • Asked students to re-implement same problem
  • It is NP-hard !

24
XML Publishing Is NP-Hard
View Tree
customer
?
?
order
complaint
order()- Q1
complaint()- Q2
PCDATA
PCDATA
XML query
WHERE ltcustomergt ltordergt x lt/gt
ltcomplaintgt y lt/gt
lt/gtRETURN ( )
Q1 JOIN Q2
The composed SQL query is Minimizing it is NP
hard ! (can be shown)
25
Recent Advancements in Query Containment
  • Definition FOk First Order Logic with k
    variables
  • Fact If Q2 ? FOk and k is small, then Q1 ? Q2
    can be checked efficiently
  • Kolaitis, Vardi98, Vardi00, Chekuri,
    Ramajaran97

26
XML Publishing Finale
  • Prediction techniques based on FOk and/or query
    width will be deployed in XML publishing in the
    future
  • (perhaps under different names)

27
XML Typechecking
  • Purpose ensure that the generated XML conforms
    to the desired DTD (or XML Schema)
  • Two kinds
  • Dynamic typechecking
  • Easy lots of XML validating parsers available
  • Static typechecking
  • Hard need complex analysis of the XML generation
    program

28
XML Typechecking
  • XML generation programs
  • Publishing RDBMS ? XML (e.g.
    SilkRoute)
  • Transformation XML ? XML (e.g. XSL,
    Xquery)
  • Integration XML XML ? XML

This talk XML ? XML
29
The XML Typechecking Problem
  • Given an XML ? XML transformation f
  • Type Checking Problem
  • Given DTDs t1, t2, check ?D ?t1, f(D) ?t2

sometimes t1 any check ?D, f(D) ?t2
30
Todays Systems Try to DoType Inference
  • Type Inference Problem
  • Given DTD t1, find the DTD f(t1) f(D) D ?t1
  • Todays systems
  • Compute f(t1)
  • Check f(t1) ? t2 (which is possible)

sometimes t1 any compute f(any) check
f(any) ? t2
31
Theorys RoleSend a Warning
  • This approach fails in general !

But it may work OK in most practical cases...
32
Why XML Type Inference Fails
RETURN ltagt (FROM Employee x RETURN ltb/gt),
(FROM Employee x RETURN ltc/gt),
(FROM Employee x RETURN ltd/gt) lt/agt
  • Xquery f
  • Inferred (wrong) DTD f(any)

lt!ELEMENT a (b,c,d)gt
33
The Typechecking Problem in Theory and Practice
  • In practice, we care about typechecking
  • Question for theory is this possible ?
  • Positive result Milo, Suciu, Vianu, 2000
  • Decidable for k-pebble tree tansducers
  • Hence decidable for
  • Join-free XQuery
  • Simple XSLT programs
  • Negative result Alon, Milo, Neven, Suciu, Vianu
    2001
  • Undecidable for transformations with value joins

34
The Typechecking Finale
  • Prediction systems will continue to use type
    inference, but will never be as robust as type
    checking in programming languages
  • Need to understand well their applicability

35
XML Storage
  • Problem
  • Given a (large) XML data instance
  • Goal store/process it in a RDBMS
  • Problem find the relational schema !
  • Current approaches
  • Generic schema Florescu, Kossman 99
  • Derive schema from DTD Shanmungasudaram et al
    99
  • Derive schema from XML dataDeutsch, Fernandez,
    Suciu 99

36
The Theory of XML Storage
  • The simplest case flat, unique subelements
  • M
  • How do we cover all 1s most economically ?
  • R1(E2, E3, E4), R2(E1, E5, E9, E12),

Oid E1 E2 E3 E4 E5000
1 1 0 0 1 0
2 0 1 1 0 0
3 0 1 0 1 0
4 0 1 1 1 0
5 1 0 1 0 0
6 1 1 0 0 0

o10000000 0 1 0 0 0
37
The Theory of XML Storage
  • XML storage and matrix rank
  • M
  • Can store XML data in k relations ? rank(M)k
  • Conversely if rank(M)k ? what about storage ?

Oid E1 E2 E3 E4 E5000
1 1 0 0 1 0
2 0 1 1 1 0
3 0 1 1 1 0
4 0 1 1 1 0
5 1 1 0 0 0
6 1 1 0 0 0
7 0 0 0 1 ...

10000000 1 0 0 1 0
38
XML Storage Finale
  • Prediction we will see several clever XML
    storage techniques discovered in the near future

39
The Data Distribution
  • Many data consumers, many places to cache
  • Data can be replicated, transformed
  • How to transform it ? The view selection problem
  • Where to place it ? The data distribution problem.

Prediction no predictions here (too early)
40
ConclusionsResurrection of Database Theory
  • Is theory irrelevant ?
  • Papadimitriou, 95 wrong question to ask
  • Respect for practice only a recent development
    in human culture
  • Applicability pressure in CS annoying trend of
    last 10 years or so
  • Database theory are we in a revolution ?
  • The past researchers created artifacts for the
    industry
  • Today society (Web, W3C) is creating artifacts
    for researchers to study, improve

Prediction there will be no difference
betweentheory and practice
at least, in theory !
Write a Comment
User Comments (0)
About PowerShow.com