Title: Web Data and the Resurrection of Database Theory
1Web Data and the Resurrection of Database Theory
- Dan Suciu
- University of Washington
2- In theory there is no difference between theory
and practice. In practice there is.
Jan L.A. van de Snepscheut September 12, 1953 -
February 23, 1994
3Short History of Database Theory
- The legendary beginnings, 1970-1971
- Relational databases are the brainchild of a
theoretician (Codd) - Heavily debated at the time (against CODASYL)
- It took several years for the concept to be
validated in practice
Theory driving the industry
4Short History of Database Theory
- The golden years (end of 70s, early 80s)
- Relational theory
- Functional dependencies
- Query containment
- Transactions
- Access methods
Theory listening to the industry
5Short History of Database Theory
- Refined decadence (end of 80s, early 90s)
- Descriptive complexity
- Logic databases
- Complex objects
- Constraint databases
Divorce ?
6Database MetatheoryAsking the Big Queries
- Christos Papadimitriou, in PODS, 1995
- Theory is inevitable CS is a science of
the artificial, and its artifact is being
changed by the very act of studying it
7Is DB Theory in a Crisis Today ?
- Industrys focus
- one particular data model relational/SQL
- one particular application (client-server)
- Theorys focus is on Logic
- New data models, query languages (query
containment, complex objects, recursion) - New applications (incomplete information, query
rewriting using views)
8One Example of Unused Theory
- Containment of conjunctive queries is NP complete
Chandra and Merlin77 - Dozens of extensions
- With union and difference Sagiv and
Yannakakis81 - With order predicates Klug88, van den
Meyden92 - With complex objects Levy and Suciu97
- With regular expressions Florescu, Levy and
Suciu98
9Query Containment
Q1 SELECT DISTINCT x.name, x.phone
FROM Person x, Person y, Person z WHERE
x.department y.department AND
x.manager z.manager
Is minimized to
Q2 SELECT DISTINCT x.name, x.phone FROM
Person x
The following can be checked Q1 ? Q2 and Q1 ?
Q2 hence Q1Q2
Minimization not used by RDBMs today
10Why Today Things Are Changing
- Just one reason The Web
- More precisely
- A new data model
- Semistructured data
- XML syntax
- New applications
- Transformation
- Integration
11Web Data Management
- Who creates the new rules
- W3C working groups
- Sometimes the industry
- The new artifacts are not concepts, but standards
- The double role of theory
- Long term conceptualize/rationalize
- E.g. keys for XML Buneman, Davidson, Fan, Hara,
Tan01 - Short term answer technical questions
12Some Questions for Database Theory
- XML publishing
- Typechecking XML transformations
- XML storage
- Data distribution
13application
XML Storage
application
object-relational
XML Typechecking
Integrate
XML Data
WEB (HTTP)
Warehouse
Transform
Warehouse
XML Publishing
application
XML Distribution
relational data
legacy data
14XML Publishing
- Today
- Legacy data
- fragmented into many flat relations
- 3rd normal form
- proprietary
- XML data
- nested
- un-normalized
- public (450 schemas at www.biztalk.org)
15XML Publishing an Example
Legacy data in E/R
name
country
name
url
euSid
usSid
Eu-Stores
US-Stores
date
tax
Eu-Sales
US-Sales
date
Products
pid
name
priceUSD
16XML Publishing an Example
- XML view
- ltallsalesgt
- ltcountrygt ltnamegt France lt/namegt
- ltstoregt ltnamegt Nicolas
lt/namegt - ltproductgt
ltnamegt Blanc de Blanc lt/namegt -
ltsoldgt 10/10/2000 lt/soldgt -
ltsoldgt 12/10/2000 lt/soldgt -
- lt/productgt
-
ltproductgtlt/productgt - lt/storegt.
- lt/countrygt
- lt/allsalesgt
- In summary group by country store product
17allsales
Output schema
country
name
store
?
PCDATA
name
product
url
PCDATA
PCDATA
name
sold
?
PCDATA
date
tax
PCDATA
PCDATA
18XML Publishing
In SilkRoute Fernandez, Suciu, Tan 00
FROM EuStores S, EuSales L, Products P
WHERE S.euSid L.euSid AND L.pid P.pid
CONSTRUCT ltallsales()gt
ltcountry(S.country)gt ltnamegt
S.country lt/namegt ltstore(S.euSid)gt
ltnamegt S.name lt/namegt
ltproduct(P.pid)gt
ltnamegt P.name lt/namegt
ltpricegt P.priceUSD
lt/pricegt lt/productgt
lt/storegt lt/countrygt
ltallsalesgt
/ union.. /
. / union / FROM USStores S, EuSales
L, Products P WHERE S.usSid L.euSid AND
L.pid P.pid CONSTRUCT ltallsales()gt
ltcountry(USA)gt ltnamegt USA
lt/namegt ltstore(S.euSid)gt
ltnamegt S.name lt/namegt
lturlgt S.url lt/urlgt
ltproduct(P.pid)gt
ltnamegt P.name lt/namegt
ltpricegt P.priceUSD lt/pricegt
lttaxgt L.tax lt/taxgt
lt/productgt lt/storegt
lt/countrygt ltallsalesgt
19Internal Representation
View Tree
Non-recursive datalog (SELECT DISTINCT )
allsales()
allsales()-
country(c) -EuStores(x,_,c), EuSales(x,y,_),
Products(y,_,_) country(USA) -
country(c)
store(c,x) - EuStores(x,_,c), EuSales(x,y,_),
Products(y,_,_) store(c,x) - USStores(x,_,_),
USSales(x,y,_), Products(y,_,_), cUSA
name(c)
store(c,x)
c
?
name(n)
product(c,x,y)
url(c,x,u)
url(c,x,u)-USStores(x,_,u), USSales(x,y,_),Produc
ts(y,_,_)
n
u
name(n)
sold(c,x,y,d)
n
date(c,x,y,d)
Tax(c,x,y,d,t)
d
t
Large query (x100 lines), large XML answer (x100
MB)
20Users Ask Specific XML Queries
- find names, urls of all stores who sold on
1/1/2000 (in XML-QL / XQuery melange)
WHERE ltallsales/country/storegt
ltproduct/sold/dategt 1/1/2000 lt/gt
ltnamegt X lt/gt
lturlgt Y lt/gt lt/gt RETURN X , Y
Small query, small answer
21Query Composition
View Tree
XML-QL Query Pattern
allsales
n1
country
n2
name(c)
n3
store
c
n4
product
url
name
n
u
Y
name(n)
X
sold
n5
n
Tax(c,x,y,d,t)
date
Z
d
t
1/1/2000
Evaluate the XML pattern(s) on the view tree,
combine all datalog rules
22Query Composition
( SELECT S.name, S.url FROM USStores S, USSales
L, Products P WHERE S.usSidL.usSid AND
L.pidP.pid AND L.date1/1/2000)
UNION ( SELECT S2.name, S2.url FROM EUStores
S1, EUSales L1, Products P1
USStores S2, USSales L2, Products P2, WHERE
S1.usSidL1.usSid AND L1.pidP1.pid AND
L1.date1/1/2000 AND S2.usSidL2.usSid AND
L2.pidP1.pid AND S1.countryUSA AND
S1.euSid S2.usSid)
23Complexity of XML Publishing
- But in practice 5-7 times more joins !
- Need query minimization
- Could this be avoided ?
- We thought hard and couldnt find a better way
- Asked students to re-implement same problem
- It is NP-hard !
24XML Publishing Is NP-Hard
View Tree
customer
?
?
order
complaint
order()- Q1
complaint()- Q2
PCDATA
PCDATA
XML query
WHERE ltcustomergt ltordergt x lt/gt
ltcomplaintgt y lt/gt
lt/gtRETURN ( )
Q1 JOIN Q2
The composed SQL query is Minimizing it is NP
hard ! (can be shown)
25Recent Advancements in Query Containment
- Definition FOk First Order Logic with k
variables - Fact If Q2 ? FOk and k is small, then Q1 ? Q2
can be checked efficiently - Kolaitis, Vardi98, Vardi00, Chekuri,
Ramajaran97
26XML Publishing Finale
- Prediction techniques based on FOk and/or query
width will be deployed in XML publishing in the
future - (perhaps under different names)
27XML Typechecking
- Purpose ensure that the generated XML conforms
to the desired DTD (or XML Schema) - Two kinds
- Dynamic typechecking
- Easy lots of XML validating parsers available
- Static typechecking
- Hard need complex analysis of the XML generation
program
28XML Typechecking
- XML generation programs
- Publishing RDBMS ? XML (e.g.
SilkRoute) - Transformation XML ? XML (e.g. XSL,
Xquery) - Integration XML XML ? XML
This talk XML ? XML
29The XML Typechecking Problem
- Given an XML ? XML transformation f
- Type Checking Problem
- Given DTDs t1, t2, check ?D ?t1, f(D) ?t2
sometimes t1 any check ?D, f(D) ?t2
30Todays Systems Try to DoType Inference
- Type Inference Problem
- Given DTD t1, find the DTD f(t1) f(D) D ?t1
- Todays systems
- Compute f(t1)
- Check f(t1) ? t2 (which is possible)
sometimes t1 any compute f(any) check
f(any) ? t2
31Theorys RoleSend a Warning
- This approach fails in general !
But it may work OK in most practical cases...
32Why XML Type Inference Fails
RETURN ltagt (FROM Employee x RETURN ltb/gt),
(FROM Employee x RETURN ltc/gt),
(FROM Employee x RETURN ltd/gt) lt/agt
- Xquery f
- Inferred (wrong) DTD f(any)
-
lt!ELEMENT a (b,c,d)gt
33The Typechecking Problem in Theory and Practice
- In practice, we care about typechecking
- Question for theory is this possible ?
- Positive result Milo, Suciu, Vianu, 2000
- Decidable for k-pebble tree tansducers
- Hence decidable for
- Join-free XQuery
- Simple XSLT programs
- Negative result Alon, Milo, Neven, Suciu, Vianu
2001 - Undecidable for transformations with value joins
34The Typechecking Finale
- Prediction systems will continue to use type
inference, but will never be as robust as type
checking in programming languages - Need to understand well their applicability
35XML Storage
- Problem
- Given a (large) XML data instance
- Goal store/process it in a RDBMS
- Problem find the relational schema !
- Current approaches
- Generic schema Florescu, Kossman 99
- Derive schema from DTD Shanmungasudaram et al
99 - Derive schema from XML dataDeutsch, Fernandez,
Suciu 99
36The Theory of XML Storage
- The simplest case flat, unique subelements
- M
- How do we cover all 1s most economically ?
- R1(E2, E3, E4), R2(E1, E5, E9, E12),
Oid E1 E2 E3 E4 E5000
1 1 0 0 1 0
2 0 1 1 0 0
3 0 1 0 1 0
4 0 1 1 1 0
5 1 0 1 0 0
6 1 1 0 0 0
o10000000 0 1 0 0 0
37The Theory of XML Storage
- XML storage and matrix rank
- M
- Can store XML data in k relations ? rank(M)k
- Conversely if rank(M)k ? what about storage ?
Oid E1 E2 E3 E4 E5000
1 1 0 0 1 0
2 0 1 1 1 0
3 0 1 1 1 0
4 0 1 1 1 0
5 1 1 0 0 0
6 1 1 0 0 0
7 0 0 0 1 ...
10000000 1 0 0 1 0
38XML Storage Finale
- Prediction we will see several clever XML
storage techniques discovered in the near future
39The Data Distribution
- Many data consumers, many places to cache
- Data can be replicated, transformed
- How to transform it ? The view selection problem
- Where to place it ? The data distribution problem.
Prediction no predictions here (too early)
40ConclusionsResurrection of Database Theory
- Is theory irrelevant ?
- Papadimitriou, 95 wrong question to ask
- Respect for practice only a recent development
in human culture - Applicability pressure in CS annoying trend of
last 10 years or so - Database theory are we in a revolution ?
- The past researchers created artifacts for the
industry - Today society (Web, W3C) is creating artifacts
for researchers to study, improve
Prediction there will be no difference
betweentheory and practice
at least, in theory !