Title: MONADIC QUERIES over TREE-STRUCTURED DATA
1MONADIC QUERIES overTREE-STRUCTURED DATA
- Georg Gottlob
- TU Wien Oxford University
- Joint work with Christoph Koch, Robert
Baumgartner, and Marcus Herzog, and Reinhard
Pichler
2Talk Outline
- Semistructured data HTML, XML
- Monadic Queries
- Monadic datalog over trees
- Xpath
- Web information extraction (wrapping)
- Lixto
3Strings, Trees, Graphs, Logic
A few well-known results
- Büchi MSOREG over strings
- Rabin decidability of S2S
- Thatcher and Wright MSO REG over ranked trees
(tree automata) - Brüggemann-Klein/Wood/Murata MSO REG over
unranked trees - Fagin ESO NP
- Note over graphs ESO NP-hard, MSO
hard for Pol. Hierarchy. - Grädel/Immerman/Vardi ESO(Horn)DatalogLFPPTI
ME - (on ordered structures)
- Courcelle MSO in LinTime on tree-like
structures (treewidth lt k) - Clarke, Emerson, Pnueli, et al CTL, LTL
4Web documents are trees !
- HTML Hypertext Markup Language
- XML Extensible Markup Language
- HTML, XML Context free languages.
- Represent a document by its parse tree.
- Tags vertex labels
- Labeled trees.
5HTML Example
lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"gt lthtmlgt ltbodygt lth1gtPeople _at_
DBAIlt/h1gt lttable border"1" cellpadding"3"
cellspacing"1"gt lttrgt lttdgtGeorg Gottloblt/tdgt
lttdgtgottlob_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18420lt/tdgt lt/trgt lttrgt
lttdgtChristoph Kochlt/tdgt
lttdgtkoch_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18449lt/tdgt lt/trgt lt/tablegt lt/bodygt lt/htmlgt
People _at_ DBAI
Georg Gottlob gottlob_at_ 18420
Christoph Koch koch_at_ 18449
6HTML Example
lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"gt lthtmlgt ltbodygt lth1gtPeople _at_
DBAIlt/h1gt lttable border"1" cellpadding"3"
cellspacing"1"gt lttrgt lttdgtGeorg Gottloblt/tdgt
lttdgtgottlob_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18420lt/tdgt lt/trgt lttrgt
lttdgtChristoph Kochlt/tdgt
lttdgtkoch_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18449lt/tdgt lt/trgt lt/tablegt lt/bodygt lt/htmlgt
People _at_ DBAI
Georg Gottlob gottlob_at_ 18420
Christoph Koch koch_at_ 18449
7HTML Example
lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"gt lthtmlgt ltbodygt lth1gtPeople _at_
DBAIlt/h1gt lttable border"1" cellpadding"3"
cellspacing"1"gt lttrgt lttdgtGeorg Gottloblt/tdgt
lttdgtgottlob_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18420lt/tdgt lt/trgt lttrgt
lttdgtChristoph Kochlt/tdgt
lttdgtkoch_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18449lt/tdgt lt/trgt lt/tablegt lt/bodygt lt/htmlgt
People _at_ DBAI
Georg Gottlob gottlob_at_ 18420
Christoph Koch koch_at_ 18449
8ltpaperDBgt ltpapergt ltauthorgt
ltchandra/gt
ltmerlin/gt lt/authorgt
lttitle Conjunctive Queries/gt lt/papergt
lt/paperDBgt
XML Example
paperDB
. . . . .
paper
title
author
.
.
9Ordered Trees as finite structures
Child-relation is a priori unordered
paper
fc
fc first child
ns
ns next sibling
author
title
fc
fc
ns
Conj. Queries
chandra
merlin
10Core XPath
- simple location steps paper/title
- loc. steps with explicit axes paper/descendantm
erlin - qualifiers paper..
- Boolean logic ...chandra and merlin and (not
harel)
Full Xpath
- node set comparisons and operations
- order functions (first, last, position) , etc.
- arithmetic and string operations
Implementations in the context of XSLT
processors Xalan,
XT, MS Internet Explorer (IE6)
11XPath Examples
/descendanta/childb
/descendanta/childb descendantc and
not(following-siblingd)
c
a
/descendanta/childb following-siblingd
a
c
b
d
b
c
a
a
b
c
a
a
c
b
d
b
b
d
b
b
c
b
c
c
c
12Ordered Trees as finite structures
Child-relation is a priori unordered
fc first child
ns next sibling
?Ultfirstchild2, nextsibling2, lastchild2,
labela1, root1, leaf1gt
a??
13Monadic Queries over Trees
Select some nodes of a tree Unary query
f Trees ? 2dom
No Joins or combinations of objects Yardstick
Monadic Second Order Logic (MSO)
Select titles of articles authored by Chandra and
Merlin
Two important applications
- Web Information Extraction (? later)
- Monadic XML Queries
-
14Monadic Datalog over Trees
Select titles of articles authored By Chandra and
Merlin
15Monadic Datalog over Trees
paperDB
fc
paper
paper
ns
ns
fc
ns
author
title
fc
fc
ns
Conj. Queries
chandra
merlin
paper(X) ? root(R) firstchild(R,X). paper(X) ?
paper(Y) nextsibling(Y,X). output(X)?
paper(P) firstchild(P,A)
firstchild(A,Z) labelChandra(Z)
nextsibling(Z,V) labelMerlin(V)
nextsibling(A,T) firstchild(T,X).
16How expressive is monadic Datalog?
- It was known that
- Monadic Datalog ? ?1-MSO
- Full Datalog P
Theorem G. Koch 2002
Over ?U, Monadic Datalog MSO
A unary query is definable in MSO iff it is
definable via a monadic datalog program.
17Proof idea Simulate Unranked Query Automata
(UQA) by Neven and
Schwentick in mon. Datalog
UQA ?Unary MSO Queries
Neven Schwentick 01
18Example Even-query
Proof idea Simulate Unranked Query Automata
(UQA) by Neven and
Schwentick in mon. Datalog
Up transition
19Example Even-query
Proof idea Simulate Unranked Query Automata
(UQA) by Neven and
Schwentick in mon. Datalog
Up transition
0
0
1
0
20Example Even-query
Proof idea Simulate Unranked Query Automata
(UQA) by Neven and
Schwentick in mon. Datalog
Up transition
0
0
1
0
qodd(X) - 0(Y), lastchild(X, Y).
21How complex is Monadic Datalog?
- Previously known facts on full Datalog over
Graphs - Data Complexity of Datalog P-complete (impl. in
Vardi 88) - Combined Complexity EXPTIME-complete (impl.
Vardi 88) - Comb. Compl. of sirups EXPTIME-cplt.
(G.Papadimitriou 99)
Theorem G. Koch 2002
Monadic Datalog over ?U has combined complexity
O(dataquery)
Data Complexity P-complete and linear-time.
22Proof idea
1.) Transform datalog program input tree in
linear time into a ground
propositional logic program
- Exploit functional dependencies
- nextsibling(X,Y) has only a linear
number - of ground instances nextsibling(ni,nj),
etc. - Decouple independent atoms of rule bodies
p(X) ?q(X) r(Y) nextsibling(X,Z) s(Z).
p(X) ?q(X) r nextsibling(X,Z) s(Z). r
? r(Y).
2.) Execute ground program in linear time by
using well-known algorithms
DowlingGallier Minoux
23XPath
W3C-standard kernel of XSLT, XQUERY, etc.
//paperauthorchandra and merlin/title
Unabbreviated syntax with explicit axes
/descendantpaperchildauthorchildchandra
and
childmerlin/childtitle
/descendantchandra/following-siblingmerlin/anc
estorpaper/childtitle
24Core XPath A tree morphism problem
root
desc.
chandra
foll-s.
merlin
anc.
paper
child
title
data tree
query tree w. location steps
/descendantchandra/following-siblingmerlin/anc
estorpaper/childtitle
25Core XPath A tree morphism problem
root
desc.
chandra
foll-s.
merlin
anc.
paper
child
title
data tree
query tree w. location steps
?
/descendantchandra/nextsiblingmerlin/ancestor
paper/childtitle
26Core XPath
- simple location steps paper/title
- loc. steps with explicit axes paper/descendantm
erlin - qualifiers paper..
- Boolean logic ...chandra and merlin and (not
harel)
Full Xpath
- node set comparisons and operations
- order functions (first, last, position) , etc.
- arithmetic and string operations
Implementations in the context of XSLT
processors Xalan,
XT, MS Internet Explorer (IE6)
27Core XPath
- simple location steps paper/title
- loc. steps with explicit axes paper/descendantm
erlin - qualifiers paper..
- Boolean logic ...chandra and merlin and (not
harel)
Full Xpath
- node set comparisons and operations
- order functions (first, last) , etc.
- arithmetic and string operations
Implementations Xalan, XT, MS Internet Explorer
6 (IE6)
Complexity, efficiency? G.,Koch,Pichler,VLDB 02
28exponential!
Document ltagtltb/gtltb/gtlt/agt
Core Xpath on Xalan and XT Queries
a/b/parenta/b/parenta/b
29quadratic
Core Xpath on Microsoft IE6
polynomial combined
complexity,
quadratic data complexity
30Full XPath on IE6 Exponential combined
complexity!
Exponential query complexity
31Axes and regular expressions
Observation All XPath Axes can be expressed
as regular expression of ?U-axes
firstchild and nextsibling
child firstchild.nextsibling
parent (nextsibling-1).firstchild-1 descend
ant firstchild.(firstchild?nextsibling)
etc
General Definition of axis
Relation definable via a regular expression
(with inversion) from the primitive relations of
?U
32Conjunctive queries with axes
CQ conjunction of ?U-atoms and of
atoms corresponding to derived
axes
Example nextsibling(X,Z) descendant(Z,U)
ancestor(U,V)
labela (V) child(V,X)
(firstchild.firstchild?firstchild-
1)(U,X)
Theorem
Evaluating conjunctive queries with axes over
trees is NP-complete (query complexity)
33Conjunctive queries with axes
CQ conjunction of ?U-atoms and of
atoms corresponding to derived
axes
Example nextsibling(X,Z) descendant(Z,U)
ancestor(U,V)
labela (V) child(V,X)
(firstchild.firstchild?firstchild-
1)(U,X)
Theorem
Evaluating conjunctive queries with axes over
trees is NP-complete (query complexity)
However XPath more akin acyclic conjunctive
queries!
34Acyclic conjunctive queries with axes
Theorem
Evaluating acyclic conjunctive queries with axes
over trees is feasible in time O(dataquery)
Proof idea translate acyclic qery into monadic
datalog program over ?U
child(A,X)
descendant(X,Y)
descendant(Y,Z)
labelb(Y)
labela(Z)
35Acyclic conjunctive queries with axes
Theorem
Evaluating acyclic conjunctive queries with axes
over trees is feasible in time O(dataquery)
Proof idea translate acyclic qery into monadic
datalog program over ?U
Ear atom which contains an ear variable that
otherwise occurs in monadic atoms only. Is
definable as (unary) MSO-query and thus
expressible by a monadic datalog program.
child(A,X)
descendant(X,Y)
descendant(Y,Z)
labelb(Y)
labela(Z)
36Acyclic conjunctive queries with axes
Theorem
Evaluating acyclic conjunctive queries with axes
over trees is feasible in time O(dataquery)
Proof idea translate acyclic qery into monadic
datalog program over ?U
child(A,X)
d(Y) lt- firstchild(Y,Z) aa(Z). aa(Z) ?
labela(Z). aa(Z) ? aa(V) nextsibling(Z,V). aa(Z)
? aa(V) firstchild(Z,V)
descendant(X,Y)
descendant(Y,Z)
labelb(Y)
labela(Z)
37Acyclic conjunctive queries with axes
Theorem
Evaluating acyclic conjunctive queries with axes
over trees is feasible in time O(dataquery)
Proof idea translate acyclic qery into monadic
datalog program over ?U
child(A,X)
d(Y) lt- firstchild(Y,Z) aa(Z). aa(Z) ?
labela(Z). aa(Z) ? aa(V) nextsibling(Z,V). aa(Z)
? aa(V) firstchild(Z,V)
descendant(X,Y)
d(Y)
labelb(Y)
38Acyclic conjunctive queries with axes
Theorem
Evaluating acyclic conjunctive queries with axes
over trees is feasible in time O(dataquery)
Proof idea translate acyclic qery into monadic
datalog program over ?U
Ear atom. Continue eliminating ear atoms until
query is entirely monadic.
child(A,X)
descendant(X,Y)
d(Y)
labelb(Y)
39Acyclic Monadic Datalog with Axes
AMX-Datalog monadic datalog programs whose rule
bodies are acyclic and may contain arbitrary axes
Theorem
Evaluating AMX-datalog programs over trees is
feasible in time O(dataprogram)
Remarks
- Same bound for stratified AMX-Datalog
- AMX-Datalog expresses MSO over ?U
- (both without and with stratification)
40Core XPath in Linear Time
Corollary
Evaluating core-XPath queries over trees is
feasible in time O(dataquery)
Proof Linear translation from Core XPath
to stratified Monadic Datalog axes
41Core XPath in Linear Time
Corollary
Evaluating core-XPath queries over trees is
feasible in time O(dataquery)
//paperauthorchandra and not merlin/title
output(X) ? root(R) descendant(R,P)
labelpaperr(P) qual1(P)
child(P,X) labeltitle(X) . qual1(X) ?
child(X,Y) labelauthor(Y) qual2(Y).
qual2(X) ? child(X,Y) labelchandra(Y) not
qual3(X) qual3(X) ? child(X,M)
labelmerlin(M) .
42Full XPath in Polynomial Time
Theorem G.,Koch,Pichler, VLDB 2002
Evaluating full XPath queries over XML
documents is feasible in polynomial time
(combined complexity)
Proof Extends the Logic Programming evaluation
paradigm to all nasty features of full
XPath.
Implementation (main memory)
XML-Taskforce XPath
To our knowledge the only XPath system that
scales.
43 Combined Complexity of XPath
PODS03, JACM05
44Data and Query Complexity
- Theorem. XPath is in L (data complexity).
- Theorem. PF is L-hard under NC1-reductions (data
complexity). - Theorem. XPath w/o multiplication, concatenation
is in L w.r.t. query complexity.
XPath
PF
L-complete (NC1-red.)
L
Data complexity
45Core XPath and CTL
Straightforward translation from Core XPath with
vertical axes to CTL with past modalities. (On
graphs with child relation order independent!)
//paperauthorchandra and merlin/title
first normalize to
//titleparentpaperauthorchandra and merlin
title EX-1(paper EX(author EXchandra
EXmerlin))
Core XPath requires multimodal CTL X? , X? ,
etc.
46General conjunctive queries with axes
We know they are NP-complete, but
Research programme
- Find interesting sets of axes for which
- CQs are tractable.
- Trace the tractablity frontier, i.e.,
determine all - maximal sets of axes for which CQs are
tractable. - Extend tractability results to datalog.
PODS 2004 G.,Koch, Schulz Solved for all XPath
axes
47Cyclic Query Example (from ComputationalLinguisti
cs)
48Complexity Results
(combined complexity)
(Partition of set of axes!)
49Some simple tractability results
CQs with ?U-atoms and additional axe-sets
child or child,child can be
answered in time O(dataquery).
- Proof idea for child
- Cycles involving child
- unsatisfiable (easy to check), or
- rewritable in linear time into acyclic CQs
50Proof idea for child,child
Xa
a
c
b
Yb
Zc
c
c
Uc
Data tree T
Cyclic query Q
51Proof idea for child,child
Xa
a
XYZU
c
b
XYZU
XYZU
Yb
Zc
c
c
XYZU
XYZU
Uc
52Proof idea for child,child
Xa
a
XYZU
c
b
XYZU
XYZU
Yb
Zc
c
c
XYZU
XYZU
Uc
53Proof idea for child,child
Xa
a
X
c
b
ZU
Y
Yb
Zc
c
c
ZU
ZU
Uc
U must have an ancestor labeled b !
54Proof idea for child,child
Xa
a
X
c
b
ZU
Y
Yb
Zc
c
c
ZU
ZU
Uc
55Proof idea for child,child
Xa
a
X
c
b
Z
Y
Yb
Zc
c
c
Z
ZU
Uc
Z must have U as descendant-or-self
56Proof idea for child,child
Xa
a
X
c
b
Z
Y
Yb
Zc
c
c
Z
ZU
Uc
57Proof idea for child,child
Xa
a
X
c
b
Y
Yb
Zc
c
c
ZU
Uc
Reduct(Q,T) Locally arc-consistent!
Lemma T Q iff Reduct(Q,T) well-labeled
58Proof idea for child,child
morphism
Xa
a
X
c
b
Y
Yb
Zc
c
c
ZU
Uc
Reduct(Q,T) Locally arc-consistent!
Lemma T Q iff Reduct(Q,T) well-labeled
59Web wrapping
Goal Make web contents accessible to electronic
data processing
WEB HTML pages layout
Corporate edp apps structured
data, Databases, XML
60Web wrapping
Goal Make web contents accessible to electronic
data processing
WEB HTML pages layout
Corporate edp apps structured
data, Databases, XML
WRAPPER
Wrappers select, extract, annotate Monadic
deatalog ideally suited, but whowannadoit? LiXt
o a graphical wrapper generator for ELOG
61lt?xml version"1.0" encoding"UTF-8"?gt ltdocumentgt
ltrecordgt ltnumbergt409449118lt/numbergt
ltitemgt98 Degrees - Notebook -
Newlt/itemgt ltpicture/gt
ltpricegt2.99lt/pricegt ltcurrencygtlt/currenc
ygt ltbidsgt-lt/bidsgt lt/recordgt
ltrecordgt ltnumbergt413171469lt/numbergt
ltitemgtNotebook - Compaq Presario
1207lt/itemgt ltpricegt730.00lt/pricegt
ltcurrencygtAU lt/currencygt ...
62Lixto Architecture
Visual Wrapper Generator
Web
Example page(s)
63Elog Program for eBay pages
64Expressive power of LiXto
Elog- Monadic kernel of Elog
Theorems G., Koch PODS2002
ELOG- expresses monadic datalog
All of ELOG- is graphically programmable via
LiXto
Corollary
LiXto expresses all MSO wrapping tasks.
65Comparison to other Wrapper Generators
- Lixto more powerful than
- regular path queries
- Lixto more powerful than HEL
- (Sahuguet, Azavant)
- ? paper
66The Lixto Suite
- Automated navigation to target pages
- Automated data extraction from target pages
- Automated data analysis,
- transformation integration
- Automated data personalization
- Automated data delivery
-
Visual Wrapper
Transformation Server
67Product Architecture
Transformation Server
LiXto Extraction Engine
68Marketing Business Intelligence
Marketing Department
Oracle 9
Business Objects report
BI Tool
69Major Customers of LiXto
70Marketing Business Intelligence
Marketing Department
Oracle 9
Business Objects report
BI Tool