Title: Deductive databases
1Deductive databases
- Toon Calders
- t.calders_at_tue.nl
2Motivation Deductive DB
- Motivation is two-fold
- add deductive capabilities to databases the
database contains - facts (intensional relations)
- rules to generate derived facts (extensional
relations) - Database is knowledge base
- Extend the querying
- datalog allows for recursion
3Motivation Deductive DB
- Datalog as engine of deductive databases
- similarities with Prolog
- has facts and rules
- rules define -possibly recursive- views
- Semantics not always clear
- safety
- negation
- recursion
4Outline
- Syntax of the Datalog language
- Semantics of a Datalog program
- Relational algebra safe Datalog with negation
and without recursion - Optimization techniques
- Conclusions
5Syntax of Datalog
- Datalog query/program
- facts ? traditional relational tables
- rules ? define intensional views
- Rules
- if-then rules
- can contain recursion
- can contain negations
- Semantics of program can be ambiguous
6Syntax of Datalog
- Example
- father(X,Y) - person(X,m), parent(X,Y).
- grandson(X,Y) - parent(Y,Z), parent(Z,X),
person(X,m). - hbrothers(X,Y) - person(X,m), person(Y,m),
parent(Z,X), parent(Z,Y).
7Syntax of Datalog
- Variables X, Y
- Constants m, f, rita,
- Positive literal p(t1,,tn)
- p is the name of a relation (EDB or IDB)
- t1, , tn constants or variables
- Negative literal not p(t1, , tn)
- Rule h - l1, , ln
- h positive literal, l1, , ln literals
In Datalog Correct negation ( In contrast to
Prologs negation by failure )
8Syntax of Datalog
- Rule can be recursive
- Arithmetic operations considered as special
predicates - AltB smaller(A,B)
- ABC plus(A,B,C)
9Outline
- Syntax of the Datalog language
- Semantics of a Datalog program
- non-recursive
- recursive datalog
- aggregation
- Relational algebra safe Datalog with negation
and without recursion - Optimization techniques
- Conclusions
10Semantics of Non-Recursive Datalog Programs
- Ground instantiation of a rule h - l1, , ln
replace every variable in the rule by a constant - Example
- father(X,Y) - person(X,m), parent(X,Y)
- instantiation
- father(toon,an) - person(toon,m),
parent(toon,an).
11Semantics of Non-Recursive Datalog Programs
- Let I be a set of facts
- The body of a rule instantiation R is satisfied
by I if - every positive literal in the body of R is in I
- no negative literal in the body of R is in I
- Example
- person(toon,m), parent(toon,an) not satisfied by
the facts given before
12Semantics of Non-Recursive Datalog Programs
- Let I be a set of facts
- R is a rule h - l1, , ln
- Infer(R,I) h
- h - l1, , ln is a ground instantiation of R
- l1 ln is satisfied by I
- RR1, , RnInfer(R,I) Infer(R1,I) ? ?
Infer(Rn,I)
13Semantics of Non-Recursive Datalog Programs
- A rule h - l1, , ln is in layer 1
- l1, , ln only involve extensional predicates
- A rule h - l1, , ln is in layer i
- for all 0ltjlti, it is not in layer j
- l1, , ln only involve predicates that are
extensional and in the layers 1, , i-1
14Semantics of Non-Recursive Datalog Programs
- Let I0 be the facts in a datalog program
- Let R1 be the rules at layer 1
-
- Let Rn be the rules at layer n
- I1 I0 ? Infer(R1, I0)
- I2 I1 ? Infer(R2, I1)
-
- In In-1 ? Infer(Rn, In-1)
15Semantics of Non-Recursive Datalog Programs
father(X,Y) - person(X,m), parent(X,Y). grandfa
ther(X,Y) - father(X,Y), parent(Y,Z). hbrothers
(X,Y) - person(X,m), person(Y,m),
parent(Z,X), parent(Z,Y).
16Semantics of Non-Recursive Datalog Programs
Stratum 0
father(X,Y) - person(X,m), parent(X,Y). grandfa
ther(X,Y) - father(X,Y), parent(Y,Z). hbrothers
(X,Y) - person(X,m), person(Y,m),
parent(Z,X), parent(Z,Y).
17Semantics of Non-Recursive Datalog Programs
Stratum 1
father alex toon jan an toon bernd toon
mattijs hbrothers bernd mattijs matti
js bernd mattijs mattijs bernd bernd
Stratum 0
father(X,Y) - person(X,m), parent(X,Y). grandfa
ther(X,Y) - father(X,Y), parent(Y,Z). hbrothers
(X,Y) - person(X,m), person(Y,m),
parent(Z,X), parent(Z,Y).
18Semantics of Non-Recursive Datalog Programs
Stratum 1
Stratum 2
father grandfather alex toon
jan mattijs jan an jan bernd toon bernd
alex mattijs toon mattijs alex bernd
hbrothers bernd mattijs mattijs bernd mat
tijs mattijs bernd bernd
Stratum 0
father(X,Y) - person(X,m), parent(X,Y). grandfa
ther(X,Y) - father(X,Y), parent(Y,Z). hbrothers(
X,Y) - person(X,m), person(Y,m),
parent(Z,X), parent(Z,Y).
19Caveat Correct Negation
- Negation in Datalog ? Negation in Prolog
- Prolog negation (Negation by failure)
- not(p(X)) is true if we fail to prove p(X)
- Datalog negation (Correct negation)
- not(p(X)) binds X to a value such that p(X) does
not hold.
20Caveat Correct Negation
- Example
- father(a,b). person(a). person(b).
- nfather(X) - person(X), not( father (X,Y) ),
person(Y). - Datalog
- ? nfather(X) ? (a), (b)
- Prolog
- ? nfather(X) ? (b)
- ? person(a), not(father(a,a)), person(a) ? yes
21Caveat Correct Negation
- Prolog
- Order of the clauses is important
- nfather(X) - person(X), not( father (X,Y) ),
person(Y). - versus
- nfather(X) - person(X), person(Y), not(
father (X,Y) ). - Order of the rules is important
- Datalog
- Order not important
- More declarative
22Caveat Correct Negation
- Difference is not fundamental
- Prolog
- nfather(X) - person(X), not( father (X,Y) ).
- ?
- Datalog
- nfather(X) - person(X), not(father_of_someone(X)
). - father_of_someone(X) - father (X,Y).
23Caveat Correct Negation
- Difference is not fundamental
- Many systems that claim to implement Datalog,
actually implement negation by failure. - Debating on whether or not this is correct is
pointless both perspectives are useful - Check on beforehand how an engine implements
negation - Throughout the course, in all exercises, in the
exam, , we assume correct negation.
24Safety
- A rule can make no sense if variables appear in
funny ways - Examples
- S(x) - R(y)
- S(x) - not R(x)
- S(x) - R(y), xlty
- In each of these cases the result is infinite
even if the relation R is finite
25Safety
- Even when not leading to infinite relations, such
Datalog Programs can be domain-dependent. - Example
- s(a,b). s(a,a). r(a). r(b).
- t(X) - not(s(X,Y)), r(X).
- If domain is a,b
- only t(b) holds.
- If domain is a,b,c
- not only t(b), but also t(a) holds
- ( Ground instantiation t(a) - not(s(a,c)),
r(a). )
26Safety
- Therefore, we will only consider rules that are
safe. - A rule h - l1, , ln is safe if
- every variable in the head of the rule also in a
non-arithmetic positive literale in body - every variable in a negative literal of the body
also in some positive literal of the body
27Model-Theoretic Semantics
- A model M of a Datalog program is
- An instatiation of all intensional relations in
the program - That satisfies all rules in the program
- If the body of a ground instantiation of a rule
holds in M, also the head must hold - Some models are special
28Model-Theoretic Semantics
- father(a,b).
- person(X) - father(X,Y).
- person(Y) - father(X,Y).
- M1 father person
- a b a
- b
- M2 father person
- a b a
- b a b
- a a
29Model-Theoretic Semantics
- A model is minimal if we cannot remove tuples
- M1 father person
- a b a
- b
- M2 father person
- a b a
- b a b
- a a
Minimal
Not Minimal
30Model-Theoretic Semantics
- For non-recursive, safe datalog programs
semantics is well defined - The model all facts that can be derived from
the program - Closed-World Assumption if a fact cannot be
derived from the database, then it is not true - Is a minimal model
31Model-Theoretic Semantics
- Minimal model is, however, not necessarily unique
- Example
- r(a).
- t(X) - r(X), not s(X).
- minimal models
- M1 M2
- r s t r s t
- a a a a
32Outline
- Syntax of the Datalog language
- Semantics of a Datalog program
- non-recursive
- recursive datalog
- aggregation
- Relational algebra safe Datalog with negation
and without recursion - Optimization techniques
- Conclusions
33Semantics of Recursive Datalog Programs
- g(a,b). g(b,c). g(a,d).
- reach(X,X) - g(X,Y). reach(Y,Y) - g(X,Y)
- reach(X,Y) - g(X,Y).
- reach(X,Z) - reach(X,Y), reach(Y,Z).
- Fixpoint of a set of rules R, starting with set
of facts I - repeat
- Old_I II I ? infer(R,I)
- until I Old_I
- Always termination (inflationary fixpoint)
34Semantics of Recursive Datalog Programs
- g(a,b). g(b,c). g(a,d).
- reach(X,X) - g(X,Y). reach(Y,Y) - g(X,Y)
- reach(X,Y) - g(X,Y).
- reach(X,Z) - reach(X,Y), reach(Y,Z).
- Step 0 reach
- Step 1 reach (a,a), (b,b), (c,c), (d,d),
(a,b), (b,c), (a,d) - Step 2 reach (a,a), (b,b), (c,c), (d,d),
(a,b), (b,c), (a,d), (a,c) - Step 3 reach (a,a), (b,b), (c,c), (d,d),
(a,b), (b,c), (a,d), (a,c) STOP
35Semantics of Recursive Datalog Programs
- Datalog without negation
- Always a unique minimal model.
- Semantics of recursive datalog with negation is
less clear. - Example
- T(a).
- R(X) - T(X), not S(X).
- S(X) - T(X), not R(X).
- What about R(a)? S(a)?
36Semantics of Recursive Datalog Programs
- For some classes of Datalog queries with negation
still a natural semantics can be defied - Important class stratified programs
- T depends on S if some rule with T in the head
contains S or (recursively) some predicate that
depends on S, in the body. - Stratified program If T depends on (not S),
then S cannot depend on T or (not T).
37Semantics of Recursive Datalog Programs
- The program
- T(a).
- R(X) - T(X), not S(X).
- S(X) - T(X), not R(X).
- is not stratified
- R depends negatively on S
- S depends negatively on R
R T S
38Semantics of Recursive Datalog Programs
- g(a,b). g(b,c). g(a,d).
- reach(X,X) - g(X,Y).
- reach(Y,Y) - g(X,Y).
- reach(X,Y) - g(X,Y).
- reach(X,Z) - reach(X,Y),
- reach(Y,Z).
- node(X) - g(X,Y).
- node(Y) - g(X,Y).
- unreach(X,Y) - node(X), node(Y), not
reach(X,Y).
g reach node unreach
39Semantics of Recursive Datalog Programs
- If a program is stratified, the tables in the
program can be partitioned into strata - Stratum 0 All database tables.
- Stratum I Tables defined in terms of tables in
Stratum I and lower strata. - If T depends on (not S), S is in lower stratum
than T.
40Semantics of Recursive Datalog Programs
- g(a,b). g(b,c). g(a,d).
- reach(X,X) - g(X,Y).
- reach(Y,Y) - g(X,Y).
- reach(X,Y) - g(X,Y).
- reach(X,Z) - reach(X,Y),
- reach(Y,Z).
- node(X) - g(X,Y).
- node(Y) - g(X,Y).
- unreach(X,Y) - node(X), node(Y), not
reach(X,Y).
0
g reach node unreach
1
2
41Semantics of Recursive Datalog Programs
- Semantics of a stratified program given by
- First, compute the least fixpoint of all tables
in Stratum 1. (Stratum 0 tables are fixed.) - Then, compute the least fixpoint of tables in
Stratum 2 then the lfp of tables in Stratum 3,
and so on, stratum-by-stratum.
42Semantics of Recursive Datalog Programs
- Fixpoint of a set of rules R, starting with set
of facts I - repeat
- Old_I II I ? infer(R,I)
- until I Old_I
- Fixpoint within one stratum always terminates
- Due to monotonicity within the strata
- Only positive dependence between tables in
stratum l. - Due to finite program, number of strata isfinite
as well
43Semantics of Recursive Datalog Programs
- Stratum 0 g(a,b). g(b,c). g(a,d).
- Stratum 1 node(a), node(b), node(c),
node(d),reach(a,a), reach(b,b), reach(c,c),
reach(d,d), reach(a,b), reach(b,c), - Stratum 2
- unreach(b,a), unreach(c,a),
44Outline
- Syntax of the Datalog language
- Semantics of a Datalog program
- non-recursive
- recursive datalog
- aggregation
- Relational algebra safe Datalog with negation
and without recursion - Optimization techniques
- Conclusions
45Aggregate Operators
Degree(X, SUM(ltYgt)) - g(X,Y).
- The lt gt notation in the head indicates
grouping the remaining arguments (X, in this
example) are the GROUP BY fields. - In order to apply such a rule, must have all of
relation g available. - Stratification with respect to use of lt gt is
similar to negation.
46Aggregate Operators
- bi(X,Y) - g(X,Y). g
- bi(Y,X) - g(X,Y).
- Degree(X, SUM(ltYgt)) - bi(X,Y). bi
- degree
47Aggregate Operators
- bi(X,Y) - g(X,Y). g
- bi(Y,X) - g(X,Y).
- Degree(X, SUM(ltYgt)) - bi(X,Y). bi
- degree
0
1
2
48Aggregate Operators
- bi(X,Y) - g(X,Y). g
- bi(Y,X) - g(X,Y).
- Degree(X, SUM(ltYgt)) - bi(X,Y). bi
- degree
- Compute stratum by stratum
- Assume strata 1 ? k fixed when computing k1
0
1
2
49Aggregate Operators
- r(a,b). r(a,c). s(a,d).
- t(X,SUM(ltYgt)) - r(X,Y).
- r(X,Y) - t(X,Z), Z2, s(X,Y).
-
50Aggregate Operators
- r(a,b). r(a,c). s(a,d). t
- t(X,SUM(ltYgt)) - r(X,Y).
- r(X,Y) - t(X,Z), Z2, s(X,Y). r s
-
51Aggregate Operators
- r(a,b). r(a,c). s(a,d). t
- t(X,SUM(ltYgt)) - r(X,Y).
- r(X,Y) - t(X,Z), Z2, s(X,Y). r s
- a is aggregating over a moving target
- Step 1 t(a,2) is added
- Step 2 r(a,d) is added
- Step 3 t(a,3) added, t(a,2) no longer true
hence r(a,d) should not have been added
52Outline
- Syntax of the Datalog language
- Semantics of a Datalog program
- Relational algebra Safe Datalog with negation
and without recursion - Optimization techniques
- Conclusions
53RA Non-Recursive Datalog
- Every operator of RA can be simulated by
non-recursive datalog - Project on the first attribute of the ternary
relation r query (A) r(A, B, C). - Cartesian product of relations r1 and r2.
- query (X1, X2, ..., Xn, Y1, Y1, Y2, ..., Ym )
r1 (X1, X2, ..., Xn ), r2 (Y1, Y2, ..., Ym
). - Union of relations r1 and r2.
- query (X1, X2, ..., Xn ) r1 (X1, X2, ..., Xn
), query (X1, X2, ..., Xn ) r2 (X1, X2, ...,
Xn ), - Set difference of r1 and r2.
- query (X1, X2, ..., Xn ) r1(X1, X2, ..., Xn
), not r2 (X1, X2, ..., Xn )
54RA Non-Recursive Datalog
- Every operator of RA can be simulated by
non-recursive datalog - Result of our construction is always safe, and
equivalent for stratified semantics - ?13 ((?1R) x R)
- ?
- query1(A) - R(A,B).
- query2(A,B,C) - query1(A), R(B,C).
- result(A,B,A) - query2(A,B,A)
55RA Non-Recursive Datalog
- Every rule can be expressed by one RA expression
- Translate every atom separately
- Negation/arithmetic use complement
construction - Essential safety
- Combine atoms with Cartesian product
- Do the joins with a selection
- Project on the relevant attributes
- Strata determine the order of evaluation
- Because of no recursion every rule only executed
once.
56RA Non-Recursive Datalog
- sister(X,Y) - person(X,f), parent(Z,X),
parent(Z,Y), not(XY). - person(X,f) ?2f Person
- parent(Z,X) and parent(Z,Y) Parent
- not(XY) complement construction
- X comes from parent(Z,X) ? ?2 Parent
- Y from parent(Z,Y) ? ?2 Parent
- not(XY) ? ?1?2 (?2 Parent x ?2 Parent)
57RA Non-Recursive Datalog
- sister(X,Y) - person(X,f), parent(Z,X),
parent(Z,Y), not(XY). - ?1,6
- ?14, 35, 17, 68
- (?2f Person x Parent x Parent
- x ?1?2 (?2 Parent x ?2 Parent))
58RA Non-Recursive Datalog
- Hence, the following two are equivalent in
expressive power - Safe Datalog with negation, without recursion or
aggregation, under the stratified semantics - Relational Algebra
- Every rule separately can be expressed by a
relational algebra expression - Makes it very suitable for implementation on top
of a relational database
59Outline
- Syntax of the Datalog language
- Semantics of a Datalog program
- Relational algebra Datalog with negation and
without recursion - Optimization techniques
- Conclusions
60Evaluation of Datalog Programs
- Running example
- root(r). child(r,a). child(r,b). child(a,c).
- child(a,d). child(c,e). child(d,f). child(b,h).
- sg(X,Y) - root(X),root(Y).
- sg(X,Y) - child(X,U), sg(U,V), child(Y,V).
r
a
b
c
d
h
f
e
61Evaluation of Datalog Programs Issues
- Repeated inferences recursive rules are
repeatedly applied in the naïve way same
inferences in several iterations. - Unnecessary inferences if we just want to find
sg of a particular node, say e, computing the
fixpoint of the sg program and then selecting
tuples with e in the first column is wasteful, in
that we compute many irrelevant facts.
62Evaluation of Datalog Programs
- Running example
- Query ? sg(e,X)
- (r, r)
- (a,a), (b,b), (a,b), (b,a)
- (c,c), (c,d), (c,h), (d,c), (d,d),
- (e,e), (f,f), (e,f), (f,e)
r
a
b
c
d
h
f
e
63Avoiding Repeated Inferences
- Seminaive Fixpoint Evaluation Avoid repeated
inferences at least one of the body facts
generated in the most recent iteration. - For each recursive table P, use a table delta_P.
- Rewrite the program to use the delta tables.
- A second evaluation of the rule
- r(X,Y) - s(X), t(Y), u(X,Z), v(Y,Z).
- only gives new tuples (X,Y) for ground
instantiations in which at least one of the atoms
is new.
64Avoiding Unnecessary Inferences
- Still, in the running example
- many unnecessary deductions when query is ?
sg(e,X) - Compare with top-down
- as in Prolog
- only facts that are connected to the ultimate
goal are being considered
65The Prolog Way
- sg(X,Y) - root(X),root(Y).
- sg(X,Y) - child(X,U), sg(U,V), child(Y,V).
- ? sg(e,X).
- try root(e) FAIL
- try child(e,U)
- ? Uc
- try sg(c,V)
- try root(e) FAIL
- try child(c,U)
- ? Ua
-
r
a
b
c
d
h
f
e
66The Prolog Way
- sg(X,Y) - root(X),root(Y).
- sg(X,Y) - child(X,U), sg(U,V), child(Y,V).
- ? sg(e,X).
- try root(e) FAIL
- try child(e,U)
- ? Uc
- try sg(c,V)
- try root(e) FAIL
- try child(c,U)
- ? Ua
-
r
a
b
c
d
h
f
e
67Magic Sets Idea
- We want to do something similar for Datalog
- Idea Define a filter table computes all
relevant values, restricts the computation of
sg(e,X). - sg(X,Y) - m(X), root(X), root(Y).
- sg(X,Y) - m(X), child(X,U), sg(U,V),child(Y,V).
- m(X) - m(Y), child(Y,X).
- m(e).
68Magic Sets
- It is always possible to do this in such a way
that bottom-up becomes as efficient as top-down! - Different proposals exist in literature
- how to introduce the magic filters
69Optimization Techniques
- Many other techniques exist as well
- Standard relational indexing techniques
- (partly) materializing intensional relations on
beforehand - Trade-off memory ?? query time performance
- (See also the OLAP-part for a similar technique)
- Different representations for relations
- BDD (Stanford)
70Outline
- Syntax of the Datalog language
- Semantics of a Datalog program
- Relational algebra Datalog with negation and
without recursion - Optimization techniques
- Conclusions
71Conclusions
- Datalog adds deductive capabilities to databases
- extensional relations
- intensional relations
- Datalog without recursion, with negation
- safety requirement
- stratification
- equal in power to relational algebra
- Closed World Assumption
72Conclusions
- Datalog without Negation
- Always a unique minimal model
- Datalog with negation and recursion
- semantics not always clear
- stratified negation
- Evaluation of datalog queries
- without negation RA-optimization
- with recursion
- semi-naive recursion
- magic sets
73Conclusions
- Very nice idea, but
- Deductive databases did not make it as a database
paradigm - Yet, many ideas survived
- recursion in SQL
- And others may re-surface in future.
- Increasing need for adding meta-information in
databases