Title: Peter Boncz
1everything you always wanted to know
about Updates in MonetDB/XQuery but were afraid
to ask
Peter Boncz (CWI) Sjoerd Mullender update
actions Jens Teubner XQUF parsing Niels
Nes logging Stefan Manegold the rest
2Overview
- XQuery Update Facility (XQUF)
- semantics the update tape
- Updatable XML storage in BATs
- maintaining order in an array without O(N) cost
- Snapshot Isolation
- why we want it, how we got it
- Concurrency Control
- optimistic, with abort convoys
- Durability
- physical logging
- Conclusion Future Challenges
3XQuery Update Facility (XUF)
- January 2006, first proposal
- Internal primitives
- updinsertBeforeupdinsertAfterupdinsertIntoup
dinsertIntoAsLastupdinsertAttributesupddelete
updreplaceValueupdrename - Pending update list concept
- updapplyUpdates
4Example
- insert
- ltitem id"id"gt
- ltlocationgtBrazillt/locationgt
- ltquantitygt200lt/quantitygt
- ltnamegtXML in a nutshelllt/namegt
- ltpaymentgtCredit Card, Personal checklt/paymentgt
- ltshippinggtWill ship internationallylt/shippinggt
- ltincategory category"category1"/gt
- lt/itemgt
- as last into
- fndoc("xmark.xml")/site/regions/samerica
5Semantics
- let root doc(foo.xml)
- for i in (1,2,3)
- return
- do insert ltxgtilt/xgt as first into root),
- do insert ltygtilt/ygt as first into
root))
6Semantics
- let root doc(foo.xml)
- for i in (1,2,3)
- return
- (do insert ltxgtilt/xgt as first into root),
- do insert ltygtilt/ygt as first into
root))
- ?
- We need to
- define an execution order, and
- enforce it
-
7The Update Tape
- update sequence ( int, node, node/str,
node/str) - fndelete() ? (DELETE, node, nil, nil)
- fninsert_() ? (INSERT, tgt-node, tgt-level,
expr-node) - fnset-attr() ? (ATTR, node, qn, val)
- fnunset-attr() ? (ATTR, node, qn, nil)
- fnset-text() ? (TEXT, node, val, nil)
- fnset-pi() ? (PI, node, ins-val, arg-val)
- fnset-comment() ? (COMMENT, node, val, nil)
( element construction ), that combines updates,
will enforce the correct order of the update
tape. Pathfinder compiler automatically inserts
call to fnupdate(item) on the result of all
update queries
8XPath Accellerator SIGMOD02
ltagt ltbgt ltcgt
ltd/gt lte/gt lt/cgt
lt/bgt ltfgt ltg/gt lthgt
lti/gt ltj/gt
lt/hgt lt/fgt lt/agt
pre post
a 0 9
b 1 3
c 2 2
d 3 0
e 4 1
f 5 8
g 6 4
h 7 7
i 8 5
j 9 6
Node-based relational encoding of XQuery's data
model
9XML Storage Revisited
pre size level
0 9 0
1 3 1
2 2 2
3 0 3
4 0 3
5 4 1
6 0 2
7 2 2
8 0 3
9 0 3
pre post
a 0 9
b 1 3
c 2 2
d 3 0
e 4 1
f 5 8
g 6 4
h 7 7
i 8 5
j 9 6
post pre size - level
10Updates Mission Impossible?
SIZE I
ltagt ltbgt ltcgt
ltd/gt lte/gt lt/cgt
lt/bgt ltfgt ltg/gt lthgt
lti/gt ltj/gt
lt/hgt lt/fgt lt/agt
pre post
a 0 9
b 1 3
c 2 2
d 3 0
e 4 1
f 5 8
g 6 4
h 7 7
i 8 5
j 9 6
PRE I
INSERT SUBTREE
size(following) O(N) ? killer (?)
11XML Storage Revisited
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 2 2 N2
5 0 3 N3
6 0 3 N4
7 4 1 N5
8 0 2 N6
9 2 2 N7
10 0 3 N8
11 0 3 N9
pre size level
0 9 0
1 3 1
2 2 2
3 0 3
4 0 3
5 4 1
6 0 2
7 2 2
8 0 3
9 0 3
pre size level
0 11 0
1 5 1
2 -1 null
3 null null
4 2 2
5 0 3
6 0 3
7 4 1
8 0 2
9 2 2
10 0 3
11 0 3
pre post
a 0 9
b 1 3
c 2 2
d 3 0
e 4 1
f 5 8
g 6 4
h 7 7
i 8 5
j 9 6
post pre size - level
Allow holes
Define logical pages
12XML Storage Revisited
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
8 2 2 N2
9 0 3 N3
10 0 3 N4
11 4 1 N5
pre size level
0 9 0
1 3 1
2 2 2
3 0 3
4 0 3
5 4 1
6 0 2
7 2 2
8 0 3
9 0 3
pre size level
0 11 0
1 5 1
2 -1 null
3 null null
4 2 2
5 0 3
6 0 3
7 4 1
8 0 2
9 2 2
10 0 3
11 0 3
pre post
a 0 9
b 1 3
c 2 2
d 3 0
e 4 1
f 5 8
g 6 4
h 7 7
i 8 5
j 9 6
post pre size - level
Allow holes
Define logical pages
page map
0 0
1 2
2 1
13XML Storage Revisited
- Update-friendly
- rid-table is append-only
- rid-tuples may be unused
- rid autoincrement column
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
8 2 2 N2
9 0 3 N3
10 0 3 N4
11 4 1 N5
- MonetDB
- rid not stored but computed
(virtual oid) - allows positional lookup/join
- Not stored ? no need to update it either
14XML Storage Revisited
pre size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 2 2 N2
5 0 3 N3
6 0 3 N4
7 4 1 N5
8 0 2 N6
9 2 2 N7
10 0 3 N8
11 0 2 N9
- Update-friendly
- rid-table is append-only
- rid-tuples may be unused
- rid autoincrement column
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
8 2 2 N2
9 0 3 N3
10 0 3 N4
11 4 1 N5
- Updatable document collection
- pfadd-doc(URI, docname, percgt0)
- pfadd-doc(URI, docname, collname, percgt0)
- pre nid.leftfetchjoin(nid_rid).swizzle(map_pid)
- Read-only document collection
- pfadd-doc(URI, docname, 0)
- pfadd-doc(URI, docname, collname, 0)
- NID RID PRE
- pre nid.leftfetchjoin(nid_rid).swizzle(map_pid)
FREE!!
15Snapshot Isolation
- Versus 2-phase locking (2PL) full
serializability - Why not 2PL XML
- lock semantics much more complex than in
relational case (order matters!!) - node-level locking in staircase join?? (now 10
cycles/node)
16Snapshot Isolation
17Snapshot Isolation
- Versus 2-phase locking (2PL) full
serializability - Why not 2PL XML
- lock semantics much more complex than in
relational case (order matters!!) - node-level locking in staircase join?? (now 10
cycles/node) - Why Snapshot Isolation
- great for read-queries, great for ll_scj (runs
unmodified) - quite strong. Better than repeatable read.
Oracle/Postgres do it. - Problem with Snapshot Isolation
- in XQuery, it is unknown at compile-time what to
snapshot (fndoc(..))
18Snapshot Isolation
- Read Query1 Read Query 2
Update Query
rid size level Nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
- Isolation By Shadow Paging (copy-on-write mmap)
- rid/pre delete/insert attr-replace
- Touch one byte per physical page addr addr
- MMU traps, OS replaces page by a copy
- we would like to replace the master copy once,
not all client copies
19Snapshot Isolation
- Read Query1 Read Query 2
Update Query
rid size level Nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
Isolate-page
- Isolation By Shadow Paging (copy-on-write mmap)
- rid/pre delete/insert attr-replace
- Touch one byte per physical page addr addr
- MMU traps, OS replaces page by a copy
- we would like to replace the master copy once,
not all client copies
20Snapshot Isolation
- Read Query1 Read Query 2
Update Query
rid size level Nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
Isolate-page
- Isolation By Shadow Paging (copy-on-write mmap)
- rid/pre delete/insert attr-replace
- Touch one byte per physical page addr addr
- MMU traps, OS replaces page by a copy
21Snapshot Isolation
- Read Query1 Read Query 2
Update Query
rid size level Nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
Master-update
- Isolation By Shadow Paging (copy-on-write mmap)
- rid/pre delete/insert attr-replace
- Touch one byte per physical page addr addr
- MMU traps, OS replaces page by a copy
- we would like to replace the master copy once,
not all client copies
22Durability
- Masters become dirty
- no time to flush them during query
- log all changes to a WAL
- log all tuples that changed entire pages
- Recovery
- after a crash, we do not know whether dirty
pages got saved - solution overwrite tables with values from the
WAL - Checkpointing Thread
- every 5 minutes, if many changes occurred,
checkpoint - memory mapped bats are sync()-ed ? ony dirty
pages get written - checkpoint locks collection, halts query
processing
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
23Durability
- Masters become dirty
- no time to flush them during query
- log all changes to a WAL
- log all tuples that changed entire pages
- Recovery
- after a crash, we do not know whether dirty
pages got saved - solution overwrite tables with values from the
WAL - Checkpointing Thread
- every 5 minutes, if many changes occurred,
checkpoint - memory mapped bats are sync()-ed ? ony dirty
pages get written - checkpoint locks collection, halts query
processing
rid size level nid
0 11 0 N0
1 5 1 N1
2 -1 null null
3 0 null null
4 0 2 N6
5 2 2 N7
6 0 3 N8
7 0 3 N9
24The Update Sequence
- Execute Query
- build update tape
- queries get isolated copies of a document (VM
copy-on-write mmap) - Prepare Intensional Updates
- execute update tape.
- does not modify masters (except append-only
tables) - Commit Phase (locked phase per doc-collection)
- precommit
- detect conflicts (not the size-ancestors)
- write WAL (globally locked)
- read master-size-ancestors, use delta, log
result - update master tables
- isolate first! Only then update masters.
- update index structures
25Many more Issues Solved
- Indexing and Updates
- Runtime QN ? NID mapping, with hash table
- read-only not a hash, but keep sorted
persistent - keep INS DEL deltas to commit without changing
the hash table - Runtime NID ? ATTR hash table
- isolation loses you MonetDB dynamic hash table
reuse - share an old copy, exploit append-mostly
Concurrency Updates ? Checkpoint Shredding ?
Query Shredding ? Updates
- Conflicting Updates
- detect conflicting queries
- look at RID page numbers and attr-IDs
- reacting to conflicts
- abort query automatic restart
- run CONVOY of 5 next update queries serially
- ACID properties on the Meta Level
- Shredding a new doc into a collection ? Query
- Shredding a new doc into a collection ? Update
- Using a collection ? Deleting/adding documents
- Meta Querying ? Deleting/adding documents
- Allocating New Pages and NIDS
- Offload shredding interference with freelist
- Unlocked access to private pages
26Snapshot Isolation
- Versus 2-phase locking (2PL) full
serializability - Why not 2PL XML
- lock semantics much more complex than in
relational case (order matters!!) - node-level locking in staircase join?? (now 10
cycles/node) - Why Snapshot Isolation
- great for read-queries, great for ll_scj (runs
unmodified) - quite strong. Better than repeatable read.
Oracle/Postgres do it. - Problem with Snapshot Isolation
- in XQuery, it is unknown at compile-time what to
snapshot (fndoc(..))
2PL () 375 transactions/5 minutes 1.2
transaction/sec
27Conclusions
- It works! Reasonable/good performance!
- transaction mgmt as a module extension outside a
kernel works - identified VM primitives that databases really
need - Future work
- Test on XML update benchmark TPOX (DB2 700
trans/second) - Packed Memory Arrays alternative for page
remapping? - page remapping is technically O(N)
- Engineering
- support for value-indexing (does PF support it
already) - asynchronous WAL writing to boost throughput
- port MIL to C primitives port C primitives to
Monet5