Title: Dan Suciu
1From Searching Text to Querying XML Streams
- Dan Suciu
- www.cs.washington.edu/homes/suciu
2About Me
- Born 1957, Romania
- BS Bucharest, PhD University of Pennsylvania
- Now University of Washington (Seattle)
- My work is on semistructured data
- Book Data on the Web From relations, to
semistructured data and XML - Past/present projects
- XML-QL precursor of XQuery
- XMill the XML compressor
- XML toolkit
3Motivation
- Text databases
- Studied over the past 15 years
- Traditional client/server model
- Struggled with lack of standard text syntax
- Recently, new standard XML
- Traditional client/server in todays dbms
- New applications stream processing
- This talk processing stream XML data
- My motivation work on the XML Toolkit project
4Outline
- Background
- The XML stream processing problem
- Basic XML processing with automata
- Adapting automata to XML
- Stream indexes
- Conclusions
5BackgroundRelational Databases
- Structured, stored in tables
- Schema separate from data
- Queries precise, refer to schema and data (SQL)
AUTHOR
AID Name Country
44 Abiteboul FR
06 Buneman UK
62 Hull USA
12 Suciu USA
29 Vianu USA
BOOKS
ISBN Title Year Publisher
0201537710 Foundations of Databases 1995 AW
155860622X Data on the Web 1999 MK
WROTE
ISBN AID
0201537710 44
0201537710 62
0201537710 29
155860622X 44
155860622X 06
155860622X 12
Hard to publish, easy to query precisely
6BackgroundText Databases
- Unstructured, stored in documents
- No schema, only data
- Queries imprecise, refer to data only (keywords)
Foundations of Databases, Abiteboul (FR), Hull
(USA), Vianu (USA) Addison Wesley, 1995
Data on the Web Abiteoul (FR), Buneman (UK),
Suciu (USA) Morgan Kaufmann, 1999
Easy to publish, hard to query precisely
7BackgroundXML Data
- Semistructured
- Schema and data are together self-describing
- Queries precise, refer to schema and data (SQL)
ltbibgt ltbookgt lttitlegt Foundations lt/titlegt
ltauthorgt ltnamegt Abiteboul
lt/namegt
ltcountrygt FR lt/countrygt
lt/authorgt ltauthorgt ltnamegt Hull
lt/namegt
ltcountrygt USA lt/countrygt
lt/authorgt ltauthorgt ltnamegt Vianu
lt/namegt
ltcountrygt USA lt/countrygt
lt/authorgt ltpublishergt Addison
Wesley lt/publishergt ltyeargt 1995
lt/yeargt lt/bookgt lt/bibgt
XML Easier to publish,easy to query precisely
8BackgroundXML Data
Data model tree
bib
paper
book
book
title
author
journal
title
author
author
publisher
author
Addison Wesley
name
country
Data on the Web
name
country
Buneman
UK
Abiteboul
FR
9BackgroundXML Data
- Querying with XPath (and XQuery)
- This talk XPath queries restricted to
- tag
- /
- //
-
-
- pathconstant
10BackgroundXPath in One Slide
tag, /
/bib/book/author/name
//,
Navigate partially known structure
/bib/book//name//zip
Conjunctivequeries ala SQL
/bib/bookauthor/nameAbiteboul
/bib/book/year1995 and authornameAbiteboul
and countryFR
This is precisely the region algebra E.g. use
proximal nodes NavarroBaeza-Yates97
11Outline
- Background
- The XML stream processing problem
- Basic XML processing with automata
- Adapting automata to XML
- Stream indexes
- Conclusions
12Main ApplicationXML Packet Routing
- Selective Dissemination of Information
AltinelFranklin00, Chan et al.02 - XML content routing Snoeren et al.01
- SOAP Message routing in Application Servers
13XML Packet Routing
ltdocgt lttaggt value lt/taggt lt/docgt
ltdocgt lttaggt value lt/taggt lt/docgt
ltdocgt lttaggt value lt/taggt lt/docgt
14Output XML Streams
Input XML Stream
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
15The XML Stream Processing Problem
Given A set of XPath expressions An Incoming
stream of XML documents Decide For each
document which expressions it matches
Hard Large number of XPath expressions e.g. 103
- 106 Streaming XML data, high
throughput e.g. 5MB/s Easy Shallow XML
data e.g. depth20 Short XPath
expressions
16The Approaches
- Basic techniques
- NFA plus optimizations
- Xfilter/Yfilter AltinelFranklin00
- XTrie Chan et al.02
- DFA
- XML Toolkit
- Beyond the obvious
- Stream indexes (XML Toolkit)
- Stream views
17Outline
- Background
- The XML stream processing problem
- Basic XML processing with automata
- Adapting automata to XML
- Stream indexes
- Conclusions
18From XPath to NFA
/catalog/productcategory"tools"/price
200/quantity //price
Extra processing needed to combine branches (not
in this talk)
19Basic NFA Evaluation
XPath
/bib/book /publisherMK /bib/book
categoryrecent/title /bib/book
//address///zip123 /bib/book
//address//"Galaxy" /bib/book
/categoryrecent /bib/book /address123 /bib/b
ook /address /field567 /bib/book
/tagsome /bib/book categoryrecent/title /b
ib/book //address//Seattle" /bib/book
//address//"Galaxy" /bib/book
/categoryrecent /bib/book /addressLisbon /bi
b/book /address /fieldsome . . . . . . . .
. /bib/book/publisherAW /bib/book
categoryrecent/title /bib/book
//address//123 /bib/book //address//"Galaxy"
/bib/book /categorynew /bib/book
/addressLondon /bib/book /address /field
some /bib/book/category old
3,66,102,4534,...
2,3,543,43,254
1,55,99,...
STACK
ltbibgt ltbookgt ... lt/bibgt
20Basic NFA Evaluation
- Properties? Space linear? Throughput
decreases linearly - Systems
- XFilter AltinelFranklin99, YFilter.
- XTrie Chan et al.02
21Basic DFA Evaluation
DFAs
XPath
/bib/book /publisherMK /bib/book
categoryrecent/title /bib/book
//address///zip123 /bib/book
//address//"Galaxy" /bib/book
/categoryrecent /bib/book /address123 /bib/b
ook /address /field567 /bib/book
/tagsome /bib/book categoryrecent/title /b
ib/book //address//Seattle" /bib/book
//address//"Galaxy" /bib/book
/categoryrecent /bib/book /addressLisbon /bi
b/book /address /fieldsome . . . . . . . .
. /bib/book/publisherAW /bib/book
categoryrecent/title /bib/book
//address//123 /bib/book //address//"Galaxy"
/bib/book /categorynew /bib/book
/addressLondon /bib/book /address /field
some /bib/book/category old
399
552
1
STACK
ltbibgt ltbookgt ... lt/bibgt
22Basic DFA Evaluation
- Properties? Throughput constant !? Space
GOOD QUESTION - System
- XML Toolkit University of Washingtonhttp//xmlt
k.sourceforge.net
23Outline
- Background
- The XML stream processing problem
- Basic XML processing with automata
- Adapting automata to XML
- Stream indexes
- Conclusions
24The Size of the DFA
0
//a/b/a/a/b
a
1
b
2
DFA for //P has 1P states KMP
a
3
a
4
b
5
NFA
25The Size of the DFA
0
//a////b
a
1
2
Size of DFA exponential in s (not a real
concern)
3
4
b
5
NFA
26The Size of the DFA
- Theorem GMOS02 The number of states in the DFA
for one linear XPath expression P is at most - k number of //
- s size of the alphabet (number of tags)
- m max number of between two consecutive //
kP k sm
27Size of DFA Multiple Expressions
//section/table/footnote //table/footnote //sectio
n/figure/footnote . . . . . //abstract/footnote/ta
ble
DFA Trie has linear number of states
AhoCorasick
28Size of DFA Multiple Expressions
//section//footnote //table//footnote //figure//fo
otnote . . . . . //abstract//footnote
100 expressions
2100 states !!
There is a theorem here too, but its not useful
29SolutionCompute the DFA Lazily
- Also used in text searching
- But will it work for 106 XPath expressions ?
- YES !
- For XPath it is provably effective, for two
reasons - XML data is not very deep
- The nesting structure in XML data tends to be
predictable
30Lazy DFA and Simple DTDs
- Document Type Definition (DTD)
- Part of the XML standard
- Will be replaced by XML Schema
- Example DTD
lt!ELEMENT document (section)gt lt!ELEMENT section
((sectionabstracttablefigure))gt lt!ELEMENT
figure (table?,footnote)gt . . . . .
Definition A DTD is simple if all cycles are loops
31Lazy DFA and Simple DTDs
Simple DTD
document
section
abstract
figure
table
footnote
32Lazy DFA and Simple DTDs
- Theorem GMOS02 If the XML data has a simple
DTD, then lazy DFA has at moststates. - n max depths of XPath expressions
- D size of the unfolded DTD
- d max depths of self-loops in the DTD
1D(1n)d
Fact of life Data-like XMLhas simple DTDs
33Lazy DFA and Data Guides
- Non-simple DTDs are useless for the lazy DFA
- Everything may contain everything
lt!ELEMENT document (section)gt lt!ELEMENT section
((sectiontablefigureabstractfootnote))gt lt!EL
EMENT table ((sectiontablefigureabstractfo
otnote))gt lt!ELEMENT figure ((sectiontablefig
ureabstractfootnote))gt lt!ELEMENT abstract
((sectiontablefigureabstractfootnote))gt
Fact of life Text-like XML has non-simple DTDs
34Lazy DFA and Data Guides
- Definition GoldmanWidom97
- The data guide for an XML data instance is the
Trie of all its root-to-leaf paths
35Lazy DFA and Data Guides
XML Data
Data Guide
document
document
section
section
section
section
section
section
table
table
figure
section
table
figure
table
section
table
figure
figure
table
Fact of life real XML data has small data
guide LiefkeS.00
36Lazy DFA and Simple DTDs
- Theorem GMOS02 If the XML data has a data
guide with G nodes, then the number of states in
the lazy DFA is at most - G number of nodes in the data guide
1G
37Number of Lazy DFA States - SYNTHETIC Data
4000 states
100000
103 XPath
104 XPath
10000
105 XPath
1000
100
10
1
simple
prov
ebBPSS
protein
nasa
treebank
3840000 states G 350000
Number of Lazy DFA States - REAL Data
100000
103 XPath
10000
104 XPath
95 states
105 XPath
1000
100
10
1
protein
nasa
treebank
39Number of States in the lazy DFA
Real XML data Synthetic XML data
Data-style DTD Theorem Lazy DFA is small Theorem Lazy DFA is small
Document-style DTD Theorem Lazy DFA is small FactLazy DFA is HUGE
40Lazy DFA in the XML Toolkit
- The XML toolkit uses a lazy DFA to process XML
streams - warm-up phase, followed by very high throughput
41Throughput for 103, 104, 105, 106 XPath
expressions
Parser 10MB/s
prob()10, prob(//)10
Lazy DFA 5.4MB/s
100MB/s
10MB/s
1MB/s
0.1MB/s
0.01MB/s
0.001MB/s
0.0001MB/s
5MB
10MB
15MB
20MB
25MB
Total input size
42Summary of Lazy DFA and XML
- Linear Xpath expressions
- Process with one lazy DFA
- Xpath expressions with branches
- Process with Deterministic Pushdown Automata
(ongoing work at the University of Washington)
43Outline
- Background
- The XML stream processing problem
- Basic XML processing with automata
- Adapting automata to XML
- Stream indexes
- Conclusions
44Stream IndeX (SIX)
- Main observation
- Parsing is major bottleneck
- Definition The SIX of an XML document is a binary
table of (begin, end) offsets - Idea
- Use SIX to reduce amount of parsing
- Works well with (lazy) DFA
- Implemented in the XML toolkit
45Stream IndeX (SIX)
SIX
XML
beginOffset endOffset
bib 0 1490124
book 3 409023
publisher 12 423
author 426 879
author 978 . . .
. . .
ltbibgtltbookgt ltpublishergt Addison-Wesley
lt/publishergt ltauthorgt Serge
Abiteboul lt/authorgt ltauthorgt
ltfirst-namegt Rick lt/first-namegt
ltlast-namegt Hull lt/last-namegt
lt/authorgt ltauthorgt Victor
Vianu lt/authorgt lttitlegt Foundations
of Databases lt/titlegt ltyeargt 1995
lt/yeargtlt/bookgtltbookgt ltpublishergt
Freeman lt/publishergt ltauthorgt
Jeffrey D. Ullman lt/authorgt lttitlegt
Principles of Database and
Knowledge Base Systems lt/titlegt
ltyeargt 1998 lt/yeargtlt/bookgt lt/bibgt
46Stream IndeX (SIX)
SIX (E.g. DIME)
0 205
30 66
72 188
90 110
95 98
0 205
30 66
72 188
0 205
30 66
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
XML
The SIX stream is about 6 of the data stream And
can be made MUCH smaller
47(No Transcript)
48Stream Views
- Idea
- Given a workload of XPath expressions with
branches - Precompute some views for each document to speed
up the entire workload - views ? header has to be small
49Stream Views
Servers
Queries
/ab88c99 /ac99e00
/ab11c22e23
/ab33d44 e55 /ac66f77 /af34g56
Short circuit evaluation !
50Stream Views
- Views ? header (binary offsets)
Header
0
30
72
0
30
72
0
30
72
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
100x speedup on a hit
XML
XML
XML
XML
Choosing the views Difficult problem
51Outline
- Background
- The XML stream processing problem
- Basic XML processing with automata
- Adapting automata to XML
- Stream indexes
- Conclusions
52Summary
- XML stream processing problem
- Fixed XPath queries, transient XML data
- Large number of queries
- High data throughput
- Relationship to text processing techniques
- Still regular expressions
- Still automata and lazy DFAs
- Different scale
- Techniques
- Lazy DFAs work for reasons specific to XML
- Stream indexes and views ongoing research
53Future Work
- Handle branches in XPath expressions
- View selection for a given workload
- Network configuration
54Thank you !
- Links
- www.cs.washington.edu/homes/suciu
- www.cs.washington.edu/homes/suciu/XMLTK
- xmltk.sourceforge.net