Dan Suciu - PowerPoint PPT Presentation

About This Presentation

Title:

Dan Suciu

Description:

... 18363.50 14.90 203.00 3380.00 19788.30 14.25 176.00 3044.00 21144.20 13.64 156.00 2756.00 22471.00 13.15 139.00 2462.00 24019.90 12.71 125.00 23.00 40.00 56.00 ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 55

Provided by: homesCsWa

Learn more at: https://homes.cs.washington.edu

Category:

Tags: dan | suciu

more less

Transcript and Presenter's Notes

Title: Dan Suciu

1
From Searching Text to Querying XML Streams

Dan Suciu
www.cs.washington.edu/homes/suciu

2
About Me

Born 1957, Romania
BS Bucharest, PhD University of Pennsylvania
Now University of Washington (Seattle)
My work is on semistructured data
Book Data on the Web From relations, to
semistructured data and XML
Past/present projects
XML-QL precursor of XQuery
XMill the XML compressor
XML toolkit

3
Motivation

Text databases
Studied over the past 15 years
Traditional client/server model
Struggled with lack of standard text syntax
Recently, new standard XML
Traditional client/server in todays dbms
New applications stream processing
This talk processing stream XML data
My motivation work on the XML Toolkit project

4
Outline

Background
The XML stream processing problem
Basic XML processing with automata
Adapting automata to XML
Stream indexes
Conclusions

5
BackgroundRelational Databases

Structured, stored in tables
Schema separate from data
Queries precise, refer to schema and data (SQL)

AUTHOR
AID Name Country
44 Abiteboul FR
06 Buneman UK
62 Hull USA
12 Suciu USA
29 Vianu USA
BOOKS
ISBN Title Year Publisher
0201537710 Foundations of Databases 1995 AW
155860622X Data on the Web 1999 MK
WROTE
ISBN AID
0201537710 44
0201537710 62
0201537710 29
155860622X 44
155860622X 06
155860622X 12
Hard to publish, easy to query precisely
6
BackgroundText Databases

Unstructured, stored in documents
No schema, only data
Queries imprecise, refer to data only (keywords)

Foundations of Databases, Abiteboul (FR), Hull
(USA), Vianu (USA) Addison Wesley, 1995
Data on the Web Abiteoul (FR), Buneman (UK),
Suciu (USA) Morgan Kaufmann, 1999
Easy to publish, hard to query precisely
7
BackgroundXML Data

Semistructured
Schema and data are together self-describing
Queries precise, refer to schema and data (SQL)

ltbibgt ltbookgt lttitlegt Foundations lt/titlegt
ltauthorgt ltnamegt Abiteboul
lt/namegt
ltcountrygt FR lt/countrygt
lt/authorgt ltauthorgt ltnamegt Hull
lt/namegt
ltcountrygt USA lt/countrygt
lt/authorgt ltauthorgt ltnamegt Vianu
lt/namegt
ltcountrygt USA lt/countrygt
lt/authorgt ltpublishergt Addison
Wesley lt/publishergt ltyeargt 1995
lt/yeargt lt/bookgt lt/bibgt
XML Easier to publish,easy to query precisely
8
BackgroundXML Data
Data model tree
bib
paper
book
book
title
author
journal
title
author
author
publisher
author
Addison Wesley
name
country
Data on the Web
name
country
Buneman
UK
Abiteboul
FR
9
BackgroundXML Data

Querying with XPath (and XQuery)
This talk XPath queries restricted to
tag
/
//
pathconstant

10
BackgroundXPath in One Slide
tag, /
/bib/book/author/name
//,
Navigate partially known structure
/bib/book//name//zip
Conjunctivequeries ala SQL
/bib/bookauthor/nameAbiteboul

/bib/book/year1995 and authornameAbiteboul
and countryFR
This is precisely the region algebra E.g. use
proximal nodes NavarroBaeza-Yates97
11
Outline

Background
The XML stream processing problem
Basic XML processing with automata
Adapting automata to XML
Stream indexes
Conclusions

12
Main ApplicationXML Packet Routing

Selective Dissemination of Information
AltinelFranklin00, Chan et al.02
XML content routing Snoeren et al.01
SOAP Message routing in Application Servers

13
XML Packet Routing
ltdocgt lttaggt value lt/taggt lt/docgt
ltdocgt lttaggt value lt/taggt lt/docgt
ltdocgt lttaggt value lt/taggt lt/docgt
14
Output XML Streams
Input XML Stream
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
15
The XML Stream Processing Problem
Given A set of XPath expressions An Incoming
stream of XML documents Decide For each
document which expressions it matches
Hard Large number of XPath expressions e.g. 103
- 106 Streaming XML data, high
throughput e.g. 5MB/s Easy Shallow XML
data e.g. depth20 Short XPath
expressions
16
The Approaches

Basic techniques
NFA plus optimizations
Xfilter/Yfilter AltinelFranklin00
XTrie Chan et al.02
DFA
XML Toolkit
Beyond the obvious
Stream indexes (XML Toolkit)
Stream views

17
Outline

Background
The XML stream processing problem
Basic XML processing with automata
Adapting automata to XML
Stream indexes
Conclusions

18
From XPath to NFA
/catalog/productcategory"tools"/price
200/quantity //price
Extra processing needed to combine branches (not
in this talk)
19
Basic NFA Evaluation
XPath
/bib/book /publisherMK /bib/book
categoryrecent/title /bib/book
//address///zip123 /bib/book
//address//"Galaxy" /bib/book
/categoryrecent /bib/book /address123 /bib/b
ook /address /field567 /bib/book
/tagsome /bib/book categoryrecent/title /b
ib/book //address//Seattle" /bib/book
//address//"Galaxy" /bib/book
/categoryrecent /bib/book /addressLisbon /bi
b/book /address /fieldsome . . . . . . . .
. /bib/book/publisherAW /bib/book
categoryrecent/title /bib/book
//address//123 /bib/book //address//"Galaxy"
/bib/book /categorynew /bib/book
/addressLondon /bib/book /address /field
some /bib/book/category old

3,66,102,4534,...
2,3,543,43,254
1,55,99,...
STACK
ltbibgt ltbookgt ... lt/bibgt
20
Basic NFA Evaluation

Properties? Space linear? Throughput
decreases linearly
Systems
XFilter AltinelFranklin99, YFilter.
XTrie Chan et al.02

21
Basic DFA Evaluation
DFAs
XPath
/bib/book /publisherMK /bib/book
categoryrecent/title /bib/book
//address///zip123 /bib/book
//address//"Galaxy" /bib/book
/categoryrecent /bib/book /address123 /bib/b
ook /address /field567 /bib/book
/tagsome /bib/book categoryrecent/title /b
ib/book //address//Seattle" /bib/book
//address//"Galaxy" /bib/book
/categoryrecent /bib/book /addressLisbon /bi
b/book /address /fieldsome . . . . . . . .
. /bib/book/publisherAW /bib/book
categoryrecent/title /bib/book
//address//123 /bib/book //address//"Galaxy"
/bib/book /categorynew /bib/book
/addressLondon /bib/book /address /field
some /bib/book/category old

399
552
1
STACK
ltbibgt ltbookgt ... lt/bibgt
22
Basic DFA Evaluation

Properties? Throughput constant !? Space
GOOD QUESTION
System
XML Toolkit University of Washingtonhttp//xmlt
k.sourceforge.net

23
Outline

Background
The XML stream processing problem
Basic XML processing with automata
Adapting automata to XML
Stream indexes
Conclusions

24
The Size of the DFA
0

//a/b/a/a/b
a
1
b
2
DFA for //P has 1P states KMP
a
3
a
4
b
5
NFA
25
The Size of the DFA
0

//a////b
a
1

2
Size of DFA exponential in s (not a real
concern)

3

4
b
5
NFA
26
The Size of the DFA

Theorem GMOS02 The number of states in the DFA
for one linear XPath expression P is at most
k number of //
s size of the alphabet (number of tags)
m max number of between two consecutive //

kP k sm
27
Size of DFA Multiple Expressions
//section/table/footnote //table/footnote //sectio
n/figure/footnote . . . . . //abstract/footnote/ta
ble
DFA Trie has linear number of states
AhoCorasick
28
Size of DFA Multiple Expressions
//section//footnote //table//footnote //figure//fo
otnote . . . . . //abstract//footnote
100 expressions
2100 states !!
There is a theorem here too, but its not useful
29
SolutionCompute the DFA Lazily

Also used in text searching
But will it work for 106 XPath expressions ?
YES !
For XPath it is provably effective, for two
reasons
XML data is not very deep
The nesting structure in XML data tends to be
predictable

30
Lazy DFA and Simple DTDs

Document Type Definition (DTD)
Part of the XML standard
Will be replaced by XML Schema
Example DTD

lt!ELEMENT document (section)gt lt!ELEMENT section
((sectionabstracttablefigure))gt lt!ELEMENT
figure (table?,footnote)gt . . . . .
Definition A DTD is simple if all cycles are loops
31
Lazy DFA and Simple DTDs
Simple DTD
document
section
abstract
figure
table
footnote
32
Lazy DFA and Simple DTDs

Theorem GMOS02 If the XML data has a simple
DTD, then lazy DFA has at moststates.
n max depths of XPath expressions
D size of the unfolded DTD
d max depths of self-loops in the DTD

1D(1n)d
Fact of life Data-like XMLhas simple DTDs
33
Lazy DFA and Data Guides

Non-simple DTDs are useless for the lazy DFA
Everything may contain everything

lt!ELEMENT document (section)gt lt!ELEMENT section
((sectiontablefigureabstractfootnote))gt lt!EL
EMENT table ((sectiontablefigureabstractfo
otnote))gt lt!ELEMENT figure ((sectiontablefig
ureabstractfootnote))gt lt!ELEMENT abstract
((sectiontablefigureabstractfootnote))gt
Fact of life Text-like XML has non-simple DTDs
34
Lazy DFA and Data Guides

Definition GoldmanWidom97
The data guide for an XML data instance is the
Trie of all its root-to-leaf paths

35
Lazy DFA and Data Guides
XML Data
Data Guide
document
document
section
section
section
section
section
section
table
table
figure
section
table
figure
table
section
table
figure
figure
table
Fact of life real XML data has small data
guide LiefkeS.00
36
Lazy DFA and Simple DTDs

Theorem GMOS02 If the XML data has a data
guide with G nodes, then the number of states in
the lazy DFA is at most
G number of nodes in the data guide

1G
37
Number of Lazy DFA States - SYNTHETIC Data
4000 states
100000
103 XPath
104 XPath
10000
105 XPath
1000
100
10
1
simple
prov
ebBPSS
protein
nasa
treebank
38
40000 states G 350000
Number of Lazy DFA States - REAL Data
100000
103 XPath
10000
104 XPath
95 states
105 XPath
1000
100
10
1
protein
nasa
treebank
39
Number of States in the lazy DFA
Real XML data Synthetic XML data
Data-style DTD Theorem Lazy DFA is small Theorem Lazy DFA is small
Document-style DTD Theorem Lazy DFA is small FactLazy DFA is HUGE
40
Lazy DFA in the XML Toolkit

The XML toolkit uses a lazy DFA to process XML
streams
warm-up phase, followed by very high throughput

41
Throughput for 103, 104, 105, 106 XPath
expressions
Parser 10MB/s
prob()10, prob(//)10
Lazy DFA 5.4MB/s
100MB/s
10MB/s
1MB/s
0.1MB/s
0.01MB/s
0.001MB/s
0.0001MB/s
5MB
10MB
15MB
20MB
25MB
Total input size
42
Summary of Lazy DFA and XML

Linear Xpath expressions
Process with one lazy DFA
Xpath expressions with branches
Process with Deterministic Pushdown Automata
(ongoing work at the University of Washington)

43
Outline

Background
The XML stream processing problem
Basic XML processing with automata
Adapting automata to XML
Stream indexes
Conclusions

44
Stream IndeX (SIX)

Main observation
Parsing is major bottleneck
Definition The SIX of an XML document is a binary
table of (begin, end) offsets
Idea
Use SIX to reduce amount of parsing
Works well with (lazy) DFA
Implemented in the XML toolkit

45
Stream IndeX (SIX)
SIX
XML
beginOffset endOffset
bib 0 1490124
book 3 409023
publisher 12 423
author 426 879
author 978 . . .
. . .
ltbibgtltbookgt ltpublishergt Addison-Wesley
lt/publishergt ltauthorgt Serge
Abiteboul lt/authorgt ltauthorgt
ltfirst-namegt Rick lt/first-namegt
ltlast-namegt Hull lt/last-namegt
lt/authorgt ltauthorgt Victor
Vianu lt/authorgt lttitlegt Foundations
of Databases lt/titlegt ltyeargt 1995
lt/yeargtlt/bookgtltbookgt ltpublishergt
Freeman lt/publishergt ltauthorgt
Jeffrey D. Ullman lt/authorgt lttitlegt
Principles of Database and
Knowledge Base Systems lt/titlegt
ltyeargt 1998 lt/yeargtlt/bookgt lt/bibgt
46
Stream IndeX (SIX)
SIX (E.g. DIME)
0 205
30 66
72 188
90 110
95 98
0 205
30 66
72 188
0 205
30 66
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
XML
The SIX stream is about 6 of the data stream And
can be made MUCH smaller
47
(No Transcript)
48
Stream Views

Idea
Given a workload of XPath expressions with
branches
Precompute some views for each document to speed
up the entire workload
views ? header has to be small

49
Stream Views
Servers
Queries
/ab88c99 /ac99e00
/ab11c22e23
/ab33d44 e55 /ac66f77 /af34g56

Short circuit evaluation !
50
Stream Views

Views ? header (binary offsets)

Header
0
30
72
0
30
72
0
30
72
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
100x speedup on a hit
XML
XML
XML
XML
Choosing the views Difficult problem
51
Outline