Dan Suciu - PowerPoint PPT Presentation

About This Presentation
Title:

Dan Suciu

Description:

... 18363.50 14.90 203.00 3380.00 19788.30 14.25 176.00 3044.00 21144.20 13.64 156.00 2756.00 22471.00 13.15 139.00 2462.00 24019.90 12.71 125.00 23.00 40.00 56.00 ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 55
Provided by: homesCsWa
Category:
Tags: dan | suciu

less

Transcript and Presenter's Notes

Title: Dan Suciu


1
From Searching Text to Querying XML Streams
  • Dan Suciu
  • www.cs.washington.edu/homes/suciu

2
About Me
  • Born 1957, Romania
  • BS Bucharest, PhD University of Pennsylvania
  • Now University of Washington (Seattle)
  • My work is on semistructured data
  • Book Data on the Web From relations, to
    semistructured data and XML
  • Past/present projects
  • XML-QL precursor of XQuery
  • XMill the XML compressor
  • XML toolkit

3
Motivation
  • Text databases
  • Studied over the past 15 years
  • Traditional client/server model
  • Struggled with lack of standard text syntax
  • Recently, new standard XML
  • Traditional client/server in todays dbms
  • New applications stream processing
  • This talk processing stream XML data
  • My motivation work on the XML Toolkit project

4
Outline
  • Background
  • The XML stream processing problem
  • Basic XML processing with automata
  • Adapting automata to XML
  • Stream indexes
  • Conclusions

5
BackgroundRelational Databases
  • Structured, stored in tables
  • Schema separate from data
  • Queries precise, refer to schema and data (SQL)

AUTHOR
AID Name Country
44 Abiteboul FR
06 Buneman UK
62 Hull USA
12 Suciu USA
29 Vianu USA
BOOKS
ISBN Title Year Publisher
0201537710 Foundations of Databases 1995 AW
155860622X Data on the Web 1999 MK
WROTE
ISBN AID
0201537710 44
0201537710 62
0201537710 29
155860622X 44
155860622X 06
155860622X 12
Hard to publish, easy to query precisely
6
BackgroundText Databases
  • Unstructured, stored in documents
  • No schema, only data
  • Queries imprecise, refer to data only (keywords)

Foundations of Databases, Abiteboul (FR), Hull
(USA), Vianu (USA) Addison Wesley, 1995
Data on the Web Abiteoul (FR), Buneman (UK),
Suciu (USA) Morgan Kaufmann, 1999
Easy to publish, hard to query precisely
7
BackgroundXML Data
  • Semistructured
  • Schema and data are together self-describing
  • Queries precise, refer to schema and data (SQL)

ltbibgt ltbookgt lttitlegt Foundations lt/titlegt
ltauthorgt ltnamegt Abiteboul
lt/namegt
ltcountrygt FR lt/countrygt
lt/authorgt ltauthorgt ltnamegt Hull
lt/namegt
ltcountrygt USA lt/countrygt
lt/authorgt ltauthorgt ltnamegt Vianu
lt/namegt
ltcountrygt USA lt/countrygt
lt/authorgt ltpublishergt Addison
Wesley lt/publishergt ltyeargt 1995
lt/yeargt lt/bookgt lt/bibgt
XML Easier to publish,easy to query precisely
8
BackgroundXML Data
Data model tree
bib
paper
book
book
title
author
journal
title
author
author
publisher
author
Addison Wesley
name
country
Data on the Web
name
country
Buneman
UK
Abiteboul
FR
9
BackgroundXML Data
  • Querying with XPath (and XQuery)
  • This talk XPath queries restricted to
  • tag
  • /
  • //
  • pathconstant

10
BackgroundXPath in One Slide
tag, /
/bib/book/author/name
//,
Navigate partially known structure
/bib/book//name//zip
Conjunctivequeries ala SQL
/bib/bookauthor/nameAbiteboul

/bib/book/year1995 and authornameAbiteboul
and countryFR
This is precisely the region algebra E.g. use
proximal nodes NavarroBaeza-Yates97
11
Outline
  • Background
  • The XML stream processing problem
  • Basic XML processing with automata
  • Adapting automata to XML
  • Stream indexes
  • Conclusions

12
Main ApplicationXML Packet Routing
  • Selective Dissemination of Information
    AltinelFranklin00, Chan et al.02
  • XML content routing Snoeren et al.01
  • SOAP Message routing in Application Servers

13
XML Packet Routing
ltdocgt lttaggt value lt/taggt lt/docgt
ltdocgt lttaggt value lt/taggt lt/docgt
ltdocgt lttaggt value lt/taggt lt/docgt
14
Output XML Streams
Input XML Stream
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
15
The XML Stream Processing Problem
Given A set of XPath expressions An Incoming
stream of XML documents Decide For each
document which expressions it matches
Hard Large number of XPath expressions e.g. 103
- 106 Streaming XML data, high
throughput e.g. 5MB/s Easy Shallow XML
data e.g. depth20 Short XPath
expressions
16
The Approaches
  • Basic techniques
  • NFA plus optimizations
  • Xfilter/Yfilter AltinelFranklin00
  • XTrie Chan et al.02
  • DFA
  • XML Toolkit
  • Beyond the obvious
  • Stream indexes (XML Toolkit)
  • Stream views

17
Outline
  • Background
  • The XML stream processing problem
  • Basic XML processing with automata
  • Adapting automata to XML
  • Stream indexes
  • Conclusions

18
From XPath to NFA
/catalog/productcategory"tools"/price
200/quantity //price
Extra processing needed to combine branches (not
in this talk)
19
Basic NFA Evaluation
XPath
/bib/book /publisherMK /bib/book
categoryrecent/title /bib/book
//address///zip123 /bib/book
//address//"Galaxy" /bib/book
/categoryrecent /bib/book /address123 /bib/b
ook /address /field567 /bib/book
/tagsome /bib/book categoryrecent/title /b
ib/book //address//Seattle" /bib/book
//address//"Galaxy" /bib/book
/categoryrecent /bib/book /addressLisbon /bi
b/book /address /fieldsome . . . . . . . .
. /bib/book/publisherAW /bib/book
categoryrecent/title /bib/book
//address//123 /bib/book //address//"Galaxy"
/bib/book /categorynew /bib/book
/addressLondon /bib/book /address /field
some /bib/book/category old

3,66,102,4534,...
2,3,543,43,254
1,55,99,...
STACK
ltbibgt ltbookgt ... lt/bibgt
20
Basic NFA Evaluation
  • Properties? Space linear? Throughput
    decreases linearly
  • Systems
  • XFilter AltinelFranklin99, YFilter.
  • XTrie Chan et al.02

21
Basic DFA Evaluation
DFAs
XPath
/bib/book /publisherMK /bib/book
categoryrecent/title /bib/book
//address///zip123 /bib/book
//address//"Galaxy" /bib/book
/categoryrecent /bib/book /address123 /bib/b
ook /address /field567 /bib/book
/tagsome /bib/book categoryrecent/title /b
ib/book //address//Seattle" /bib/book
//address//"Galaxy" /bib/book
/categoryrecent /bib/book /addressLisbon /bi
b/book /address /fieldsome . . . . . . . .
. /bib/book/publisherAW /bib/book
categoryrecent/title /bib/book
//address//123 /bib/book //address//"Galaxy"
/bib/book /categorynew /bib/book
/addressLondon /bib/book /address /field
some /bib/book/category old

399
552
1
STACK
ltbibgt ltbookgt ... lt/bibgt
22
Basic DFA Evaluation
  • Properties? Throughput constant !? Space
    GOOD QUESTION
  • System
  • XML Toolkit University of Washingtonhttp//xmlt
    k.sourceforge.net

23
Outline
  • Background
  • The XML stream processing problem
  • Basic XML processing with automata
  • Adapting automata to XML
  • Stream indexes
  • Conclusions

24
The Size of the DFA
0

//a/b/a/a/b
a
1
b
2
DFA for //P has 1P states KMP
a
3
a
4
b
5
NFA
25
The Size of the DFA
0

//a////b
a
1

2
Size of DFA exponential in s (not a real
concern)

3

4
b
5
NFA
26
The Size of the DFA
  • Theorem GMOS02 The number of states in the DFA
    for one linear XPath expression P is at most
  • k number of //
  • s size of the alphabet (number of tags)
  • m max number of between two consecutive //

kP k sm
27
Size of DFA Multiple Expressions
//section/table/footnote //table/footnote //sectio
n/figure/footnote . . . . . //abstract/footnote/ta
ble
DFA Trie has linear number of states
AhoCorasick
28
Size of DFA Multiple Expressions
//section//footnote //table//footnote //figure//fo
otnote . . . . . //abstract//footnote
100 expressions
2100 states !!
There is a theorem here too, but its not useful
29
SolutionCompute the DFA Lazily
  • Also used in text searching
  • But will it work for 106 XPath expressions ?
  • YES !
  • For XPath it is provably effective, for two
    reasons
  • XML data is not very deep
  • The nesting structure in XML data tends to be
    predictable

30
Lazy DFA and Simple DTDs
  • Document Type Definition (DTD)
  • Part of the XML standard
  • Will be replaced by XML Schema
  • Example DTD

lt!ELEMENT document (section)gt lt!ELEMENT section
((sectionabstracttablefigure))gt lt!ELEMENT
figure (table?,footnote)gt . . . . .
Definition A DTD is simple if all cycles are loops
31
Lazy DFA and Simple DTDs
Simple DTD
document
section
abstract
figure
table
footnote
32
Lazy DFA and Simple DTDs
  • Theorem GMOS02 If the XML data has a simple
    DTD, then lazy DFA has at moststates.
  • n max depths of XPath expressions
  • D size of the unfolded DTD
  • d max depths of self-loops in the DTD

1D(1n)d
Fact of life Data-like XMLhas simple DTDs
33
Lazy DFA and Data Guides
  • Non-simple DTDs are useless for the lazy DFA
  • Everything may contain everything

lt!ELEMENT document (section)gt lt!ELEMENT section
((sectiontablefigureabstractfootnote))gt lt!EL
EMENT table ((sectiontablefigureabstractfo
otnote))gt lt!ELEMENT figure ((sectiontablefig
ureabstractfootnote))gt lt!ELEMENT abstract
((sectiontablefigureabstractfootnote))gt
Fact of life Text-like XML has non-simple DTDs
34
Lazy DFA and Data Guides
  • Definition GoldmanWidom97
  • The data guide for an XML data instance is the
    Trie of all its root-to-leaf paths

35
Lazy DFA and Data Guides
XML Data
Data Guide
document
document
section
section
section
section
section
section
table
table
figure
section
table
figure
table
section
table
figure
figure
table
Fact of life real XML data has small data
guide LiefkeS.00
36
Lazy DFA and Simple DTDs
  • Theorem GMOS02 If the XML data has a data
    guide with G nodes, then the number of states in
    the lazy DFA is at most
  • G number of nodes in the data guide

1G
37
Number of Lazy DFA States - SYNTHETIC Data
4000 states
100000
103 XPath
104 XPath
10000
105 XPath
1000
100
10
1
simple
prov
ebBPSS
protein
nasa
treebank
38
40000 states G 350000
Number of Lazy DFA States - REAL Data
100000
103 XPath
10000
104 XPath
95 states
105 XPath
1000
100
10
1
protein
nasa
treebank
39
Number of States in the lazy DFA
Real XML data Synthetic XML data
Data-style DTD Theorem Lazy DFA is small Theorem Lazy DFA is small
Document-style DTD Theorem Lazy DFA is small FactLazy DFA is HUGE
40
Lazy DFA in the XML Toolkit
  • The XML toolkit uses a lazy DFA to process XML
    streams
  • warm-up phase, followed by very high throughput

41
Throughput for 103, 104, 105, 106 XPath
expressions
Parser 10MB/s
prob()10, prob(//)10
Lazy DFA 5.4MB/s
100MB/s
10MB/s
1MB/s
0.1MB/s
0.01MB/s
0.001MB/s
0.0001MB/s
5MB
10MB
15MB
20MB
25MB
Total input size
42
Summary of Lazy DFA and XML
  • Linear Xpath expressions
  • Process with one lazy DFA
  • Xpath expressions with branches
  • Process with Deterministic Pushdown Automata
    (ongoing work at the University of Washington)

43
Outline
  • Background
  • The XML stream processing problem
  • Basic XML processing with automata
  • Adapting automata to XML
  • Stream indexes
  • Conclusions

44
Stream IndeX (SIX)
  • Main observation
  • Parsing is major bottleneck
  • Definition The SIX of an XML document is a binary
    table of (begin, end) offsets
  • Idea
  • Use SIX to reduce amount of parsing
  • Works well with (lazy) DFA
  • Implemented in the XML toolkit

45
Stream IndeX (SIX)
SIX
XML
beginOffset endOffset
bib 0 1490124
book 3 409023
publisher 12 423
author 426 879
author 978 . . .
. . .
ltbibgtltbookgt ltpublishergt Addison-Wesley
lt/publishergt ltauthorgt Serge
Abiteboul lt/authorgt ltauthorgt
ltfirst-namegt Rick lt/first-namegt
ltlast-namegt Hull lt/last-namegt
lt/authorgt ltauthorgt Victor
Vianu lt/authorgt lttitlegt Foundations
of Databases lt/titlegt ltyeargt 1995
lt/yeargtlt/bookgtltbookgt ltpublishergt
Freeman lt/publishergt ltauthorgt
Jeffrey D. Ullman lt/authorgt lttitlegt
Principles of Database and
Knowledge Base Systems lt/titlegt
ltyeargt 1998 lt/yeargtlt/bookgt lt/bibgt
46
Stream IndeX (SIX)
SIX (E.g. DIME)
0 205
30 66
72 188
90 110
95 98
0 205
30 66
72 188
0 205
30 66
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
XML
The SIX stream is about 6 of the data stream And
can be made MUCH smaller
47
(No Transcript)
48
Stream Views
  • Idea
  • Given a workload of XPath expressions with
    branches
  • Precompute some views for each document to speed
    up the entire workload
  • views ? header has to be small

49
Stream Views
Servers
Queries
/ab88c99 /ac99e00
/ab11c22e23
/ab33d44 e55 /ac66f77 /af34g56

Short circuit evaluation !
50
Stream Views
  • Views ? header (binary offsets)

Header
0
30
72
0
30
72
0
30
72
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
100x speedup on a hit
XML
XML
XML
XML
Choosing the views Difficult problem
51
Outline
  • Background
  • The XML stream processing problem
  • Basic XML processing with automata
  • Adapting automata to XML
  • Stream indexes
  • Conclusions

52
Summary
  • XML stream processing problem
  • Fixed XPath queries, transient XML data
  • Large number of queries
  • High data throughput
  • Relationship to text processing techniques
  • Still regular expressions
  • Still automata and lazy DFAs
  • Different scale
  • Techniques
  • Lazy DFAs work for reasons specific to XML
  • Stream indexes and views ongoing research

53
Future Work
  • Handle branches in XPath expressions
  • View selection for a given workload
  • Network configuration

54
Thank you !
  • Links
  • www.cs.washington.edu/homes/suciu
  • www.cs.washington.edu/homes/suciu/XMLTK
  • xmltk.sourceforge.net
Write a Comment
User Comments (0)
About PowerShow.com