Title: Yanlei Diao
1Towards an Internet-Scale XML Dissemination
Service
- Yanlei Diao
- Shariq Rizvi
- Michael J. Franklin
- EECS, U.C. Berkeley
2Outline
- XML dissemination services
- System model
- Core techniques
- Status and conclusions
3Applications of XML Dissemination
- News feeds via RSS (Really Simple Syndication)
- My Yahoo! updated headlines from BBC, CNet, NPR.
- Mobile services
- Mobile operators connect content providers with
millions of clients running a multitude of
operating systems. - Stock tickers
- QuoteMedia fast access to real-time and
historical stock data. - Online auctions
- freebidingtools.com create your own feed for
your favorite eBay search. - Network monitoring
- Ganglia a distributed monitoring system for
clusters and grids.
4YFilter An XML Dissemination Service
YFilter
- User queries Specification of data interests,
written in an XML query language.
- Data sources Continuously publish XML data
items.
- The service Delivers to each user the XML data
items that match her data interests the
delivered results are presented in a customized
format.
5ONYX Large-Scale XML Dissemination
- ONYX
- Operator Network using YFilter for XML
Dissemination
YFilter
- An overlay network of information brokers running
YFilter.
- Underlying infrastructures
- A dedicated network
- Peer-to-peer
- Collaboration among administrative domains
6Design Space Expressiveness
- Expressiveness data model query language a
service supports - Subject-based
- Messages a subject label
- Queries a specific label or a wildcard
- Predicate-based
- Messages attribute-value pairs
- Queries a set of predicates
- XML filtering
- Messages XML
- Queries subset of XPath 1.0
- XML filtering and transformation
- Messages XML
- Queries subset of XQuery
7Design Space Why Distributed Processing?
- Privacy
- Regulations e.g., CA Senate Bill No. 1386.
- Policies e.g., customers data stay behind the
firewall. - Locality of data interests
- Disseminate regional data directly to local
subscribers. - Scalability
- Data volume number of messages per second up to
thousands, message size from 1 KB to 20 KB. - Query population up to millions.
- Frequency of query updates from a daily basis to
every few minutes. - Result Volume can amplify the input data volume
by a large factor.
8Related Systems
9Content of the Paper
- Content-driven routing
- Need to handle both structural and value-based
constraints. - Leverage YFilter NFA-based operator networks,
distributed construction. - Filtering power of routing (i.e., fraction of
messages filtered) - Filtering power can be inherently limited.
- Use query partitioning (if possible) to improve
it. - Distributed transformation
- Currently either at the publishers side or at
the edge brokers. - Perform cascading message transformation during
routing. - Efficient XML transmission
- Verbosity of XML, and XML parsing at each routing
step. - Investigate different XML formats for XML
transmission. - Detailed architectural design
- Other optimization techniques
10Outline
- XML dissemination services
- System model
- Core techniques
- Status and conclusions
11Operations on Data/Query flows
a transformation query
12System Tasks on Data/Query Planes
- Processing planes query plane and data plane
Planes System Tasks Query Plane Data Plane
Content-driven routing
Incremental transformation
Final query processing
13Outline
- XML dissemination services
- System model
- Core techniques
- Status and conclusions
14Routing Table Design
- A routing table mapping from output links to
routing queries. - a routing query the data interests of queries
down from an output link. - data interest of a query XPath expressions, for
and where clauses of FLWOR expressions. - Routing table design
- a canonical form of routing queries
- a representation of routing tables and
- an algorithm constructing them from a distributed
query population. - Two (conflicting) goals
- High filtering power of routing
- Fraction of messages filtered in routing.
- High routing efficiency
- Number of messages routed per second.
15YFilter Basics
- An XML filtering and transformation engine that
processes multiple queries in a shared fashion. - A Non-Deterministic Finite Automaton (NFA)-based
operator network. - Benefits for routing
- Fast structure matching.
- A small maintenance cost for query updates.
- Extensibility for supporting new operators.
Q1 /nitf head/pubdata_at_edition.areaSF
.//tobject.subject_at_tobject.subject.typeS
tock
Q2 /nitf head/pubdata_at_edition.areaSF
.//tobject.subject_at_tobject.subject.matterfis
hing
- Y. Diao and M.J. Franklin. Query Processing for
High-Volume XML Message Brokering. VLDB 2003. - Y. Diao, et al. Path Sharing and Predicate
Evaluation for High-Performance XML Filtering.
TODS, Dec. 2003. - ? YFilter v1.0 release Coming later this month!
16Our Solution
- Routing queries are a disjunction of path
expressions - Each XPath expression (equivalent of the for and
where clauses of FLOWR expressions) is a routing
query. - Multiple routing queries can be connected by or.
- Routing table representation
- Merge routing queries into a single combined
operator network. - Construction algorithm
- Map() a user query ? a routing query in the
canonical form. - Collect() routing queries sent from child
brokers ? a routing table. - Aggregate() all the routing queries (at a node)
?a new routing query.
17An Example Scenario
18Example (continued)
19Sharing and Short-cut Evaluation
- Separate routing query representations short-cut
evaluation.
- Combined one sharing may sacrifice the short-cut
evaluation strategy.
- Solution dynamic pruning of the operator network
at runtime - Each operator/NFA state has a static set of
broker ids that it can reach. - System keeps a dynamic set of broker ids that
have been reached. - YFilter execution is extended to prune the
operator network using these sets.
20Other Routing Considerations
- Content Generalization
- Large routing tables can be a problem.
- Introduce content generation as an additional
step in Collect( ) or Aggregate( ). - Generalization methods.
- Trade off filtering power for routing (space)
efficiency. - Filtering Power of Routing
- Fraction of messages filtered by routing.
- Selectivity of the union of the user queries at
the node. - Loss in precision in the routing queries
representing this node. - If inherently low, partition the query population
to improve it. - An Exclusiveness Pattern e.g., /a/b_at_id?
- Identify a set of such patterns, and partition
queries using them.
21Status and Conclusions
- Queries bring intelligence to the network routing
fabric.
- We present a detailed architectural design of
ONYX.
- We address fundamental issues.
- YFilters NFA-based operator networks are good
for routing! - Locality of data interests is key to filtering
power!
- Status YFilter release, XML transmission, other
implementation underway.
- This is an area full of opportunities for
optimization. - Improving routing efficiency.
- Improving filtering power of routing.
- Incremental message transformation.
- Sharing among different processing tasks.
- Schema-based optimization
22Questions
ONYX