Title: XML Warehousing and Xyleme
1XML Warehousing and Xyleme
- S. Abiteboul
- INRIA and Xyleme
- Serge.Abiteboul_at_inria.fr
- December 2002
2Organization
- The context and motivations
- XML warehouse
- Xyleme An XML warehouse
- Zooms on some aspects of the technology
- Scaling
- Mass storage of XML
- XML query processing
- Semantic integration
- Web page ranking
- Query subscription
- Xyleme the company, in very brief
3The context
- The Web and XML are changing dramatically the
world of distributed information
4The Web of yesterday
- Protocol HTTP
- Documents HTML
- Millions of independent web sites and billions of
documents - Browsing and keyword search (full-text indexing)
- Publication of databases using forms
- Data management with the Web
- HTML is primarily for humans
- Data management applications on the Web
- Based on hand-made wrappers
- Expensive, incomplete, short-lived, not adapted
to the Web constant change - No real support for distributed data management!
5What is changing
- Information used to live in islands and a lot of
its value was wasted - Different formats relational, meta data,
documents and text, data exchange formats - A Web standard for data exchange, XML, is fixing
it - XML can capture all kinds of information over a
wide spectrum of information - XML comes with a family of emerging standards
XML schema, XSL/T, Xquery, domain specific
schemas - Different computers, platforms, languages,
applications - Web services, e.g., SOAP, are fixing it
- SOAP allows ubiquitous computing on the Internet
- SOAP comes with a family of emerging standards
WSDL, UDDI
6What is changing
- XML and Web services provide a uniform access to
information, independent of platform, system,
language, communication protocol and data format
- The dream for distributed data management
- The gathering, integration, consolidation,
analysis of distributed information become
feasible at a much lower cost
7(1) XML covers the information spectrum
Structured Data
Meta data
Hierarchy
Books Contracts Catalogs Bank
accounts Emails Financial Reports Insurance
Policies Economical Analysis
Derivatives Inventory Political
analysis Insurance Claims Financial
News Sports News Resumes
8XML covers the information spectrum
- Very structured information such as databases
- Most DBMS now export in XML
- Semi-structured data such as data exchange
formats (ASN.1, SGML), e.g., technical
documentation - Documents
- Meta-data Author, date, status
- Existing structure in them chapter, section,
table of content and index - Possibly tagging of elements in it (citation,
lists) - Links to other documents
- Meta data for unstructured data such as images
and sound - Plain text
XML
9XMLs asset the marriage of text and structure
- labeled ordered trees where leaves are text
- Marriage of document and database worlds
- Marriage of full text indexing (keyword search)
and structure indexing (SQL-style query) - Is it the ultimate data model? No
- Purely syntax more semantics needed
- Is it OK for now? Definitely yes (because it is a
standard)
10XMLs asset typing
- Applications need typing and XML data can be
typed if needed (DTD and XML schema) - Trees
- Logical Granularity neither page or document
level but the piece of information that is
needed - Semantics and structure are in tags and paths
- product-table/product/reference
- product-table/product/price
11HTML
hard
Text presentation - Where is the data ?
12XML
easy
Data Structure Semistructured (presentation
elsewhere)
13(2) Web services and ubiquitous distributed
computing
- Possibility to activate a method on some remote
web server - Exchange information in XML input and result are
in XML - Ubiquitous XML distributed computing
infrastructure - 2 main applications
- E-commerce
- Access to remote data
- With XML and Web services, it is possible
- To get information from virtually anywhere
- To provide information to virtually anywhere
14Accessing remote information
Query some data services that provide candidate
genes
Heterogeneous formats, protocols, etc.
Gene banks
Application using gene banks
processing
Use some processing services
processing
processing
15Same with web services
Query some data services that provide candidate
genes
Uniform access to information
Web
Gene banks
Application using gene banks
processing
Use some processing services
processing
processing
16XML and Web services
- Exchange of information
- E-commerce, B2B, G2C
- Cooperative work
- Information brokers
- Web sites, portals
- Content publication in general
- Mediation mode get the XML pages when needed
- Warehouse mode load them in advance
17Advantages of a warehouse approach
- Allows for support of complex query processing
with high performance - Allows for complex analysis of the data
- Allows for enriching the information
- Allows for better monitoring of information
- Allows for versioning, archiving, temporal
queries if needed - Mediator approach is preferable or compulsory in
some applications - Supply chain
- Comparative shopping
- Typically for volatile information such as plane
ticket price
18XML warehouse
19Main functionalities
Admin GUI
User GUI
Access Reporting Sub
User GUI Editing Pub
View Integration
Enrichment
Feeding
Exploitation
Repository
API
API
Warehousing Analysis (data
warehouse) (OLAP)
20Main functionalities(1) Feeding
- Loading from the Web (Internet and Intranet)
- Web search
- Web crawl
- Access Web data via forms or Web services
- Plug-ins to load from
- File systems, document management systems
- Data bases, LDAP
- Newsgroup, emails
- Other applications
- Extraction and transformation
- XSL-T or Xquery mappings for XML sources
- XML-izers to load data from other formats
- Monitoring of the feeding
21Main functionalities(1) Feeding continued
- User feeding
- Document editing
- Meta data editing
- Using WebDAV protocol
- Publication
- By GUI or from programs (SOAP-based API)
-
22Main functionalities(2) Repository
- Storage of massive volume of XML (terabytes)
- Indexing of massive volume of XML
- By structure
- By full-text
- Linguistic support stemming, synonyms, etc.
- Very efficient XML query processing
- Importance ranking
- Monitoring of the warehouse (support for
subscriptions) - Access control and security
- Versioning, archiving
- Recovery
- No full transaction mechanism
23Main functionalities(3) Enrichment
- Global organization
- Global schema management
- Management of collections
- Incorporate domain ontologies and thesauri
- Document classification
- Cleaning by filtering out documents from
collections, etc. - Document enrichment
- Concept extraction and tagging
- Cleaning inside de document
- Summarization, etc.
- Relationships between documents
- Tables of contents
- Tables of index
- Cross referencing, etc.
24Main functionalities(4) View and integration
- View management
- Document restructuring/mapping
- Schema to schema mapping
- Semantic integration
- Manual for complex ones and (semi-) automatic for
simple ones - Tools to analyze a set of schemas
- Tools to integrate them
- Processing for queries on integration view
- Management of virtual data in a mediator style
25Functionalities(5) Exploitation
- Access to the warehouse
- Browsing
- Querying by keywords, XPaths or Xquery
- Temporal queries
- Query subscription
- Reporting
- Generation of complex reports with pointers to
documents, counts, abstracts - Organized by collections, content, domains
- By GUI or from programs (Web service-based API)
26Admin Specify the lifecycle of information in
the warehouse starting from its acquisition
- Specify with parameters (in red) documents to
process - Add from a toolbox, some processing to apply (in
pink) - Specify when processing should be applied (in
green)
27Specifying the enrichment
- What processing should be performed
- Applications that come with the system
- Arbitrary processing provided as Web services
- Interface of services
- XML input the documents or collection of
documents in the warehouse to be processed - XML output the result
- Where to plug the result
- Where to store the new documents (collections,
names) - Where to put enrichments in existing documents
- When to start the processing
- At the time the document is loaded
- At some later time, assuming some information has
already been gathered (dependencies)
28User queries and reporting
Choose the collections of interest
Choose the criteria of selection
Choose what to extract as a result
WHERE CLAUSE
SELECT CLAUSE
FROM CLAUSE
Quantity of results Preference ranking and
possible relaxation
PREFER CLAUSE
Classify/group results for presentation and
drilling
ORGANIZE CLAUSE
Choose presentation style
STYLE CLAUSE
29Example
- From collections MuséeRodin, WebMuseum, LACMA
- Where Art_Item/ artist NameRodin
- Select Name, Owner, Annotations
- Prefer
- Rodin in title page
- Owner is public or owner is in France
- Get first 20
- Organize as
- Art_Item/material sculpture, painting, others
- Owner
- Present as
30XylemeAn XML warehouse
Zooms on some aspects of the technology
31Xyleme a dynamic XML warehouse
- Scaling
- Feeder
- E.g., loading with a single PC millions of Web
documents per day and scale up with more
machines - Repository
- E.g., storing and indexing of tera Bytes of XML
(other formats, e.g., pdf) - Enrichment
- E.g., tools (together with partner) for
classification and concept extraction - View and semantic integration
- E.g., a suite of tools of XML integration
- Exploitation
- E.g., access via SOAP and graphic interfaces
321. An architecture to scale
33The scaling
- Size of data billions of XML documents
- Size of data and index terabytes
- Number of customers
- thousands of simultaneous queries
- millions of subscriptions
- An architecture based on distribution
-
34Architecture
- Cluster of PCs
- Runs on Linux and C (also Solaris)
- Communications
- local Corba (Orbacus)
- external HTTP, SOAP
- Distribution between autonomous machines
35Functional architecture
-------------------- I N T E R N E T
-----------------------
Web Interface
Query Processor
Repository and Index Manager
36Architecture and scaling
-------------------- I N T E R N E T
-----------------------
E T H E R N E T
372. Data Acquisition and Maintenance of Web pages
(internet or intranet)
38Crawl le Web
- Discover HTML/XML pages on the web (intranet or
internet) - Parse/load pages and follow links
- Manage metadata for the known pages
- Do this under bounded resources
- Network bandwidth
- Memory and disk resources
- Tested on the Internet in October 2001
- Millions of pages crawled per day on each crawler
- Up to 10 crawlers and close to 1 billion HTML/XML
pages discovered in a couple of months
39Optimization
Page Scheduling
- Optimization problem
- Decide which page to crawl or refresh next to
optimize the quality of the warehouse - Criteria
- Read more often important pages
- Based on customers preferences
- Page importance can also be used to order query
results - Dont read a page that is probably up-to-date
- Uses an estimate of the change frequency for each
page - Advantages
- Have a fresh view of useful portions of
information
40Page scheduling
- Determine which page to read next
- minimize a particular cost function under some
constraint (bandwidth of crawlers) - The penalty for a page takes into account
- importance of the page (to be defined next)
- customer needs (obtained via pub/sub)
- staleness of the data
- penalty for being out of date
- penalty for aging
- The page scheduler fully controls the crawling
- vs. random crawling in classic search engines
41Page Importance
- Based on customers criteria and on the link
structure of the web - Intuition a page is important if many important
pages reference it - Fixpoint definition importance vector Imp
- Proposed by IBM used by search engines such as
Google - Link matrix M(i,j) if page i refers to page j
- Outdegree of page i out(i)
- Imp0(k) 1/N (initialization)
- Impm(k) ?i M(i,k) Impm-1(i)/out(i)
(iteration) - Imp is the limit
42Page Importance
- Novel technology developed by Xyleme
- Patent pending
- On-line evaluation of page importance
- Use much less resources
- Faster reaction to changes on the web
432. XML Repository
44Storing XML
- Document systems
- Good for keyword search
- No or inefficient support for structure search
- Relational store (e.g., Oracle 8i)
- Well adapted for some applications
- Very typed data and Tables efficient
- Otherwise too many joins and inefficient
- Object database store (e.g., Excellon) and Native
XML databases (e.g., Tamino) - Same issues
- Xyleme XML Native storage
45Repository
- Goal
- minimize I/O for direct access and scanning
- efficient direct accesses both with fulltext
indexing and structure indexing - good compaction but not at the cost of access
- Efficient storage of trees
- use fixed length storage pages
- variable length records inside a page
- Main issue tree balancing
46Tree Balancing
Record 1
Record 3
Record 2
47Tree Balancing
Large collections may use several records
483. Semantic Data Integration
49Classification
- Based on word occurrences in document and
statistical resources - Classification by semantic domain
- Classification by language
- Use the XXX classifier
50Semantic Integration
- Web Heterogeneity
- Many possible types for data in a particular
domain, many DTDs - Semantic Integration
- one abstract DTD for the domain
- gives the illusion that the system maintains an
homogeneous database for this domain - 1 domain 1 abstract DTD
51Views
- Choose an abstract DTD for each domain
- For each concrete DTD in a domain, find how it
relates to the abstract DTD using linguistic
tools such as WordNet - Provide relationships between paths in the
concrete and abstract DTD - Possibly automatic, manual or hybrid
- With manual mapping, a domain expert may specify
much more complex views - Query processing process queries on the Abstract
DTD
524. Query Processing
53Query Language
- Today A mix of OQL and XQL
- Tomorrow the future W3C standard
- Example
- select product/name, product/price
- from doc in catalogue,
- product in doc/product
- where product//components contains flash
- and product/description contains
camera
54Data Distribution
- Cluster of documents physical collection of
documents (? semantic domain) - Distribution
- Storage machine
- in charge of a cluster of documents
- Index machine
- index for a cluster
55Step0 Indexing
- Standard inverted index
- word ? documents that contain this word
- Xyleme index
- word ? elements that contain this word
- document element identifier
- Goal more work can be performed without
accessing data
56Step1 Localization
global query on abstract dtd
- Query on an abstract dtd
- Localization of machines that host concrete DTDs
that will participate in the query
catalogue/product/price relevant for machine
56 machine 45
local queries
union of queries on local machines
57Step2 Optimization
- Algebraic rewriting
- Linear search strategy based on simple heuristics
- use in memory indexes
- minimize communication
- Optimization of the global plan
- Optimization of the local plans
58Step3 Execution
- A plan usually consists of
- 1. parallel translation from abstract queries to
concrete patterns on the relevant index machines - 2. parallel index scans to identify the relevant
elements for a concrete pattern - 3. parallel construction of resulting elements
- 4. pipeline evaluation (i.e., no intermediate
data structure) - Note 2. Requires smart indexes
59Abstract2Concrete
for catalogue/product/price scan relevant
concrete pattern ? d1//camera/price ?
d2/product/cost ? d3/piano/price ...
For each concrete pattern, the local plan is
optimized dynamically
for each concrete pattern scan the element ids
? 234 ? 177
60Identifiers
- Essential for query processing
- Identifier (preorder rank/postorder rank)
- X ancestor of Y ltgt pre(X) lt pre(Y) and
post(X) gt post(Y) - E.g., 2lt5 and 4 gt2 gt (2,4) ancestor (5,2)
1
A B C D E
F G
7
2
6
4
6
3
4
7
1
3
5
5
2
Text
615. Change Control
62Change management
- Users are often interested in changes to the web
- Change monitoring
- query subscription
- Soon to come Version management
- representation and storage of changes
63Query Subscription
- Users subscribe to certain events such as
- Update of a particular page, a page in a given
site - Discovery of a new page containing some specific
words - Insertion of a particular element in some pages
(new products in a catalog) - Detection of illegal copies of selected documents
- Users may request to be notified
- Immediately at the time the event is detected
- Regularly, e.g., weekly
- After a certain number of event detections
64Examples
- subscription myPariscope
- what are the new movie entries in Pariscope
site - monitoring newMovies
- select URL
- where URL extends www.pariscope.fr/movies/
- and new(self)
- manage the changes in the movies showing in
Paris - continuous delta Showing
- select ... from ... where
- when daily
- notify daily send me a daily report
65Atomic Events
document
Loading of millions of pages/day
d
atomic event 46 URL matches pattern
www.xyz.com/ atomic event 67 XML
document contains the tag soccer
metadata manager
HTML parser
d/46
complex event detection
XML loader
d/46,67
loading
66Complex Events
Several millions of pages crawled per
day Hundreds of millions of alerts raised
HTML parser
complex event detection
Millions of subscriptions
XML loader
complex event 12 67 46 (XML document contains
the tag soccer and URL matches pattern
www.xyz.com/)
67Notification Processing
- Very efficient/scalable algorithm for complex
event detection - Notifications by
- Email
- Web posting
- Web services in SOAP
notification processor
complex event detection
alerts
notifications
Millions of notifications/day
68Xyleme in short
- Spin-off of lINRIA (National Research Institute)
- Technology developed in research project of 60
man/years - Creation of Xyleme SA in September 2000
- Now about 25 persons 13 RD, 4 Services, 10
marketing, sales admin. - Customers include Press agency (AFP), Newspaper
groups (Moniteur, Le Monde), National library
(BNF) - First round of capital in 2000 (SGAM
Viventures). - Second round in 2002 (Deutsche Bank)
69Thank you() If you want to know more about
Xyleme http//www.xyleme.com Serge.Abiteboul_at_xy
leme.com Amir.Milo_at_xyleme.com