XML Warehousing and Xyleme - PowerPoint PPT Presentation

1 / 69

About This Presentation

Title:

XML Warehousing and Xyleme

Description:

... emerging standards: XML schema, XSL/T, Xquery, domain specific ... XSL-T or Xquery mappings for XML sources. XML-izers to load data from other formats ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 70

Provided by: abite

Category:

more less

Transcript and Presenter's Notes

Title: XML Warehousing and Xyleme

1
XML Warehousing and Xyleme

S. Abiteboul
INRIA and Xyleme
Serge.Abiteboul_at_inria.fr
December 2002

2
Organization

The context and motivations
XML warehouse
Xyleme An XML warehouse
Zooms on some aspects of the technology
Scaling
Mass storage of XML
XML query processing
Semantic integration
Web page ranking
Query subscription
Xyleme the company, in very brief

3
The context

The Web and XML are changing dramatically the
world of distributed information

4
The Web of yesterday

Protocol HTTP
Documents HTML
Millions of independent web sites and billions of
documents
Browsing and keyword search (full-text indexing)
Publication of databases using forms
Data management with the Web
HTML is primarily for humans
Data management applications on the Web
Based on hand-made wrappers
Expensive, incomplete, short-lived, not adapted
to the Web constant change
No real support for distributed data management!

5
What is changing

Information used to live in islands and a lot of
its value was wasted
Different formats relational, meta data,
documents and text, data exchange formats
A Web standard for data exchange, XML, is fixing
it
XML can capture all kinds of information over a
wide spectrum of information
XML comes with a family of emerging standards
XML schema, XSL/T, Xquery, domain specific
schemas
Different computers, platforms, languages,
applications
Web services, e.g., SOAP, are fixing it
SOAP allows ubiquitous computing on the Internet
SOAP comes with a family of emerging standards
WSDL, UDDI

6
What is changing

XML and Web services provide a uniform access to
information, independent of platform, system,
language, communication protocol and data format
The dream for distributed data management
The gathering, integration, consolidation,
analysis of distributed information become
feasible at a much lower cost

7
(1) XML covers the information spectrum
Structured Data
Meta data
Hierarchy
Books Contracts Catalogs Bank
accounts Emails Financial Reports Insurance
Policies Economical Analysis
Derivatives Inventory Political
analysis Insurance Claims Financial
News Sports News Resumes
8
XML covers the information spectrum

Very structured information such as databases
Most DBMS now export in XML
Semi-structured data such as data exchange
formats (ASN.1, SGML), e.g., technical
documentation
Documents
Meta-data Author, date, status
Existing structure in them chapter, section,
table of content and index
Possibly tagging of elements in it (citation,
lists)
Links to other documents
Meta data for unstructured data such as images
and sound
Plain text

XML
9
XMLs asset the marriage of text and structure

labeled ordered trees where leaves are text
Marriage of document and database worlds
Marriage of full text indexing (keyword search)
and structure indexing (SQL-style query)
Is it the ultimate data model? No
Purely syntax more semantics needed
Is it OK for now? Definitely yes (because it is a
standard)

10
XMLs asset typing

Applications need typing and XML data can be
typed if needed (DTD and XML schema)
Trees
Logical Granularity neither page or document
level but the piece of information that is
needed
Semantics and structure are in tags and paths
product-table/product/reference
product-table/product/price

11
HTML
hard
Text presentation - Where is the data ?
12
XML
easy
Data Structure Semistructured (presentation
elsewhere)
13
(2) Web services and ubiquitous distributed
computing

Possibility to activate a method on some remote
web server
Exchange information in XML input and result are
in XML
Ubiquitous XML distributed computing
infrastructure
2 main applications
E-commerce
Access to remote data
With XML and Web services, it is possible
To get information from virtually anywhere
To provide information to virtually anywhere

14
Accessing remote information
Query some data services that provide candidate
genes
Heterogeneous formats, protocols, etc.
Gene banks
Application using gene banks
processing
Use some processing services
processing
processing
15
Same with web services
Query some data services that provide candidate
genes
Uniform access to information
Web
Gene banks
Application using gene banks
processing
Use some processing services
processing
processing
16
XML and Web services

Exchange of information
E-commerce, B2B, G2C
Cooperative work
Information brokers
Web sites, portals
Content publication in general
Mediation mode get the XML pages when needed
Warehouse mode load them in advance

17
Advantages of a warehouse approach

Allows for support of complex query processing
with high performance
Allows for complex analysis of the data
Allows for enriching the information
Allows for better monitoring of information
Allows for versioning, archiving, temporal
queries if needed
Mediator approach is preferable or compulsory in
some applications
Supply chain
Comparative shopping
Typically for volatile information such as plane
ticket price

18
XML warehouse
19
Main functionalities
Admin GUI
User GUI
Access Reporting Sub
User GUI Editing Pub
View Integration
Enrichment
Feeding
Exploitation
Repository
API
API
Warehousing Analysis (data
warehouse) (OLAP)
20
Main functionalities(1) Feeding

Loading from the Web (Internet and Intranet)
Web search
Web crawl
Access Web data via forms or Web services
Plug-ins to load from
File systems, document management systems
Data bases, LDAP
Newsgroup, emails
Other applications
Extraction and transformation
XSL-T or Xquery mappings for XML sources
XML-izers to load data from other formats
Monitoring of the feeding

21
Main functionalities(1) Feeding continued

User feeding
Document editing
Meta data editing
Using WebDAV protocol
Publication
By GUI or from programs (SOAP-based API)

22
Main functionalities(2) Repository

Storage of massive volume of XML (terabytes)
Indexing of massive volume of XML
By structure
By full-text
Linguistic support stemming, synonyms, etc.
Very efficient XML query processing
Importance ranking
Monitoring of the warehouse (support for
subscriptions)
Access control and security
Versioning, archiving
Recovery
No full transaction mechanism

23
Main functionalities(3) Enrichment

Global organization
Global schema management
Management of collections
Incorporate domain ontologies and thesauri
Document classification
Cleaning by filtering out documents from
collections, etc.
Document enrichment
Concept extraction and tagging
Cleaning inside de document
Summarization, etc.
Relationships between documents
Tables of contents
Tables of index
Cross referencing, etc.

24
Main functionalities(4) View and integration

View management
Document restructuring/mapping
Schema to schema mapping
Semantic integration
Manual for complex ones and (semi-) automatic for
simple ones
Tools to analyze a set of schemas
Tools to integrate them
Processing for queries on integration view
Management of virtual data in a mediator style

25
Functionalities(5) Exploitation

Access to the warehouse
Browsing
Querying by keywords, XPaths or Xquery
Temporal queries
Query subscription
Reporting
Generation of complex reports with pointers to
documents, counts, abstracts
Organized by collections, content, domains
By GUI or from programs (Web service-based API)

26
Admin Specify the lifecycle of information in
the warehouse starting from its acquisition

Specify with parameters (in red) documents to
process
Add from a toolbox, some processing to apply (in
pink)
Specify when processing should be applied (in
green)

27
Specifying the enrichment

What processing should be performed
Applications that come with the system
Arbitrary processing provided as Web services
Interface of services
XML input the documents or collection of
documents in the warehouse to be processed
XML output the result
Where to plug the result
Where to store the new documents (collections,
names)
Where to put enrichments in existing documents
When to start the processing
At the time the document is loaded
At some later time, assuming some information has
already been gathered (dependencies)

28
User queries and reporting
Choose the collections of interest
Choose the criteria of selection
Choose what to extract as a result
WHERE CLAUSE
SELECT CLAUSE
FROM CLAUSE
Quantity of results Preference ranking and
possible relaxation
PREFER CLAUSE
Classify/group results for presentation and
drilling
ORGANIZE CLAUSE
Choose presentation style
STYLE CLAUSE
29
Example

From collections MuséeRodin, WebMuseum, LACMA
Where Art_Item/ artist NameRodin
Select Name, Owner, Annotations
Prefer
Rodin in title page
Owner is public or owner is in France
Get first 20
Organize as
Art_Item/material sculpture, painting, others
Owner
Present as

30
XylemeAn XML warehouse
Zooms on some aspects of the technology
31
Xyleme a dynamic XML warehouse

Scaling
Feeder
E.g., loading with a single PC millions of Web
documents per day and scale up with more
machines
Repository
E.g., storing and indexing of tera Bytes of XML
(other formats, e.g., pdf)
Enrichment
E.g., tools (together with partner) for
classification and concept extraction
View and semantic integration
E.g., a suite of tools of XML integration
Exploitation
E.g., access via SOAP and graphic interfaces

32
1. An architecture to scale
33
The scaling

Size of data billions of XML documents
Size of data and index terabytes
Number of customers
thousands of simultaneous queries
millions of subscriptions
An architecture based on distribution

34
Architecture

Cluster of PCs
Runs on Linux and C (also Solaris)
Communications
local Corba (Orbacus)
external HTTP, SOAP
Distribution between autonomous machines

35
Functional architecture
-------------------- I N T E R N E T
-----------------------
Web Interface
Query Processor
Repository and Index Manager
36
Architecture and scaling
-------------------- I N T E R N E T
-----------------------
E T H E R N E T
37
2. Data Acquisition and Maintenance of Web pages
(internet or intranet)
38
Crawl le Web

Discover HTML/XML pages on the web (intranet or
internet)
Parse/load pages and follow links
Manage metadata for the known pages
Do this under bounded resources
Network bandwidth
Memory and disk resources
Tested on the Internet in October 2001
Millions of pages crawled per day on each crawler
Up to 10 crawlers and close to 1 billion HTML/XML
pages discovered in a couple of months

39
Optimization
Page Scheduling

Optimization problem
Decide which page to crawl or refresh next to
optimize the quality of the warehouse
Criteria
Read more often important pages
Based on customers preferences
Page importance can also be used to order query
results
Dont read a page that is probably up-to-date
Uses an estimate of the change frequency for each
page
Advantages
Have a fresh view of useful portions of
information

40
Page scheduling

Determine which page to read next
minimize a particular cost function under some
constraint (bandwidth of crawlers)
The penalty for a page takes into account
importance of the page (to be defined next)
customer needs (obtained via pub/sub)
staleness of the data
penalty for being out of date
penalty for aging
The page scheduler fully controls the crawling
vs. random crawling in classic search engines

41
Page Importance

Based on customers criteria and on the link
structure of the web
Intuition a page is important if many important
pages reference it
Fixpoint definition importance vector Imp
Proposed by IBM used by search engines such as
Google
Link matrix M(i,j) if page i refers to page j
Outdegree of page i out(i)
Imp0(k) 1/N (initialization)
Impm(k) ?i M(i,k) Impm-1(i)/out(i)
(iteration)
Imp is the limit

42
Page Importance

Novel technology developed by Xyleme
Patent pending
On-line evaluation of page importance
Use much less resources
Faster reaction to changes on the web

43
2. XML Repository
44
Storing XML

Document systems
Good for keyword search
No or inefficient support for structure search
Relational store (e.g., Oracle 8i)
Well adapted for some applications
Very typed data and Tables efficient
Otherwise too many joins and inefficient
Object database store (e.g., Excellon) and Native
XML databases (e.g., Tamino)
Same issues
Xyleme XML Native storage

45
Repository

Goal
minimize I/O for direct access and scanning
efficient direct accesses both with fulltext
indexing and structure indexing
good compaction but not at the cost of access
Efficient storage of trees
use fixed length storage pages
variable length records inside a page
Main issue tree balancing

46
Tree Balancing
Record 1
Record 3
Record 2
47
Tree Balancing
Large collections may use several records
48
3. Semantic Data Integration
49
Classification

Based on word occurrences in document and
statistical resources
Classification by semantic domain
Classification by language
Use the XXX classifier

50
Semantic Integration

Web Heterogeneity
Many possible types for data in a particular
domain, many DTDs
Semantic Integration
one abstract DTD for the domain
gives the illusion that the system maintains an
homogeneous database for this domain
1 domain 1 abstract DTD

51
Views

Choose an abstract DTD for each domain
For each concrete DTD in a domain, find how it
relates to the abstract DTD using linguistic
tools such as WordNet
Provide relationships between paths in the
concrete and abstract DTD
Possibly automatic, manual or hybrid
With manual mapping, a domain expert may specify
much more complex views
Query processing process queries on the Abstract
DTD

52
4. Query Processing
53
Query Language

Today A mix of OQL and XQL
Tomorrow the future W3C standard
Example
select product/name, product/price
from doc in catalogue,
product in doc/product
where product//components contains flash
and product/description contains
camera

54
Data Distribution

Cluster of documents physical collection of
documents (? semantic domain)
Distribution
Storage machine
in charge of a cluster of documents
Index machine
index for a cluster

55
Step0 Indexing

Standard inverted index
word ? documents that contain this word
Xyleme index
word ? elements that contain this word
document element identifier
Goal more work can be performed without
accessing data

56
Step1 Localization
global query on abstract dtd

Query on an abstract dtd
Localization of machines that host concrete DTDs
that will participate in the query

catalogue/product/price relevant for machine
56 machine 45
local queries
union of queries on local machines
57
Step2 Optimization

Algebraic rewriting
Linear search strategy based on simple heuristics
use in memory indexes
minimize communication
Optimization of the global plan
Optimization of the local plans

58
Step3 Execution

A plan usually consists of
1. parallel translation from abstract queries to
concrete patterns on the relevant index machines
2. parallel index scans to identify the relevant
elements for a concrete pattern
3. parallel construction of resulting elements
4. pipeline evaluation (i.e., no intermediate
data structure)
Note 2. Requires smart indexes

59
Abstract2Concrete
for catalogue/product/price scan relevant
concrete pattern ? d1//camera/price ?
d2/product/cost ? d3/piano/price ...
For each concrete pattern, the local plan is
optimized dynamically
for each concrete pattern scan the element ids
? 234 ? 177
60
Identifiers

Essential for query processing
Identifier (preorder rank/postorder rank)
X ancestor of Y ltgt pre(X) lt pre(Y) and
post(X) gt post(Y)
E.g., 2lt5 and 4 gt2 gt (2,4) ancestor (5,2)

1
A B C D E
F G
7
2
6
4
6
3
4
7
1
3
5
5
2
Text
61
5. Change Control
62
Change management

Users are often interested in changes to the web
Change monitoring
query subscription
Soon to come Version management
representation and storage of changes

63
Query Subscription

Users subscribe to certain events such as
Update of a particular page, a page in a given
site
Discovery of a new page containing some specific
words
Insertion of a particular element in some pages
(new products in a catalog)
Detection of illegal copies of selected documents
Users may request to be notified
Immediately at the time the event is detected
Regularly, e.g., weekly
After a certain number of event detections

64
Examples

subscription myPariscope
what are the new movie entries in Pariscope
site
monitoring newMovies
select URL
where URL extends www.pariscope.fr/movies/
and new(self)
manage the changes in the movies showing in
Paris
continuous delta Showing
select ... from ... where
when daily
notify daily send me a daily report

65
Atomic Events
document
Loading of millions of pages/day
d
atomic event 46 URL matches pattern
www.xyz.com/ atomic event 67 XML
document contains the tag soccer
metadata manager
HTML parser
d/46
complex event detection
XML loader
d/46,67
loading
66
Complex Events
Several millions of pages crawled per
day Hundreds of millions of alerts raised
HTML parser
complex event detection
Millions of subscriptions
XML loader
complex event 12 67 46 (XML document contains
the tag soccer and URL matches pattern
www.xyz.com/)
67
Notification Processing

Very efficient/scalable algorithm for complex
event detection
Notifications by
Email
Web posting
Web services in SOAP

notification processor
complex event detection
alerts
notifications
Millions of notifications/day
68
Xyleme in short

Spin-off of lINRIA (National Research Institute)
Technology developed in research project of 60
man/years
Creation of Xyleme SA in September 2000
Now about 25 persons 13 RD, 4 Services, 10
marketing, sales admin.
Customers include Press agency (AFP), Newspaper
groups (Moniteur, Le Monde), National library
(BNF)
First round of capital in 2000 (SGAM
Viventures).
Second round in 2002 (Deutsche Bank)