Part One XML and Databases - PowerPoint PPT Presentation

About This Presentation

Title:

Part One XML and Databases

Description:

name John /name /person o555. o456. o123. children. children. mother. Names ... Similarity extends to structure as well ( Travolta' NEAR Cage' = Face/Off' ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 31

Provided by: sou59

Category:

more less

Transcript and Presenter's Notes

Title: Part One XML and Databases

1
Part OneXML and Databases

Soumen Chakrabarti
CSE, IIT Bombay

2
Form and content

The Web today
HTML generated by hand, wysisyg editors,
webified databases
HTML specifies rendering for human reading
Screen scraping required to consolidate data
The Web in the future
Common interchange format (XML)
Concentrate on content, not form
Represent data class broader than relations

3
Role of databases

Contribute
Data storage and indexing
Query processing and optimization
Views, transformations, integration
Adopt
Search modalities
Content-based approximate search
Linguistic analysis

4
Features of semi-structured data

No explicit schema, or volatile schema
Schema size comparable to data size
Structure changes without notice
Heterogeneous, deeply nested, irregular
Has nature of documents rather than tables

5
Semi-structured data model example
Bib
o1
complex object
paper
paper
book
references
o12
o24
o29
references
references
author
page
author
year
author
title
http
title
title
publisher
author
author
author
o43
25
96
1997
last
firstname
atomic object
firstname
lastname
first
lastname
243
206
Serge
Abiteboul
Victor
122
133
Vianu
Object Exchange Model (OEM)
6
Syntax
paper author Abiteboul,
author firstname
Victor, last
name Vianu, title Regula
r path queries , page fir
st 122, last 133
7
Some observations

Missing or additional attributes
Multiple attributes
Different types in different objects
Heterogeneous collections

8
Object IDs and references
Jane
Maryidrefo123 o555/ othero456John
o456
children
children
mother
o555
o123
9
Names and acronyms

OEM (Object Exchange Model) a semi-structured
data model from Stanford, 1995
Lore a system for storing data adhering to the
OEM
Lorel a query language for Lore
XML (eXtensible Markup Language) a
simplification of SGML and a generalization of
HTML
XML-QL Query language for XML

10
Lorel query examples
select Bib.paper.title from Bib.paper where Bib.
paper.year 1995
Alternative
select X.title from Bib.paper X, Bib.(paperbo
ok) Y where Y.author.lastname? Ullman an
d Y.reference X
Navigating partiallyknown structures
Transitive closure
11
XML-QL query examples
where
Morgan Kaufmann
a k in www.a.b.c/bib.xml construct a
where a
in www.a.b.c/bib.xml
construct al
12
XML storage in ternary relation
o1
paper
o2
year
title
author
author
o3
o4
o5
o6
The Calculus

1986

Too many joins
Label name storage redundant

13
Storage optimization through mining

Inline common cases
Tolerate a few nulls

14
Schema extraction

Schema a template for type/semantics
specification
Conformance
Does that data conform to a given schema ?
Classification
If so, which objects belong to what
classes/types?
Applications
Storage and query optimization

15
Graph simulation

Given two edge-labeled graphs G1 and G2, a
simulation is a relation R between nodes such
that if (x1, x2) is in R, and (x1, a, y1) is in
G1, then there exists (x2, a, y2) in G2 (same
label) such that (y1,y2) is in R

R
G1
G2
a
y1
16
Upper and lower bound schema

Lower bound schema
Conformance find simulation R from S to D
Classification check if (c,x) in R
Used in storage optimization
Upper bound schema (data guides)
Conformance find simulation R from D to S
Classification check if (x,c) in R
Used in path index generation and query
optimization

17
Sample data
r
employee
employee
employee
employee
employee
employee
employee
employee
manages
manages
manages
manages
manages
p8
p1
p2
p3
p4
p5
p6
p7
managedby
managedby
managedby
managedby
managedby
worksfor
worksfor
worksfor
worksfor
worksfor
company
worksfor
worksfor
worksfor
c
18
Lower bound schema
Root r
employee
company
employee
Bosses p1,p4,p6
Regulars p2,p3,p5,p7,p8
manages
managedby
worksfor
Company c
worksfor
19
Storage using lower bound schema
Lower-bound schema
Store rest in overflow graph
20
Upper bound schema (DataGuides)
Root r
employee
Employees p1,p1,p3,P4 p5,p6,p7,p8
company
manages
managedby
worksfor
Bosses p1,p4,p6
Regulars p2,p3,p5,p7,p8
manages
managedby
worksfor
Company c
worksfor
21
Query optimization issues
Select x from A.B x where exists y in x.C y5
D
A
A
A
D
B
B
D
D
B
B
B
B
B
B
B
C
C
C
C
C
C
C
C
C
5
5
5
4
4
5
4
4
5
22
What makes the problem difficult

Selectivity estimation
Index selection
Access cost models
Clustering choices

23
Part Two Information Retrieval and Databases

Soumen Chakrabarti
CSE, IIT Bombay

24
Information retrieval (IR)

Search
Inverted index
Boolean match
Relevance ranking
Classification
Learn topics from examples
Clustering
Discover topics from a document collection
Never done inside a relational database

D5 3, 37, 50
cat
D7 9, 20
dog
D7 7, 90, 400
D20 22, 533
25
Current style of loose integration

RDBMS provides hooks
Declare some columns as textual with keyword
index
Inserts, updates, and deletes trigger external
program, e.g., Verity search engine
Search engine maintains separate indices
Simple query rewriting to combine relational and
text-match where-clauses

26
Reasons

Space
BLOB vs. pure relational representation
Average English word is only 5 bytes
Time
Most text engines are resigned to flexible (i.e.,
no) model for data consistency
Much faster read-only access than relational
database lookups

27
New features desired

Operations that are more complex than keyword
search can benefit from tighter coupling with
RDBMS
Approximate search is essential (Anand Rajaraman,
Amazon.com, SIGMOD 99)
Misspelling book title, author name common
Variant of OEM edge label (author/writer/poet)
Similarity extends to structure as well
(Travolta NEAR Cage Face/Off)

28
Case study generalized like

SQL has limited string matching constructs
like x, x, x
x must be exact match
Need more lenient match
Applications LDAP, IR
String edit distance is not suitable
Given query, order strings in database in
increasing order of edit distance and pick top 5