Title: Part One XML and Databases
1Part OneXML and Databases
- Soumen Chakrabarti
- CSE, IIT Bombay
2Form and content
- The Web today
- HTML generated by hand, wysisyg editors,
webified databases
- HTML specifies rendering for human reading
- Screen scraping required to consolidate data
- The Web in the future
- Common interchange format (XML)
- Concentrate on content, not form
- Represent data class broader than relations
3Role of databases
- Contribute
- Data storage and indexing
- Query processing and optimization
- Views, transformations, integration
- Adopt
- Search modalities
- Content-based approximate search
- Linguistic analysis
4Features of semi-structured data
- No explicit schema, or volatile schema
- Schema size comparable to data size
- Structure changes without notice
- Heterogeneous, deeply nested, irregular
- Has nature of documents rather than tables
5Semi-structured data model example
Bib
o1
complex object
paper
paper
book
references
o12
o24
o29
references
references
author
page
author
year
author
title
http
title
title
publisher
author
author
author
o43
25
96
1997
last
firstname
atomic object
firstname
lastname
first
lastname
243
206
Serge
Abiteboul
Victor
122
133
Vianu
Object Exchange Model (OEM)
6Syntax
paper author Abiteboul,
author firstname
Victor, last
name Vianu, title Regula
r path queries , page fir
st 122, last 133
7Some observations
- Missing or additional attributes
- Multiple attributes
- Different types in different objects
- Heterogeneous collections
8Object IDs and references
Jane
Maryidrefo123 o555/ othero456John
o456
children
children
mother
o555
o123
9Names and acronyms
- OEM (Object Exchange Model) a semi-structured
data model from Stanford, 1995
- Lore a system for storing data adhering to the
OEM
- Lorel a query language for Lore
- XML (eXtensible Markup Language) a
simplification of SGML and a generalization of
HTML
- XML-QL Query language for XML
10Lorel query examples
select Bib.paper.title from Bib.paper where Bib.
paper.year 1995
Alternative
select X.title from Bib.paper X, Bib.(paperbo
ok) Y where Y.author.lastname? Ullman an
d Y.reference X
Navigating partiallyknown structures
Transitive closure
11XML-QL query examples
where
Morgan Kaufmann
a k in www.a.b.c/bib.xml construct a
where a
in www.a.b.c/bib.xml
construct al
12XML storage in ternary relation
o1
paper
o2
year
title
author
author
o3
o4
o5
o6
The Calculus
1986
- Too many joins
- Label name storage redundant
13Storage optimization through mining
- Inline common cases
- Tolerate a few nulls
14Schema extraction
- Schema a template for type/semantics
specification
- Conformance
- Does that data conform to a given schema ?
- Classification
- If so, which objects belong to what
classes/types?
- Applications
- Storage and query optimization
15Graph simulation
- Given two edge-labeled graphs G1 and G2, a
simulation is a relation R between nodes such
that if (x1, x2) is in R, and (x1, a, y1) is in
G1, then there exists (x2, a, y2) in G2 (same
label) such that (y1,y2) is in R
R
G1
G2
a
y1
16Upper and lower bound schema
- Lower bound schema
- Conformance find simulation R from S to D
- Classification check if (c,x) in R
- Used in storage optimization
- Upper bound schema (data guides)
- Conformance find simulation R from D to S
- Classification check if (x,c) in R
- Used in path index generation and query
optimization
17Sample data
r
employee
employee
employee
employee
employee
employee
employee
employee
manages
manages
manages
manages
manages
p8
p1
p2
p3
p4
p5
p6
p7
managedby
managedby
managedby
managedby
managedby
worksfor
worksfor
worksfor
worksfor
worksfor
company
worksfor
worksfor
worksfor
c
18Lower bound schema
Root r
employee
company
employee
Bosses p1,p4,p6
Regulars p2,p3,p5,p7,p8
manages
managedby
worksfor
Company c
worksfor
19Storage using lower bound schema
Lower-bound schema
Store rest in overflow graph
20Upper bound schema (DataGuides)
Root r
employee
Employees p1,p1,p3,P4 p5,p6,p7,p8
company
manages
managedby
worksfor
Bosses p1,p4,p6
Regulars p2,p3,p5,p7,p8
manages
managedby
worksfor
Company c
worksfor
21Query optimization issues
Select x from A.B x where exists y in x.C y5
D
A
A
A
D
B
B
D
D
B
B
B
B
B
B
B
C
C
C
C
C
C
C
C
C
5
5
5
4
4
5
4
4
5
22What makes the problem difficult
- Selectivity estimation
- Index selection
- Access cost models
- Clustering choices
23Part Two Information Retrieval and Databases
- Soumen Chakrabarti
- CSE, IIT Bombay
24Information retrieval (IR)
- Search
- Inverted index
- Boolean match
- Relevance ranking
- Classification
- Learn topics from examples
- Clustering
- Discover topics from a document collection
- Never done inside a relational database
D5 3, 37, 50
cat
D7 9, 20
dog
D7 7, 90, 400
D20 22, 533
25Current style of loose integration
- RDBMS provides hooks
- Declare some columns as textual with keyword
index
- Inserts, updates, and deletes trigger external
program, e.g., Verity search engine
- Search engine maintains separate indices
- Simple query rewriting to combine relational and
text-match where-clauses
26Reasons
- Space
- BLOB vs. pure relational representation
- Average English word is only 5 bytes
- Time
- Most text engines are resigned to flexible (i.e.,
no) model for data consistency
- Much faster read-only access than relational
database lookups
27New features desired
- Operations that are more complex than keyword
search can benefit from tighter coupling with
RDBMS
- Approximate search is essential (Anand Rajaraman,
Amazon.com, SIGMOD 99)
- Misspelling book title, author name common
- Variant of OEM edge label (author/writer/poet)
- Similarity extends to structure as well
(Travolta NEAR Cage Face/Off)
28Case study generalized like
- SQL has limited string matching constructs
- like x, x, x
- x must be exact match
- Need more lenient match
- Applications LDAP, IR
- String edit distance is not suitable
- Given query, order strings in database in
increasing order of edit distance and pick top 5
29Sliding-window matching
nas
asc
sce
cen
ent
pas
sca
cal
ras
rascal
nascent
pascal
- Given a query, scan to get a set of 3-grams
- Similarity of string in database to query
number of shared 3-grams
30Issues
- Minimally disruptive architecture
- Low storage overheads
- Fast query processing
- Good selectivity estimates
- Combining with other predicates for ranking
- Efficiently handling updates