Title: Quarrying Unfamiliar DataSpaces
1Quarrying Unfamiliar DataSpaces
- Bill Howe
- David Maier
- Nick Rayner
- Sponsored by the NSF ITR Program 2001-2006
- In collaboration with Antonio Baptista, Paul
Turner, Yinlong Zhang, Sergey Frolov, and the
entire CORIE Environmental Science Team at OGI
2Dataspaces
- DataSpace (DS)
- Autonomous, heterogeneous data sources
- grouped by an identifiable scope
- with respect to a set of requirements
- DataSpace Support Platform (DSSP)
- A collection of best effort services
- Catalog and Browse
- Search and Query
- Workflow (Events, Actions, and Monitoring)
- Integrity checks/guarantees
From Databases to Dataspaces A New
Abstraction for Information Management, Michael
Franklin, Alon Halevy, David Maier, SIGMOD Record
December 2005.
3Dataspaces vs. Databases
- Single Schema
- Centralized Administration
- Structured Query
- Strict Integrity Constraints
- Data Coexistence
- Autonomous Sources
- Search, Browse, Approximate Answer
- Best Effort Guarantees
4Dataspaces vs. Semantic Web
- No ontology
- Probably no inferencing
- Pay as you go
- Autonomous agents
- crawling richly described autonomous data
sources - which are related formally via ontologies
5Dataspace Timeline
Insular, application-specific databases
Autonomous agents crawling richly described data
sources integrated via an ontology
time, scope
6Example Scientific Data Repository
Atmospheric forcings River forcings Global ocean
forcings
Sensor Data
Ocean simulation results Configuration and log
files Annotations Data Products
salinity
/anim-sal_estuary_7.gif
7Example Pharmacology
RxNav Interface developed by the National
Library of Medicine
8Dataspace Timeline
Semantic Web
Quarry and related tools
Dataspaces
utility
Federated Databases
Data Integration Tools
RDF/OWL
Insular Databases
time, scope
9Unfamiliar Dataspaces
- No schema is available
- No query workload is available
- Browse is the dominant interaction
- keys, ids, URIs not directly useful
- properties and values carry the meaning
- Goal Maximize return on effort when working with
an unfamiliar dataspace
10Green Field Tools for Unfamiliar Dataspaces
- Goal A working, extensible application with the
least possible effort - We need at least
- a Data Model
- Lowest Common Denominator
- minimal modeling decisions
- an API
- easy to use for domain experts
- uniformly efficient
11Outline
- Dataspaces
- Data Models for Dataspaces
- Quarry Data Model
- Quarry Storage
- Quarry API
- Experimental Results
- Extensions
- Related Work
12Data Models
RDFSOWL
High
XML
RDF
Application Costs
Object Models
Hypertext
Low
Document/Text
NetCDF, HDF, etc.
Relational
High
Low
Modeling Costs
expressive power in terms of structure,
operations, and constraints
13Quarry Data Model
- resource, property, value
- (subject, predicate, object) if you prefer
- no intrinsic distinction between literal values
and resource values - no explicit types or classes
- no variables (no inference)
14Example Pharmacology
Concept
Relationship
Atom
15Example Scientific Data Repository
/anim-sal_estuary_7.gif
16Outline
- Dataspaces
- Data Models for Dataspaces
- Quarry Data Model
- Quarry Storage
- Quarry API
- Experimental Results
- Extensions
- Related Work
17Some Storage Models
- Schema dependent storage (RDFS)
- We assume schema is unavailable
- Indexed Triple Store
- Logically, one large table of (s, p, o) triples
- Physically, multiple indices for various access
patterns - Property Tables
- Some properties get their own (s, o) extents
(basically isomorphic to a pso index) - Selection of properties depends on query workload
18A Simple Idea
- Signatures
- resources expressing the same properties
clustered together - Posit that Signature ltlt Resource
- Queries evaluated over Signature Extents
19Triple Store
A Query in RDQL
Triples
select ?p where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt), (?r, ltspathgt, ?p)
rsrc
prop
value
101
depth
7
336
variable
temp
101
path
/iso_e_s_7.gif
101
variable
salt
and in SQL
843
channel
north
SELECT r.rsrc, p.value as path FROM Triples r,
Triples v, Triples d, Triples p WHERE
r.property region AND v.property
variable AND d.property depth AND
p.property path AND r.rsrc v.rsrc
AND v.rsrc d.rsrc AND d.rsrc p.rsrc
843
variable
salt
336
path
/trans_s_t.gif
843
path
/trans_n_s.gif
336
channel
south
101
region
estuary
One join per condition
20Triple Store
select ?p where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt) (?r, ltspathgt, ?p)
SELECT rsrc, MAX(CASE WHEN propertyregion'
THEN value END) as region, MAX(CASE WHEN
propertyvariable' THEN value END) as variable,
MAX(CASE WHEN propertydepth' THEN value END)
as depth, MAX(CASE WHEN propertypath' THEN
value END) as path, FROM TriplesGROUP BY
rsrc HAVING MAX(CASE WHEN propertyregion'
THEN value END) estuary AND MAX(CASE WHEN
propertyvariable' THEN value END) salt AND
MAX(CASE WHEN propertyregion' THEN value END)
7
but cant exploit indexes
21Property Tables
depth
select ?p where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt) (?r, ltspathgt,
?p)
value
rsrc
101
7
region
variable
rsrc
value
rsrc
value
101
estuary
336
temp
101
salt
select p.value from region r, variable v,
depth d, path p where r.value estuary
and v.value salt and d.value 7 and
r.rsrc v.rsrc and v.rsrc d.rsrc and
d.rsrc p.rsrc
path
843
salt
rsrc
value
channel
101
/iso_e_s_7.gif
336
/trans_s_t.gif
rsrc
value
843
/trans_n_s.gif
843
north
336
south
22Signature Tables
select ?p where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt) (?r, ltspathgt,
?p)
S1 variable, channel, path
variable
channel
value
rsrc
336
temp
south
/trans_s_t.gif
north
843
salt
/trans_n_s.gif
S2 depth, region, variable, path
region
rsrc
depth
variable
path
101
7
salt
estuary
/iso_e_s_7.gif
select path from S2 where region estuary
and variable salt and depth 7
23Choosing a Storage Model
- Sources of information
- A priori knowledge (schema)
- Query workload (learning)
- The data (mining)
24Computing Signatures
r0
p0
v(0,0)
r0
p0
v(0,0)
r2
p1
v(2,1)
p1
v(0,1)
r0
p2
v(0,2)
p2
v(0,2)
External Sort
r0
p1
v(0,1)
r1
p1
v(1,1)
r1
p3
v(1,3)
p3
v(1,3)
r1
p1
v(1,1)
r2
p1
v(1,1)
r2
p3
v(2,3)
p3
v(1,3)
Nest
r0
p0, p1, p2
v(0,0), v(0,1), v(0,2)
hash(S0)
r1
p1, p3
v(1,1), v(1,3)
hash(S1)
r2
p1, p3
v(1,1), v(1,3)
hash(S2)
25Computing Signatures
r0
p0, p1, p2
v(0,0), v(0,1), v(0,2)
hash(P0)
r1
p1, p3
v(1,1), v(1,3)
hash(P1)
r2
v(1,1), v(1,3)
signatures
hash(S0)
rsrc
p0
p1
p2
signature
sighash
r0
p0, p1, p2
hash(S0)
v(0,0)
v(0,1)
v(0,2)
p1, p3
hash(S1)
hash(S1)
rsrc
p1
p3
r1
v(1,1)
v(1,3)
r2
v(1,1)
v(1,3)
26Outline
- Dataspaces
- Data Models for Dataspaces
- Quarry Storage
- Quarry API
- Experimental Results
- Extensions
- Related Work
27Quarry API
- /2004/2004-001//anim-tem_estuary_bottom.gif
- aggregate bottom animation isotem day
001 directory images plottype
isotem region estuary runid
2004-001 year 2004 -
- /2004/2004-001//amp_plume_2d.gif day 001
- directory images plottype 2d
- region plume
- runid 2004-001 year 2004
28Quarry API Describe
- Describe(r)
- Property, Value pairs describing resource r
Describe(/2005-002//anim-sal_plume_5.gif)
year2005, day002, runid2005-002, anim,
regionplume, variablesalt, depth5,
plottypeisoline Describe(/2005-002//anim-sa
l_channel_transects.gif) year2005,
day002, runid2005-002, anim, channelplume,
variablesalt, plottypetransect
29Quarry API Values
- Values(B, p)
- Unique values of property p associated with any
resource that satisfies B
Values(varsalt, day) 1,2,3,4,5,6,7
30Quarry API Properties
- Properties(B)
- The set of properties that describe any resource
satisfying B
GetProperties(variablesalt) plottype, year,
region, depth, channel, GetProperties(plottype
isoline) region, depth, year,
31Quarry API
- Applications use sequences of Prop and Val calls
to explore the Dataspace
32Quarry API
all unique properties
p
all unique values of parent property
v
all properties of resources satisfying pv
Every path from a root represents a conjunctive
query
33Expressiveness
- Incomparable with most RDF Query Languages
- Unique properties not usually supported by
others - Were limited to queries of the form
?s LANGUAGECODE en . ?s DESCRIPTIONTYPE 2 . ?s
UMLSAUI A3711025 . ?s string Sodium_lactate_0.16_
molar_infusion .
34Quarry Query Processing
- Props(B)
- B (regionestuary and day136 and
variablesalt) - let cover region, day, variable
- Ans
- for Sig in Signatures
- if cover in Sig
- if exists tup (tup in Extent(Sig) and B(tup))
- Ans Ans U Sig
select rho from Extent(Sig1) where B limit 1
35Quarry Query Processing
- Val(B, rho)
- B (regionestuary and day136 and
variablesalt) - let cover region, day, variable
- Ans
- for Sig in Signatures
- if cover in Sig
- for tuple in Extent(Sig)
- if B(tuple)
- insert tuplerho in Ans
select rho from Extent(Sig1) where
B union select rho from Extent(Sig1) where B
36Experimental Results
- Yet Another RDF Store
- Several B-Tree indexes to support
- spo, po ? s, os ? p, etc.
- Reports of YARS outperforming Redland and Sesame
- 3M triples, single term queries
- We looked at multi-term queries
?s ltp0gt lto0gt ?s ltp1gt lto1gt ?s ltpngt ltongt
37Experimental Results Queries
3.6M triples 606k resources 149 signatures
38Frequent YARS Access Plan
?s LANGUAGECODE en . ?s DESCRIPTIONTYPE 2 . ?s
UMLSAUI A3711025 . ?s string Sodium_lactate_0.16_
molar_infusion .
spo
ltsgt string Sodium
spo
ltsgt LANGUAGECODE en
Choice of first lookup can be important
po ? s
spo
ltsgt DESCRIPTIONTYPE 2
UMLSAUIA3711025 ? ltsgt
39YARS Plan Speed
time (s)
cardinality of first join
40Scaling Up
- Queries covered by many signatures can be
inefficient
SELECT orig_code FROM sig1 WHERE va_class_name
DE820 UNION SELECT orig_code FROM sig2 WHERE
va_class_name DE820 UNION SELECT orig_code
FROM sig3 WHERE va_class_name
DE820 UNION SELECT orig_code FROM sig4 WHERE
va_class_name DE820
41Scaling Up
S1(a,b,c)
S12(a,b,c,d)
S2(a,b,d)
pad with nulls
42Scaling Up
- Extract
- Find commonly access property sets and
materialize them separately
S1(a,b,c)
S1(a,b,c)
S2(a,b,d)
S2(a,b,d)
Sab(a,b)
43Related Work
- RDF Redland, Sesame, Jena, YARS, Forth, KAON
- Primarily Indexed Triple Stores
- Path Indexes Lorel, DataGuides
- Data Mining for Structure
- Ding, Wilkinson, Sayers, Kuno _at_ HP Labs
Application-specific Schema Design for Storing
Large RDF Datasets
44(No Transcript)
45Data Management Solutions
Web Search
Virtual Organization
Far
Enterprise potal
Ontology
Administrative Proximity
Data Integration System
Near
Desktop Search
Scientific Respository
DBMS
Low
High
Semantic Integration
Diagram adapted, with permission, from Figure 1
in the paper From Databases to Dataspaces A New
Abstraction for Information Management, Michael
Franklin, Alon Halevy, David Maier, SIGMOD Record
December 2005.
46Query Languages
- RQL
- RDQL
- RDFQL
- RxPath
- N3
- SeRQL
- Triple
- Versa
47Facts
- Environmental Observation and Forecasting System
- 7.5M triples describing 1M files
- Integrated Pharmacological Database
- 23M triples describing 0.6M concepts
48Dataspace Components
- Catalog and Browse
- Search and Query
- Global Query
- Structured Query
- Provenance Query
- Continuous Query (Monitoring)
- Local Store and Index
- Discovery
- Source Extension
49Scaling Way Up
50Integrity Constraints and Normalization
51Growing a Query Language
- Desc(k)
- Prop(B)
- Val(B, p)
52Pharmacological Database
- Signature ltgt ptty in 85 of the cases