Title: Emergent Semantics: Towards Self-Organizing Scientific Metadata
1Emergent Semantics Towards Self-Organizing
Scientific Metadata
- Bill Howe, David Maier
- Oregon Health and Science University
2- The file anim-sal_estuary_7.gif is a data
product derived from the output of the ELCIRC
simulation program run for the period January
8-15 2002. The image shows salinity (practical
salinity units) in the estuary region of the
domain. Its actually an animation, where each
frame is a horizontal slice 7 meters below the
mean sea level. There are 96 frames, each
representing 15 minutes.
program ELCIRC simStart 1/8/02 simEnd
1/15/02 region estuary variable
salinity timesteps 96 plottype animation
3Environmental Observation and Forecasting System
- Daily forecasts and 1000s of ad hoc hindcasts
- One simulation involves 20k files
- inputs, parameters, outputs, derived data
products
- This scale mandates
- query access rather than simple filesystem
browsing - Automation everywhere
4Tasks
- Collect metadata.
- Organize collected metadata.
- Publish organized metadata for querying.
5Challenges
- Metadata is scattered
- in file paths
- within file headers
- in nearby files
- Metadata requirements change frequently
- new simulation codes
- new data product types
- new users, internal and external
Depth 7
Variable Salinity
/anim-sal_estuary_7.gif
Type Animation
Region Estuary
6Obvious Solution
- Data Managers work with Domain Experts
- design a relational schema, load data, test,
repeat
file
- But
- Large up-front cost to DB design
- Slow return on investment
- Use cases unknown
- Significant change is anticipated
- DB languages/APIs not necessarily within
scientists skill set
data product
region
7Alternative Solution Steps 1-3
- Harvest metadata via simple collection scripts
written by the domain experts - Use RDF as a schema-independent metadata
representation - Use RDBMS technology for storage and management
1. Collection scripts
filesystem
3. db
2. rdf
8A Narrower Interface
SQL statements Database APIs Load Strategies Data
formats/models
rich schema
filesystem
Collection scripts
generic schema
filesystem
RDF triples
9Generic RDF Schema
subject property object
file//forecasts/2003-184/images/anim-sal_estuary_7.gif propertyregion estuary
file//forecasts/2003-184/images/anim-sal_estuary_7.gif propertyvariable salt
file//forecasts/2003-184/images/anim-sal_estuary_7.gif propertyplottype animation
file//forecasts/2003-184/images/anim-sal_estuary_7.gif propertysource file//forecasts/2003-184/run/1_salt.63
10Is Generic RDF Good Enough?
- Find files with region, plottype, and variable
descriptors
SELECT r.subject as file, r.object as region,
p.object as plottype, v.object as
variable FROM statements r, statements p,
statements v WHERE r.subject p.subject AND
p.subject v.subject AND r.property
propertyregion AND p.property
propertyplottype AND v.property
propertyvariable
3 self-joins!
11Decomposed Data
- So we can query the RDF directly, but
- no grouping structures to aid query formulation
and processing. - Automatically infer groupings from the RDF data,
observing that related files often share
signatures. - Let users impose groupings using a web interface
(like views)
db
... ltisofar.gif, type, isolinegt, ltisofar.gif,
region, fargt, ltanimsal.gif, timesteps,
10gt, ltanimsal.gif, var, saltgt, ...
filesystem
plot
animation
12Alternative Solution Steps 4-6
- Partition descriptors into equivalence classes
based on file signatures - Expose signatures via the web to facilitate
browsing and querying - Recompute signature extents as new metadata is
integrated
4. partition data
5. publish to the web
db
website
6. query and browse via profiles
13- The set of properties defined for a particular
file
14Signatures
- A files signature is just the set of properties
used to describe it. - If signatures were fixed, we might derive a
relational schema from them. Instead, we need to
respond to changes
4. partition data
db
find signatures
compute signature extents
15Example Consolidate Files with Similar Signatures
- Modify schema (DM)
- Transfer tuples from A to B (DM)
- Modify collection programs
- Modify extraction routines (DE)
- Modify Internal organization (DE)
- Modify SQL statements (DM)
16Alternative
- Change two lines in a collection script (DE)
- Assert(fileA, animation, )
- Assert(fileA, plottype, animation)
- Assert(fileB, plottype, animation)
- Reload data (Automatic)
- Recompute Signatures (Automatic)
- Republish data (Automatic)
17Benefits
- Narrow interface between data creators and data
managers - Metadata exploitable prior to finalizing a
thorough schema - Derived schema can adapt to changing requirements
automatically - Profiles constitute emergent semantics meaning
is assigned after data is collected.