Title: RDF as a Lingua Franca: Key Architectural Strategies
1RDF as a Lingua Franca Key Architectural
Strategies
- David Booth, Ph.D.
- Cleveland Clinic (contractor)
- Semantic Technology Conference
- 15-June-2009
- Latest version of these slides
- http//dbooth.org/2009/stc/
2About the speaker
- Senior Software Architect, Cleveland Clinic's
SemanticDB project - Senior research architect, HP Software
- W3C GRDDL standard
- W3C Fellow 2002-2005
- W3C Web Services Architecture document
- W3C WSDL 2.0 standard
- ATT Bell Labs
- Ph.D. Computer Science, UCLA
3Outline
- Part 1 The Problem
- Babelization
- SOA and RDF
- Part 2 Architectural Strategies
- RDF message semantics
- GRDDL transformations from XML to RDF
- REST-based SPARQL endpoints
- Semantic Data Federation
- Named graphs
- Monotonicity
- Part 3 Example Cleveland Clinic SemanticDB
4PART 1The Problem
5Problem 1 Babelization
- Proliferation of data models (XML schemas, etc.)
- Parsing issues influence data models
- No consistent semantics
- Data chaos
Tower of Babel, Abel Grimmer (1570-1619)
6Problem 2 Integration complexity
- Many data producers, many data consumers
- Producers and consumers interact in complex ways
- Tight coupling hampers independent versioning . .
.
7Problem 3 Client/service versioning
- Need to version clients and services
independently - Data models evolve
- No such thing as the data model
- There are several, slightly different but related
models
8RDF and SOA
- RDF can help
- Bridge vocabularies / data formats
- Looser data coupling
- Consistent semantics across applications
- SOA can help
- Looser process coupling
- How?
9PART 2Architectural Strategies
101. RDF message semantics
- Interface contract can specify RDF, regardless of
serialization - RDF pins the semantics
11But Web services use XML!
- XML is well known and used
- Existing apps may require specific XML or other
formats that cannot be changed - How can we gain the benefits of RDF message
semantics while still accommodating XML?
12Custom XML serializations of RDF
- Recall RDF is syntax independent
- Specifies info model -- not syntax!
- Can be serialized in any agreed-upon way
- Therefore
- Can view existing XML formats as custom
serialization of RDF! - How? GRDDL . . .
13What is GRDDL?
- "Gleaning Resource Descriptions from Dialects of
Languages" - W3C standard
- Permits RDF to be "gleaned" from XML
- XML document or schema specifies GRDDL
transformation - GRDDL transformation produces RDF from XML
document - Transformation is typically written in XSLT
142. GRDDL transformations from XML to RDF
- Therefore
- Same XML document can be consumed by
- Legacy XML app
- RDF app
- App interface contract can specify RDF
- Serializations can vary
- Semantics are pinned by RDF
- Helps bridge XML and RDF worlds
15Bridging XML and RDF
Service
Normalizeto RDF
XML/other
Core AppProcessing
Client
Serialize asXML/other/RDF
- Input Accept whatever formats are required
- Use GRDDL to transform XML to RDF
- Output Serialize to whatever formats are
required - Generate XML/other directly (or even RDF!), or
- SPARQL query can generate specific view first
163. REST-based SPARQL endpoints
HTTP
RDF
SPARQL
Consumer
Producer
17What is REST?
- REST Representational State Transfer
- Architectural style
- Identified by Roy Fielding in PhD thesis
- Based on uniform interface
- HTTP GET, PUT, POST, DELETE
18Why REST?
- HTTP is ubiquitous
- Simpler than SOAP-based Web services (WS)
- Looser process coupling
- Easier to change/version the process flow
19What is SPARQL?
- W3C standard
- Query language for RDF
- Modeled after SQL
- SELECT ...
- WHERE ...
20Why SPARQL?
- RDF gives looser data coupling
- Insulates consumers from internal model changes
- Inferencing can transform data to consumer's
desired model - One endpoint supports multiple consumer needs
- Each consumer gets what it wants
- Simpler interface for consumers
- Uniform SPARQL interface instead of a different
set of parameters for each REST endpoint
214. Semantic Data Federation
A1
X
A2
A3
SPARQL
Adapters
SemanticDataFederation
B1
B2
C1
C2
Z
- Get data from multiple sources
- Provide data to consumers
- Model transformation, caching, etc.
- Conceptual component -- not necessarily a
separate service
22Key features of semantic data federation
- REST-based SPARQL endpoint
- Client gets just the data it wants
- Support for a variety of data sources
- E.g., SQL, SPARQL(!), etc.
- Easy to add a new data source adapter, e.g., HTTP
- Caching
- Not multiple masters
- Inferencing
- Provides loose coupling at both data and process
levels
23Why inferencing?
- Allows new data sources to be more readily
connected to existing data - Allows new output vocabularies to be more readily
supported in response to client needs - Easier versioning with both clients and data
sources - Inferencing can help bridge across versions
24Data source adapters
SemanticDataFederation
SPARQL
Adapters
- Responsible for
- Mechanics of getting the data
- Transforming from native format to RDF
- May involve custom code or reusable tools
- E.g., Gloze performs XMLlt--gtRDF lift/drop
25Add a new data source
Semantic Data Federation
SPARQL
Adapter
Adapter
DataSource
- Strategy
- Adapter transforms native format to corresponding
RDF - Not directly to hub ontology!
- Bridging rules transform to hub ontology
26Adding a new output vocabulary
Semantic Data Federation
SPARQL
Adapter
DataSource
Client
- Strategy
- Bridging rules transform from hub ontologies to
new output vocabulary - Client can query using desired vocabulary
275. Named graphs
- Different queries require different subsets of
data - Entire data may be too big to process all at once
- So . . .
- Sets of RDF data can be bundled as named graphs
- Query strategy can pull in only the named graphs
that are needed, i.e., a working set - Graphs can be freely merged
- Contents can overlap
28Using named graphs for data subsets
- Examples
- Specific longitudinal data across patients
- Detailed data for each surgical event
- Data on a particular group of patients
296. Monotonicity
- Monotonicity Old conclusions remain true when
new facts are added - System design choice not automatic
- Without monotonicity
- Data change invalidates everything downstream
- System is more tightly coupled
- Different components must be versioned in lock
step - With monotonicity
- New data can be added freely
- Easier versioning
- More robust
30Monotonicity is valuable, but not free!
- Data models can be simpler without monotonicity
- Engineering trade-off
- Non-monotonic design
- Patient123 highBloodPressure true
- Monotonic design
- Patient123 highBloodPressure true at 1222PM
23-Aug-2007 - Patient123 highBloodPressure false at 0405PM
24-Aug-2007 - How to get the best of both worlds?
31Distilling data to simplify queries
- Detailed raw data can be distilled into simpler
assertion sets - Easier for specific queries
- Example raw data
- Patient123 BP 150/96 at 1222PM 23-Aug-2007
- Patient123 BP 155/97 at 0632PM 23-Aug-2007
- Patient123 BP 155/97 at 0632PM 23-Aug-2007
- Distilled for 23-Aug-2007
- Patient123 highBloodPressure true
- Meaning Patient123 had high blood pressure at
some time
32Using named graphs for distilled data
- Distilled data
- Easier for specific queries
- Less general than raw data
- May involve information loss
- Named graph can act as context
- Semantics are qualified (or loosened)
- E.g. Named graph for 23-Aug-2007 indicates
Patient123 had high blood pressure at some time - SPARQL update language (SPARUL) will make named
graphs easy to create from queries - Raw data should also be kept (in separate named
graphs)
33Adding named graphs for distilled data
Named graphsof distilled data
Raw data
- Is obese
- Had high blood pressure prior to admission
- Has condition X
34Abandoning unneeded named graphs
Named graphsof distilled data
Raw data
- Unneeded named graphs can be ignored
- And eventually discarded
35Summary of monotonicity strategy
- Don't change data!
- Create new named graphs instead
- Use named graphs to compartmentalize data
- But if you must change data
- Use named graphs to limit downstream impact
- Only regenerate those that are affected
- Retain both raw data and distilled data (in
separate named graphs)
36Summary of architectural strategies
- RDF message semantics
- GRDDL transformations from XML to RDF
- REST-based SPARQL endpoints
- Semantic Data Federation
- Named graphs
- Monotonicity
37PART 3Example Cleveland Clinic SemanticDB
38SemanticDB Project
- Applies semantic web technology to
- Clinical research
- Outcomes reporting
- Quality reporting
- Sponsored by Cleveland Clinic's Heart and
Vascular Institute
39Cleveland Clinic SemanticDB Project
User interfaces
Ontologies
Cyc natural language processing
Patient Data Entry
Natural languagequery
SPARQL interface
Structured query
Semantic wiki
Data-source adaptors
Instance data
Patient-centric systems
Patientregistry
Geneticpatientregistry
Tagged literature, e.g., PUBMED
. . .
. . .
40More information
- Cleveland Clinic SemanticDB projecthttp//www.w3
.org/2001/sw/sweo/public/UseCases/ClevelandClinic/
- RDF and SOAhttp//dbooth.org/2007/rdf-and-soa/
- SPARQLhttp//jena.sourceforge.net/ARQ/Tutorial/
- GRDDLhttp//www.w3.org/TR/grddl-primer/
41Questions?