Title: Analytical and Data Services Guidelines
1Analytical and Data Services Guidelines
- Architecture/VCDE WorkspacesJoint Face to
FaceFebruary 1st-2nd, 2006
Scott Oster Ohio State University oster_at_bmi.osu.e
du
Patrick McConnell Duke Comprehensive Cancer
Center patrick.mcconnell_at_duke.edu
2Overview
- Overview of Data and Analytical Services
- Distinction between Analytical Tool and
Analytical Service - Metadata definition and usage
- Current UML model for service metadata
- Need for harmonization
- Plan to consensus
- Leveraging existing data standards in caBIG
- Defacto standards into UML
- Bridging caDSR and GME
- Namespace issues (existing standards)
- Connecting CDEs and Schema types
3caBIG Services
Analytical Service
Grid-Enabled Client
Tool 1
Tool 2
Research Center
NCICB
Grid Data Service
Tool 3
Tool 4
Grid Portal
Research Center
4caBIG Services
- Data Services
- Data services present an object view of data
sources - Objects exposed as data services comply with
common data elements registered in the caDSR/EVS,
and transported as XML using schema types
registered in GME - Currently Query only (no update, insert, or
delete) - Analytical Services
- Analytical Services are base Globus services
- Required to be strongly-typed with respect to
input and output - Analytical services input and output objects
conforming to registered classes in caDSR, and
schema types registered in GME - Graphical tool to automatically create source
code, configuration files, and build process for
new analytical services - Input and output parameters can be discovered
from GME
5Analytical Tool vs. Analytical Service
- Analytical services provide data back to the grid
- Analytical tools only consume data from the grid
- Examples
- caWorkbench
- RProteomics
6Analytical Service Guidelines
- Inputs and outputs (parameters) defined by
- Objects with metadata registered in caDSR and
- Objects with XML Schema defined
- Parameters defined as objects, not simple data
elements - a.k.a no Java primitives
- Provide service level metadata, the structure of
which is defined in the caDSR - Internal (non API) classes do not need to be
registered in the caDSR
7Analytical Tool Guidelines
- Inputs defined by
- Objects with metadata registered in caDSR and
- Objects with XML Schema defined
- No output types need be defined in the caDSR
- No service level metadata must be provided
- Internal (non API) classes do not need to be
registered in the caDSR
8Analytical service and tool open questions
- Tools that are provided as an API in a
programming language - Example Q5
- Should tools be a dead-end for data
- Many tools can output well-defined,
standards-based objects - Example caWorkbench
- Many tools can abstract analyses into services
- Example VISDA
- Should analytical service method signatures be
reviewed and harmonized - Issue raised in interoperability review of
RProteomics - Promote interoperability, plug-and-play analytics
- Provides context by which to evaluate parameter
CDEs
9caBIG Service Description
- Client and service APIs are object oriented, and
operate over well-defined and curated data types - Objects are defined in UML and converted into
Administered Components, which are in turn
registered in the Cancer Data Standards
Repository (caDSR) - Object definitions draw from vocabulary
registered in the Enterprise Vocabulary Services
(EVS), and their relationships are thus
semantically described - XML serialization of objects adhere to XML
schemas registered in the Global Model Exchange
(GME) - All data in caGrid travel between services and
between client and services as XML documents that
conform to well-defined schemas stored in GME
10Current Metadata
- Metadata and Registry Services
- Support for Advertisement and Discovery processes
- Metadata and registry services maintain metadata
associated with data and analytical services - All services register information to an Index
Service - Services can be discovered using semantics of
their data types - Three types of Service Metadata
- Common Metadata describes generic information
about service providing Cancer Center - Data Service Metadata describes the data exposed
using terminology and objects from caDSR/EVS - Analytical Service Metadata describes the
supported operations and their inputs and outputs
using terminology and objects from caDSR/EVS
11The need for more service-level metadata
- Why?
- Find the service you want (discovery)
- Help understand what a service does (extension of
advertisement) - Types of fields
- Name
- Description with concept
- Keywords
- For high precision calculations operating
system, hardware - Contact information
- Method signatures
12VCDE proposed model for service level metadata
13VCDE proposed model for service level metadata
cont.
14Service level metadata next steps
- Form a cross-cutting working group
- Evaluate two models, use cases
- Get input from caGrid team
- Propose model to VCDE, Architecture, caGrid
15Bringing existing biomedical standards to caBIG
- There is a wealth of existing standards in the
biomedical field - The great thing about standards is that there
are so many to choose from - The problem with standards is that there are so
many to choose from - MAGE-OM/MAGE-ML, BioPax, mzXML, etc.
- Most standards based on XML Schema
- Or alternate non-UML encodings RDF, OWL,
Protégé, etc. - Translating XML Schema to well defined object
models in UML is not trivial - Passing standards-based XML across the grid using
the caGrid infrastructure has not been explored
16Converting from XML Schema to caBIG UML
- Names of classes and attributes fixed by schema
(if you actually want to follow the schema) - Plurals, poor semantics, contain parent name,
etc. - caGrid requires specific namespace to enter GME
- The namespace is probably already defined in the
schema - Extension of simple types (e.g. extending String)
- XML Schema allows such extension, caDSR does not
- Elements can contain both values (text) and
sub-elements - Examples XHTML, PubMed abstracts
- caCORE SDK compatibility
- id attributes, Collection
- Elements can contain text and have attributes
- Basically an extension of String, but also with
attributes - XML Schema intentionally very hierarchical
- End up with a bunch of empty classes
- XML Schema constructs not supported by UML and/or
caDSR - Example choice
- Many simple types do not exist in the caDSR
- Duration, int versus integer, etc.
- Collections of primitives
- Cannot model in caDSR with primitive type
17Potential solutions XSD-gtXMI
- Preface XMI-gtXSD is much easier
- You can even do this with EnterpriseArchitect
- HyperModel XSD-gtUML, UML-gtXSD
- Defacto standard for XSD-gtUML conversion
- Plugin to Eclipse
- Freely available, but not open source
- XMIGenerator XSD-gtUML
- Developed at Duke to addresses some deficiencies
in HyperModel - Standalone, command-line based application
- Open source, freely available
- XSD-gtJava-gtUML
- Many tools to do this, but you will get many
artifacts in the UML
18XSD-gtJAXB-gtJava-gtEA-gtUML (mzXML)
19XSD-gtHyperModel-gtXMI (mzXML)
20XSD-gtHyperModel-gtXMI (pepXML)
21XSD-gtXMIGenerator-gtXMI (mzXML)
22Discussion from breakout yesterday
23Existing Mapping from caDSR to GME
- In caDSR, each project (application) will have
its own Classification Scheme (e.g. caCORE). A
Classification Scheme may define a subproject,
which is represented as a Classification Scheme
Item (CSI) (e.g. caBIO). In caGrid 0.5, each CSI
had its own schema. - Each XML schema will be published into the caGrid
GME service. As the caDSR ensures semantic
interoperability, the GME ensures programmatic
data exchange (syntactic) interoperability.
24From caDSR to GME (cont)
- The caGrid 0.5 recommendation for assigning
schema namespaces for caBIG objects is shown
below - For example
- gme//caTIES.caBIG/3.0/edu.upmc.opi.cabig.caties.d
ocument.domain - This provides a coarse-grain, rule-based mapping
from caDSR to GME
caDSR
ltClassification Schemegt.ltContextgt/ltClassification
Scheme Versiongt/ltClassification Scheme Itemgt
GME
ltdomaingt /
ltversiongt /
ltnamegt
25Connecting the caDSR and GME
- Some applications will need to work at both the
CDE level and the XML level - Examples workflow engine, translational query
system, etc. - There is no defined link between
- A CDE and an XML element
- A CDE and an XML element or attribute
Names different
Attribute or element?
What about Collection associations
?
26Potential solutions
- Change the caDSR
- Provide a link from each CDE and attribute to the
location in the XSD - Change the GME
- Provide a link from each element/attribute in the
XSD to the caDSR - Provide a mapping service
- Given a context and CDE, give me the XSD
element/attribute - Given an element/attribute and context, give me
the CDE\ - Likely we should start a cross-cutting working
group to address the problem