Title: Architecture of caGrid 1'0
1Architecture ofcaGrid 1.0
- caBIG Annual MeetingApril 9th-11th, 2006
2Agenda
- caGrid Overview
- High-level Architecture
- Component Designs
- Metadata Management
- Service Infrastructure and Tools
- Data Service Enhancements
- Security Infrastructure Enhancements
- Object Identifiers
- Workflow Middleware
- caGrid 1.0 Portal/Monitor
3What is caGrid?
- Development project of Architecture Workspace,
aimed at helping define and implement Gold
Compliance - No requirements on implementation technology will
be necessary for Gold compliance - Specifications will be created defining
requirements for interoperability - caGrid provides core infrastructure, and tooling
to provide a way to achieve Gold compliance - Gold compliance creates the G in caBIG
- Gold gt Grid gt connecting Silver Systems
4What is Grid?
- A lot of different things to a lot of different
people - Evolution of distributed computing to support
sciences and engineering - Some common themes prevail
- Sharing of resources (computational, storage,
data, etc) - Secure Access (global authentication, local
authorization, policies, trust, etc) - Open Standards
- Virtualization
- The real and specific problem that underlies the
Grid concept is coordinated resource sharing and
problem solving in dynamic, multi-institutional
virtual organizations. - I. Foster, C. Kesselman, S. Tuecke. International
J. Supercomputer Applications, 15(3), 2001. - A good general overview can be found here
http//gridcafe.web.cern.ch/gridcafe/
5caGrid Overview
- Requirements
- Support scientific requirements Use cases from
cancer research community - Support functional requirements identifiers,
workflow, query, etc - Support non-functional requirements security,
reliability, performance, etc - Principles
- Driven by cancer research community requirements
- caBIG Principles
- Open Source, Open Access, Open Development
- Federated
- Syntactic and Semantic Interoperability
- Services-Oriented Architecture
- Metadata driven and implements Virtualization
- Standards based
6caGrid Conceptual View
7caGrid Components
- Leverage existing technologies
- caDSR, EVS, Mobius GME Common data elements,
controlled vocabularies, schema management - Globus Toolkit (currently version 4.0.1)
- Core grid services infrastructure
- Service deployment, service registry, invocation,
base security infrastructure - Additional Core Infrastructure
- Higher-level security services
- Grid service access to metadata components
(caDSR, GME, etc) - Workflow, Identifier services
- Service Provider Tooling (Introduce)
- Graphical service development and configuration
environment - Abstractions from service infrastructure for Data
and Analytical services - Deployment wizards
- Client Tooling
- High-level APIs for interacting with core
components and services - Graphical Tools
8caGrid Metadata Infrastructure
- Cancer Data Standards Repository (caDSR)
- caBIG projects register their data models as
Common Data Elements (CDEs) which are
semantically harmonized and then centrally stored
and managed the caDSR - Enterprise Vocabulary Services (EVS)
- EVS is set of services and resources that address
the need for controlled vocabulary - Thesaurus a biomedical thesaurus
- Metathesaurus based on NLM's Unified Medical
Language System Metathesaurus supplemented with
additional cancer-centric vocabulary - Global Model Exchange (GME)
- GME is a DNS-like data definition registry and
exchange service that is responsible for storing
and linking together data models in the form of
XML schema. - Globus Information Services
- The Globus Information Services infrastructure
provides a generic framework for aggregation of
service metadata, a registry of running Grid
services, and a dynamic data-generating and
indexing node, suitable for use in a hierarchy or
federation of services
9caGrid Data Description Infrastructure
- Client and service APIs are object oriented, and
operate over well-defined and curated data types - Objects are defined in UML and converted into
ISO/IEC 11179 Administered Components, which are
in turn registered in the Cancer Data Standards
Repository (caDSR) - Object definitions draw from vocabulary
registered in the Enterprise Vocabulary Services
(EVS), and their relationships are thus
semantically described - XML serialization of objects adhere to XML
schemas registered in the Global Model Exchange
(GME)
10caGrid Advertisement and Discovery
- All services register their service location and
metadata information to an Index Service - The Index Service subscribes to the standardized
metadata and aggregates their contents
- Clients can discover services using a discovery
API which facilitates inspection of data types - Leveraging semantic information in EVS (from
which service metadata is drawn), services can be
discovered by the semantics of their data types
11Introduce Goals
- A framework which enables fast and easy creation
of caGrid compatible services whether they are
data, analytical, custom, or core services. - Provide easy to use graphical service authoring
tools. - Hide all grid-ness from the developer so that
they can concentrate on the domain expert
implementation. - Utilize best practice layered grid service
architectures. - Handle all service architecture requirements of
the caGrid. - Strong service interface data typing
- Metadata and service registration
- Grid security integration
12Addressing the Requirements
- Grid Services
- We will be fulfilling the requirement that each
operation is available through grid services by
utilizing the Globus 4.0 toolkit as well as
utilizing best practice extensions. - Tool providers will describe the grid service
interface they wish to provide. - We will be using the Globus toolkit during
creation, registration, discovery, and invocation
of these service operations as grid services - Clients will be using the operation through a
grid service interface and will not need to be
aware of any implementation specific details of
the grid service
13Addressing the Requirements (cont)
- Strongly typed interfaces
- Introduce will enable schema extraction from a
GME so that the wsdl, beans, and service metadata
can be automatically populated so the service
will be using strongly typed and publicly
accessible data types. - Data model schema, and any referenced schemas,
are extracted and placed in the schema directory
of the analytical build skeleton - types can be used at service build time to
generate the language binding objects and create
strongly-typed grid service interfaces
14Addressing the Requirements (cont)
- Providing Client Side Object API
- Globus build process will automatically generate
a client side object oriented API - We will generate a wrapper for this API which
matches the service interface to make a clean
mapping from client to service. - This wrapper will handle auto boxing of the
parameters into document literal form.
15Model Driven Architecture
- Introduce uses an XML document to describe the
data types used, service interface, metadata, and
security requirements - Introduce Toolkit uses it this document for
service generation and synchronization
16Introduce Portal Service Creation
- Populate required variables for service creation
- Name published service name
- Creation Direction directory to create the
skeleton - Package the java package you wish to use for
your service - Namespace Domain the namespace to be used to
define the service interface and types
17Introduce Service Creation Architecture
18Created Skeleton Layout
generated
built
developers contribution
19Introduce Service Modification Tool
- Graphical tool to automatically create source
code, configuration files, and build process for
new analytical services - Developer defines the operations of the service
and just has to focus on the implementation of
them - Generated service is caBIG compliant in its
mechanisms to register, advertise, and secure
20Introduce Service Modification Tool cont.
- Input and output parameters can be discovered
from GME or caDSR - Schema types can be automatically downloaded and
configured as operation parameters - Specified types are used to create necessary Java
Objects using Axis/Globus behind the scenes
21Skeleton Synchronization Process
22Command Line Tools
- Creation tool enables the easy creation of a
new service skeleton - Will prompt for input on name, package,
directory, namespace, and service extension types - Can take in a properties file instead
- Can be called with ant ant createServiceSkeleton
- Sync tools set of tools which enable the
modification of caGrid compliant grid services - Keeps mirror of service interface in a
configuration/state document and regenerates the
grid service skeleton upon changes - Creates/Modifies
- service interface
- wsdl
- client/server implementation
- build files
- Security and registration configuration files
- Can be called with ant ant resync
- Creates a backup of service for user
23Introduce Summary
- Requirements
- Basic strongly typed grid requirements plus
semantically interoperable caBIG requirements - Architecture
- Grid service framework which is encapsulated and
layered on Globus - Introduce Toolkit
- Enables easy development of caBIG compliant grid
service - Introduce Service Layout
- Simple grid service layout making it easy to
locate and manage pieces of the implementation
24Data Service Overview
- Specialization of caGrid grid services to expose
data through a common query interface - Present an object view of data sources
- Exposed objects are registered in caDSR and their
XML representation in GME - Queries made with CQL Query objects
- Results returned as objects (or identifiers)
nested in a CQL Query Result Set
25Data Service Query Language
- Specifies a target object (result) type and
selects the instances which satisfy the specified
properties and nested object properties - Allows path navigation
- Provides logical grouping
- Provides name/predicate/value filtering on
properties of objects - Recursively defined
26Data Service Interface
public CQLQueryResultsType processQuery(CQLQueryTy
pe query)
- Data Providers only responsibility is to
implement CQL over their local data resource - A default implementation will be provided for
caCORE SDK created systems - caGrid provides grid service implementation to
invoke providers CQL implementation - Service provides all features necessary for
compliance, such as advertisement of data service
metadata, and security integration
27Data Service Query Scenario
- Client builds a CQL Query
- CQL Query is serialized and submitted to the Grid
Data Service - Grid Data Service deserializes the CQL Query
Object and processes it
- Data Source is queried by the Grid Data Service
- Grid Data Service Builds a CQL Result Set
- Result Set is serialized and returned to the
client - Client deserializes result set
- Result set is iterated with client tools to
retrieve objects
28Federated and Aggregated Query
- Componentized library being developed to
facilitate limited federating and aggregating
queries - An extension language used to describe
distributed queries - Library creates and executes a Query Plan for the
distributed query, using multiple CQL queries to
targeted data services - Initial Prototype (built on caGrid 0.5)
Demonstration in a later breakout session
29Data Service Client Tooling
- APIs provided to discover available data services
on the grid based on client-defined criteria
(such exposed data models and concepts) - Object-Oriented API for building queries,
querying a given data service, and processing the
results - Client tools available to iterate query result
sets - Object iterator deserializes XML into registered
objects - XML iterator simply returns XML documents
30caGrid Security Overview
- caGrid 0.5 Security
- Focus was on getting the basic building blocks in
place to facilitate the secure deployment of
reference implementations in the caBIG test bed. - caGrid 1.0 Security
- Focus is on getting the basic building blocks in
place to facilitate the secure deployment of
multi-institutional production applications. - Leveraging/Extending work done in the Grid,
caBIG, and other related communities to provide a
enterprise solution. - caGrid implementation efforts are not focused on
policy but on providing the building blocks that
the community can use for implementing policy.
31caGrid 1.0 Security Overview
- What areas will the security infrastructure focus
on? - Federated Identity Management
- Trust Management
- Authorization
- caGrid Integration
32Identity Management and Federation - Overview
- Major Focus of the caBIG Security Evaluation
White Paper - Enable users to use their institution provided
identity for authenticating to the Grid - User should be able to authenticate to the Grid
using their institutions existing mechanisms
Image taken from the caBIG Security Evaluation
White Paper
33Federated Identity Management
- Dorian
- Successor of GUMS, redesigned to address
Federated Identity Management Requirement. - WSRF compliant Grid service
- Manages users Grid credentials
- Enables users to authenticate and create grid
proxies via their institutions Identity Provider
(IdP) - Internal Dorian IdP allows unaffiliated users or
small institutions access to the grid. - Internal Certificate Authority / ability to
integrate with existing Certificate Authorities - Administration Interface
- Configurable/Extensible User Policies
- User Management
- Trusted IdP Management
- Full Client API / Administrative Portal
34Federated Identity Management
- Dorian Current Status
- Initial service implementation completed
- Initial administration portal completed
- Integrated with Ohio State University IdP
- Dorian Whitepaper accepted
- CBMS2006
- Identity Management and Federation WG
- Review and Shape Dorian
- Standardize SAML Assertions
35Trust Management
- PKI (Public Key Infrastructure) is the
cornerstone of grid security grid users
authenticate themselves to the grid via
certificates - Grid Services accept certificates as valid
credentials if they are signed by trusted
authorities - Grid services must be configured to trust
certificate authorities - Current Approach (Globus, caGrid 0.5)
- Service Container and or Service can be
configured by specifying a trusted ca
certificates directory in the server/service
configuration directory - Drawbacks
- Hard for grid administrators to manage
- Every time a new trusted authority comes on line,
all the services in the grid must re-configured
to trust that authority
36Trust Management
- Grid Trust Service (GTS)
- WSRF Grid Service
- Provides Support for Managing Trusted Certificate
Authorities - Administrator register/manage certificate
authorities and CRLS with GTS - Client tools synchronize Globus Trust Framework
with GTS - Globus is authenticating against the current
trust fabric - GTS Current Status
- Prototype Development Almost Completed
37Authorization
- Authorization is finding out if the person, once
identified, is permitted to have the resource.
Authorization is equivalent to checking the guest
list at an exclusive party, or checking for your
ticket when you go to the opera. - In caGrid we want to empower service providers to
make authorization decisions - We will provide a callout mechanism which service
providers can implement for providing custom
authorization - Empowers authorization decisions to be made
locally - Authorization Working Group
- Evaluating authorization usecases in caGrid
- Identify Authorization Mechanisms for caBIG
- Discuss, resolve (if possible), and document
political/logistical/technical issue with
authorization mechanisms identified - Identify infrastructure needed to support
authorization mechanism identified
38caGrid Service Integration
- Streamline all aspects of security into the
service development and deployment process - Introduce Toolkit
- Development Toolkit for building and deploying
caGrid 1.0 Services - Security on Service is configured and managed
through Introduce
39caGrids Identifier Services Framework
- Identifier
- Naming of individual Data-Objects
- Globally Unique Name for each Data-Object
- Services
- Create/modify/delete name-object bindings
- Resolve name to data-object
- Framework
- Provide for Trust Fabric gt Binding Integrity
- Policy-driven Administration gt Curator Model
- Fully Integrated with caGrids Architecture and
Implementation
40Why (Standardized) Data-Object Identifiers?
- Efficiency
- Passing by reference vs by value(Data-Object can
be many Mbytes) - Data-Object Equality test through String
comparison(inequality test is no requirement) - Consistency
- Standardized way of referencing objects
- Standard identifier gt data-object resolution
mechanism - Meta-data binding to standard object reference
- Well-known primary/foreign key for (distributed)
JOINs - Name for policy expression for data-object access
- Name for audit entries about data-object related
activities -
- Possible correlation of all of the above
41Data-Object Identifier Properties
- Identifier is a String
- Identifier is a forever globally unique name for
single Data-Object - Identifier can be (globally) resolved to
associated Data-Object - Data-Objects are immutable, almost immutable or
mutable - Identifier value meaningless opaque string for
consumer - Resolution information embedded in Identifier
Name - Only meaningful for resolution service related
components - Identifier is a Universal Resource Identifier
(URI) - URI-schema will be made completely transparent
from Identifier producing applications and
consumers. - bigid - at least until we have learned more
about its usage( and to avoid distracting
schema-choice discussions)
42Identifier Usage Model
43Naming Authority, Identifier Curator, Data Owner
and Identifier User
- Naming Authority (NA)
- Guards integrity of identifier namespace
bindings - Maintains identifier to data-objects endpoint
mapping - Conceptually equivalent to caDSR
- Identifier Curator/Administrator
- Understands semantics/access of data owners
objects - Trusted by NA to administer binding for certain
identifiers - Administers identifier to data-objects endpoint
binding - Data Owner
- Provides access to data-objects through
endpoint-references - Identifier User/Consumer
- Trusts an NA for certain identifier bindings
- Uses 2-step resolution to obtain
data-object(identifier gt endpoint gt
data-object) - (In-)Directly trusts Data Owner for data-object
integrity
44Identifier Services Framework Requirements
- Fully integrate with caGrid Architecture and
Implementation - WS-Interface specifications and implementations
- Naming Authority, Identifier Curator and Data
Owner Services - In practice, co-location option of Curator/Data-
or NA/Curator/Data Services makes sense - Java APIs to accommodate co-located functionality
- Abstract as much as possible of framework
intrinsics, resolution, and naming schema from
identifier producers and consumers - Ideally it should be a transparent infrastructure
service - Support (secure) Data-Object migration,
replication, caching - All requirements for truly distributed deployment
- Solid Trust Fabric for Identifier Administration
and Resolution - Success stands or falls with integrity of the
underlying framework - Leverage existing Identifier framework
implementation - where possible and where it makes sense (Handle
System, LSID)
45Identifier Services Framework Next Steps
- High Level Architecture and Design Document
- Implementation Design Document
- Implementation of WS-Applications, Java APIs
Libraries - Standardized identifier attribute extensions for
data-object schemas - Documentation Tutorials
- caBIO volunteered as first Identifier Producing
Application - Excellent first application from a requirement
perspective - Experience will directly feedback in identifier
svc development - Data Owners and Curator applications will have
easy to use APIs to associate standardized
Identifiers with their data objects, and to
enable global resolution of those Identifiers to
the data-objects - Client-Applications will have many options to use
Identifiers to specify search queries, in result
from queries, to retrieve associated
data-objects, in Workflow directives, in
papers/articles, in cross-references
46What is workflow?
- High-level scripting for frequently tasks
- Often automates a manually driven sequence
- Bioinformatics pipeline
- A common pattern that motivates this work
- Canonical pattern for service workflows
- receive input message and trigger to start
- declare variables all local to the workflow
- Invoke services, assign variables, loops, etc
- Return final results.
47caGrid Workflow - High Level Objectives
- Compose workflows integrating data and analytic
services in flexible patterns - Flexible control-flow patterns loops,
conditionals, iteration over collections - Type-safety verifying data-type correctness of
arguments passed between services - Robustness recover and continue long running
workflows after failures - Usability and integration specify workflows in
graphical interfaces and scripted textual form - Record data provenance of workflow results
48BPEL Workflow specifics
- BPEL Business Process Execution Language
- Described in an XML document
- Work done via Service invocations
- Variables hold data
- copied from outputs to input
- XPath used to select data
- Looping, conditionals, parallel flows
- Event-driven message exchanges allowed
- Dynamic service discovery (to be developed)
49Basic BPEL Workflow Model
Receive Inputs
Assign args
Send results
Assign results
Invoke Service
Analytic Service
50BPEL Workflow Model Pipelines (sequences)
Assign results
Send results
Receive Inputs
51BPEL Workflow Model Parallelism (flows)
Receive Inputs
Send results
52BPEL Workflow Model Conditional (select)
Select
Send results
Receive Inputs
53BPEL Workflow Model Loops (while)
Assign results
Send results
while
Receive Inputs
54Service-oriented Science via caGrid workflow
Workflow script Fetch data from data service in
Chicago Perform step 1 using service at
Duke Perform step 2 using service at OSU
Analytic service _at_ duke.edu
Workflow Results
Analytic service _at_ osu.edu
55caGrid workflow implementation
ltBPEL Workflow Docgt
ltWorkflow Inputsgt
link
BPEL Engine
Analytic service _at_ duke.edu
link
link
ltWorkflow Resultsgt
link
Analytic service _at_ osu.edu
- Each workflow is also a service
- Enacted by BPEL Engine
- Typically runs like a script (synchronous)
- Other powerful models are possible
56Workflow components in caGrid 1.0
Grid-Enabled Client
Analytical Service
Tool 1
Tool 2
Research Center
NCICB
Grid Data Service
Tool 3
Tool 4
BPEL Engine
Workflow State
Grid Portal
Workflow Manager
Workflow Document
Workflow Storage
57ActiveBPEL Engine Interface
58The future of caBIG workflow
- Workflow management service
- Optimized data flow
- Pass data directly from service to service
- Provenance
- Tracking your workflow
- Grid cache
- Storing intermediate results
- Manipulate data by reference(via identifiers)
59Accessing caGrid workflow - Design
BPEL Workflow Doc
Workflow inputs
BPEL Engine
Workflow Mgmt Service
Analytic service _at_ duke.edu
Workflow Results
Analytic service _at_ osu.edu
- Workflow management service
- Sharing workflows
- Get workflow status
60Workflow demo overview
CQL
5x
Argonne
Data Service
Duke
5x
5x
interpolate
removeBG
denoise
align
normalize
plot
10x
10x
OSU
10x
5x
5x
interpolate
removeBG
denoise
align
normalize
5x
61caGrid Monitoring Portal - Overview
- Purpose of caGrid Portal
- To provide a high level view of the status of
services on the grid based on collected
information. - caGrid Portal provides
- Geographic map of grid nodes (cancer centers)
- Cancer Center information
- Grid services information
- Search facility
- Extensible framework for further development
62caGrid Monitoring Portal - Scope
- What is not in scope?
- Portal does not replace data curation and
management - Portal is not a grid data service
- Portal does not facilitate altering of data on
the grid - What is in scope?
- Provides near-time view of the grid
- Information cached for functional and performance
reasons - Local cache periodically refreshed
- Provides extensible framework
63caGrid Monitoring Portal
- Map
- caGrid nodes
- Location
- Service availability indication
- caGrid Level RSS feed
64caGrid Monitoring Portal
- Center List and Details
- List
- Provides a listing of all cancer centers
participating in caGrid - Details
- Address
- Contacts
- caGrid Services
- Center level RSS feed
65caGrid Monitoring Portal
- Service List and Service Details
- List
- A categorized list of all service instances
currently known to the portal - Details
- An uptime graph
- Service functions and information from WSDL
- Contact information
- Service level RSS feed
66caGrid Monitoring Portal
67caGrid Monitoring Portal
- Basic Advanced Search
- Provides a mechanism to search caGrid service
metadata - Basic search looks for text in metadata elements
- Advanced search allows for filtering on a
per-element basis
68caGrid Monitoring Portal
69caGrid Monitoring Portal
- Requirements document and meta-data model
continue to be available for review and feedback
until April 17 - http//gforge.nci.nih.gov/plugins/scmcvs/cvsweb.ph
p/cagrid-1-0/Documentation/docs/portal/portal-requ
irements.pdf?rev1.1cvsrootcagrid-1-0 - http//tinyurl.com/kgfdc
- http//gforge.nci.nih.gov/plugins/scmcvs/cvsweb.ph
p/cagrid-1-0/Documentation/docs/portal/portal-requ
irements.odt?rev1.1cvsrootcagrid-1-0 - http//tinyurl.com/ew3de
70caGrid 1.0 Timeline
71Project Resources and Communication
- caGrid 1.0 GForge Home
- Feature Requests
- Bug Reports
- Discussion Forums
- Public Wiki
- Downloads / Source Repository
- http//gforge.nci.nih.gov/projects/cagrid-1-0/
- caGrid Users Mailing List
- https//list.nih.gov/archives/cagrid_users-l.html
- cagrid_users-l_at_list.nih.gov
- Architecture Workspace
- Community direction from Working Groups
- Report out and feedback during WS calls
72caGrid Team
- Ohio State University - Department of BioMedical
Informatics (http//bmi.osu.edu/) - Dave Ervin
- Shannon Hastings
- Tahsin Kurc
- Stephen Langella
- Scott Oster
- Joel Saltz
- Argonne National Lab / University of
Chicago(http//www.globus.org) - William Allcock
- Jarek Gawor
- Ravi Madduri
- Frank Siebenlist
- Michael Wilde
- Duke University
- A. Jamie Cuticchia
- Patrick McConnell
- Georgetown University
- Colin Freas
- Paul A. Kennedy
- Chad La Joie
- SAIC (http//www.saic.com)
- Manav Kher
- Booz Allen Hamilton (http//www.bah.com)
- Arumani Manisundaram
- Michael Keller
- Reechik Chatterjee
73caGrid Demo
- See a live demonstration of several caGrid
components Tuesday 1230pm
74Architecture ofcaGrid 1.0
- caBIG Annual MeetingApril 9th-11th, 2006
75(No Transcript)
76 77Proposed Analytical Service Metadata