Title: caBIG%20Data%20Structures
1caBIG Data Structures
- CS584 Lecture on 4/6/2007
Patrick McConnell Duke Comprehensive Cancer
Centerpatrick.mcconnell_at_duke.edu
2Agenda
- caBIG background (5 min, 8 slides)
- Goals, program structure, organizations
- caTRIP background (5 min, 6 slides)
- Background, use cases, architecture
- caBIG compatibility (30 min, 21 slides
demonstration) - Interoperability, compatibility, syntactics, and
semantics - Building caBIG compatible systems (10 min, 7
slides) - Interoperability, compatibility, syntactics, and
semantics - caGrid (10 min, 8 slides)
- Background, service creation, metadata
- caTRIP demonstration (10 min, 2 slides demo)
- Demonstration
- Discussion/questions (5 min throughout)
3caBIG Background
- Goals, program structure, organizations
4caBIG backgroundBiomedical information tsunami
- overwhelming volume of data
- multitude of sources
5caBIG backgroundInformatics tower of Babel
- Each cancer research community speaks its own
scientific dialect - Integration critical to achieve promise of
molecular medicine
6caBIG backgroundGoals and principles
- 50 Cancer Centers are working towards a common
goal of integrated data, tools and methodologies
to accelerate cancer research goals at the
National Cancer Institute for Bioinformatics
(NCICB), the cancer Biomedical Informatics Grid
(caBIG) - The goal of caBIG is to create a virtual web of
interconnected data, individuals, and
organizations which will - redefine how research is conducted
- care is provided
- patients / participants interact with the
biomedical research enterprise - The principles driving caBIG are
- Open Source
- Open Access
- Open Development
- Federated Model
7caBIG backgroundcaBIG facilitates sharing
8(No Transcript)
9caBIG backgroundWorkspaces
DOMAIN WORKSPACE 1 Clinical Trial Management
Systems
addresses the need for consistent, open and
comprehensive tools for clinical trials
management.
DOMAIN WORKSPACE 2 Integrative Cancer Research
provides tools and systems to enable integration
and sharing of information.
DOMAIN WORKSPACE 3 Tissue Banks Pathology Tools
provides for the integration, development, and
implementation of tissue and pathology tools.
DOMAIN WORKSPACE 4 Imaging
provides for the sharing and analysis of in vivo
imaging data.
responsible for evaluating, developing, and
integrating systems for vocabulary and ontology
content, standards, and software systems for
content delivery
CROSS CUTTING WORKSPACE 1 Vocabularies Common
Data Elements
developing architectural standards and
architecture necessary for other workspaces.
CROSS CUTTING WORKSPACE 2 Architecture
10caBIG backgroundCommunities
Ohio State University-Arthur G. James/Richard
Solove Oregon Health and Science
University Roswell Park Cancer Institute St Jude
Children's Research Hospital Thomas Jefferson
University-Kimmel Translational Genomics Research
Institute Tulane University School of
Medicine University of Alabama at
Birmingham University of Arizona University of
California Irvine-Chao Family University of
California, San Francisco University of
California-Davis University of Chicago University
of Colorado University of Hawaii University of
Iowa-Holden University of Michigan University of
Minnesota University of Nebraska University of
North Carolina-Lineberger University of
Pennsylvania-Abramson University of
Pittsburgh University of South Florida-H. Lee
Moffitt University of Southern
California-Norris University of
Vermont University of Wisconsin Vanderbilt
University-Ingram Velos Virginia Commonwealth
University-Massey Virginia Tech Wake Forest
University Washington University-Siteman Wistar Ya
le UniversityNorthwestern University-Robert H.
Lurie
9Star Research Albert Einstein Ardais Argonne
National Laboratory Burnham Institute California
Institute of Technology-JPL City of Hope
Clinical Trial Information Service (CTIS) Cold
Spring Harbor Columbia University-Herbert
Irving Consumer Advocates in Research and
Related Activities (CARRA) Dartmouth-Norris
Cotton Data Works Development Department of
Veterans Affairs Drexel University Duke
University EMMES Corporation First Genetic
Trust Food and Drug Administration Fox Chase
Fred Hutchinson GE Global Research
Center Georgetown University-Lombardi IBM Indiana
University Internet 2 Jackson Laboratory Johns
Hopkins-Sidney Kimmel Lawrence Berkeley
National Laboratory Massachusetts Institute of
Technology Mayo Clinic Memorial Sloan
Kettering Meyer L. Prentis-Karmanos New York
University
11caBIG backgroundDukes role in caBIG
- Pankaj Agarwal
- Bob Annechiarico
- Bill Banks
- Vijaya Chadaram
- Jamie Cuticchia
- Raj Dash
- Mohammad Farid
- Seth Fehrs
- Patrick McConnell
- Salvatore Mungal
- Mark Peedin
- CALGB
- CCR
- Coalition of Cooperative Groups
- Dana Farber
- Georgetown
- Mayo
- Oregon Health Sciences University
- Integrative Cancer Research
- Workspace participant
- RProteomics developer
- caTRIP developer
- Architecture
- Workspace participant
- caGrid developer
- caGrid scientific liaison
- Guide to Mentors
- Vocabularies and Common Data Elements
- Workspace participant
- Guide to Mentors
- Clinical Trials Management Systems
- Workspace participant
- C3PR developer
- CTMS Interoperability architect
- C3D developer
- Tissue Banking and Pathology Tools
- Workspace participant
12The Cancer Translational ResearchInformatics
Platform (caTRIP)
- Background, use cases, architecture
13caTRIPWho is involved?
- Duke Bioinformatics
- Jamie Cuticchia (PI)
- Patrick McConnell (lead architect)
- Duke Information Systems
- Bob Annechiarico (PM)
- Wilma Stanley (developer)
- Mark Peedin (developer)
- Mohamad Farid (DBA)
- Jeff Allred (IT manager)
- Duke Pathology
- Raj Dash (domain expert)
- Chris Hubbard (developer)
- Duke Oncology
- Kelley Marcom (domain expert)
- Gretchen Kimmick (domain expert)
- Kimberly Blackwell (domain expert)
- Lee Wilke (domain expert)
- Duke CALGB
- Kimberly Johnson (DataMart liaison)
- SemanticBits
- Ram Chilukuri (lead developer)
- Srini Akkala (developer)
- Sanjeev Agarwal (developer)
- 5 AM Solutions
- Bill Mason (developer)
- NCI
- Julie Klemm (ICR WS lead)
- Carl Shaefer (NCI rep)
- Subha Madhavan (caIntegrator PM)
- BAH
- Curtis Lockshin
- Mehul Shah (tech support)
Managers and Architects
Software Developers
Database Developers and IT
NCI/BAH
Domain Experts
14caTRIP What is translational research?
- Bench-to-Bedside
- Wikipedia (the source of all knowledge)Translat
ional medicine is a branch of medical research
that attempts to more directly connect basic
research to patient care. - Basic research occurs in the lab
- Patient care occurs in the clinic
- Translational research broadenedTranslational
medicine can also have a much broader definition,
referring to the development and application of
new technologies in a patient driven environment
- where the emphasis is on early patient testing
and evaluation.facilitate the interaction
between basic research clinical medicine,
particularly in clinical trials.
15caTRIP Initial focus
- Our initial focus will be on connecting existing
data systems, including basic science data, to
enhance patient care - Initial problem scenario outcomes analysis
- Use data from existing patients to inform the
treatment of another patient - Leverage clinical, pathology, tissue, and basic
science data - ScenarioPatient A enters the clinic. What
treatments were applied with success on other
patients with similar characteristics (race, sex,
symptoms, pathology results, adverse events,
biomarkers).
16caTRIP Broadened focus scientific use cases
- Find available tumor tissue
- What are all the tissue specimens from her2/neu
positive patients that have a primary tumor in
the breast and are BRCA1 positive? - Find factors of survival
- What are all the ER positive patients that have
survived breast cancer after radiation treatment? - Find patients for trials
- What are all the patients that are triple
negative (ER, PR, and HER2/NEU negative)? - Determine the distribution of disease factors
over time - Does a change in pathology biomarkers over time
contribute to recurrence or death? - Determine correlation of factors pre and post
surgery - Does a change in ER or PR status before and after
surgery correlate with other factors? - Find pathology reports of interest
- Show me all of the pathology reports for Her2/Neu
positive patients with a lobular carcinoma.
17caTRIP Connecting disparate data systems
CAEPathology Biomarkers
Tumor RegistryDiagnosis, Treatment, Recurrence,
Follow-up
caTissue CORETissue Bank
MRN
caTRIP
caTRIP
caTRIP
caTRIP
caTIESPathology Reports
caIntegratorSNP Data
18caTRIP Architecture overview
Distributed Query Engine
query
GUI
authenticate
discover
Domain Grid Services
Core Grid Services
authorize
CAE
caTissueCORE
CGEMSSNP
caTIES
TR
IdPService
GridGrouper
IndexService
Duke
caTIES
TR
caTissue CORE
CAE
caIntegrator
Domain Controller
Illumina
MAW3
Tumor Registry
19caBIG Compatibility
- Interoperability, compatibility, syntactics, and
semantics
20caBIG compatibility Interoperability defined
Courtesy Charlie Mead
- ability of a system to access and use the parts
or equipment of another system
Semanticinteroperability
Syntacticinteroperability
21caBIG compatibility How does this apply to caBIG?
- Connect scientists and practitioners through a
shareable and interoperable infrastructure - Develop standard rules and a common language to
more easily share information (compatibility
guidelines) - Build or adapt tools for collecting, analyzing,
integrating, and disseminating information
associated with cancer research and care. - The cancer community is united in its mission to
eliminate suffering and death due to cancer. It
is now connected by caBIG.
22caBIG compatibility What is compatibility in
caBIG?
- The four areas of the caBIG compatibility
guidelines - Information Models - Individual types of data
are rarely collected or presented in isolation.
Rather, they are assembled into a contextual
environment that includes closely and more
distantly associated data and information. These
associations and relationships can be presented
in the form of an information model. - CDEs - Data that is collected on a given study or
trial must be defined and described such that
remote users of that data can understand what it
means. These metadata descriptions are referred
to as data elements. - Vocabularies and Ontologies - Biomedical
information includes a substantial body of
specialized concepts that are represented by
terms. Agreement upon the basic concepts, terms
and definitions that are inherent in all
biomedical information is essential for achieving
semantic interoperability. - Programming and Messaging Interfaces - Computer
programs and the people who write them are able
to access resources from other programs through
programming and messaging interfaces. Each of
these interfaces responds to a particular syntax
for its communications. Agreement upon standards
for these interfaces is necessary to overcome
barriers to syntactic interoperability.
23caBIG compatibility Levels of compatibility
- The four levels of the caBIGTM compatibility
guidelines - Legacy - Implies no interoperability with an
external system or resource. A system that was
designed without awareness of or prior to the
availability of these compatibility guidelines,
and which does not meet any of the requirements
for interoperability. - Bronze - Classifies the minimum requirements that
must be met to achieve a basic degree of
interoperability. - Silver - A rigorous set of requirements that,
when met, significantly reduce the barrier to use
of a resource by a remote party who was not
involved in the development of that resource. - Gold - Currently being defined by caBIG. Is
expected to provide for a formalized grid
architecture and data standards that will enable
standardized advertising, discovery, and use of
all federated caBIG resources.
24caBIG compatibility caBIG compatibility
guidelines
Syntactic
Semantic
Semantic Syntactic
25caBIG compatibility Syntactic interoperability
- The solution for syntactic interoperability in
caBIG at the silver level of compatibility is for
all systems to provide an Object Oriented
Application Programmer Interface (API). - Object Oriented Interfaces can be implemented in
many programming languages. - This interface can be connected to the caGrid so
that the local data repository is globally
accessible in a language independent way. - The interface is described by an information
model, which acts as the junction between the
syntactic components and the semantic components.
26caBIG compatibility Programming and messaging
interfaces
- Types of APIs
- Client APIs in a programming language
- Messaging APIs via a messaging protocol
- Types of systems
- Data services provide access to an information
model - Query method
- Associations are traversable
- Analytical services provide methods tomanipulate
data - Hybrid services provide methods to manipulate
information models - Analytical tools consumer of silver compatible
data, but dont produce it
27caBIG compatibility Programming and messaging
interfaces details
Legacy Bronze Silver Gold
No programmatic interfaces to the system are available. Only local data files in a custom format can be read Data transfer mechanisms implemented only on an ad hoc basis Programmatic access to data from an external resource is possible. Well-described APIs provide access to data in the form of data objects. Standards-based electronic data formats are supported for both input to and output from the system. Standards-based messaging protocols are supported wherever messaging is relevant. All features of Silver, plus Service-oriented components produce or consume resources in the form of grid services Interoperable with data grid architecture to be defined by caBIG
Examples Examples Examples Examples
Executables Proprietary API/data format JavaDocs XML, ASN.1 SOAP, CORBA Globus caGrid-based services
28caBIG compatibility caTRIP API
Hyperlinks to caTRIP API docs
29caBIG compatibility caTRIP grid service WSDL
Hyperlinks to caTRIP API WSDL
30caBIG compatibility caTRIP grid service WSDL
Hyperlinks to caTRIP FQP UML
31caBIG compatibility Semantic interoperability
- The Solution for semantic interoperability lies
in object oriented UML design of the service, an
unambiguous description of elements within the
system and storage of the description in a
publicly accessible repository (metadata). - UML model
- Use of publicly accessible terminologies/
vocabularies/ontologies (EVS-NCI Thesaurus) - Use of publicly accessible metadata repository
(caDSR)
32caBIG compatibility Common data element (CDE)
details
Legacy Bronze Silver Gold
No Structured metadata is recorded Data element descriptions have sufficient detail for a subject matter expert to unambiguously interpret Data elements are built using controlled terminology Metadata is stored and publicized in an electronic format that is separate from the resource that is being described Common Data Elements (CDEs) built from controlled terminologies and according to practices validated by the VCDE workspace are used throughout. CDEs are registered as ISO/IEC 11179 metadata components in the cancer Data Standards Repository (caDSR) All features of Silver, plus Common Data Elements (CDEs) designated as caBIG Standards by the VCDE workspace are used. Metadata is advertised and discoverable via the caBIG grid services registry
Examples Examples Examples Examples
Free-text pathology reports GeneOntology from GO website NCI Thesaurus GeneOntology registered in EVS NCI Thesaurus
33 Enterprise Vocabulary Services
caBIG compatibility Metadata stored in caDSR
- Storage of Metadata
- caDSR cancer Data Standards Repository
- Common Data Elements CDEs
- Enable end-users to access information about data
and services without having to access human
developers - Fusion of UML models Concepts/Definitions
34caBIG compatibility caTRIP CDEs
Hyperlinks to caTRIP CDEs
35caBIG compatibility Vocabulary/terminology
details
Legacy Bronze Silver Gold
Free text used throughout for data collection Use of publicly accessible controlled vocabularies as well as local terminologies. Terminologies must include definitions of terms that meet caBIG VCDE workspace guidelines Terminologies reviewed and validated by the caBIG Vocabulary/Common Data Element (VCDE) Workspace used for all relevant data collection fields. All features of Silver, plus Full adoption of caBIG terminology standards as approved by the VCDE Workspace.
Examples Examples Examples Examples
Free-text pathology reports GeneOntology from GO website NCI Thesaurus GeneOntology registered in EVS NCI Thesaurus
36 Enterprise Vocabulary Services
caBIG compatibility Publicly accessible
terminologies
- Controlled vocabulary resources for the cancer
research community - Vocabulary Products and Services
- NCI Thesaurus
- NCI Metathesaurus
- External Vocabularies
- NCI Thesaurus - controlled vocabulary source for
metadata - Has excellent coverage of cancer terminology
- Expands based on needs for additional terminology
- Based on concepts rather than terms
- Each concept has a unique identifier or CUI with
definitions and synonym - Housed by the Enterprise Vocabulary Service (EVS)
- LexBIG
- a caBIG-funded vocabulary server to enable a
Federated Vocabulary environment.
37caBIG compatibility caTRIP CDEs
Hyperlinks to a caTRIP concept
38caBIG compatibility Information model (UML)
details
Legacy Bronze Silver Gold
No model describing the system is available in electronic format Diagrammatic representation of the information model is available in electronic format. Information models are defined in UML as class diagrams and are reviewed and validated by the VCDE workspace. All features of Silver, plus Information models are harmonized across the caBIG Domain Workspaces
Examples Examples Examples Examples
Database diagram
39caBIG compatibility Domain information modeling
- A Domain Information Model is a representation of
our understanding of an area of knowledge. - Domain Information Models consist of Classes
that represent things in the real world - Classes contain attributes that are
characteristics of different instances of things
in the real world. - Relationships between the classes are described
by associations and indicated by lines with
directionality and cardinality - Each class plus attribute creates one Common Data
Element (CDE)
40caBIG compatibility Tumor Registry model
Diagnosis
Participant
Collaborative Staging
Follow up and Recurrence
Hyperlinks to caTRIP UML
Treatment
41Building caBIG Compatible Systems
42Building caBIG compatible systemsSteps for
creating an analytical system
- Step 1 model and register metadata
- Model the domain objects
- Register metadata
- Step 2 implement the analytical system
- Implement an interface
- Map data objects to existing inputs
- Plug-in analytics
- Step 3 create the data service
- Create an XML Schema
- Use the caGrid 1.0 Introduce toolkit to create a
service - Configure the service
- Deploy
- Step 4 invoke the service
- Java-based client
- Use caTRIP
43Building caBIG compatible systemsSteps for
creating a data system
- Step 1 model and register metadata
- Model the domain objects
- Register metadata
- Step 2 implement the information system
- Model the databases (via scripts or EA)
- Build the database
- Generate Java beans
- Create Hibernate mappings
- Jar it all up
- Step 3 create the data service
- Create an XML Schema
- Use the caGrid 1.0 Introduce toolkit to create a
service - Configure the service
- Deploy
- Step 4 invoke the service
- Java-based client
- Use caTRIP
44Building caBIG compatible systemsN-tier
architecture
Index Service
advertise
advertise
Distributed Query Engine
CQL Query
caGrid Data Service
caCORE SDK
CQL Engine
domainmodel
Object-relational mapping
database
45Building caBIG Compatible SystemscaCORE SDK
Vocabularies
Info Model
Common Data Elements
Messaging Interfaces/ API
46caBIG compatibility Mapping UML to CDEs
47caBIG compatibility Mapping UML to CDEs example
Created Data Element
Gene Entrez Gene Genomic Identifier
java.lang.String
Class Gene
Datatype String
Attribute entrezGeneID
Gene
Entrez Gene Genomic Identifier
java.lang.String
48caBIG compatibility Use SIW to designate
existing CDEs
49caGrid
- Background, service creation, metadata
50caGridWhat is caGrid?
- What is Grid?
- Evolution of distributed computing to support
sciences and engineering - Sharing of resources (computational, storage,
data, etc) - Secure Access (global authentication, local
authorization, policies, trust, etc.) - Open Standards
- Virtualization
- What is caGrid?
- Development project of Architecture Workspace
- Helping define and implement Gold Compliance
- Implementation of Grid technology
- Leverages open standards, community open source
projects - No requirements on implementation technology
necessary for compliance - Specifications will be created defining
requirements for interoperability - caGrid provides core infrastructure, and tooling
to provide a way to achieve Gold compliance - Gold compliance creates the G in caBIG
- Gold gt Grid gt connecting Silver Systems
51caGridMetadata infrastructure goals
- Support strongly typed grid
- Syntactic and Semantic interoperability
- Programmatic!
- Smooth transition from Application to Grid and
back - Leverage wealth of existing metadata
- Enable service Advertisement and Discovery
52caGridService development process
- Service developers first create a service using a
simple wizard to specify information (target
directory, type of service, service name, etc) - Next developer locate the data types they will
use for inputs or outputs - Can be discovered from the caDSR, GME, file
system, etc - Operations are then defined that take some number
of the data types as input, and produce some
number as output - Metadata and Service Properties can be added and
configured - The services security can be completely
configured - Some or all of these steps may be automatically
handled by extensions
53caGrid Introduce
- GUI for creating and manipulating a grid service
- Provides means of simple creation of service
skeleton that a developer can then implement,
build, and deploy - Automatic code generation of complete caBIG
compliant grid service which is configured to
provide - Advertisement
- Standard Metadata
- Security
- Complete Client API
54caGridSteps for creating a data system
- Step 1 model and register metadata
- Model the domain objects
- Register metadata
- Step 2 implement the information system
- Model the databases (via scripts or EA)
- Build the database
- Generate Java beans
- Create Hibernate mappings
- Jar it all up
- Step 3 create the data service
- Create an XML Schema
- Use the caGrid 1.0 Introduce toolkit to create a
service - Configure the service
- Deploy
- Step 4 invoke the service
- Java-based client
- Use caTRIP
55caGridSteps for creating an analytical system
- Step 1 model and register metadata
- Model the domain objects
- Register metadata
- Step 2 implement the analytical system
- Implement an interface
- Map data objects to existing inputs
- Plug-in analytics
- Step 3 create the data service
- Create an XML Schema
- Use the caGrid 1.0 Introduce toolkit to create a
service - Configure the service
- Deploy
- Step 4 invoke the service
- Java-based client
- Use caTRIP
56caGridcaGrid data description infrastructure
- Client and service APIs are object oriented, and
operate over well-defined and curated data types - Objects are defined in UML and converted into
ISO/IEC 11179 Administered Components, which are
in turn registered in the Cancer Data Standards
Repository (caDSR) - Object definitions draw from controlled
terminology and vocabulary registered in the
Enterprise Vocabulary Services (EVS), and their
relationships are thus semantically described - XML serialization of objects adhere to XML
schemas registered in the Global Model Exchange
(GME)
57caGridMetadata services
- Cancer Data Standards Repository (caDSR)
- caBIG projects register their data models as
Common Data Elements (CDEs) which are
semantically harmonized and then centrally stored
and managed the caDSR - The caDSR grid service provides
- Model discovery and traversal
- caGrid standard metadata generation capabilities
- Enterprise Vocabulary Services (EVS)
- EVS is set of services and resources that address
the need for controlled vocabulary - The EVS grid service provides
- Query access to the data semantics and controlled
vocabulary managed by the EVS - Global Model Exchange (GME)
- GME is a DNS-like data definition registry and
exchange service that is responsible for storing
and linking together data models in the form of
XML schema. - The GME grid service provides
- Access to the authoritative structural
representation of data types on the grid - Globus Information Services Index Service
- The Globus Information Services infrastructure
provides a generic framework for aggregation of
service metadata, a registry of running Grid
services, and a dynamic data-generating and
indexing node, suitable for use in a hierarchy or
federation of services - The Index grid service provides
- Yellow and white pages for the grid
58caGridcaGrid production environment
59The Cancer Translational ResearchInformatics
Platform (caTRIP)
60caTRIP Clinical and research scenarios
- Clinical scenario for demonstration
- A patient enters the clinic and is diagnosed with
a lobular carcinoma - The Her2/Neu biomarker test comes back positive
- What are the treatments and outcomes of other
patients with similar characteristics? - Query for diagnosis date, treatment, treatment
date, survival, recurrence, and BRCA1 and BRCA2
status - Look for treatments given with success and
correlation between BRCA status in case test
should be ordered - Research scenario for demonstration
- Is there a correlation between recurrence,
mortality, histologic grade, and Her2/Neu status
for breast cancer patients diagnosed with lobular
carcinoma? - Query caTRIP for recurrence type, date of death,
histologic grade, and Her2/Neu status for
patients diagnosed with lobular carcinoma - Correlation is determined in Microsoft Excel
- Investigate gene biomarkers that correlate with a
Her2/Neu status of negative and survival - Query caTRIP for all available tissue to order
for microarray experiments - Query sharing
- What are all the triple negative patients?
61caTRIP Why the Simple GUI?
- What are all the tissue specimens from her2/neu
positive patients that have a primary tumor in
the breast and are BRCA1 positive?
caTissue CORE
CAE
Participant Medical Record Number
CGEMS
Tumor Registry
62Discussion/questions
63Backup Slides
64CTMS Interoperability Project
- Goals, scope, BRIDG, architecture, demo
65CTMSiA collaborative effort
- 11 Organizations
- Booz Allen Hamilton
- Dana-Farber
- Duke University
- Ekagra
- Harvard University
- Mayo Clinic
- NCICB
- Nortel Government Solutions
- Northwestern University
- ScenPro
- SemanticBits
- 8 Locations
- Maryland
- Minnesota
- Virginia
- Georgia
- Massachusetts
- 35 Team Members / 5 Applications
- Cancer Central Clinical Participant Registry
(C3PR) - Cancer Central Clinical Database (C3D)
- Patient Study Calendar (PSC)
- caXchange LabViewer and the Clinical Trials
Object Model (CTOM) - Cancer Adverse Events Reporting System (caAERS)
- 8 Roles
- Analysts
- Architects
- Developers
- Project Director
- Project Manager
- Project Sponsor
- Project Tech Leads
- Subject Matter Experts
66CTMSi Credits
- Project Director
- Meg Gronvall (BAH)
- Charles N. Mead, M.D. (BAH)
- NCICB CTMS Lead
- Christo Andonyadis, D.Sc. (NCICB)
- Project Manager
- Edmond Mulaire (SemanticBits)
- Project Architects
- Patrick McConnell (Duke)
- Niket Parikh (BAH)
- Analysts
- Smita Hastak (ScenPro)
- Wendy Ver Hoef (ScenPro)
- Subject Matter Experts
- Project Technical Leads
- Ram Chilukuri (SemanticBits)
- Charles Griffin (Ekagra)
- Vinay Kumar (SemanticBits)
- Stephen Reckford (Nortel Government Solutions)
- Rhett Sutphin (Northwestern)
- Sean Whitaker (Northwestern)
- caAERS Ram Chilukuri (SemanticBits), Krikor
Krumlian - (Akaza Research), Vinay Kumar (SemanticBits),
Rhett - Sutphin (Northwestern), Kulasekaran Sethumadhavan
- (SemanticBits), Sujith Thayylithodi
(SemanticBits) - caGrid Manav Kher (SemanticBits), Vinay Kumar
- (SemanticBits), Joshua Phillips (SemanticBits)
- caXchange (Lab Viewer/CTOM) Charles Griffin
- (Ekagra), Smita Hastak (ScenPro), Mukesh
Mediratta - (Ekagra), Kunal Modi (Ekagra), Wendy Ver Hoef
67CTMSi Goal
Lab Results
Participant Registration
Patient Scheduling
Adverse Events
Clinical Trials DB
68CTMSi BRIDG extract
Labs
Subject
AdverseEvents
Eligibility
Study
Site
69(No Transcript)
70CTMSi Architectural overview
AuthenticationTrustAuthorization
Messages
caXchange
caGrid
Enterprise Service Bus
InboundBindingComponent
OutboundBindingComponent
Routing Rules
GTS
Dorian
Grid Grouper
71CTMSi Demonstration
72Service Metadata All Services
- Common Service Metadata
- Provided by all services
- Details services capabilities, operations,
contact information, hosting research center - Service operations inputs and outputs defined in
terms of structure and semantics extracted from
caDSR and EVS - Majority auto-generated by Introduce
73Service Metadata Service Security
- Service Security Metadata
- Provided by all services
- Details the services requirements on
communication channel for each operation - Can be used by client to programmatically
negotiate an acceptable means of communication - For example Does operation X allow anonymous
clients, or are credentials required? - Auto-generated by Introduce
74Service Metadata Data Service
- Data Service Metadata
- Provided by all data services
- Describes the Domain Model being exposed, in
terms of a UML model linked to semantics - Provides information needed to formulate the
Object-Oriented Query - As with common metadata, data types defined in
terms of structure and semantics extracted from
caDSR and EVS - Auto-generated by Introduce
75caTRIP in-depth ArchitectureSecurity
authorization
User Grid Certificate
Grid Data Service
authentication
User Credentials
SAML Assertion
Dorian
CSM
Trust Fabric
caGrid Authentication Service
backenddata
GridGrouper
Duke Authentication Plugin
Duke Domain ControllerNT Security
76caTRIP in-depth Data sharingChallenges in data
sharing
- Building data-oriented systems
- Duke requires IRB approval to gain access to
identifiable data - We worked around by leveraging people already on
IRB protocols - Deidentifying data
- Data is owned by different groups across the
cancer center - Traditional deidentification data manager
deidentifies an entire dataset then throws away
the key - Distributed deidentification trusted service
provider (TSP) deidentifies discreet values - Traditional approach is not scalable requires a
middle-man - IRB approval required for distributed approach
because it deviates from traditional
deidentification (at Duke)
77caTRIP in-depth Data sharingDistributed
deidentification
Secure connection
MRN3
MRN3
GHI789
GHI789
Trusted Service Provider
Has IRB approval to see identifiable data
Has IRB approval to see identifiable data
Has IRB approval to store identifiable data
PHI DEID
MRN1 ABC123
MRN2 DEF456
. . . . . .
PHI DEID
MRN1 ABC123
MRN2 DEF456
MRN3 GHI789
. . . . . .
Randomly generated
78caTRIP in-depth ArchitectureSimple GUI
configuration
Service A
Service B
TissueSpecimen
SpecimenCollectionGroup
BreastCancerBiomarkers
Target
ClinicalReport
Linking Object Join Condition
Associated Classes
ParticipantMedicalIdentifier
Filter Object
Association Direction
SpecimenCharacteristics
Foreign Association inbound Paths
Associated Object Tree
Linking Object Join Condition
Join Condition CDE ex. MRN
Target
Association Direction
Service A
Service B
Foreign Association Outbound Path
Foreign Association
79caTRIP in-depth ArchitecturecaBIG compatibility
- Challenge
- Silver-compatibility is in some ways (and for
good reason) stringent - Grid technologies were still in development
(caGrid 1.0 is now released) - caTRIP is a silver-compatible application (in
theory) - Compatibility submission package completed
- Going through review now for silver-compatible
data services - caTRIP leverages caCORE technologies
- Common Security Module (CSM) provides
authorization - caCORE-SDK provides tooling to create Java
classes from UML (XMI), XML schemas, and castor
mappings - caTRIP leverages caGrid technologies
- Index Service provides advertisement and
discovery - Authentication Service provides
- Dorian helps provide authentication
- GTS provides trust fabrics
80Next steps
- Aggregate data from multiple services of the same
type - Scenario caTissue Suite deployed at 13 cancer
centers - Add datasets and data types
- CTMS, population sciences, basic science, etc.
- Add analytical services
- Integrate with workflow
- Add visualization components
- Enhanced reporting
- Automate Excel pivot table
- Data mining results
- Enhanced querying
- Asynchronous, parallel querying
- Querying multiple deployed distributed query
services - Continue refinement of user interface
- Synchronization of advanced and simple GUI
- Additional usability features
81caGridcaBIG Resources
- caBIG Website http//cabig.cancer.gov/index.asp
- caBIG Compatibility Guidelines
https//cabig.nci.nih.gov/compatibility_guidelines
_documentation/ - Cancer Common Ontologic Representation
Environment (caCORE) http//ncicb.nci.nih.gov/NCI
CB/infrastructure/cacore_overview - Enterprise Vocabulary Services (EVS)
http//ncicb.nci.nih.gov/NCICB/infrastructure/caco
re_overview/vocabulary - Cancer Data Standards Repository (caDSR)
http//ncicb.nci.nih.gov/NCICB/infrastructure/caco
re_overview/cadsr - caCORE Software Developers Kit (caCORE SDK)
http//ncicb.nci.nih.gov/NCICB/infrastructure/caco
resdk - caCORE Training http//ncicb.nci.nih.gov/NCICB/t
raining/cadsr_training - Model Driven Architecture http//www.omg.org/mda/
- UML Modeling http//www.sparxsystems.com.au/UML_
Tutorial.htm
82caTRIP Why cant I just write DCQL?
- What are all the tissue specimens from her2/neu
positive patients that have a primary tumor in
the breast and are BRCA1 positive?
- ltDCQLQuery xmlns"http//caGrid.caBIG/1.0/gov.nih.
nci.cagrid.dcql"gt - ltTargetObject name"edu.wustl.catissuecore.dom
ainobject.impl.TissueSpecimenImpl"
serviceURL"http//152.16.96.114/wsrf/services/cag
rid/CaTissueCore"gt - ltAssociation name"edu.wustl.catissuecore.
domainobject.impl.SpecimenCollectionGroupImpl"
roleName"specimenCollectionGroup"gt - ltAssociation name"edu.wustl.catissuec
ore.domainobject.impl.ClinicalReportImpl"
roleName"clinicalReport"gt - ltAssociation name"edu.wustl.catis
suecore.domainobject.impl.ParticipantMedicalIdenti
fierImpl" roleName"participantMedicalIdentifier"gt
- ltGroup logicRelation"AND"gt
- ltForeignAssociationgt
- ltJoinConditiongt
- ltLeftJoingt
-
ltObjectgtedu.wustl.catissuecore.domainobject.impl.P
articipantMedicalIdentifierImpllt/Objectgt -
ltPropertygtmedicalRecordNumberlt/Propertygt - lt/LeftJoingt
- ltRightJoingt
-
ltObjectgtedu.duke.catrip.cae.domain.general.Partici
pantMedicalIdentifierlt/Objectgt -
ltPropertygtmedicalRecordNumberlt/Propertygt - lt/RightJoingt
- lt/JoinConditiongt
- ltForeignObject
name"edu.duke.catrip.cae.domain.general.Participa
ntMedicalIdentifier" serviceURL"http//152.16.96.
114/wsrf/services/cagrid/CAE"gt - ltAssociation
name"edu.duke.catrip.cae.domain.general.Participa
nt" roleName"participant"gt
Select tissue
Foreign Join w/ CAE
HER2/NEU Positive
Foreign Join w/ Tumor Registry
Primary Site Breast
Foreign Join w/ CGEMS
BRCA1 Positive
83caTRIPDistributed query engine
CQL
database
caGrid data service
data objects
CQL
Distributed Query Engine
DCQL
database
caGrid data service
data objects
data objects
CQL
database
caGrid data service
data objects
84CTMSi BRIDG dynamic modeling
- Process flow
- story boards
- Scenarios
- Use cases
- Text UML activity diagrams
- Links to static structures
- Interaction diagrams (?)
- Sequence diagrams
- Collaboration diagrams (UML 2.0)
85CTMSi Patient registration message
JMS OUT Queue
ESB
Router
caAERS Grid Service
JMS IN Queue
PSC Grid Service
86caBIG compatibility CDE Browser
87caBIG compatibility CDE Browser permissible
values
88caBIG compatibility NCI Thesaurus
Concept Code
Relationships
Preferred Name
Definition
Synonyms
89caGrid caGrid community involvement
- caGrid itself provides no real data or
analysis to caBIG - Its the enabling infrastructure which allows the
community to do so - Community members add value to the grid as
applications, services, and processes (for
example shared workflows) - caGrid provides the necessary core services,
APIs, and tooling - The real value of the grid comes from bringing
this information to the end user - Data Services expose data to the grid in a
unified way - Analytical Services expose analytical operations
to the grid - Community members develop end user applications
which consume of the resources provided by the
grid
90caGridcaGrid exposing silver systems
- Object Oriented APIs and data resources are
developed using Object types and information
models registered in the caDSR - These silver systems are grid-enabled by
defining a grid service interface that defines
the functionality to be exposed to the grid - The grid service interface uses the same Object
types as the existing system, but leverages a
platform and language neutral representation
(XML) of them - The grid service implementation maps service
invocations to API calls or queries into the
existing system
91caGridFederated Query Processor
- Provides a mechanism to perform basic distributed
aggregations and joins of queries over multiple
data services - As caGrid data services all use a uniform query
language, CQL, the Federated Query Infrastructure
can be used to express queries over any
combination of caGrid data services - Federated queries are expressed with a query
language, DCQL, which is an extension to CQL to
express such concepts as joins, aggregations, and
target services - Implemented as a stateful grid service, queries
may be executed asynchronously and results
retrieved at a later time - Supports secure deployments wherein result
ownership is enforced - Coupled with semantic discovery capabilities of
caGrid, provides a powerful framework for data
discovery, mining, and integration
92caGridData service common query language
- Specifies a target object (result) type and
selects the instances which satisfy the specified
properties and nested object properties - Allows path navigation
- Provides logical grouping
- Provides name/predicate/value filtering on
properties of objects - Recursively defined
- Ability to return full Objects, Set of
attributes, count of results, or distinct
attribute values
93caGridExample CQL query
LIKE BRCA
Return all Genes with a symbol beginning with BRCA and have an associated Taxon with a scientificName equal to Homo sapiens
ltCQLQuery xmlns"http//CQL.caBIG/1/gov.nih.nci.cagrid.CQLQuery"gt ltTarget name"gov.nih.nci.cabio.domain.Gene"gt ltGroup logicRelation"AND"gt ltAttribute name"symbol" predicate"LIKE value"BRCA"/gt ltAssociation roleName"taxon name"gov.nih.nci.cabio.domain.Taxon"gt ltAttribute namescientificName" predicateEQUAL_TO valueHomo sapiens"/gt lt/Associationgt lt/Groupgt lt/Targetgt lt/CQLQuerygt
Homo sapiens
94caBIG compatibility Metadata and concepts example