Title: TM
1 TM
caBIG ICR Face-to-Face CDE Creation Lessons
Learned Craig Street/Vishal Nayak Biomedical
Informatics Facility University of Pennsylvania
caBIG ICR Face-to-Face May 2-3, 2005
2GENE CDE Focus Group
- A focus group was formed within the Genome
Annotation SIG and was charged with the
responsibility of creating relevant CDEs. - Consisting of Craig Street, Rakesh Nagarajan,
Juli Klemm, and Vishal Nayak (with input from
Harold Riethman and Baris Suzek). - It was decided that the developers are experts in
their domain and know best how to model their
data. We opted not to be restrictive in our
modeling strategy and created CDEs that were as
inclusive as possible. - It was decided to focus initially on the modeling
of Genomic Identifiers.
3 TM
Genomic Identifier Modeling Strategy
- Recommendations to Developers
- Polled Developers of ICR Projects to Determine
What Identifiers are Currently Being Used in
their Systems - Compiled List of Required Genomic Identifiers
(i.e., DNA or its RNA or Protein Product) based
on Survey - Each ICR Projects Object Model must utilize AT
LEAST ONE of these Defined Genomic Identifier
CDEs for each UML Class with any Genomic
Identifier Attributes.
caBIG ICR Face-to-Face May 2-3, 2005
4Genomic Identifier Modeling Strategy
- How should the genomic identifiers be modeled?
- (a) Should the identifiers be made object classes
AND the type of identifier the property? - OR
- (b) Should the identifiers be modeled as a
property to the Gene, mRNA, Protein object
classes with the type of identifier as a property
qualifier? - (b) was adopted as it is the more intuitive and
commonly used modeling strategy. This will be the
recommended modeling strategy and allows for the
harmonization of CDEs based on immutable EVS
concept codes. - This presentation focuses on the lessons learned
from recommended modeling strategy. The
alternate strategies will be dealt with at a
later stage based on need.
5First Step Creating definitions for the various
concepts
- Came up with a list of comprehensive and precise
definitions for all the elements (object classes,
properties, qualifiers) from disparate sources. - Obtained consensus from the members of the Genome
Annotation SIGs and the V/CDE workspace
facilitators on these definitions. - These definitions are used in the Documentation
and Description tags for the UML classes and
attributes respectively. - Many of the Genomic Identifiers were NOT present
as concepts in the EVS. Some of these definitions
could be used to define these concepts.
6First Step Creating definitions for the various
concepts
- Lessons Learned
- While creating definitions de novo, it is quicker
and more acceptable to get the definition from a
single source rather than creating them ad hoc
from a variety of sources. This is true
especially if these concepts have to be entered
into the EVS. - If the definitions have to replace or be cited as
alternate definitions to existing EVS concepts,
there should be a good reason why the existing
definitions are insufficient or how the new
definitions will enhance/complement the existing
ones.
7Second Step UML Modeling with Enterprise
Architect
8Second Step UML Model
- The UML model we created is NOT an information
model for a system. - It is just a way (the preferred way) of entering
the Genomic Identifier CDEs into the caDSR. - We created three classes Gene, MessengerRNA and
Protein and have the preferred genomic
identifiers as attributes. - The UML model created some confusion
- To map or not to map, that is the question.
9First Way to Model
10Second Way to Model
11Which Way Should be Adopted?
- WE WILL STICK TO CDE CREATION FOR NOW!
- Mapping the identifiers is too complicated to be
handled by a partial model like this. - The first way of modeling is sufficient to put
the CDEs into the caDSR. - The aim of this exercise is to provide
pre-existing Genomic Identifier CDEs to
developers that are anticipated to be reused
commonly as opposed to mapping.
12Third Step Creating the UML Model with
Enterprise Architect
- This involved creation of classes and attributes
and their associated tags and adding them to the
Logical Diagram. - Care was taken to follow the Semantic Connector
Best Naming Practices as closely as possible.
There was some debate on whether there should be
separate Ensembl_gene, Ensembl_transcript and
Ensembl_protein OR just Ensembl as a property
qualifier (similarly for RefSeq). It was decided
to just have Ensembl AND RefSeq. - The class names were title cased and camel case
was used to distinguish the object class and
property qualifiers to allow the semantic
connector to distinguish between the different
concepts. - Documentation tags were created for the UML
classes and Description tags were created for the
attributes. - An effort was made to create ALL the required
tags, including the concept code tags in order to
reduce the number of iterations through the
semantic connector for it to identify the concept
codes for the individual concepts.
13Third Step Creating the UML Model with
Enterprise Architect
- The length of the tag value field is sometimes
NOT sufficient to hold the complete definition. - Added only the truncated definition in the
tag-value field and added the complete definition
as a NOTE for the tag. When the .xmi is
generated, this is translated to ltUMLCommentgt
field. - All the Genomic Identifiers were indicated to be
of type java.lang.String. The data type objects
are included in the model. The value domains can
be curated once the CDEs are in the caDSR. - The xmi file is generated.
- NEWS FLASH! UML models created from a different
version of EA might not generate a perfect xmi.
The compatibilities have to be worked out.
14Fourth Step Installation of the caCORE SDK
- To run the xmi through the semantic connector,
the caCORE SDK (with the functional
semantic-connector tool) has to be installed. - caCORE SDK 1.0.2 was successfully installed.
15Fifth Step Semantic Connection
16WAIT!!!!!!
- Before running the xmi file through the semantic
connector, it has to be checked whether it is a
valid MDR XMI file. - The xmi generated by EA is NOT a valid MDR XMI
file!!!! - There is a solution the ant fix-ea tool
generously supplied with the caCORE SDK toolkit. - Modified the fix-ea tool and the
deploy.properties file to fix the generated xmi. - There is one more thing that needed to be done
The ltUMLCommentgt tags and the truncated EA tag
values! Also, there are these mysterious NOTES
tags appended to some of the EA definitions. - Considered adding a function to the FixEAXMI
class to fix these problems, but this is too
specific an issue. Am considering writing one at
a later stage, if needed. - Corrected them manually..not sure if this is a
best practice!
17NOT YET!!
- A number of identifiers do not have associated
EVS tags. - Created a list of all the CDEs with all the
associated EVS concepts for the object classes,
properties and qualifiers and sent them out to
the group and the EVS team. The EVS concepts that
are not present in the EVS were highlighted. - Getting the terms into the EVS can be a major
bottleneck. This should have been recognized and
we should have contacted the EVS team at an
earlier stage. - The concepts are in the EVS currently but they
are not visible!
18THIS IS WHAT THE NEXT STEPS SHOULD LOOK LIKE
(WITH THE CURRENT SDK)
- Running the ant task semantic-connector that will
generate a report associating the classes,
properties and their qualifiers with immutable
EVS concept codes. - Distribute the report to the EVS team (Nicole
Thomas) and the Genome Annotation SIG to ensure
everyones satisfaction. - Modify the xmi file using the ant task
semantic-connector. - Submit the xmi file to NCICB for UML Loading once
ALL the concepts are tagged. - The CDEs will then be loaded on to the caDSR
staging server. - The CDEs can be further curated using the CDE
curation tool. This includes adding the Value
Domains. - When the CDEs are approved, they are loaded on to
the caDSR production server.
19BUT WE TOOK A SHORTCUT
- Since the number of CDEs were small and the EVS
concepts cannot be seen as of now, George
Komatsoulis and his team generously offered to
manually annotate them and load them into the
caDSR staging server. - The CDEs were loaded into the caDSR staging
server last evening! - http//cdebrowser-stage.nci.nih.gov/
20ISSUES THAT NEED TO BE RESOLVED
- The context and the classification schema into
which these CDEs are to be loaded has still not
been addressed. (I see it has been loaded into
the caCORE context and the caBIG classification
schema). - The question of whether it is acceptable to come
up with MORE COMPREHENSIVE definitions from
various sources for new EVS concepts as opposed
to citing a single, reliable source also needs to
be talked about.
21SUMMARY
- Establishing effective communication and reaching
a consensus on some very contentious issues has
been a major bottleneck in this process. - The scope of this process has also been under a
cloud. The question of whether the list of
Genomic Identifier CDEs has to be more
comprehensive has been hotly debated. We decided
to opt for more flexibility and power to the
developers.
22ACKNOWLEDGMENTS
- Rakesh Nagarajan, Juli Klemm, Harold Riethman and
Baris Suzek for their invaluable input in forming
the modeling strategy. - The EVS team for creating the new EVS concepts at
short notice. - The V/CDE team (George Komatsoulis, Hong Dang,
Tommie Curtis, Brian Davis, Mike Keller) for
providing us all the expertise and resources for
the CDE creation exercise. - Niket Parikh from BAH for help with EA issues.
- All members of the Genome Annotation SIG for
their valuable contributions.
23QUESTIONS???