PPT – TM PowerPoint presentation | free to view

About This Presentation

Title:

TM

Description:

Craig Street/Vishal Nayak. Biomedical Informatics Facility. University ... Also, there are these mysterious #NOTES# tags appended to some of the EA definitions. ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 24

Provided by: michael1399

Category:

Tags: appended

more less

Transcript and Presenter's Notes

Title: TM

1
TM
caBIG ICR Face-to-Face CDE Creation Lessons
Learned Craig Street/Vishal Nayak Biomedical
Informatics Facility University of Pennsylvania
caBIG ICR Face-to-Face May 2-3, 2005
2
GENE CDE Focus Group

A focus group was formed within the Genome
Annotation SIG and was charged with the
responsibility of creating relevant CDEs.
Consisting of Craig Street, Rakesh Nagarajan,
Juli Klemm, and Vishal Nayak (with input from
Harold Riethman and Baris Suzek).
It was decided that the developers are experts in
their domain and know best how to model their
data. We opted not to be restrictive in our
modeling strategy and created CDEs that were as
inclusive as possible.
It was decided to focus initially on the modeling
of Genomic Identifiers.

3
TM
Genomic Identifier Modeling Strategy

Recommendations to Developers
Polled Developers of ICR Projects to Determine
What Identifiers are Currently Being Used in
their Systems
Compiled List of Required Genomic Identifiers
(i.e., DNA or its RNA or Protein Product) based
on Survey
Each ICR Projects Object Model must utilize AT
LEAST ONE of these Defined Genomic Identifier
CDEs for each UML Class with any Genomic
Identifier Attributes.

caBIG ICR Face-to-Face May 2-3, 2005
4
Genomic Identifier Modeling Strategy

How should the genomic identifiers be modeled?
(a) Should the identifiers be made object classes
AND the type of identifier the property?
OR
(b) Should the identifiers be modeled as a
property to the Gene, mRNA, Protein object
classes with the type of identifier as a property
qualifier?
(b) was adopted as it is the more intuitive and
commonly used modeling strategy. This will be the
recommended modeling strategy and allows for the
harmonization of CDEs based on immutable EVS
concept codes.
This presentation focuses on the lessons learned
from recommended modeling strategy. The
alternate strategies will be dealt with at a
later stage based on need.

5
First Step Creating definitions for the various
concepts

Came up with a list of comprehensive and precise
definitions for all the elements (object classes,
properties, qualifiers) from disparate sources.
Obtained consensus from the members of the Genome
Annotation SIGs and the V/CDE workspace
facilitators on these definitions.
These definitions are used in the Documentation
and Description tags for the UML classes and
attributes respectively.
Many of the Genomic Identifiers were NOT present
as concepts in the EVS. Some of these definitions
could be used to define these concepts.

6
First Step Creating definitions for the various
concepts

Lessons Learned
While creating definitions de novo, it is quicker
and more acceptable to get the definition from a
single source rather than creating them ad hoc
from a variety of sources. This is true
especially if these concepts have to be entered
into the EVS.
If the definitions have to replace or be cited as
alternate definitions to existing EVS concepts,
there should be a good reason why the existing
definitions are insufficient or how the new
definitions will enhance/complement the existing
ones.

7
Second Step UML Modeling with Enterprise
Architect
8
Second Step UML Model

The UML model we created is NOT an information
model for a system.
It is just a way (the preferred way) of entering
the Genomic Identifier CDEs into the caDSR.
We created three classes Gene, MessengerRNA and
Protein and have the preferred genomic
identifiers as attributes.
The UML model created some confusion
To map or not to map, that is the question.

9
First Way to Model
10
Second Way to Model
11
Which Way Should be Adopted?

WE WILL STICK TO CDE CREATION FOR NOW!
Mapping the identifiers is too complicated to be
handled by a partial model like this.
The first way of modeling is sufficient to put
the CDEs into the caDSR.
The aim of this exercise is to provide
pre-existing Genomic Identifier CDEs to
developers that are anticipated to be reused
commonly as opposed to mapping.

12
Third Step Creating the UML Model with
Enterprise Architect

This involved creation of classes and attributes
and their associated tags and adding them to the
Logical Diagram.
Care was taken to follow the Semantic Connector
Best Naming Practices as closely as possible.
There was some debate on whether there should be
separate Ensembl_gene, Ensembl_transcript and
Ensembl_protein OR just Ensembl as a property
qualifier (similarly for RefSeq). It was decided
to just have Ensembl AND RefSeq.
The class names were title cased and camel case
was used to distinguish the object class and
property qualifiers to allow the semantic
connector to distinguish between the different
concepts.
Documentation tags were created for the UML
classes and Description tags were created for the
attributes.
An effort was made to create ALL the required
tags, including the concept code tags in order to
reduce the number of iterations through the
semantic connector for it to identify the concept
codes for the individual concepts.

13
Third Step Creating the UML Model with
Enterprise Architect

The length of the tag value field is sometimes
NOT sufficient to hold the complete definition.
Added only the truncated definition in the
tag-value field and added the complete definition
as a NOTE for the tag. When the .xmi is
generated, this is translated to ltUMLCommentgt
field.
All the Genomic Identifiers were indicated to be
of type java.lang.String. The data type objects
are included in the model. The value domains can
be curated once the CDEs are in the caDSR.
The xmi file is generated.
NEWS FLASH! UML models created from a different
version of EA might not generate a perfect xmi.
The compatibilities have to be worked out.

14
Fourth Step Installation of the caCORE SDK

To run the xmi through the semantic connector,
the caCORE SDK (with the functional
semantic-connector tool) has to be installed.
caCORE SDK 1.0.2 was successfully installed.

15
Fifth Step Semantic Connection
16
WAIT!!!!!!

Before running the xmi file through the semantic
connector, it has to be checked whether it is a
valid MDR XMI file.
The xmi generated by EA is NOT a valid MDR XMI
file!!!!
There is a solution the ant fix-ea tool
generously supplied with the caCORE SDK toolkit.
Modified the fix-ea tool and the
deploy.properties file to fix the generated xmi.
There is one more thing that needed to be done
The ltUMLCommentgt tags and the truncated EA tag
values! Also, there are these mysterious NOTES
tags appended to some of the EA definitions.
Considered adding a function to the FixEAXMI
class to fix these problems, but this is too
specific an issue. Am considering writing one at
a later stage, if needed.
Corrected them manually..not sure if this is a
best practice!

17
NOT YET!!

A number of identifiers do not have associated
EVS tags.
Created a list of all the CDEs with all the
associated EVS concepts for the object classes,
properties and qualifiers and sent them out to
the group and the EVS team. The EVS concepts that
are not present in the EVS were highlighted.
Getting the terms into the EVS can be a major
bottleneck. This should have been recognized and
we should have contacted the EVS team at an
earlier stage.
The concepts are in the EVS currently but they
are not visible!

18
THIS IS WHAT THE NEXT STEPS SHOULD LOOK LIKE
(WITH THE CURRENT SDK)

Running the ant task semantic-connector that will
generate a report associating the classes,
properties and their qualifiers with immutable
EVS concept codes.
Distribute the report to the EVS team (Nicole
Thomas) and the Genome Annotation SIG to ensure
everyones satisfaction.
Modify the xmi file using the ant task
semantic-connector.
Submit the xmi file to NCICB for UML Loading once
ALL the concepts are tagged.
The CDEs will then be loaded on to the caDSR
staging server.
The CDEs can be further curated using the CDE
curation tool. This includes adding the Value
Domains.
When the CDEs are approved, they are loaded on to
the caDSR production server.

19
BUT WE TOOK A SHORTCUT

Since the number of CDEs were small and the EVS
concepts cannot be seen as of now, George
Komatsoulis and his team generously offered to
manually annotate them and load them into the
caDSR staging server.
The CDEs were loaded into the caDSR staging
server last evening!
http//cdebrowser-stage.nci.nih.gov/

20
ISSUES THAT NEED TO BE RESOLVED

The context and the classification schema into
which these CDEs are to be loaded has still not
been addressed. (I see it has been loaded into
the caCORE context and the caBIG classification
schema).
The question of whether it is acceptable to come
up with MORE COMPREHENSIVE definitions from
various sources for new EVS concepts as opposed
to citing a single, reliable source also needs to
be talked about.

21
SUMMARY

Establishing effective communication and reaching
a consensus on some very contentious issues has
been a major bottleneck in this process.
The scope of this process has also been under a
cloud. The question of whether the list of
Genomic Identifier CDEs has to be more
comprehensive has been hotly debated. We decided
to opt for more flexibility and power to the
developers.

22
ACKNOWLEDGMENTS

Rakesh Nagarajan, Juli Klemm, Harold Riethman and
Baris Suzek for their invaluable input in forming
the modeling strategy.
The EVS team for creating the new EVS concepts at
short notice.
The V/CDE team (George Komatsoulis, Hong Dang,
Tommie Curtis, Brian Davis, Mike Keller) for
providing us all the expertise and resources for
the CDE creation exercise.
Niket Parikh from BAH for help with EA issues.
All members of the Genome Annotation SIG for
their valuable contributions.

23
QUESTIONS???

Write a Comment

User Comments (0)