TM - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

TM

Description:

Craig Street/Vishal Nayak. Biomedical Informatics Facility. University ... Also, there are these mysterious #NOTES# tags appended to some of the EA definitions. ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 24
Provided by: michael1399
Category:
Tags: appended

less

Transcript and Presenter's Notes

Title: TM


1
TM
caBIG ICR Face-to-Face CDE Creation Lessons
Learned Craig Street/Vishal Nayak Biomedical
Informatics Facility University of Pennsylvania
caBIG ICR Face-to-Face May 2-3, 2005
2
GENE CDE Focus Group
  • A focus group was formed within the Genome
    Annotation SIG and was charged with the
    responsibility of creating relevant CDEs.
  • Consisting of Craig Street, Rakesh Nagarajan,
    Juli Klemm, and Vishal Nayak (with input from
    Harold Riethman and Baris Suzek).
  • It was decided that the developers are experts in
    their domain and know best how to model their
    data. We opted not to be restrictive in our
    modeling strategy and created CDEs that were as
    inclusive as possible.
  • It was decided to focus initially on the modeling
    of Genomic Identifiers.

3
TM
Genomic Identifier Modeling Strategy
  • Recommendations to Developers
  • Polled Developers of ICR Projects to Determine
    What Identifiers are Currently Being Used in
    their Systems
  • Compiled List of Required Genomic Identifiers
    (i.e., DNA or its RNA or Protein Product) based
    on Survey
  • Each ICR Projects Object Model must utilize AT
    LEAST ONE of these Defined Genomic Identifier
    CDEs for each UML Class with any Genomic
    Identifier Attributes.

caBIG ICR Face-to-Face May 2-3, 2005
4
Genomic Identifier Modeling Strategy
  • How should the genomic identifiers be modeled?
  • (a) Should the identifiers be made object classes
    AND the type of identifier the property?
  • OR
  • (b) Should the identifiers be modeled as a
    property to the Gene, mRNA, Protein object
    classes with the type of identifier as a property
    qualifier?
  • (b) was adopted as it is the more intuitive and
    commonly used modeling strategy. This will be the
    recommended modeling strategy and allows for the
    harmonization of CDEs based on immutable EVS
    concept codes.
  • This presentation focuses on the lessons learned
    from recommended modeling strategy. The
    alternate strategies will be dealt with at a
    later stage based on need.

5
First Step Creating definitions for the various
concepts
  • Came up with a list of comprehensive and precise
    definitions for all the elements (object classes,
    properties, qualifiers) from disparate sources.
  • Obtained consensus from the members of the Genome
    Annotation SIGs and the V/CDE workspace
    facilitators on these definitions.
  • These definitions are used in the Documentation
    and Description tags for the UML classes and
    attributes respectively.
  • Many of the Genomic Identifiers were NOT present
    as concepts in the EVS. Some of these definitions
    could be used to define these concepts.

6
First Step Creating definitions for the various
concepts
  • Lessons Learned
  • While creating definitions de novo, it is quicker
    and more acceptable to get the definition from a
    single source rather than creating them ad hoc
    from a variety of sources. This is true
    especially if these concepts have to be entered
    into the EVS.
  • If the definitions have to replace or be cited as
    alternate definitions to existing EVS concepts,
    there should be a good reason why the existing
    definitions are insufficient or how the new
    definitions will enhance/complement the existing
    ones.

7
Second Step UML Modeling with Enterprise
Architect
8
Second Step UML Model
  • The UML model we created is NOT an information
    model for a system.
  • It is just a way (the preferred way) of entering
    the Genomic Identifier CDEs into the caDSR.
  • We created three classes Gene, MessengerRNA and
    Protein and have the preferred genomic
    identifiers as attributes.
  • The UML model created some confusion
  • To map or not to map, that is the question.

9
First Way to Model
10
Second Way to Model
11
Which Way Should be Adopted?
  • WE WILL STICK TO CDE CREATION FOR NOW!
  • Mapping the identifiers is too complicated to be
    handled by a partial model like this.
  • The first way of modeling is sufficient to put
    the CDEs into the caDSR.
  • The aim of this exercise is to provide
    pre-existing Genomic Identifier CDEs to
    developers that are anticipated to be reused
    commonly as opposed to mapping.

12
Third Step Creating the UML Model with
Enterprise Architect
  • This involved creation of classes and attributes
    and their associated tags and adding them to the
    Logical Diagram.
  • Care was taken to follow the Semantic Connector
    Best Naming Practices as closely as possible.
    There was some debate on whether there should be
    separate Ensembl_gene, Ensembl_transcript and
    Ensembl_protein OR just Ensembl as a property
    qualifier (similarly for RefSeq). It was decided
    to just have Ensembl AND RefSeq.
  • The class names were title cased and camel case
    was used to distinguish the object class and
    property qualifiers to allow the semantic
    connector to distinguish between the different
    concepts.
  • Documentation tags were created for the UML
    classes and Description tags were created for the
    attributes.
  • An effort was made to create ALL the required
    tags, including the concept code tags in order to
    reduce the number of iterations through the
    semantic connector for it to identify the concept
    codes for the individual concepts.

13
Third Step Creating the UML Model with
Enterprise Architect
  • The length of the tag value field is sometimes
    NOT sufficient to hold the complete definition.
  • Added only the truncated definition in the
    tag-value field and added the complete definition
    as a NOTE for the tag. When the .xmi is
    generated, this is translated to ltUMLCommentgt
    field.
  • All the Genomic Identifiers were indicated to be
    of type java.lang.String. The data type objects
    are included in the model. The value domains can
    be curated once the CDEs are in the caDSR.
  • The xmi file is generated.
  • NEWS FLASH! UML models created from a different
    version of EA might not generate a perfect xmi.
    The compatibilities have to be worked out.

14
Fourth Step Installation of the caCORE SDK
  • To run the xmi through the semantic connector,
    the caCORE SDK (with the functional
    semantic-connector tool) has to be installed.
  • caCORE SDK 1.0.2 was successfully installed.

15
Fifth Step Semantic Connection
16
WAIT!!!!!!
  • Before running the xmi file through the semantic
    connector, it has to be checked whether it is a
    valid MDR XMI file.
  • The xmi generated by EA is NOT a valid MDR XMI
    file!!!!
  • There is a solution the ant fix-ea tool
    generously supplied with the caCORE SDK toolkit.
  • Modified the fix-ea tool and the
    deploy.properties file to fix the generated xmi.
  • There is one more thing that needed to be done
    The ltUMLCommentgt tags and the truncated EA tag
    values! Also, there are these mysterious NOTES
    tags appended to some of the EA definitions.
  • Considered adding a function to the FixEAXMI
    class to fix these problems, but this is too
    specific an issue. Am considering writing one at
    a later stage, if needed.
  • Corrected them manually..not sure if this is a
    best practice!

17
NOT YET!!
  • A number of identifiers do not have associated
    EVS tags.
  • Created a list of all the CDEs with all the
    associated EVS concepts for the object classes,
    properties and qualifiers and sent them out to
    the group and the EVS team. The EVS concepts that
    are not present in the EVS were highlighted.
  • Getting the terms into the EVS can be a major
    bottleneck. This should have been recognized and
    we should have contacted the EVS team at an
    earlier stage.
  • The concepts are in the EVS currently but they
    are not visible!

18
THIS IS WHAT THE NEXT STEPS SHOULD LOOK LIKE
(WITH THE CURRENT SDK)
  • Running the ant task semantic-connector that will
    generate a report associating the classes,
    properties and their qualifiers with immutable
    EVS concept codes.
  • Distribute the report to the EVS team (Nicole
    Thomas) and the Genome Annotation SIG to ensure
    everyones satisfaction.
  • Modify the xmi file using the ant task
    semantic-connector.
  • Submit the xmi file to NCICB for UML Loading once
    ALL the concepts are tagged.
  • The CDEs will then be loaded on to the caDSR
    staging server.
  • The CDEs can be further curated using the CDE
    curation tool. This includes adding the Value
    Domains.
  • When the CDEs are approved, they are loaded on to
    the caDSR production server.

19
BUT WE TOOK A SHORTCUT
  • Since the number of CDEs were small and the EVS
    concepts cannot be seen as of now, George
    Komatsoulis and his team generously offered to
    manually annotate them and load them into the
    caDSR staging server.
  • The CDEs were loaded into the caDSR staging
    server last evening!
  • http//cdebrowser-stage.nci.nih.gov/

20
ISSUES THAT NEED TO BE RESOLVED
  • The context and the classification schema into
    which these CDEs are to be loaded has still not
    been addressed. (I see it has been loaded into
    the caCORE context and the caBIG classification
    schema).
  • The question of whether it is acceptable to come
    up with MORE COMPREHENSIVE definitions from
    various sources for new EVS concepts as opposed
    to citing a single, reliable source also needs to
    be talked about.

21
SUMMARY
  • Establishing effective communication and reaching
    a consensus on some very contentious issues has
    been a major bottleneck in this process.
  • The scope of this process has also been under a
    cloud. The question of whether the list of
    Genomic Identifier CDEs has to be more
    comprehensive has been hotly debated. We decided
    to opt for more flexibility and power to the
    developers.

22
ACKNOWLEDGMENTS
  • Rakesh Nagarajan, Juli Klemm, Harold Riethman and
    Baris Suzek for their invaluable input in forming
    the modeling strategy.
  • The EVS team for creating the new EVS concepts at
    short notice.
  • The V/CDE team (George Komatsoulis, Hong Dang,
    Tommie Curtis, Brian Davis, Mike Keller) for
    providing us all the expertise and resources for
    the CDE creation exercise.
  • Niket Parikh from BAH for help with EA issues.
  • All members of the Genome Annotation SIG for
    their valuable contributions.

23
QUESTIONS???
Write a Comment
User Comments (0)
About PowerShow.com