Title: Adapting an Existing Data Service to be caBIG
1Adapting an Existing Data Service to be caBIG
Silver-level Compliant Peter Hussey LabKey
Software, Inc, Seattle, WA USA Contact
peter_at_labkey.com
Abstract
Challenges in Adapting an Existing Application to
caCORE
caCORE SDK Development Process
The National Cancer Institutes caBIG initiative
aims for interoperability of bioinformatics
applications. caBIG envisions that this will be
achieved by encouraging all applications to
implement a standard programming interface and to
register their terms and data objects with a
centralized service. The required programming
interface is essentially defined in terms of the
behavior of applications built using the caCORE
Software Development Kit (SDK). The caCORE SDK is
designed and documented for building a new
application from scratch. Little is documented on
how one might achieve caBIG silver-level
compliance in an application not built with the
caCORE SDK. This poster describes the caCORE SDK
development and build process and how the LabKey
team changed it to work with their existing
proteomics platform software. The LabKey/CPAS
solution creates a parallel web application that
supports the caBIG programming interface and
accesses LabKey/CPAS data through a SQL View
layer.
- There are three phases in the caCORE development
paradigm - Create the UML model elements using the UML
modeling tool. This is a painstaking task for any
moderately complex real-world application. The
application object model is essentially specified
twice as a UML Class model and as a UML Data
model. The Class model corresponds to the objects
in the application that a developer will
ultimately use to access the data service. The
Data model describes the implementation of those
classes in a relational database, In most cases
there is a single SQL table that corresponds to a
single Class object. The data objects are linked
together through a set of specific relationships
and attribute values that must all match exactly,
but are each specified and visible on separate
property dialogs within Enterprise Architect.
(Note the 4.0 SDK has added a very useful
validation step to the build process that should
make it much easier to track down and fix
inconsistencies and omissions in the UML models
than what the LabKey/CPAS team experienced.)
Figure 2 shows a small subset of the LabKey/CPAS
UML model in a diagram that combines some the
class elements and the data elements in a single
diagram. - Register the classes and attributes of the UML
model objects with NCIs Enterprise Vocabulary
Services (EVS) and the Cancer Data Standards
Repository (caDSR). The common data element
identifiers resulting from this step are
incorporated into the class model objects as
additional tagged values. - Run the SDK build process, creating three runtime
entities from the model (figure 2)
- Most large-scale, team-built applications are not
designed using an application generator approach.
LabKey/CPAS is one such application. Yet
LabKey/CPAS still needs to participate in the
interoperability of caBIG. For these situations,
the caCORE SDK can be used to generate a web
application that runs in parallel to an existing
application and exposes a caBIG silver-compliant
programming interface over the data managed by
the non-caCORE application. The main
pre-requisite to this architecture is that the
data to expose is held in a relational database.
We also made the big simplification that the
caCORE-generated web application would expose
read-only interfaces, which is allowed and
appropriate for caBIG compliance. Within this
simplified target, we still encountered
difficulties around the following - SQL schema implementation differences from
caCORE. The caCORE SDK makes several assumptions
regarding the database schema that may not be
true for an existing application - A class in the object model to be exposed
corresponds 1-to-1 with a table in the SQL Schema - The object identifier maps to a single integer
primary key in the corresponding relational
table. - A relationship between Class objects corresponds
to a foreign key in the SQL tables - Security integration. An existing application
will likely have some security implementation
that logically should extend to the caBIG
interface. The caCORE SDK, however, discusses
only the implementation of security in a new
application, not integration with an existing
security model.
Introduction
In 2007, the LabKey/CPAS development team set out
to achieve caBIG silver level compliance for
the MS2 proteomics data managed by CPAS, our
application used by several large cancer center
clients. Achieving compliance proved difficult
because caBIG compliance for a data service is
defined in terms of the behavior of applications
built with the caCORE SDK. LabKey/CPAS was not
designed or built with any reference to the
caCORE SDK. The caBIG compliance guidelines
suggest that building an application with the
caCORE SDK was just one possible implementation
of silver compliance. We found, however, no
precise definition or test for what caBIG silver
compliance meant, in particular what queries a
silver-compliant service needed to support. Our
challenge became finding a way to incorporate the
caCORE runtime architecture into our existing
application with minimal impact on existing code.
The LabKey/CPAS Solution
LabKey/CPAS resolved these challenges through the
creation of a SQL View layer. In our solution,
the Data model defines a virtual schema
definition in a database schema named cabig. We
then created a set of SQL views with the same
names and same columns as the UML Data model. The
caCORE-generated web application interacts with
these views as if they were tables. The web
application cannot tell the difference. Under the
covers, the view layer passes through the queries
to the original base tables (managed by the
non-caCORE application), and fixes up the
differences along the way. We wrapped the cabig
view definition scripts into a new module of
LabKey/CPAS and included a small set of UI
changes that configures and tests caBIG access
for a given folder.
- Database definition scripts, in the form of SQL
CREATE TABLE commands - A web application that implements the UML Class
model and can translate requests for objects into
SQL commands. - A set of programming interface libraries that
enable applications to query, insert, update and
delete application objects over several different
communication channels, including local Java
applications and web service calls.
The caCORE Application Paradigm
The caCORE SDK is based on a software development
paradigm that starts with an abstract model of
the entities represented in a particular
application. Real-world examples of such entities
include identified peptides in an MS2 run or
microarray test results. Entities are usually
related to other entities in known ways. For
example a single MS2 run entity must have 1 or
more FASTA databases and may have 0 or one or
more identified peptides. Generally the
interesting entities in an application are those
stored in the database. There is often a close
correspondence between a row (record) in a SQL
table in the database used by an application and
an instance (single entity) of a class of similar
entities to be exposed by the application. The
caCORE SDK architecture is based largely on the
1-to-1 correspondence between an application
class and a SQL table.
- The view layer solves the issues described above
- Security Integration Since data access in
LabKey/CPAS is granted on a folder-by-folder
basis, we wanted to enable or disable caBIG
access by folder. We added a single true/false
caBIGPublished column to our existing
core.Containers table. This bit is turned on and
off by the Publish button accessible on a
projects Permissions page. The corresponding
Containers view in the cabig schema includes the
restriction WHERE caBIGPublishedtrue. All of
the other view definitions in the cabig schema
include an inner join to the cabig.Containers
view. As a result, the caBIG interface sees only
data in those containers that have been
published. - Data Model Compliance Most of the underlying
CPAS tables have a single integer primary key,
but a few had two-column integer keys. To meet
the caCOREs requirement for a single column key,
the SQL View definition includes a sum function
SELECT ((4294967296 op.propertyid)op.o
bjectid) AS id, ..As a second example, the
PeptidesData table in CPAS is used to store score
values from different search engines in
generically-named ScoreX columns.. For caBIG,
we chose to represent the scores for different
engines as different objects (preserving the
1-to-1 paradigm). We handled this difference in
the view layer by creating a view per search
engine, with the appropriate filter.
Search Application
Scriptapps
caCOREAPI
ClientAPI
Figure 2. The caCORE SDK Build process
caCORE Runtime Architecture
In a software application based on the caCORE
design, developers write web pages and
program-to-program applications using the API
generated by the SDK build process. The web
application handles both read and write access to
the underlying SQL database in order to support
the creation and management of application
objects.
LabKey/CPAS
caCORE web application
- At the core of the generated caCORE web
application is Hibernate, an open source
middleware layer for mapping Java programming
objects into SQL table objects and vice-versa.
(Figure 3). The caCORE SDK build process
translates the UML model into configuration files
that allow Hibernate to construct complex queries
by translating relationships between objects into
SQL JOIN constructs. Hibernate allows
programmers to issue database queries in a simple
a Query By Example format. The use of Hibernate
in the caCORE runtime yields several benefits - It avoids mixing SQL commands application code,
common source of bugs in web database
applications. - It is highly configurable, allowing the developer
to tune the way Hibernate translates object
access into SQL. - It supports a standardized Hibernate Query
Language (HQL) that looks like SQL but works
unchanged across all supported relational
databases, allowing the developer to issue more
complex queries than can be expressed via the
standard QBE mechanism. - The caCORE SDK allows a developer or analyst to
leverage application model knowledge into a
working web database application that would
otherwise be very difficult and expensive to
build from scratch.
SQL database
cabig Views
Figure 4. The caCORE implementation for CPAS
Conclusion
Our efforts to adapt our existing comprehensive
proteomics application to achieve caBIG silver
compliance proved successful once we decided on
the basic approach of running the caCORE SDK
generated web application in parallel to
LabKey/CPAS. In our design, the SDK generated
application accesses the relational data through
a set of views that handle some of the tricky
mapping and security problems. The views also act
as a buffer between the underlying base tables
and the web application, allowing names to change
in one place without affecting the other. In the
future, it will be relatively easy for LabKey to
expand of the scope of our caBIG interface to
incorporate any data managed by LabKey Server. In
fact, putting the data into LabKey may well be
the fastest way for a developer to achieve silver
compliance for a data service, while at the same
time gaining many of the data analysis and
management features that are built-in to the
LabKey platform.
One of the design goals of the caCORE
architecture is to create an inter-operability
standard that is not tied to a single programming
language. So in the caCORE development paradigm,
the developer describes objects and their
relationships in Universal Modeling Language
(UML). UML is a high-level, primarily graphical
approach to defining a programming project. UML
is implemented by a number of tools including
Enterprise Architect and ArgoUML, the two tools
supported by the current caCORE SDK (version
4.0). UML modeling, however, is only partly
standardized. It is difficult, for example, to
transfer a model between tools without losing
information in the transfer.
Figure 3. The caCORE runtime architecture