Title: Collaborative Computing Project N
1Collaborative Computing Project N A Data Model
for Macromolecular NMR
Rasmus Fogh, University of Cambridge John
Ionides, European Informatics Institute Wayne
Boucher, Azara Inc. Ernest Laue, University of
Cambridge Funded by UK Biotechnology and
Biological Sciences Research Council
2CCPN activities
CCPN is a non-profit body serving the NMR
community Workshops and seminars New analysis
and processing software Distribution of
software Data Model for NMR Source code
distributed for free to non-commercial users
3Contents of this talk
- What is the problem?
- How so we propose to solve it?
4Problem Exchanging data
Many different programs, all with different data
formats Conversion programs are messy Exchange
is one-way only You cannot switch back and
forth between programs Solution common data
format for exchange
5Problem Data harvesting
More and more must be deposited in databases All
data are on computer but cannot be transferred
directly Entering in deposition system is
laborious and often skipped Solution Common
data format to carry all data along
6Problem Closed Software
Software is mostly closed and/or proprietary You
rarely have source code access Writing new
routines requires having your own
program Writing macros ties you to one single
program Partial solution Common data format
for uniform data access
7Summary
Problems are Data Exchange Data
harvesting Access to program data Solution
is CCPN Common Data Model Once the format is
implemented, that is all you will need to
know (the data model is part of the machinery)
8The CCPN Solution
One central data model, supporting several data
formats, with APIs (Application Program
Interfaces) for several languages, Using
automatic generation of code and documentation.
9Solution should be
All-comprehensive but Simple and
understandable Human readable but Precisely
defined and checkable on computer Stable but
Easy to modify and maintain
10Contents, part 2
Data Formats Data Model Application Program
Interface Framework and code generation
11How we work
Small core group for development All
contributions welcome Widest possible
consultation with NMR program
developers Instrument manufacturers Databases
(BioMagResBank, PDB, )
12Data Format
A precise specification of how to write data to
a file E.g. PDB ATOM 132 1HB TYR 108 A
Data Formats can be hard to change once
made You can convert between formats provided
the underlying data model is the same
13Target Data Formats
XML ltresidue nameTYR SEQID108gt ltatom
nameHB1gt lt/atomgt lt/residuegt NMR-STA
R SQL database
14Data Model
What kind of data are there and how do they
relate? E.g. Each crosspeak belongs to one
spectrum. Each spectrum has a temperature Eac
h spectrum corresponds to one sample. Each
sample has a pH Not straightforward to make.
(e.g. pH Titration series)
15Data Model cont.
If data models differ, conversion cannot
help. E.g. 1 assignment per dimension
27 ALA HN
26 LYS HA
Multiple assignments per dimension ???
16Data Model Scope
For instance Raw and processed data,
crosspeaks, assignment, structure,
molecules, intermediate assignment information,
notes and labels, MQ and reduced-dimensionality
spectra, J- and other splittings, Titration
series, LC-NMR, molecules in exchange,
17API (Application Program Interface)
Programs interact with all data through
API Clearly defined set of subroutine
calls File storage can change SQL, XML,
NMR-STAR What is in memory can change Is a
shift stored, or calculated when
needed? Subroutine call remains the
same myshift myatom.getChemicalShift()
18Data Model Specification
One single master model Modelling in UML
(Universal Modelling Language), a software
industry standard with many tools
available. APIs, I/O routines, documentation,
(editors, ) Autogenerated from UML master
model Data Formats XML, NMR-STAR, SQL
() APIs for Python, C, (Java, Fortran, )
19(No Transcript)
20Current Status
Version 1 framework and autogeneration Version 1
Python API XML input/output Core data model
(spectra, crosspeaks, assignment, PDB format
data) To be presented for detailed comments
November at EBI Hinxton
21For Details See
http//www.bio.cam.ac.uk/nmr/ccp