Title: Enabling Collaboration
1EnablingCollaboration
http//www.pdb.org/ ? info_at_rcsb.org The PDB is
supported by funds from the NSF, DOE, and two
units of the NIH the NIGMS and NLM
2Current PDB Management
- Research Collaboratory for Structural
Bioinformatics - Director
- Helen M. Berman, Rutgers University
- Co-Directors
- Phil Bourne, UCSD/SDSC
- Gary Gilliland, CARB/NIST
- John Westbrook, Rutgers University
- Although the R in RCSB is for research
PDB is purely a production service project
3PDB Deposition and Distribution Sites
In place
Planned
SDSC, Rutgers, NIST, BMRB
Cambridge Crystallographic Data Centre,
UK National University of Singapore, Singapore
Osaka University, Japan Universidade
Federal de Minas Gerais, Brazil Max-Delbrück-Cente
r, Germany
4 Some Goals Related to Collaboration
- Seamless data exchange among structure
determination applications, databases, and
deposition pathways - Timely distribution and enabling complete
analysis of macromolecular structure data by
remote Internet users and applications
5Requirements
- Consensus on well-defined metadata specifications
for all exchanged data - Well-integrated software supporting the data
specifications - APIs to deliver data
6Data Sharing Nightmare
7Ontology Content
- http//deposit.pdb.org/mmcif/
- NMR
- Modeling
- Crystallization
- Symmetry
- Image data
- BIOSYNC
- Data harvesting/pipelining
- PDB data exchange
- Including structural genomics extensions
8Whats Driving Data Definition
- IUCr sponsored community effort (1989 -gt )
- Data harvesting
- Data management and exchange for PDB
- High-throughput structure determination and
structural genomics (via International Task
Forces) - Data deposited will be at the level of journal
materials and methods section - Each data item must be carefully defined in PDB
exchange data dictionary - Additional description of X-ray, NMR experiments
and new items describing protein production (3x
increase in content scope)
9Elements of Ontology Metadata
- Data attributes
- Definition
- Examples
- Data type (primitive type/regular expression
patterns) - Range or allowed values
- Classes
- Categories
- Subcategories
- Category groups
- Associations
- Parent-child relationships
- Interdependencies/exclusivity
- Methods
10Ontology Representation
- Macromolecular Crystallographic Information
Framework - XML Schema Mapping
- Other emerging ontology representations
DAMLOIL, RDFS, OWL - Many difficult issues
- resolving semantic ambiguities encoding meaning
- integrating multiple views and controlled
vocabularies - separation of primary and derived information
- supporting rapid evolution and changing
scientific persceptives - including detailed process modeling
11Incremental Data Pipeline
Target Tracking
Incremental Assembly
12Current Collection and Integration Strategy( If
its not electronic you probably wont get it )
- Collect status data on the progress of each
target - Collect bits of output from each program step
Work with software developers to optimize this
data component - Merge the data from each step into common
representation - Use editing tool to enter remaining data and
check results - Make all data files available in the
representation of the exchange dictionary (beta
-- ftp//beta.rcsb.org)
13Software Integration Toolshttp//deposit.pdb.org/
softwarehttp//deposit.pdb.org/mmcif
- Standalone data input tool creates and edits
files in PDB exchange format - PDB validation suite checks data in exchange
format - Tools to extract and translate data from program
output files in exchange format - Format exchange (PDB, XML conversion tools)
- C, C, and Perl tools to parse and manage mmCIF
14Typical Project Deposition Data Flow
Target Selection
Crystal Production
Protein Production
Project Database
Structure Determination
Merged Project Data
Exchange Dictionary
PDB Deposition
15Target Registration Database(a new form of
sharing )TargetDB Â http//targetdb.pdb.org/
- All targets downloadable in XML (17,950 Targets)
- Targets downloaded from 13 centers weekly
- Target search by
- Sequence (FASTA), project target ID, project
site, status (selected, cloned, expressed, in
PDB), update date, protein name, source organism - Report output in HTML, FASTA, and XML
- Integrates sequences from PDB entries (41,000
sequences including 700 pre-release sequences) - Provides links to related sequence databases
- Open to all Structural Genomics projects
16Application Level Distribution
- Corba specification adopted by OMG in February
2001 - Based on the PDB exchange data ontology
- Provides high performance access
- Direct access to binary data structures
- Broad granularity of access (individual atoms to
biological assemblies)
17CORBA Implementation
- OpenMMS provides a Java-only toolkit that creates
XML, CORBA and relational DB representations of
the PDB data ontology. - Allow programmers to more easily create
efficient, high performance and robust
applications that use PDB data - Provides database-to-database interoperability
- C server under development
- Code and examples available at
http//openmms.sdsc.edu/ -
18API Development
- EJB
- 60 entities developed
- LSID
- In collaboration with I3C
- Coarse grain SOAP access to PDB and TargetDB data
- SOAP API
- Fine grain SOAP access based modeled on Corba
specification - Reuses C Corba server
- Direct SQL
- Problems large investment for robust production
support for potentially short lived technology
19Access
- PDB SDSC Access Site
- http//www.pdb.org/
- PDB Deposition Sites
- http//autodep.ebi.ac.uk/
- http//pdbdep.protein.osaka-u.ac.jp/adit/
- http//pdb.rutgers.edu/adit/
- PDB Software Download Site
- http//pdb.rutgers.edu/software/
- PDB mmCIF Resource Site http//pdb.rutgers.edu/mmc
if/ - mmCIF Beta Data Site
- ftp//beta.rcsb.org/pub/pdb/uniformity/data/mmCIF/
20PDB Project Team
Director Helen M. Berman (Rutgers) Co-Directors
John Westbrook (Rutgers), Phil Bourne
(UCSD/SDSC), Gary Gilliland (NIST) Rutgers
Anthony Adelakun, Kyle Burkhardt, Li Chen, Sharon
Cousin, Zukang Feng, Lisa Iype, Shri Jain,
Jessica Marvin, Rose Oughtred, Gnanesh Patel,
Tania Rose Posa, Suzanne Richman, Bohdan
Schneider (Prague), Olivera Tosic, Rosalina
Valera, Christine Zardecki NIST T.N. Bhat,
Phoebe Fagan, Veerasamy Ravichandran, Michael
Tung, Greg Vasquez, Padma Priya Paragi
Vedanthi UCSD/SDSC David Archbell, Peter
Arzberger, Bryan Banister, Tammy Battistuz,
Wolfgang F. Bluhm, Eliot Clingman, Nita
Deshpande, Ward Fleri, Douglas S. Greer, David
Padilla, Thomas Solomon, David Stoner, Peggy
Wagner
21Sequence Target DTDTargetDB - http//targetdb.pdb
.org
- lt!ELEMENT targets (target)gt
- lt!ELEMENT target (id, lab, date, status,
sequence, name?, url, remark)gt - lt!-- required data items --gt
- lt!-- any lab specified id --gt
- lt!ELEMENT id (PCDATA)gt
- lt!-- any lab specified id --gt
- lt!ELEMENT lab (PCDATA)gt
- lt!-- most recent update. format YYYY-MM-DD --gt
- lt!ELEMENT date (PCDATA)gt
- lt!-- status. One or more or the following
descriptive terms - Selected, Cloned, Expressed, Soluble, Purified,
Crystallized, - Diffraction-quality Crystals, Diffraction NMR,
Assigned HSQC, - Crystal Structure, NMR Structure, In PDB, Work
Stopped, Other --gt - lt!ELEMENT status (PCDATA)gt
- lt!-- protein sequence in IUPAC 1-letter codes --gt
- lt!ELEMENT sequence (PCDATA)gt
- lt!-- optional data items --gt
- lt!-- any lab-specified name for the protein --gt
- lt!ELEMENT name (PCDATA)gt
22Diverse Delivery Options using a Common Data
Dictionary
CRYST1 101.362 114.722 45.591 90.00 90.00
90.00 P 21 21 2 20
- ltmmCIFGROUP.cell_groupgt
- ltmmCIFCATEGORY.cellgt
- ltmmCIFcell entry_id"RCSB000000"gtÂ
- ltlength_agt101.362lt/length_agt Â
- ltlength_bgt114.722lt/length_bgt
- ltlength_cgt45.591lt/length_cgt Â
- ltangle_alphagt90.00lt/angle_alphagt
- ltangle_betagt90.00lt/angle_betagt Â
- ltangle_gammagt90.00lt/angle_gammagt
- ltZ_PDBgt20lt/Z_PDBgt Â
- lt/mmCIFcellgtÂ
- lt/mmCIFCATEGORY.cellgtÂ
- lt/mmCIFGROUP.cell_groupgt