Title: Overview and Progress of VBI Proteomics Database Project
1Overview and Progress of VBI Proteomics Database
Project
Xianfeng Jeff Chen, Chengdong Zhang, Ronald
Kenyon, Dana Eckart, Oswald Crasta, and Bruno
Sobral
Abstract VBI is part of the Administrative
Resource Center for Biodefense Proteomics
Research Program funded through NIH/NIAID
proteomics initiative with a goal of improve
defense against bioterrorism and
emerging/reemerging infectious diseases. VBI
project team has prototyped a public accessible
database system and web interface at
http//proteinbank.vbi.vt.edu, which supports
functionalities with proteomics data management
and data communication. The backend of this
website is an Oracle-based relational database
with multiple schemas modeled with data types of
administration, 2D gel, Mass spectrometry, and
yeast two-hybrid system .
Three phases of database design have been
initiated with (1) process-oriented design for
datasets generated by each PRC site, (2)
normalized and consolidated data into single
schema to remove information redundancy, and (3)
a multi-layered database architecture with
physical, logical, and stored application layers
as a final production instance. Thus, the final
mature VBI production proteomics database will
apply generic design in key-value-pair fashion as
the physical layer, process-oriented design
implemented as view /materialized view as the
logical layer, and stored procedure for backend
data processing as the application layer(Fig. 6).
Introduction The National Institute of Allergy
and Infectious Diseases (NIAID), part of National
Institute of Health(NIH), has awarded an 8.7
million contract to Social Scientific Systems,
Inc. (SSS) with subcontract to Virginia
Bioinformatics Institute (VBI) and Georgetown
University (GU) to serve as Administrative
Resource Centers (ARC) for the seven recently
awarded Proteomics Research Centers (PRCs)(Fig.
1) The contract supports NIAIDs goals to
improve defense against bioterrorism and
emerging/reemerging infectious diseases. Under
this contract, the ARC team will provide support
to the PRCs by developing and maintaining a
publicly accessible database and website that
contains data and technology protocols generated
by each PRC site. SSS will take the lead in
project coordination, reagent/LIMS management,
protocol/SOP development and maintenance. GU
will build data analysis tools for the seven
PRCs. VBI will build a centralized relational
database system to manage proteomics datasets
generated by the PRCs.(Fig. 2).
Fig. 6. Three database instances with three
design logics at three phases.
Fig. 5. Three design logics and three phases of
VBI proteomics database design.
There are three schemas being deployed so far for
testing the database functionalities. The first
is the administrative schema that supports user
profile, query history, user data uploading, and
document/file management. The second is the
schema called Data_ Repository to support public
2D gel, and MS proteomics data downloading. The
third is the schema for PRC and internal 2D gel
and MS data storage in a highly decomposed
fashion(Fig. 7). Sample test datasets from
various sources are converted into a standard
format for normalization and data decomposition
into relational format for data uploading into
database(Fig. 8). Sample MS datasets from four
organisms ---- Escherichia coli, Mycobacterium
smegmatis, Saccharomyces cerevisiae, Homo
sapiens and 2D gel datasets from Arabidopsis
thaliana have been loaded for database
functionality testing. A total of 12 MS datasets
from the above mentioned organisms with 13772
protein hits have been populated into this
database schema. A total of 6 sets of 2D gel
with 2936 spots list have been loaded into the
development database schema.
Fig. 2. Responsibilities of administrative
centers.
Fig. 1. NIAID proteomics research program.
Progress of NIAID-funded VBI proteomics database
project VBI project team has prototyped a
proteomics database with a web interface to
facilitate storage, visualization, and analysis
of proteomics datasets generated by PRC
centers(Fig.3) . The system employs two data
storage systems, one is an Oracle-based
relational database and the other is a networked
file server to facilitate data queries and quick
data downloading/uploading(Fig. 4). This system
currently contains prototypes of functionalities
with account/document management, data query,
data downloading, and data uploading/submission.
Fig. 8. Data flow of proteomics data management,
data reformatting, and data standardization.
Fig. 7. An example schema of VBI 2D gel and mass
spectrometry database modeled using Erwin
based on Pedro class diagram.
Reference Chris F. Taylor, Norman W. Paton, Kevin
L. Garwood, Paul D. Kirby, David A. Stead,
Zhikang Yin, Eric W. Deutsch, Laura Selway, Janet
Walker, Isabel Riba-Garcia, Shabaz Mohammed,
Michael J. Deery, Julie A. Howard, Tom Dunkley,
Ruedi Aebersold, Douglas B. Kell, Kathryn S.
Lilley, Peter Roepstorff, John R. Yates III, Andy
Brass, Alistair J.P. Brown, Phil Cash, Simon J.
Gaskell, Simon J. Hubbard and Stephen G. Oliver
(2003) A systematic approach to modeling,
capturing, and disseminating proteomics
experimentaldata. Nature Biotechnology 21 247
254 Acknowledgement We thank NIH/NIAID and
SSS for the subcontract awarded to Dr. Bruno
Sobral for this VBI proteomics database project.
Fig. 4. VBI computing system architecture.
Fig. 3. VBI proteomics database web interface.
The backend of this dynamically generated
querable interface is an Oracle-based relational
database with multiple schemas that are modeled
at this point with data types of administration,
2D gel, and Mass spectrometry (MS), yeast
two-hybrid system (Y2H) in process-oriented
fashion based on Pedro class diagram (Taylor et
al 2003). Three instances of databases have been
created in our system as our database development
environment --- development, test/stage, and
production databases(Fig. 5).