Title: Proteomics Database Discussion
1Proteomics Database Discussion
- Raju Kucherlapati
- Harvard Medical School
2Proteomics Information Lifecycle
Analytical Chemistry Lab
Collaboration
Publicly Available Data
SitePortal / LIMSEnvironment
CollaborationData ManagementSystem
ExperimentalData Repositories
Protein IdentificationAlgorithm
- Goals
- Make proteomic data widely available so that it
can be leveraged in future studies
- Goals
- Enable Laboratory to multitask while organizing
and securing data for each individual
collaboration - Collect required annotations and associate
annotations with samples and instrument files - Appropriately store instrument files
- Facilitate Protein Informatics Processes
- Facilitate communication between outside
investigators and the laboratory
- Goals
- Integrate data produced at different sites into a
unified schema - Potentially enforce minimum annotation set
required to fulfill collaboration analysis
objective. - Potentially provide an environment for analysis
across all collaboration datasets
Reference DataRepositories
- Goals
- Provide the means of collecting and disseminating
proteomics knowledge
3Proteomics Information LifecycleSystem Examples
Analytical Chemistry Lab
Collaboration
Publicly Available Data
SitePortal / LIMSEnvironment
CollaborationData ManagementSystem
ExperimentRepositories
Protein IdentificationAlgorithm
Pride is an example of this type of database.
- The HPCGG leverages its custom built Gateway for
Integrated Genomics-Proteomics Applications and
Data System to meet this need. - For Protein Identification We Leverage Sequest.
This collaboration is currently planning to
leverage a customized version of caLIMS for this
purpose.
Reference DataRepositories
- Examples
- Bind
- Swiss-Prot
- Protein Data Bank
4Key Chokepoint in Todays Flow
Analytical Chemistry Lab
Collaboration
Publicly Available Data
SitePortal / LIMSEnvironment
CollaborationData ManagementSystem
ExperimentRepositories
Protein IdentificationAlgorithm
High throughput versions of these algorithms rely
on sequence databases which are known to be far
from complete
Reference DataRepositories
Proteins that are not adequately represented in
the sequence databases may never flow across this
link
5Deficiencies in Current Sequence Databases
- Whats Missing
- SNP induced changes within proteins
- Other polymorphism induced changes in proteins
- Post translational modifications (PTM)
Sequence IdentificationDatabases (Contents based
onProtein Backbone Sequences)
While it is possible to add specific instances of
these items into the database, you have to know
what you are looking for and in the case of PTMs
there is a risk of increasing the false positive
rate
We are likely failing to pull a large number of
proteins out of our instrument files because of
deficiencies in our sequence databases
6The Good News and the Challenge
- We expect more robust sequence databases will
become available - SNP and Polymorphism data being collected in
structured databases today - PTM data being to be collected and cataloged
- Protein identification algorithms area of
continuing research
In the (potentially not too distant) future, we
expect that more robust sequence databases will
become available and that these databases will be
dynamic, consistently improving as base genetic
and proteomic data grows But Are we well
positioned to leverage these databases to
reanalyze what will then be historical datasets?
7Information Loss within Currently Flow
Analytical Chemistry Lab
Collaboration
Publicly Available Data
Custom Interface
Data Interchange Format (mzXML for example)
SitePortal / LIMSEnvironment
CollaborationData ManagementSystem
ExperimentRepositories
Protein IdentificationAlgorithm
- Potential Points of Loss
- Full set of annotations ultimately required for
analysis not identified in advance and structured
in LIMS - Instrument files not retained (should not be an
issue with GIGPAD)
- Potential Points of Loss
- Raw files
- Vender specific information in instrument files.
- Tuning and calibration information
- Annotations not transferred through interface
- Potential Points of Loss
- Raw files
- Vender specific information in instrument files.
- Tuning and calibration information
- Collaboration specific or other annotations not
adequately represented in the Data Interchange
Format
We need to ensure that data stored in these
systems can be reanalyzed We are very happy to
see that David States solution includes movement
of Raw Files to the caLIMS Envornment
8Problems Caused by the Size of Instrument Files
- File sizes involved are large
- Raw file sizes vary considerably depending on
concentration and number of proteins in samples
but for frame of reference - LCQ centroid mode 10 MB/hour/instrument
- Profile mode can approach the FT low-res mode in
data rate - FT profile low res mode 150
MB/hour/instrument - Movement of files across networks difficult
- Storage requirements extensive
- Trend in proteomics is for newer instrumentation
to produce larger files.
9Potential Solutions Two Different Directions
- Facilitate movement of instrument files
- Develop means of ensuring instrument data files
can move to a research who wants to
algorithmically analyze them - Low tech solutions may play a roll (literally
physically mailing DVDs or hard-drives) - Peer to peer networking technology such as
BitTorrent maybe able to help us distribute
network load - Facilitate movement of algorithms to data
- Allow raw instrument files to remain physically
dispersed at the sites that created them but
ensure remote reanalysis is possible. - Data grid technologies may be able to help us if
we move in this direction
10Informatics Overview