Proteomics Database Discussion - PowerPoint PPT Presentation

1 / 10

About This Presentation

Title:

Proteomics Database Discussion

Description:

Enable Laboratory to multitask while organizing and securing ... peer networking technology such as BitTorrent maybe able to help us distribute network load ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 11

Provided by: informat734

Category:

more less

Transcript and Presenter's Notes

Title: Proteomics Database Discussion

1
Proteomics Database Discussion

Raju Kucherlapati
Harvard Medical School

2
Proteomics Information Lifecycle
Analytical Chemistry Lab
Collaboration
Publicly Available Data
SitePortal / LIMSEnvironment
CollaborationData ManagementSystem
ExperimentalData Repositories
Protein IdentificationAlgorithm

Goals
Make proteomic data widely available so that it
can be leveraged in future studies

Goals
Enable Laboratory to multitask while organizing
and securing data for each individual
collaboration
Collect required annotations and associate
annotations with samples and instrument files
Appropriately store instrument files
Facilitate Protein Informatics Processes
Facilitate communication between outside
investigators and the laboratory

Goals
Integrate data produced at different sites into a
unified schema
Potentially enforce minimum annotation set
required to fulfill collaboration analysis
objective.
Potentially provide an environment for analysis
across all collaboration datasets

Reference DataRepositories

Goals
Provide the means of collecting and disseminating
proteomics knowledge

3
Proteomics Information LifecycleSystem Examples
Analytical Chemistry Lab
Collaboration
Publicly Available Data
SitePortal / LIMSEnvironment
CollaborationData ManagementSystem
ExperimentRepositories
Protein IdentificationAlgorithm
Pride is an example of this type of database.

The HPCGG leverages its custom built Gateway for
Integrated Genomics-Proteomics Applications and
Data System to meet this need.
For Protein Identification We Leverage Sequest.

This collaboration is currently planning to
leverage a customized version of caLIMS for this
purpose.
Reference DataRepositories

Examples
Bind
Swiss-Prot
Protein Data Bank

4
Key Chokepoint in Todays Flow
Analytical Chemistry Lab
Collaboration
Publicly Available Data
SitePortal / LIMSEnvironment
CollaborationData ManagementSystem
ExperimentRepositories
Protein IdentificationAlgorithm
High throughput versions of these algorithms rely
on sequence databases which are known to be far
from complete
Reference DataRepositories
Proteins that are not adequately represented in
the sequence databases may never flow across this
link
5
Deficiencies in Current Sequence Databases

Whats Missing
SNP induced changes within proteins
Other polymorphism induced changes in proteins
Post translational modifications (PTM)

Sequence IdentificationDatabases (Contents based
onProtein Backbone Sequences)
While it is possible to add specific instances of
these items into the database, you have to know
what you are looking for and in the case of PTMs
there is a risk of increasing the false positive
rate
We are likely failing to pull a large number of
proteins out of our instrument files because of
deficiencies in our sequence databases
6
The Good News and the Challenge

We expect more robust sequence databases will
become available
SNP and Polymorphism data being collected in
structured databases today
PTM data being to be collected and cataloged
Protein identification algorithms area of
continuing research

In the (potentially not too distant) future, we
expect that more robust sequence databases will
become available and that these databases will be
dynamic, consistently improving as base genetic
and proteomic data grows But Are we well
positioned to leverage these databases to
reanalyze what will then be historical datasets?
7
Information Loss within Currently Flow
Analytical Chemistry Lab
Collaboration
Publicly Available Data
Custom Interface
Data Interchange Format (mzXML for example)
SitePortal / LIMSEnvironment
CollaborationData ManagementSystem
ExperimentRepositories
Protein IdentificationAlgorithm

Potential Points of Loss
Full set of annotations ultimately required for
analysis not identified in advance and structured
in LIMS
Instrument files not retained (should not be an
issue with GIGPAD)

Potential Points of Loss
Raw files
Vender specific information in instrument files.
Tuning and calibration information
Annotations not transferred through interface

Potential Points of Loss
Raw files
Vender specific information in instrument files.
Tuning and calibration information
Collaboration specific or other annotations not
adequately represented in the Data Interchange
Format

We need to ensure that data stored in these
systems can be reanalyzed We are very happy to
see that David States solution includes movement
of Raw Files to the caLIMS Envornment
8
Problems Caused by the Size of Instrument Files

File sizes involved are large
Raw file sizes vary considerably depending on
concentration and number of proteins in samples
but for frame of reference
LCQ centroid mode 10 MB/hour/instrument
Profile mode can approach the FT low-res mode in
data rate
FT profile low res mode 150
MB/hour/instrument
Movement of files across networks difficult
Storage requirements extensive
Trend in proteomics is for newer instrumentation
to produce larger files.

9
Potential Solutions Two Different Directions

Facilitate movement of instrument files
Develop means of ensuring instrument data files
can move to a research who wants to
algorithmically analyze them
Low tech solutions may play a roll (literally
physically mailing DVDs or hard-drives)
Peer to peer networking technology such as
BitTorrent maybe able to help us distribute
network load
Facilitate movement of algorithms to data
Allow raw instrument files to remain physically
dispersed at the sites that created them but
ensure remote reanalysis is possible.
Data grid technologies may be able to help us if
we move in this direction