Proteomics Database Discussion - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Proteomics Database Discussion

Description:

Enable Laboratory to multitask while organizing and securing ... peer networking technology such as BitTorrent maybe able to help us distribute network load ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 11
Provided by: informat734
Category:

less

Transcript and Presenter's Notes

Title: Proteomics Database Discussion


1
Proteomics Database Discussion
  • Raju Kucherlapati
  • Harvard Medical School

2
Proteomics Information Lifecycle
Analytical Chemistry Lab
Collaboration
Publicly Available Data
SitePortal / LIMSEnvironment
CollaborationData ManagementSystem
ExperimentalData Repositories
Protein IdentificationAlgorithm
  • Goals
  • Make proteomic data widely available so that it
    can be leveraged in future studies
  • Goals
  • Enable Laboratory to multitask while organizing
    and securing data for each individual
    collaboration
  • Collect required annotations and associate
    annotations with samples and instrument files
  • Appropriately store instrument files
  • Facilitate Protein Informatics Processes
  • Facilitate communication between outside
    investigators and the laboratory
  • Goals
  • Integrate data produced at different sites into a
    unified schema
  • Potentially enforce minimum annotation set
    required to fulfill collaboration analysis
    objective.
  • Potentially provide an environment for analysis
    across all collaboration datasets

Reference DataRepositories
  • Goals
  • Provide the means of collecting and disseminating
    proteomics knowledge

3
Proteomics Information LifecycleSystem Examples
Analytical Chemistry Lab
Collaboration
Publicly Available Data
SitePortal / LIMSEnvironment
CollaborationData ManagementSystem
ExperimentRepositories
Protein IdentificationAlgorithm
Pride is an example of this type of database.
  • The HPCGG leverages its custom built Gateway for
    Integrated Genomics-Proteomics Applications and
    Data System to meet this need.
  • For Protein Identification We Leverage Sequest.

This collaboration is currently planning to
leverage a customized version of caLIMS for this
purpose.
Reference DataRepositories
  • Examples
  • Bind
  • Swiss-Prot
  • Protein Data Bank

4
Key Chokepoint in Todays Flow
Analytical Chemistry Lab
Collaboration
Publicly Available Data
SitePortal / LIMSEnvironment
CollaborationData ManagementSystem
ExperimentRepositories
Protein IdentificationAlgorithm
High throughput versions of these algorithms rely
on sequence databases which are known to be far
from complete
Reference DataRepositories
Proteins that are not adequately represented in
the sequence databases may never flow across this
link
5
Deficiencies in Current Sequence Databases
  • Whats Missing
  • SNP induced changes within proteins
  • Other polymorphism induced changes in proteins
  • Post translational modifications (PTM)

Sequence IdentificationDatabases (Contents based
onProtein Backbone Sequences)
While it is possible to add specific instances of
these items into the database, you have to know
what you are looking for and in the case of PTMs
there is a risk of increasing the false positive
rate
We are likely failing to pull a large number of
proteins out of our instrument files because of
deficiencies in our sequence databases
6
The Good News and the Challenge
  • We expect more robust sequence databases will
    become available
  • SNP and Polymorphism data being collected in
    structured databases today
  • PTM data being to be collected and cataloged
  • Protein identification algorithms area of
    continuing research

In the (potentially not too distant) future, we
expect that more robust sequence databases will
become available and that these databases will be
dynamic, consistently improving as base genetic
and proteomic data grows But Are we well
positioned to leverage these databases to
reanalyze what will then be historical datasets?
7
Information Loss within Currently Flow
Analytical Chemistry Lab
Collaboration
Publicly Available Data
Custom Interface
Data Interchange Format (mzXML for example)
SitePortal / LIMSEnvironment
CollaborationData ManagementSystem
ExperimentRepositories
Protein IdentificationAlgorithm
  • Potential Points of Loss
  • Full set of annotations ultimately required for
    analysis not identified in advance and structured
    in LIMS
  • Instrument files not retained (should not be an
    issue with GIGPAD)
  • Potential Points of Loss
  • Raw files
  • Vender specific information in instrument files.
  • Tuning and calibration information
  • Annotations not transferred through interface
  • Potential Points of Loss
  • Raw files
  • Vender specific information in instrument files.
  • Tuning and calibration information
  • Collaboration specific or other annotations not
    adequately represented in the Data Interchange
    Format

We need to ensure that data stored in these
systems can be reanalyzed We are very happy to
see that David States solution includes movement
of Raw Files to the caLIMS Envornment
8
Problems Caused by the Size of Instrument Files
  • File sizes involved are large
  • Raw file sizes vary considerably depending on
    concentration and number of proteins in samples
    but for frame of reference
  • LCQ centroid mode 10 MB/hour/instrument
  • Profile mode can approach the FT low-res mode in
    data rate
  • FT profile low res mode 150
    MB/hour/instrument
  • Movement of files across networks difficult
  • Storage requirements extensive
  • Trend in proteomics is for newer instrumentation
    to produce larger files.

9
Potential Solutions Two Different Directions
  • Facilitate movement of instrument files
  • Develop means of ensuring instrument data files
    can move to a research who wants to
    algorithmically analyze them
  • Low tech solutions may play a roll (literally
    physically mailing DVDs or hard-drives)
  • Peer to peer networking technology such as
    BitTorrent maybe able to help us distribute
    network load
  • Facilitate movement of algorithms to data
  • Allow raw instrument files to remain physically
    dispersed at the sites that created them but
    ensure remote reanalysis is possible.
  • Data grid technologies may be able to help us if
    we move in this direction

10
Informatics Overview
Write a Comment
User Comments (0)
About PowerShow.com