Title: Use Cases for a Proteomics Data Repository
1Use Cases for a Proteomics Data Repository
Our Experiences with PRIDE The PRoteomics
IDEntifications database
PRIDE - A Data Repository and Data Transfer
Format for Protein Peptide Identifications and
Supporting Evidence Phil Jones, EBI, Hinxton,
Cambridgeshire, UK. pjones_at_ebi.ac.uk
2Requirements Overview What needs to be
considered?
- Nature of likely queries how will the
repository be interrogated? - What is the nature of the response that the user
querying the repository will require? - Which kinds of proteomics data should be
included in the repository? - How will submission of data to the repository be
promoted / encouraged? - How will the repository meet common standards
for the exchange of data and what are the
advantages of doing so? - What level of detail should the repository
include? Major data storage and efficiency
concerns connected with this.
3What kinds of Questions will be Asked? - Search
Criteria
- Likely Queries may be a combination of any of
the following - Literature Reference (by author / title /
keywords etc.) - Protein ID
- Protein family / Domain / other classification
- Peptide sequence
- Species
- Developmental stage / age
- Tissue / Organ / Cell type
- Sub-cellular Component
- Disease / Pathological State
- Genotype / Phenotype
- Environmental conditions (of organism under
analysis) - Sample processing method
- Instrument Type / Parameters
- Search Engine / Parameters
4What kinds of Questions will be Asked? - Search
Space
- Require common controlled vocabularies /
ontologies to define search space - Species NCBI Tax id, ITIS.
- Tissue / organ / cell type MeSH, Plant
Ontology, cell.obo - Sub-cellular component GO
- Disease MeSH
- Genotype GO
- Phenotype MGI's Mammalian Phenotype Ontology
- Sample Processing PSI Ontology
- MS Instrument PSI Ontology
5What kinds of response will the typical user
expect?
- Need to define what will be returned to the user
querying the database and the format of such a
return - Machine readable data formats (e.g. XML)
- Human readable data formats
- Graphical display e.g. visualisation of spectra,
gel images. - Display of statistics or compact summary of
data. - Details of tissue, sample prep, other
experimental parameters - Predicted protein identifications and
appropriate scores - Predicted peptide identifications and
appropriate scores - Predicted post translational modifications
- Links to references in the literature
Data format
Data content
6Controlling Data Volume How detailed do you
want to go?
- Raw MS data would quickly swell to TB in
magnitude. - Peak lists will certainly involve GB of data
initially. Can be expected to swell to TB but
perhaps at a more controllable and sustainable
rate. - Massive data sets create problems for both
storage and efficiency of data retrieval. - Raw data optionally stored by submitter, e.g. in
FTP server and linked to from the repository?
7Data formats for submission and inter-repository
exchange
- As well as allowing submission of data, the
flexibility to exchange data with external
proteomics repositories would also be desirable. - Successful model of collaborative effort to
achieve this is the PSI initiative for the
exchange of protein interaction data using the
PSI MI XML format, with major protein-protein
interaction databases being involved - BIND
- DIP
- Hybrigenics
- IntAct
- MINT
- MIPS interaction tables.
- The ability to exchange MI data is now being
extended by the IMEx consortium with the
following aims - creation of a consistent body of public data
- avoidance of redundant curation.
8Data formats for submission and inter-repository
exchange
- PSI General Proteomics Standards Workgroup
developing - MIAPE Minimum Information about a Proteomics
Experiment - PSI Object Model
- The PSI / GPS Ontology (working name PSI-ont)
based upon the MGED ontology - Data exchange formats
- mzData (MS Instrument output / peak lists)
- mzIdent (Peptide and Protein Identifications).
9How has PRIDE tackled these problems ?
PRIDE is a multi-faceted project offering
- An XML schema for transfer of proteomics protein
identification data. - A relational database implementation for the data
repository using OJB, allowing the use of most
currently available RDBMSs. A central data
repository is being set up at the EBI. However
the intention is to implement a network of
federated databases across the community that can
exchange data and not necessarily PRIDE. - Secure upload of proteomics data in the PRIDE XML
schema format. (Future developments upload and
download using the mzData XML schema and the
mzIdent XML schema.) - The ability to search the repository and download
results in PRIDE XML or HTML format. (Future
developments download in alternative XML
formats.) - Following release, will become open source and
freely available.
10The PRIDE Data Model
11PRIDE Security and Data Availability
Security measures for data privacy, traceability
and group access
- Before data can be uploaded, the user needs to
register. All personal data is encrypted on the
database. - All experimental data is linked to the person who
has uploaded it. - Data can be marked as private. The person
uploading the data may give a date upon which the
data becomes public. - Group access to private data can be granted by
creating a 'collaboration'. Other users can then
apply to join the collaboration. This is
validated by the person who created it. The
collaboration concept will allow geographically
separated laboratories to share data via PRIDE
before it is publicly available.
12PRIDE Data Curation and Support
- Consistent use of protein identifiers
encouraging (but not mandating) use of IPI. - Removing data uploaded in error.
- Ensuring consistent use of ontology / controlled
vocabularies for annotation of data - Species (NCBI Taxonomy IDs)
- Tissue (MeSH IDs)
- Sub-cellular location (GO)
- MS related ontologies (currently being developed
as part of PSI). - Providing point of contact between investigators
(personal information is not given out without
permission)
13PRIDE Post Translational Modifications
- Provides capacity to store details of
post-translational modifications - Will use RESID PTM IDs for naturally occurring
PTMs and the UNIMOD database for PTMs arising
as artefacts of MS. - (Negotiations have taken place to allow the
RESID curators to annotate UNIMOD, possibly
allowing a single comprehensive PTM database ID
to be used). - Data that can be stored includes
- a link to the peptide that the PTM was found in.
- reference database name and PTM ID
- mono-isotopic mass delta value
- mean mass delta value
14PRIDE Web Application Data Upload
15PRIDE Web Application Data Search
16PRIDE Web Application Search Results (HTML)
17PRIDE Future Directions
- mzData compatibility for both import and export
early 2005. - mzIdent compatibility for import and export
shortly after first release version of mzIdent
becomes available. - Improved search facilities including boolean
search on multiple fields. - Development of a PRIDE curation tool.
- Set up peak list re-analysis pipeline to provide
up-to-date protein identifications using the
latest version of IPI for all data sets. - Negotiations under way to include protein
identifications from HBPP, HPPP and HLPP projects
in PRIDE.
18PRIDE Acknowledgements
- Lennart Martens
- Samuel Kerrien
- Antony Quinn
- Mark Rijnbeek
- Kai Runte
- Chris Taylor
- Henning Hermjakob
- Weimin Zhu
- Rolf Apweiler