Title: Data Mining
1Data Mining Analysis Tools Administrative
Resource _at_ GU-PIR
- Cathy Wu, Peter McGarvey, Raja Mazumder
- June 13 14, 2005
- Blacksburg, VA
2- Objectives
- Functional Analysis of PRC Proteomic Data
- Public Dissemination of Analysis Results and
Tools - Features
- Home http//www.proteomicsresource.org/
- Hosted at http//pir.georgetown.edu/proteomics/
- Protein Master Catalog
- Complete Proteomes
- Data Mining and Analysis Tools
3Data Integration at Admin Resource
Master Catalog Complete Proteomes at GU-PIR
Protein ID Peptide/Protein Sequence Mapping
Integrated Data at VBI
Data Exchange Format Controlled
Vocabulary Ontology
Multiple Data Types from Proteomics Research
Centers
4Complete Proteomes
- Connect to Master Catalog
- Direct Report Retrieval
- Sequence Search (Check Box)
- Text Search - Over 60 Fields (Input Box)
- Display Option
- Save Option
4
6
3
1
3
2
2
Demo
5Text Search Summary Table
6- Link to Experimental Data at VBI
- Text Sequence Search
- Direct Report Retrieval
- UniProtKB Protein Report
- iProClass Protein Report
- BioThesaurus Name Report
- PIRSF Family Report
- UniRef50 Cluster Report
Master Catalog
Demo
7UniProtKB Report (I)
ID Accession
Name Taxon
Bibliography
8UniProtKB Report (II)
Sequence Feature
Demo
9BioThesaurus Report
1
3
- Gene and Protein Name Mapping
- Search Synonyms
- Resolve Name Ambiguity
- Underlying ID Mapping
10Analysis of Homologous Proteins
3
1
2
- Fully Curated/Characterized Protein Family PIRSF
Family Report - No Family Classification
- UniRef50 Cluster Report Automated Clustering at
50 Identity - Related Sequences Pre-computed BLAST Sequence
Neighbors at e10 or Top 300 Hits - Conserved Hypothetical Family SEED Genome Context
11PIRSF Family Report
Curated Protein Family Information
Demo
12UniRef50 Cluster Report
No PIRSF
13Related Sequences
No PIRSF
- Pre-computed BLAST Sequence Neighbors
- At e10 or Top 300 Hits
Demo
14SEED Genome Context
- Genome Context for Conserved Hypothetical
Protein Family - Access via PIRSF Curation Interface
15Data Mining and Analysis Tools
- Current Tools
- Iterative Text and Sequence Searches
- Peptide Match
- Gene/Protein ID Mapping
- Prototypes Developed
- iProXpress for Protein Function and Pathway
Analysis - RLIMS-P for Literature Mining of Protein
Phosphorylation Site - CUPID for Core/Unique Protein Identification in
Related Proteomes - To Be Developed
- iProMSid for Protein Identification from Tandem
MS - Immunoinformatics for Epitope Prediction in
Pathogen Proteomes
16Peptide Match
- Exact Peptide Sequence Match
- Search against UniProtKB or Selected Complete
Proteomes - Example UniProtKB Search MRPGQGLTEITCRILEGLKPV
17ID Mapping
- Gene/Protein ID Mapping to UniProtKB Accession
Number - Over 20 Database Identifiers e.g., NCBI GI, PDB
ID - Example GI to UniProtKB Mapping
18iProXpress Expression Analysis
- Global Bioinformatics Analysis of Proteins
- Protein Mapping
- Sequence Analysis Data Mining
- Display of Protein Information Matrix
- Categorization and Visualization for Function/
Pathway Discovery
19iProXpress Prototype
- Display of Protein Information Matrix
- Categorization and Visualization for
Function/Pathway Discovery
Demo
20RLIMS-P Rule-based Literature Mining System -
Phosphorylation
- Functionality
- Identify Phosphorylation-Related MEDLINE
Abstracts 96 recall - Extract Phosphorylation Information (Substrate,
Kinase, Site) 98 Precision for Substrate/Site
Extraction - Applications
- UniProtKB Site Feature Annotation
- Proteomics Protein Identification
http//pir.georgetown.edu/iprolink/rlimsp
21CUPID Core and Unique Protein Identification
System
- Homology Analysis of Related Proteomes
- Genus/Species/Strain Levels
- Pathogenic vs. Non-Pathogenic
Demo
22iProMSid Protein Identification
- Protein Identification from Tandem Mass
Spectrometry - Comprehensive Sequence Library (including
Compressed Genomic Data) - Reanalysis Pipeline for MS Peak List Data
23Immunoinformatics Epitope Prediction
- T-Cell Epitope Prediction in Pathogen Proteomes
- Programs
- MHCPred MHC class I II
- NetMHC MHC class I
- ProPred MHC class II
- NetChop Proteosome Cleavage
24Acknowledgments
- Funding
- NIH (NIAID, NHGRI, NCI), NSF, Air Force, DOD,
DCWASA - PIR Team
- Executive Team Cathy Wu (Co-PI), Winona Barker,
Peter McGarvey (Project Manager, IWG), Hongzhan
Huang, Lai-Su Yeh, Darran Natale - Protein Science Team Anastasia Nikolskaya,
Zhang-Zhi Hu, Raja Mazumder, CR Vinayaka, Sona
Vasudevan, Xianying Wei, Robert Ledley, Cecilia
Arighi, Xin Yuan, Christina Fang, Vincent Hermoso - Informatics Team Baris Suzek (IWG), Leslie
Arminski, Sehee Chung, Hsing-Kuo Hua, Yongxing
Chen, Jing Zhang, Jess Cannata - Student Natalia Petrova
- UniProt Consortium EBI, SIB, PIR
25Begin demo
http//pir.georgetown.edu/proteomics/
26Proteome text search
27iProClass Link
28(No Transcript)
29Link to SF Page
30Link to Taxonomy
Proteome Text Search
31Proteome 5624 Seq Refseq 5618 Seq Uniprot
5622 Seq
Proteome text search
HIS2_BACAN
32View Statistics  Pathway    Function
Category    Process Category
33CellActivation
34(No Transcript)
35Core and Unique Protein IDentification System
(CUPID)
- Identification of strain-, species- and
genus-specific proteins
36Pilot Project Aims
- Develop a robust method to identify unique
proteins that can be utilized to develop strain-,
species- and genus-specific diagnostic probes. - The method would include rigorous computational
and manual analysis of suspected homologs - Allow identification of unique proteins from any
organism
37Methodology
- Computer assisted manual curation
- All efforts are made to detect homologs. If no
homologs can be found then the protein is
considered unique - Algorithm involves pairwise comparisons (BLAST
and reciprocal BLAST) followed by manual curation - Evidence tagging allows selection of appropriate
unique set
38Selecting An Organism
Genus specific
Select
39Strain- Species- Genus Specific Proteins
Additional flexibility in the results page
Summary results
40List Of Unique Proteins
41Genes With Cellular Homologs
Present in the genus Orthopoxvirus but not
present in any other Poxviridae
Camelpox virus (CMPV) Cowpox virus (CPV)
Ectromelia virus (EPV) Monkeypox virus (MPV)
Rabbitpox virus (RPV) Vaccinia (Copenhagen) VAC
Variola-India 1967 (smallpox virus) VAV
42Salient Features Of The Algorithm
- Provides sets of unique proteins at the strain,
species, and genus level - Uses a rigorous algorithm that includes manual
verification - Uses different parameters for short sequences
- Allows retrieval of unique and core unique
proteins at different taxonomic levels - Provides the identity of the nearest non-self
neighbor
43Conclusion
- The unique sets of proteins could be used as
potential diagnostic, therapeutic or vaccine
targets - At the computational level protein comparisons
are more sensitive than DNA - Manuscript submitted