Data Mining - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Data Mining

Description:

Functional Analysis of PRC Proteomic Data. Public Dissemination of Analysis ... virus (RPV); Vaccinia (Copenhagen) VAC; Variola-India 1967 (smallpox virus) VAV ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 44
Provided by: nfi6
Category:
Tags: data | mining | vac

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining Analysis Tools Administrative
Resource _at_ GU-PIR
  • Cathy Wu, Peter McGarvey, Raja Mazumder
  • June 13 14, 2005
  • Blacksburg, VA

2
  • Objectives
  • Functional Analysis of PRC Proteomic Data
  • Public Dissemination of Analysis Results and
    Tools
  • Features
  • Home http//www.proteomicsresource.org/
  • Hosted at http//pir.georgetown.edu/proteomics/
  • Protein Master Catalog
  • Complete Proteomes
  • Data Mining and Analysis Tools

3
Data Integration at Admin Resource
Master Catalog Complete Proteomes at GU-PIR
Protein ID Peptide/Protein Sequence Mapping
Integrated Data at VBI
Data Exchange Format Controlled
Vocabulary Ontology
Multiple Data Types from Proteomics Research
Centers
4
Complete Proteomes
  • Connect to Master Catalog
  • Direct Report Retrieval
  • Sequence Search (Check Box)
  • Text Search - Over 60 Fields (Input Box)
  • Display Option
  • Save Option

4
6
3
1
3
2
2
Demo
5
Text Search Summary Table
6
  • Link to Experimental Data at VBI
  • Text Sequence Search
  • Direct Report Retrieval
  • UniProtKB Protein Report
  • iProClass Protein Report
  • BioThesaurus Name Report
  • PIRSF Family Report
  • UniRef50 Cluster Report

Master Catalog
Demo
7
UniProtKB Report (I)
ID Accession
Name Taxon
Bibliography
8
UniProtKB Report (II)
Sequence Feature
Demo
9
BioThesaurus Report
1
3
  • Gene and Protein Name Mapping
  • Search Synonyms
  • Resolve Name Ambiguity
  • Underlying ID Mapping

10
Analysis of Homologous Proteins
3
1
2
  • Fully Curated/Characterized Protein Family PIRSF
    Family Report
  • No Family Classification
  • UniRef50 Cluster Report Automated Clustering at
    50 Identity
  • Related Sequences Pre-computed BLAST Sequence
    Neighbors at e10 or Top 300 Hits
  • Conserved Hypothetical Family SEED Genome Context

11
PIRSF Family Report
Curated Protein Family Information
Demo
12
UniRef50 Cluster Report
No PIRSF
13
Related Sequences
No PIRSF
  • Pre-computed BLAST Sequence Neighbors
  • At e10 or Top 300 Hits

Demo
14
SEED Genome Context
  • Genome Context for Conserved Hypothetical
    Protein Family
  • Access via PIRSF Curation Interface

15
Data Mining and Analysis Tools
  • Current Tools
  • Iterative Text and Sequence Searches
  • Peptide Match
  • Gene/Protein ID Mapping
  • Prototypes Developed
  • iProXpress for Protein Function and Pathway
    Analysis
  • RLIMS-P for Literature Mining of Protein
    Phosphorylation Site
  • CUPID for Core/Unique Protein Identification in
    Related Proteomes
  • To Be Developed
  • iProMSid for Protein Identification from Tandem
    MS
  • Immunoinformatics for Epitope Prediction in
    Pathogen Proteomes

16
Peptide Match
  • Exact Peptide Sequence Match
  • Search against UniProtKB or Selected Complete
    Proteomes
  • Example UniProtKB Search MRPGQGLTEITCRILEGLKPV

17
ID Mapping
  • Gene/Protein ID Mapping to UniProtKB Accession
    Number
  • Over 20 Database Identifiers e.g., NCBI GI, PDB
    ID
  • Example GI to UniProtKB Mapping

18
iProXpress Expression Analysis
  • Global Bioinformatics Analysis of Proteins
  • Protein Mapping
  • Sequence Analysis Data Mining
  • Display of Protein Information Matrix
  • Categorization and Visualization for Function/
    Pathway Discovery

19
iProXpress Prototype
  • Display of Protein Information Matrix
  • Categorization and Visualization for
    Function/Pathway Discovery

Demo
20
RLIMS-P Rule-based Literature Mining System -
Phosphorylation
  • Functionality
  • Identify Phosphorylation-Related MEDLINE
    Abstracts 96 recall
  • Extract Phosphorylation Information (Substrate,
    Kinase, Site) 98 Precision for Substrate/Site
    Extraction
  • Applications
  • UniProtKB Site Feature Annotation
  • Proteomics Protein Identification

http//pir.georgetown.edu/iprolink/rlimsp
21
CUPID Core and Unique Protein Identification
System
  • Homology Analysis of Related Proteomes
  • Genus/Species/Strain Levels
  • Pathogenic vs. Non-Pathogenic

Demo
22
iProMSid Protein Identification
  • Protein Identification from Tandem Mass
    Spectrometry
  • Comprehensive Sequence Library (including
    Compressed Genomic Data)
  • Reanalysis Pipeline for MS Peak List Data

23
Immunoinformatics Epitope Prediction
  • T-Cell Epitope Prediction in Pathogen Proteomes
  • Programs
  • MHCPred MHC class I II
  • NetMHC MHC class I
  • ProPred MHC class II
  • NetChop Proteosome Cleavage

24
Acknowledgments
  • Funding
  • NIH (NIAID, NHGRI, NCI), NSF, Air Force, DOD,
    DCWASA
  • PIR Team
  • Executive Team Cathy Wu (Co-PI), Winona Barker,
    Peter McGarvey (Project Manager, IWG), Hongzhan
    Huang, Lai-Su Yeh, Darran Natale
  • Protein Science Team Anastasia Nikolskaya,
    Zhang-Zhi Hu, Raja Mazumder, CR Vinayaka, Sona
    Vasudevan, Xianying Wei, Robert Ledley, Cecilia
    Arighi, Xin Yuan, Christina Fang, Vincent Hermoso
  • Informatics Team Baris Suzek (IWG), Leslie
    Arminski, Sehee Chung, Hsing-Kuo Hua, Yongxing
    Chen, Jing Zhang, Jess Cannata
  • Student Natalia Petrova
  • UniProt Consortium EBI, SIB, PIR

25
Begin demo
http//pir.georgetown.edu/proteomics/
26
Proteome text search
27
iProClass Link
28
(No Transcript)
29
Link to SF Page
30
Link to Taxonomy
Proteome Text Search
31
Proteome 5624 Seq Refseq 5618 Seq Uniprot
5622 Seq
Proteome text search
HIS2_BACAN
32
View Statistics   Pathway     Function
Category     Process Category
33
CellActivation
34
(No Transcript)
35
Core and Unique Protein IDentification System
(CUPID)
  • Identification of strain-, species- and
    genus-specific proteins

36
Pilot Project Aims
  • Develop a robust method to identify unique
    proteins that can be utilized to develop strain-,
    species- and genus-specific diagnostic probes.
  • The method would include rigorous computational
    and manual analysis of suspected homologs
  • Allow identification of unique proteins from any
    organism

37
Methodology
  • Computer assisted manual curation
  • All efforts are made to detect homologs. If no
    homologs can be found then the protein is
    considered unique
  • Algorithm involves pairwise comparisons (BLAST
    and reciprocal BLAST) followed by manual curation
  • Evidence tagging allows selection of appropriate
    unique set

38
Selecting An Organism
Genus specific
Select
39
Strain- Species- Genus Specific Proteins
Additional flexibility in the results page
Summary results
40
List Of Unique Proteins
41
Genes With Cellular Homologs
Present in the genus Orthopoxvirus but not
present in any other Poxviridae
Camelpox virus (CMPV) Cowpox virus (CPV)
Ectromelia virus (EPV) Monkeypox virus (MPV)
Rabbitpox virus (RPV) Vaccinia (Copenhagen) VAC
Variola-India 1967 (smallpox virus) VAV
42
Salient Features Of The Algorithm
  • Provides sets of unique proteins at the strain,
    species, and genus level
  • Uses a rigorous algorithm that includes manual
    verification
  • Uses different parameters for short sequences
  • Allows retrieval of unique and core unique
    proteins at different taxonomic levels
  • Provides the identity of the nearest non-self
    neighbor

43
Conclusion
  • The unique sets of proteins could be used as
    potential diagnostic, therapeutic or vaccine
    targets
  • At the computational level protein comparisons
    are more sensitive than DNA
  • Manuscript submitted
Write a Comment
User Comments (0)
About PowerShow.com