Title: Anastasia Nikolskaya
1COMPLEMENTING GENE ONTOLOGY WITH PIRSF
CLASSIFICATION-BASED PROTEIN ONTOLOGY
- Anastasia Nikolskaya
- PIR (Protein Information Resource)
- Georgetown University Medical Center
- www.uniprot.org http//pir.georgetown.edu/
2Why Protein Classification?
- Automatic annotation of protein sequences based
on protein families (propagation of annotation) - Systematic correction of annotation errors
- Protein name standardization in UniProt
- Functional predictions for uncharacterized
protein families -
-
3PIRSF Classification System
- PIRSF A network structure with hierarchies from
Superfamilies to Subfamilies reflects
evolutionary relationships of full-length
proteins - Definitions
- Basic unit Homeomorphic Family
- Homologous (Common Ancestry) Inferred by
sequence similarity - Homeomorphic Full-length sequence similarity and
common domain architecture - Network Structure Flexible number of levels with
varying degrees of sequence conservation - Advantages
- Annotation of both generic biochemical and
specific biological functions - Accurate propagation of annotation and
development of standardized protein nomenclature
and ontology
4Levels of protein classification
5PIRSF Classification System
A protein may be assigned to only one
homeomorphic family, which may have zero or more
child nodes and zero or more parent nodes. Each
homeomorphic family may have as many domain
superfamily parents as its members have domains.
6PIRSF Classification System
A protein may be assigned to only one
homeomorphic family, which may have zero or more
child nodes and zero or more parent nodes. Each
homeomorphic family may have as many domain
superfamily parents as its members have domains.
SF500001 stimulates trophoblast
migration SF500002 stimulates proliferation of
prostate cancer cells SF500003
anti-proliferative and pro-apoptotic effects on
cancer cells SF500004 inhibitor of IGF
SF500005 stimulates bone formation SF500006
inhibitor of IGF-II
7Creation and curation of PIRSFs
UniProt proteins
New proteins
Automatic Procedure
Unassigned proteins
- Computer-Generated (Uncurated) Clusters (36,000
PIRSFs) - Preliminary Curation (5,000 PIRSFs)
- Membership
- Signature Domains
- Full Curation (1,300 PIRSFs)
- Family Name with evidence tag
- Description, Bibliography
Automatic clustering
Preliminary Homeomorphic Families
Orphans
Map domains on Families
Automatic placement
Merge/split clusters
Add/remove members
Computer-assisted Manual Curation
Curated Homeomorphic Families
Name, refs, abstract, domain arch.
Final Homeomorphic Families
Protein name rule/site rule
Create hierarchies (superfamilies/subfamilies)
Build and test HMMs
8PIRSF-Based Protein Annotation in UniProt
UniProt is developing protein name standards and
guidelines Classification of proteins into
families provides a convenient and accurate
mechanism to propagate curated information to
individual protein members
- Rule-Based annotation system using curated
PIRSFs - Site Rules (PIRSR) Position-Specific Site
Features (active sites, binding sites, modified
sites, other functional sites) - Name Rules (PIRNR) transfer name from PIRSF to
individual proteins (define a subgroup if
necessary) - Protein Name (may differ from family name),
synonyms, acronyms - EC
- Misnomers
- GO Terms (homeomorphic family-based, propagatable
GO annotation) - Function
9PIRSF-Based Protein Ontology
- PIRSF family hierarchy is based on evolutionary
relationships - Standardized PIRSF family names
- Network structure (in DAG) for PIRSF family
classification system
10PIRSF to GO Mapping
- PIRSF to GO mapping provides a link between GO
concepts and protein objects - Mapped 5500 curated PIRSF homeomorphic families
and subfamilies to the GO hierarchy
DynGO viewerHongfang Liu , University of Maryland
- Superimpose GO and PIRSF hierarchies
- Bidirectional display (GO-centric or
PIRSF-centric views)
11Protein Ontology Can Complement GO
- Expanding a Node
- Identification of GO subtrees that need expansion
if GO concepts are too broad - 67 of curated PIRSF families and subfamilies
map to GO leaf nodes - Among these, 2209 PIRSFs have shared GO leaf
nodes (many PIRSFs to 1 GO leaf) - Example PIRSF001969 vs PIRSF018239 and
PIRSF036495 High- vs low-affinity IGF binding - Identification of missing GO nodes
12Protein Ontology Can Complement GO
Identification of Missing GO Nodes (higher
levels)
13Protein Ontology Can Complement GO
- Linking Function, Biological Process, and
Cellular Component through a Protein Object Based
on Protein Annotations
- Mechanism to examine the relationships between
the three GO ontologies based on the shared
annotations at different protein family levels - Example molecular function estrogen receptor
activity and biological process signal
transduction ,estrogen receptor signaling
pathway
14PIRSF Protein Classification a link between GO
and protein objects
- Annotation Quality
- Annotation of biological function of whole
proteins - Annotation of uncharacterized hypothetical
proteins - Correction of annotation errors and
underannotations - Standardization of Protein Names
- PIRSF to GO mapping provides a link between GO
sub-ontologies and protein objects
15PIRSF-based Protein Ontology Can Complement GO
- Identification of GO subtrees that need expansion
if GO concepts are too broad - Comprehensive classification of related protein
families in PIRSF can help in identification of
missing GO nodes when entire groups of PIRSF
superfamilies or families cannot be mapped to
existing GO terms - Mechanism to examine the relationships between
the three GO ontologies (molecular function,
biological process, and cellular component), as
well as between GO concepts, based on the shared
annotations at different protein family levels
16 Acknowledgements
- Hongfang Liu , University of Maryland
- Judith Blake, The Jackson Laboratory
- Dr. Cathy Wu, Director
- Protein Classification team
- Dr. Winona Barker Dr. Lai-Su Yeh Dr.
Anastasia Nikolskaya - Dr. Darren Natale Dr. Zhangzhi Hu Dr. Raja
Mazumder - Dr. CR Vinayaka Dr. Xianying Wei Dr. Sona
Vasudevan - Informatics team
- Dr. Hongzhan Huang Baris Suzek, M.S.
Sehee Chung, M.S. - Dr. Leslie Arminski Dr. Hsing-Kuo Hua
Yongxing Chen, M.S. - Jing Zhang, M.S. Amar Kalelkar
- Students
- Christina Fang Vincent Hormoso Natalia
Petrova Jorge Castro-Alvear
http//pir.georgetown.edu/
PIR Team
UniProt (SwissProt, TrEMBL, PIR)
www.uniprot.org