Title: Protein Function Prediction Based on Domain Content
1Protein Function Prediction Based on Domain
Content
Ankita Sarangi School of Informatics,
IUB Capstone Presentation, May 11,
2009 Advisor Yuzhen Ye
2What information can be used for function
annotation?
- Sequence based approaches
- Protein A has function X, and protein B is a
homolog (ortholog) of protein A Hence B has
function X - Structure-based approaches
- Protein A has structure X, and X has specific
structural features Hence Xs function sites are
used to assign function to the Protein A - Motif-based approaches (sequence motifs, 3D
motifs) - A group of genes have function X and they all
have motif Y protein A has motif Y Hence
protein As function might be related to X - Guilt-by-association
- Gene A has function X and gene B is often
associated with gene A, B might have function
related to X - Associations
- Domain fusion, phylogenetic profiling, PPI, etc.
3Function Annotation Using Protein Domains
- A protein domain is a part of protein sequence
and structure that can evolve, function, and
exist independently of the rest of the protein
chain. - Each domain forms a compact three-dimensional
structure and often can be independently folded. - Many proteins consist of several structural
domains. - Among relevant sequence features of a protein,
domains occupy a key position. They are
sequential and structural motifs found
independently in different proteins, in different
combinations, and as such seem to be the building
blocks of proteins
4Supra-domains
- However, it is also known that certain sets of
independent domains are frequently found
together, which may indicate functional
cooperation. - Supra- Domains A supra-domain is defined as a
domain combination in a particular
N-to-C-terminal orientation that occurs in at
least two different domain architectures in
different proteins with (i) different types of
domains at the N and C-terminal end of the
combination or (ii) different types of domains
at one end and no domain at the other. - A type of Supra-domain are ones whose activity is
created at the interface between the two domains
of a protein - (Ref JMB, 2004, 336809823)
- We may make mistakes if we do function prediction
based on individual domains - We know proteins that have domain A and B have
function F, what about proteins having domain A
or domain B only?
5My Work
- A survey of mis-annotation based on single
domains - We are interested to know how serious this
problem is in the current annotation system - There is no systematic survey on this so far
- Function annotation using domain patterns (domain
combinations) instead of individual domains - Utilize the relationship of the predicted
functions (as shown in the GO directed acyclic
graph of functions) - Provide a web-tool and visualization of the
predicted functions and their relationship with
domain patterns
6Protein Domains SUPERFAMILY
- SUPERFAMILY is a database of structural and
functional protein annotations for all completely
sequenced organisms. - The SUPERFAMILY web site and database provides
protein domain assignments, at the SCOP
'superfamily' and 'family' levels, for the
predicted protein sequences in over 900 organisms - We made a local copy of this database
7Protein Function GO
- The GENE ONTOLOGY(GO) project is a collaborative
effort to address the need for consistent
descriptions of gene products in different
databases. - Consists of three structured, controlled
vocabularies (ontologies) that describe gene
products in terms of their associated biological
processes, cellular components and molecular
functions in a species-independent manner.
8A Survey of Mis-Annotations Using Single Domains
We looked at several supra-domains listed in this
paper Supra-domains Evolutionary Units Larger
than Single Protein Domains Voget etal. J. Mol.
Biol. (2004) 336, 809823
9Finding Functions for Domains and Domain
Combinations
- Superfamily
- Use the SCOP ID of the domains to obtain Gene
Identifiers associated with the supra-domains as
well as their individual domains - UniProt ID mapping file
- (is a tab-delimited table, which includes
mappings for 20 different sequence identifier
types(example ENSGALP000, AN7518.2, Afu1g003,
gi41409236refNP_962072.1)and - gene_association.goa_uniprot
- (GO assignments for the UniProt KnowledgeBase
(UniProtKB)) - To obtain Swiss prot ID, GO ID
- Find Gene Ontology functions that are associated
with - proteins which contain both the domains and the
individual domains
10Supra-domain Example 1
(SCOP ID - 63380)
(SCOP ID 52343)
The N-terminal domain binds FAD and the
C-terminal domain binds NADPH. The FAD acts as an
intermediate in electron transfer between NADPH
and substrate, and this domain combination is
used by many different enzymes
11Function Annotation for Proteins with
Supra-Domains and Single Domains
100 IEA
12Function Example GO0016491 ((oxidoreductase
activity))
- 10 proteins with Supra domains annotated to
GO0016491---- 2491proteins with Supra domains - 3 proteins with Riboflavin Synthase domain-like
annotated to GO0016491 --- 42 proteins with
Riboflavin Synthase domain-like - 1 protein with reductase-like, C-terminal
NADP-linked domain annotated to GO0016491--- 47
protein with reductase-like, C-terminal
NADP-linked domain - Specific proteins searched and presence and
absence of the combined domain was confirmed
along with GO ID as well as annotation evidence
which was found to be Inferred Electronic
Annotation
13Example GO0016491
- Supra Domains Riboflavin Synthase domain-like,
Ferredoxin reductase-like, C-terminal NADP-linked
domain - Protein Name Oxidoreductase FAD-binding domain
protein - Gene Ontology Biological Process GO005514
is_a child of GO0008152 - molecular function GO0016491
- PFAM domains PF00970. FAD_binding
- PF00175 NAD_binding
- Evidence IEA (Inferred Electronic Annotation)
- Proteins A4FHX1 , A1UCP3, A4T5V2, A3PWD0
,Q1BCA1 - Protein Name Sulfide dehydrogenase
(Flavoprotein) subunit SudA sulfide dehydrogenase
(Flavoprotein) subunit SudB - Gene Ontology Biological Process GO005514
is_a child of GO0008152 - molecular function GO0016491
- PFAM Domains PF00175. NAD_binding
- PF07992. Pyr_redox (FAD_pyr_nucl-diS_OxRdtase.
) - Evidence IEA (Inferred Electronic Annotation)
- Proteins Q2J1U9, Q13CJ3, Q5PB24
- Protein Name Dihydroorotate dehydrogenase
electron transfer subunit, putative - Gene Ontology Biological Process GO005514
is_a child of GO0008152 - molecular function GO0016491
- PFAM domains PF00970. FAD_binding
- Riboflavin Synthase domain-like
- Protein Name Putative uncharacterized protein
- Gene Ontology Biological Process GO005514
is_a child of GO0008152 - molecular function GO0016491
- PFAM Domains PF07992 - Pyr_redox (Q0A5G3)
- OR
- PF08021. FAD_binding (A4FEM2, A1WVX7 )
- Evidence IEA (Inferred Electronic Annotation)
- Proteins Q0A5G3, A4FEM2, A1WVX7
- Ferredoxin reductase-like, C-terminal NADP-linked
domain -
- Protein Name Protein-P-II uridylyltransferase
- Gene Ontology Biological Process GO0008152 is
a parent of GO0006807 - molecular function GO0016491
- PFAM PF01966 - NAD Binding
- Evidence IEA (Inferred Electronic Annotation)
- Protein Q6MLQ2
- Ref http//www.uniprot.org/uniprot/
14Supra-domain Example 2
PreATP-grasp domain (SCOP ID 52440)
Glutathione synthetase ATP-binding domain-like
(SCOP ID 56059)
Lots of different enzymes forming carbonnitrogen
bonds have this combination of domains. Both
domains contribute to substrate binding and the
active site, and the C-terminal domain binds ATP
as well as the other substrate
15Function Annotation for Proteins with
Supra-Domains and Single Domains
75 IEA
16RESULTS
- Functional annotations were found to be shared by
proteins having the Supra-domains as well as the
single domains. - The percentage of proteins having Supra-domains
were much higher than single domains. - Since, both domains are required for the function
of the protein, the functions assigned to single
domain proteins may be said to be mis-annotated. - This study gave us motivation of developing a
computational tool for function annotation based
on domain combinations (domain patterns) instead
of individual domains
17A Novel Computational Tool for Function
Annotation Using Domain Content
- Utilize the relationship of the predicted
functions (as shown in the GO directed acrylic
graph of functions) - Provide a web-tool and visualization of the
predicted functions and their relationship with
domain patterns
18Probabilistic Approach
- Functional annotation term F (in this case a Gene
Ontology) and a domain set D. The probability
that a protein exhibiting D would possess F is
modeled as - P(FD)P(DF)P(F)/P(D)
- (i.e., posterior probability of a function given
a set of domains D P(DF), P(F), and P(D) can be
learned from proteins with known functions) - Ref Predicting protein function from domain
content Forslund et alBioinformatics, Vol. 24
no. 15 2008, pages 16811687
19Datasets Used
- Gene Ontology database
- gene_association.goa_uniprot
- Swisspfam
20Methodology
- For an input domain pattern (pfam domains)
- All the Pfam pattern containing the given pattern
are extracted (e.g., if input domain pattern is A
B, all the domain patterns that contain this
domain pattern will be considered, such as A B
C, etc) - GO function associated with all the domain
patterns are extracted - Calculate the probability using
P(FD)P(DF)P(F)/P(D) - number of proteins that occurs with the domain
pattern possessing the function - If the percentage probabilities lie close to one
another than the parent GO function is found and
a diagram depicting a sum of the distance of the
parent from the two children is printed
otherwise the GO terms that have P(FD) gt 0.9
MaxP(FD) are extracted - Summary graph providing all the GO functions
associated with the pattern search
21Example Domain Input PF03446 PF00393
22Example Domain Input PF02801
23Conclusion
- A survey of annotation based on single domains
- Function annotation using domain patterns (domain
combinations) instead of individual domains - To DO
- Do a more thorough survey with the annotation
studies of single domains - Define all the relationships between the GO IDs
in the Summary Graph - Refine and test the computational tool.
24Acknowledgements
- I would like to thank
- Dr. Yuzhen Ye
- Faculty of the Department of Bioinformatics
- Drs. Dalkilic, Kim, Hahn, Radivojac, Tang
-
- Linda Hostetter and Rachel Lawmaster
- Family and Friends