Protein Function Prediction Based on Domain Content - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Protein Function Prediction Based on Domain Content

Description:

(SCOP ID - 63380) (SCOP ID 52343) 100% IEA ... domain-like (SCOP ID = 56059) Lots of different enzymes forming carbon nitrogen bonds ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 25

Provided by: Anki9

Category:

more less

Transcript and Presenter's Notes

Title: Protein Function Prediction Based on Domain Content

1
Protein Function Prediction Based on Domain
Content
Ankita Sarangi School of Informatics,
IUB Capstone Presentation, May 11,
2009 Advisor Yuzhen Ye
2
What information can be used for function
annotation?

Sequence based approaches
Protein A has function X, and protein B is a
homolog (ortholog) of protein A Hence B has
function X
Structure-based approaches
Protein A has structure X, and X has specific
structural features Hence Xs function sites are
used to assign function to the Protein A
Motif-based approaches (sequence motifs, 3D
motifs)
A group of genes have function X and they all
have motif Y protein A has motif Y Hence
protein As function might be related to X
Guilt-by-association
Gene A has function X and gene B is often
associated with gene A, B might have function
related to X
Associations
Domain fusion, phylogenetic profiling, PPI, etc.

3
Function Annotation Using Protein Domains

A protein domain is a part of protein sequence
and structure that can evolve, function, and
exist independently of the rest of the protein
chain.
Each domain forms a compact three-dimensional
structure and often can be independently folded.
Many proteins consist of several structural
domains.
Among relevant sequence features of a protein,
domains occupy a key position. They are
sequential and structural motifs found
independently in different proteins, in different
combinations, and as such seem to be the building
blocks of proteins

4
Supra-domains

However, it is also known that certain sets of
independent domains are frequently found
together, which may indicate functional
cooperation.
Supra- Domains A supra-domain is defined as a
domain combination in a particular
N-to-C-terminal orientation that occurs in at
least two different domain architectures in
different proteins with (i) different types of
domains at the N and C-terminal end of the
combination or (ii) different types of domains
at one end and no domain at the other.
A type of Supra-domain are ones whose activity is
created at the interface between the two domains
of a protein
(Ref JMB, 2004, 336809823)
We may make mistakes if we do function prediction
based on individual domains
We know proteins that have domain A and B have
function F, what about proteins having domain A
or domain B only?

5
My Work

A survey of mis-annotation based on single
domains
We are interested to know how serious this
problem is in the current annotation system
There is no systematic survey on this so far
Function annotation using domain patterns (domain
combinations) instead of individual domains
Utilize the relationship of the predicted
functions (as shown in the GO directed acyclic
graph of functions)
Provide a web-tool and visualization of the
predicted functions and their relationship with
domain patterns

6
Protein Domains SUPERFAMILY

SUPERFAMILY is a database of structural and
functional protein annotations for all completely
sequenced organisms.
The SUPERFAMILY web site and database provides
protein domain assignments, at the SCOP
'superfamily' and 'family' levels, for the
predicted protein sequences in over 900 organisms
We made a local copy of this database

7
Protein Function GO

The GENE ONTOLOGY(GO) project is a collaborative
effort to address the need for consistent
descriptions of gene products in different
databases.
Consists of three structured, controlled
vocabularies (ontologies) that describe gene
products in terms of their associated biological
processes, cellular components and molecular
functions in a species-independent manner.

8
A Survey of Mis-Annotations Using Single Domains
We looked at several supra-domains listed in this
paper Supra-domains Evolutionary Units Larger
than Single Protein Domains Voget etal. J. Mol.
Biol. (2004) 336, 809823
9
Finding Functions for Domains and Domain
Combinations

Superfamily
Use the SCOP ID of the domains to obtain Gene
Identifiers associated with the supra-domains as
well as their individual domains
UniProt ID mapping file
(is a tab-delimited table, which includes
mappings for 20 different sequence identifier
types(example ENSGALP000, AN7518.2, Afu1g003,
gi41409236refNP_962072.1)and
gene_association.goa_uniprot
(GO assignments for the UniProt KnowledgeBase
(UniProtKB))
To obtain Swiss prot ID, GO ID
Find Gene Ontology functions that are associated
with
proteins which contain both the domains and the
individual domains

10
Supra-domain Example 1
(SCOP ID - 63380)
(SCOP ID 52343)
The N-terminal domain binds FAD and the
C-terminal domain binds NADPH. The FAD acts as an
intermediate in electron transfer between NADPH
and substrate, and this domain combination is
used by many different enzymes
11
Function Annotation for Proteins with
Supra-Domains and Single Domains
100 IEA
12
Function Example GO0016491 ((oxidoreductase
activity))

10 proteins with Supra domains annotated to
GO0016491---- 2491proteins with Supra domains
3 proteins with Riboflavin Synthase domain-like
annotated to GO0016491 --- 42 proteins with
Riboflavin Synthase domain-like
1 protein with reductase-like, C-terminal
NADP-linked domain annotated to GO0016491--- 47
protein with reductase-like, C-terminal
NADP-linked domain
Specific proteins searched and presence and
absence of the combined domain was confirmed
along with GO ID as well as annotation evidence
which was found to be Inferred Electronic
Annotation

13
Example GO0016491

Supra Domains Riboflavin Synthase domain-like,
Ferredoxin reductase-like, C-terminal NADP-linked
domain
Protein Name Oxidoreductase FAD-binding domain
protein
Gene Ontology Biological Process GO005514
is_a child of GO0008152
molecular function GO0016491
PFAM domains PF00970. FAD_binding
PF00175 NAD_binding
Evidence IEA (Inferred Electronic Annotation)
Proteins A4FHX1 , A1UCP3, A4T5V2, A3PWD0
,Q1BCA1
Protein Name Sulfide dehydrogenase
(Flavoprotein) subunit SudA sulfide dehydrogenase
(Flavoprotein) subunit SudB
Gene Ontology Biological Process GO005514
is_a child of GO0008152
molecular function GO0016491
PFAM Domains PF00175. NAD_binding
PF07992. Pyr_redox (FAD_pyr_nucl-diS_OxRdtase.
)
Evidence IEA (Inferred Electronic Annotation)
Proteins Q2J1U9, Q13CJ3, Q5PB24
Protein Name Dihydroorotate dehydrogenase
electron transfer subunit, putative
Gene Ontology Biological Process GO005514
is_a child of GO0008152
molecular function GO0016491
PFAM domains PF00970. FAD_binding

Riboflavin Synthase domain-like
Protein Name Putative uncharacterized protein
Gene Ontology Biological Process GO005514
is_a child of GO0008152
molecular function GO0016491
PFAM Domains PF07992 - Pyr_redox (Q0A5G3)
OR
PF08021. FAD_binding (A4FEM2, A1WVX7 )
Evidence IEA (Inferred Electronic Annotation)
Proteins Q0A5G3, A4FEM2, A1WVX7
Ferredoxin reductase-like, C-terminal NADP-linked
domain
Protein Name Protein-P-II uridylyltransferase
Gene Ontology Biological Process GO0008152 is
a parent of GO0006807
molecular function GO0016491
PFAM PF01966 - NAD Binding
Evidence IEA (Inferred Electronic Annotation)
Protein Q6MLQ2
Ref http//www.uniprot.org/uniprot/

14
Supra-domain Example 2
PreATP-grasp domain (SCOP ID 52440)
Glutathione synthetase ATP-binding domain-like
(SCOP ID 56059)
Lots of different enzymes forming carbonnitrogen
bonds have this combination of domains. Both
domains contribute to substrate binding and the
active site, and the C-terminal domain binds ATP
as well as the other substrate
15
Function Annotation for Proteins with
Supra-Domains and Single Domains
75 IEA
16
RESULTS

Functional annotations were found to be shared by
proteins having the Supra-domains as well as the
single domains.
The percentage of proteins having Supra-domains
were much higher than single domains.
Since, both domains are required for the function
of the protein, the functions assigned to single
domain proteins may be said to be mis-annotated.
This study gave us motivation of developing a
computational tool for function annotation based
on domain combinations (domain patterns) instead
of individual domains

17
A Novel Computational Tool for Function
Annotation Using Domain Content

Utilize the relationship of the predicted
functions (as shown in the GO directed acrylic
graph of functions)
Provide a web-tool and visualization of the
predicted functions and their relationship with
domain patterns

18
Probabilistic Approach

Functional annotation term F (in this case a Gene
Ontology) and a domain set D. The probability
that a protein exhibiting D would possess F is
modeled as
P(FD)P(DF)P(F)/P(D)
(i.e., posterior probability of a function given
a set of domains D P(DF), P(F), and P(D) can be
learned from proteins with known functions)
Ref Predicting protein function from domain
content Forslund et alBioinformatics, Vol. 24
no. 15 2008, pages 16811687

19
Datasets Used

Gene Ontology database
gene_association.goa_uniprot
Swisspfam

20
Methodology

For an input domain pattern (pfam domains)
All the Pfam pattern containing the given pattern
are extracted (e.g., if input domain pattern is A
B, all the domain patterns that contain this
domain pattern will be considered, such as A B
C, etc)
GO function associated with all the domain
patterns are extracted
Calculate the probability using
P(FD)P(DF)P(F)/P(D)
number of proteins that occurs with the domain
pattern possessing the function
If the percentage probabilities lie close to one
another than the parent GO function is found and
a diagram depicting a sum of the distance of the
parent from the two children is printed
otherwise the GO terms that have P(FD) gt 0.9
MaxP(FD) are extracted
Summary graph providing all the GO functions
associated with the pattern search

21
Example Domain Input PF03446 PF00393
22
Example Domain Input PF02801
23
Conclusion

A survey of annotation based on single domains
Function annotation using domain patterns (domain
combinations) instead of individual domains
To DO
Do a more thorough survey with the annotation
studies of single domains
Define all the relationships between the GO IDs
in the Summary Graph
Refine and test the computational tool.

24
Acknowledgements