Protein Function Prediction Based on Domain Content - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Protein Function Prediction Based on Domain Content

Description:

(SCOP ID - 63380) (SCOP ID 52343) 100% IEA ... domain-like (SCOP ID = 56059) Lots of different enzymes forming carbon nitrogen bonds ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 25
Provided by: Anki9
Category:

less

Transcript and Presenter's Notes

Title: Protein Function Prediction Based on Domain Content


1
Protein Function Prediction Based on Domain
Content
Ankita Sarangi School of Informatics,
IUB Capstone Presentation, May 11,
2009 Advisor Yuzhen Ye
2
What information can be used for function
annotation?
  • Sequence based approaches
  • Protein A has function X, and protein B is a
    homolog (ortholog) of protein A Hence B has
    function X
  • Structure-based approaches
  • Protein A has structure X, and X has specific
    structural features Hence Xs function sites are
    used to assign function to the Protein A
  • Motif-based approaches (sequence motifs, 3D
    motifs)
  • A group of genes have function X and they all
    have motif Y protein A has motif Y Hence
    protein As function might be related to X
  • Guilt-by-association
  • Gene A has function X and gene B is often
    associated with gene A, B might have function
    related to X
  • Associations
  • Domain fusion, phylogenetic profiling, PPI, etc.

3
Function Annotation Using Protein Domains
  • A protein domain is a part of protein sequence
    and structure that can evolve, function, and
    exist independently of the rest of the protein
    chain.
  • Each domain forms a compact three-dimensional
    structure and often can be independently folded.
  • Many proteins consist of several structural
    domains.
  • Among relevant sequence features of a protein,
    domains occupy a key position. They are
    sequential and structural motifs found
    independently in different proteins, in different
    combinations, and as such seem to be the building
    blocks of proteins

4
Supra-domains
  • However, it is also known that certain sets of
    independent domains are frequently found
    together, which may indicate functional
    cooperation.
  • Supra- Domains A supra-domain is defined as a
    domain combination in a particular
    N-to-C-terminal orientation that occurs in at
    least two different domain architectures in
    different proteins with (i) different types of
    domains at the N and C-terminal end of the
    combination or (ii) different types of domains
    at one end and no domain at the other.
  • A type of Supra-domain are ones whose activity is
    created at the interface between the two domains
    of a protein
  • (Ref JMB, 2004, 336809823)
  • We may make mistakes if we do function prediction
    based on individual domains
  • We know proteins that have domain A and B have
    function F, what about proteins having domain A
    or domain B only?

5
My Work
  • A survey of mis-annotation based on single
    domains
  • We are interested to know how serious this
    problem is in the current annotation system
  • There is no systematic survey on this so far
  • Function annotation using domain patterns (domain
    combinations) instead of individual domains
  • Utilize the relationship of the predicted
    functions (as shown in the GO directed acyclic
    graph of functions)
  • Provide a web-tool and visualization of the
    predicted functions and their relationship with
    domain patterns

6
Protein Domains SUPERFAMILY
  • SUPERFAMILY is a database of structural and
    functional protein annotations for all completely
    sequenced organisms.
  • The SUPERFAMILY web site and database provides
    protein domain assignments, at the SCOP
    'superfamily' and 'family' levels, for the
    predicted protein sequences in over 900 organisms
  • We made a local copy of this database

7
Protein Function GO
  • The GENE ONTOLOGY(GO) project is a collaborative
    effort to address the need for consistent
    descriptions of gene products in different
    databases.
  • Consists of three structured, controlled
    vocabularies (ontologies) that describe gene
    products in terms of their associated biological
    processes, cellular components and molecular
    functions in a species-independent manner.

8
A Survey of Mis-Annotations Using Single Domains
We looked at several supra-domains listed in this
paper Supra-domains Evolutionary Units Larger
than Single Protein Domains Voget etal. J. Mol.
Biol. (2004) 336, 809823
9
Finding Functions for Domains and Domain
Combinations
  • Superfamily
  • Use the SCOP ID of the domains to obtain Gene
    Identifiers associated with the supra-domains as
    well as their individual domains
  • UniProt ID mapping file
  • (is a tab-delimited table, which includes
    mappings for 20 different sequence identifier
    types(example ENSGALP000, AN7518.2, Afu1g003,
    gi41409236refNP_962072.1)and
  • gene_association.goa_uniprot
  • (GO assignments for the UniProt KnowledgeBase
    (UniProtKB))
  • To obtain Swiss prot ID, GO ID
  • Find Gene Ontology functions that are associated
    with
  • proteins which contain both the domains and the
    individual domains

10
Supra-domain Example 1
(SCOP ID - 63380)
(SCOP ID 52343)
The N-terminal domain binds FAD and the
C-terminal domain binds NADPH. The FAD acts as an
intermediate in electron transfer between NADPH
and substrate, and this domain combination is
used by many different enzymes
11
Function Annotation for Proteins with
Supra-Domains and Single Domains
100 IEA
12
Function Example GO0016491 ((oxidoreductase
activity))
  • 10 proteins with Supra domains annotated to
    GO0016491---- 2491proteins with Supra domains
  • 3 proteins with Riboflavin Synthase domain-like
    annotated to GO0016491 --- 42 proteins with
    Riboflavin Synthase domain-like
  • 1 protein with reductase-like, C-terminal
    NADP-linked domain annotated to GO0016491--- 47
    protein with reductase-like, C-terminal
    NADP-linked domain
  • Specific proteins searched and presence and
    absence of the combined domain was confirmed
    along with GO ID as well as annotation evidence
    which was found to be Inferred Electronic
    Annotation

13
Example GO0016491
  • Supra Domains Riboflavin Synthase domain-like,
    Ferredoxin reductase-like, C-terminal NADP-linked
    domain
  • Protein Name Oxidoreductase FAD-binding domain
    protein
  •   Gene Ontology Biological Process GO005514
    is_a child of GO0008152
  • molecular function GO0016491
  •   PFAM domains PF00970. FAD_binding
  • PF00175 NAD_binding
  • Evidence IEA (Inferred Electronic Annotation)
  • Proteins A4FHX1 , A1UCP3, A4T5V2, A3PWD0
    ,Q1BCA1
  • Protein Name Sulfide dehydrogenase
    (Flavoprotein) subunit SudA sulfide dehydrogenase
    (Flavoprotein) subunit SudB
  • Gene Ontology Biological Process GO005514
    is_a child of GO0008152
  • molecular function GO0016491
  • PFAM Domains PF00175. NAD_binding
  • PF07992. Pyr_redox (FAD_pyr_nucl-diS_OxRdtase.
    )
  • Evidence IEA (Inferred Electronic Annotation)
  • Proteins Q2J1U9, Q13CJ3, Q5PB24
  • Protein Name Dihydroorotate dehydrogenase
    electron transfer subunit, putative
  • Gene Ontology Biological Process GO005514
    is_a child of GO0008152
  • molecular function GO0016491
  • PFAM domains PF00970. FAD_binding
  • Riboflavin Synthase domain-like
  • Protein Name Putative uncharacterized protein
  • Gene Ontology Biological Process GO005514
    is_a child of GO0008152
  • molecular function GO0016491
  • PFAM Domains PF07992 - Pyr_redox (Q0A5G3)
  • OR
  • PF08021. FAD_binding (A4FEM2, A1WVX7 )
  • Evidence IEA (Inferred Electronic Annotation)
  • Proteins Q0A5G3, A4FEM2, A1WVX7
  • Ferredoxin reductase-like, C-terminal NADP-linked
    domain
  •  
  • Protein Name Protein-P-II uridylyltransferase
  • Gene Ontology Biological Process GO0008152 is
    a parent of GO0006807
  •   molecular function GO0016491
  • PFAM PF01966 - NAD Binding
  • Evidence IEA (Inferred Electronic Annotation) 
  • Protein Q6MLQ2
  • Ref http//www.uniprot.org/uniprot/

14
Supra-domain Example 2
PreATP-grasp domain (SCOP ID 52440)
Glutathione synthetase ATP-binding domain-like
(SCOP ID 56059)
Lots of different enzymes forming carbonnitrogen
bonds have this combination of domains. Both
domains contribute to substrate binding and the
active site, and the C-terminal domain binds ATP
as well as the other substrate
15
Function Annotation for Proteins with
Supra-Domains and Single Domains
75 IEA
16
RESULTS
  • Functional annotations were found to be shared by
    proteins having the Supra-domains as well as the
    single domains.
  • The percentage of proteins having Supra-domains
    were much higher than single domains.
  • Since, both domains are required for the function
    of the protein, the functions assigned to single
    domain proteins may be said to be mis-annotated.
  • This study gave us motivation of developing a
    computational tool for function annotation based
    on domain combinations (domain patterns) instead
    of individual domains

17
A Novel Computational Tool for Function
Annotation Using Domain Content
  • Utilize the relationship of the predicted
    functions (as shown in the GO directed acrylic
    graph of functions)
  • Provide a web-tool and visualization of the
    predicted functions and their relationship with
    domain patterns

18
Probabilistic Approach
  • Functional annotation term F (in this case a Gene
    Ontology) and a domain set D. The probability
    that a protein exhibiting D would possess F is
    modeled as
  • P(FD)P(DF)P(F)/P(D)
  • (i.e., posterior probability of a function given
    a set of domains D P(DF), P(F), and P(D) can be
    learned from proteins with known functions)
  • Ref Predicting protein function from domain
    content Forslund et alBioinformatics, Vol. 24
    no. 15 2008, pages 16811687

19
Datasets Used
  • Gene Ontology database
  • gene_association.goa_uniprot
  • Swisspfam

20
Methodology
  • For an input domain pattern (pfam domains)
  • All the Pfam pattern containing the given pattern
    are extracted (e.g., if input domain pattern is A
    B, all the domain patterns that contain this
    domain pattern will be considered, such as A B
    C, etc)
  • GO function associated with all the domain
    patterns are extracted
  • Calculate the probability using
    P(FD)P(DF)P(F)/P(D)
  • number of proteins that occurs with the domain
    pattern possessing the function
  • If the percentage probabilities lie close to one
    another than the parent GO function is found and
    a diagram depicting a sum of the distance of the
    parent from the two children is printed
    otherwise the GO terms that have P(FD) gt 0.9
    MaxP(FD) are extracted
  • Summary graph providing all the GO functions
    associated with the pattern search

21
Example Domain Input PF03446 PF00393
22
Example Domain Input PF02801
23
Conclusion
  • A survey of annotation based on single domains
  • Function annotation using domain patterns (domain
    combinations) instead of individual domains
  • To DO
  • Do a more thorough survey with the annotation
    studies of single domains
  • Define all the relationships between the GO IDs
    in the Summary Graph
  • Refine and test the computational tool.

24
Acknowledgements
  • I would like to thank
  • Dr. Yuzhen Ye
  • Faculty of the Department of Bioinformatics
  • Drs. Dalkilic, Kim, Hahn, Radivojac, Tang
  • Linda Hostetter and Rachel Lawmaster
  • Family and Friends
Write a Comment
User Comments (0)
About PowerShow.com