Functional annotation - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Functional annotation

Description:

Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGI Kmavrommatis_at_lbl.gov – PowerPoint PPT presentation

Number of Views:133

Avg rating:3.0/5.0

Slides: 32

Provided by: jgi92

Category:

more less

Transcript and Presenter's Notes

Title: Functional annotation

1
Functional annotation Datasources Konstantinos
Mavrommatis Head of Omics group DOE-JGI Kmavromma
tis_at_lbl.gov
2
Outline

Genome annotation (Functional)
How do we know it is correct?
How do we do it?
Data collections
Protein families
Pathway collections

3
Genome annotation The process of identifying the
locations and functions of coding sequences.

cobalamin biosynthetic enzyme, cobalt-precorrin-4
methyltransferase (CbiF)
molecular/enzymatic (methyltransferase)
Reaction (methylation)
Substrate (cobalt-precorrin-4)
Ligand (S-adenosyl-L-methionine)
metabolic (cobalamin biosynthesis)
physiological (maintenance of healthy nerve and
red blood cells, through B12).

4
Functional annotation helps make sense out of
nonsense
5
Function prediction is mainly based on homology
detection

Homology
implies a common evolutionary origin.
not retention of similarity in any of their
properties.
Homology ? similarity of function.
Function transfer by homology

6
Function transfer based on homology is error prone
Punta Ofran. PLOS Comp Biol. 2008
7
Limits in transfer of annotation based on homology
Punta Ofran. PLOS Comp Biol. 2008
8
If no similarity is detected use alternative
methods to predict function

Subcellular localization
Gene context
Special sequence motifs features

9
Annotation should make sense in the context of
the cell metabolism
Model pathway
Substrate A
Substrate B
Substrate C
Substrate D
Enzyme 2
Genome annotation
10
Annotation should make sense.Missing genes may
be present.
11
Genome annotation The process of identifying the
locations and functions of coding sequences.

Helps prediction
Is error prone.
Has to make sense.

12
There are multiple datasources to help organize
information and facilitate annotation

Sequence databases
Protein classification databases
Specialized databases

13
Primary databases store raw information from
various sources

EMBL/GenBank/DDBJ (http//www.ncbi.nlm.nih.gov/,ht
tp//www.ebi.ac.uk/embl)
Archive containing all sequences from all sources
GenBank/UnitProt contain translations of
sequences.

Year Base pairs Sequences 2004 44,575,745,176 40,6
04,319 2005 56,037,734,462 52,016,762 2006 69,019,
290,705 64,893,747 2007 83,874,179,730 80,388,382
2008 99,116,431,942 98,868,465
14
Primary databases accumulate errors in sequences
and annotations

In the sequences themselves
Sequencing errors.
Cloning vector sequences.
In the annotations
Inaccuracies, omissions, and even mistakes.
Inconsistencies between some fields.
Redundancy.

15
IMG is using Refseq as its primary source
16
Protein families use different methods to
classify proteins

COG/KOG
Pfam
TIGRfam
KEGG Orthologs
InterPro

17
What are COGs/KOGs? How much can I trust them?
gtgnlCOG2723 COG2723, BglB,
Beta-glucosidase/6-phospho-beta-glucosidase/beta-
galactosidase Carbohydrate transport
and metabolism. Length 460 Score
388 bits (998), Expect e-132 Identities
176/503 (34), Positives 251/503 (49), Gaps
75/503 (14) Query 4 SFPKSFRFGWSQAGFQSEMGTPGSE
DPNTDWYVWVHDPENIASGLVSGDLPEHGPGYWGL 63
FPK F G A FQ E DW VWVHD I LVSGD
PE Sbjct 3 KFPKDFLWGGATAAFQVEGAWNEDGKGP
SDWDVWVHDE--IPGRLVSGDPPEEASDFYHR 60 Query 64
YRMFHDNAVKMGLDIARINVEWSRIFPKPMPDPPQGNVEVKGNDVLAVHV
DENDLKRLDE 123 Y A MGL R
EWSRIFP Sbjct
61 YKEDIALAKEMGLNAFRTSIEWSRIFPNGDGGEV------------
-------------- 94 Query 124 AANQEAVRHYREIFSDLKAR
GIHFILNFYHWPLPLWVHDPIRVRKGDLSGPTGWLDVKTV 183
N R Y F LKARGI YH LPLW P
GW TV Sbjct 95 --NEKGLRFYDRLFDELKARGI
EPFVTLYHFDLPLWLQKPYG----------GWENRETV
142 Query 184 INFARFAAYTAWKFDDLADEYSTMNEPNVVHSNG
YMWVKSGFPPSYLNFELSRRVMVNLI 243
FARAA F D T NEPNVV GY G PP
V Sbjct 143 DAFARYAATVFERFGDKVKYWFTFNEPNV
VVELGYL--YGGHPPGIVDPKAAYQVAHHML 200 Query 244
QAHARAYDAVKAISKK-PIGIIYANSSFTPLTDK--DAKAVELAEYDSRW
IFFDAIIKGE 300 AHA A AK I K GII
PLDK D KA E A F DA KGE Sbjct 201
LAHALAVKAIKKINPKGKVGIILNLTPAYPLSDKPEDVKAAENADRFHNR
FFLDAQVKGE 260 Query 301 --------------LMGVTRDDL
----KGRLDWIGVNYYSRTVVKLIGEKSYVSIPGYGY 342
L DL DIGNYY VK
GYG Sbjct 261 YPEYLEKELEENGILPEIEDGDLEI
LKENTVDFIGLNYYTPSRVK---AAEPRYVSGYGP 317
Reciprocal best hit Bidirectional best hit
Blast best hit Unidirectional best hit
18
Pfam are based on the detection of domains
http//pfam.sanger.ac.uk
HMMs of protein alignments (local) for domains,
or global (cover whole protein)
19
TIGRfam

Full length alignments.
Domain alignments.
Equivalogs families of proteins with specific
function.
Superfamilies families of homologous genes.
HMMs

http//www.tigr.org/TIGRFAMs/
20
How can we search Pfam and TIGRfam?

GA Gathering method Search threshold to build
the full alignment.
TC Trusted Cutoff Lowest sequence score and
domain score of match in the full alignment.
NC Noise Cutoff Highest sequence score and
domain score of match not in full alignment.

Trusted cutoff
Gathering cutoff
Noise cutoff
Hits to other models
Query BChl_A M357 Accession
PF02327.12 Description Bacteriochlorophyll A
protein Scores for complete sequences (score
includes all domains) --- full sequence ---
--- best 1 domain --- -dom- E-value
score bias E-value score bias exp N
Sequence Description -------
------ ----- ------- ------ ----- ---- --
-------- ----------- 0.00014
11.2 0.0 0.00024 10.5 0.0 1.2 1
trE0STV9E0STV9_IGNAA Glycoside hydrolase family
1 Domain annotation for each sequence (and
alignments) gtgt trE0STV9E0STV9_IGNAA Glycoside
hydrolase family 1 OSIgnisphaera aggregans
(strain DSM) score bias c-Evalue
i-Evalue hmmfrom hmm to alifrom ali to
envfrom env to acc --- ------ -----
--------- --------- ------- ------- -------
------- ------- ------- ---- 1 ! 10.5
0.0 1.1e-05 0.00024 217 273 ..
255 307 .. 240 321 .. 0.84
Alignments for each domain domain 1
score 10.5 bits conditional E-value 1.1e-05
BChl_A 217 fshagsgvvdsisrwaelfpvek
lnkpasveagfrsdsqgievkvdgelpgvsvdag 273
fs gvsi w l e
gfr iev vg l v d trE0STV9E0STV9_IGNAA
255 FSKKPIGIVESIASWIPLREGDR----EAAEKGFRYNLWPIEVAVN
GYLDDVYRDDL 307
89999998877765....3569
99864 PP
21
InterPro. Composite pattern databases

To simplify sequence analysis, the family
databases are being integrated to create a
unified annotation resource InterPro
Release 30.0 (Dec10) contains 21178 entries
Central annotation resource, with pointers to its
satellite dbs

http//www.ebi.ac.uk/interpro/
22
KEGG orthology
lt10-5 evalue rank 5 70 query length 30
identity
Xizeng Mao et al. Bioinformatics Volume
21,(2005)3787-3793
23
ENZYME
24
Pathway collectionsKEGG

Contains information about biochemical pathways,
and protein interactions.

http//www.kegg.com
25
Pathway collectionsMetacyc
26
Functional annotation
http//imgweb.jgi-psf.org/img_er_v260/doc/img_er_a
nn.pdf
27
RNA structural and functional annotation are
coupled

SILVA alignments of rRNAs are used to generate
models
Covariance models for each RNA class are used to
predict genes

28
There is a plethora of specialized databases that
one needs to search
http//www.oxfordjournals.org/nar/database/c
29
In most cases databases are interconnected but
..not all databases are updated regularly.
Changes of annotation in one database are not
reflected in others
30
There are multiple datasources to help organize
information and facilitate annotation