Title: Functional annotation
1 Functional annotation Datasources Konstantinos
Mavrommatis Head of Omics group DOE-JGI Kmavromma
tis_at_lbl.gov
2Outline
- Genome annotation (Functional)
- How do we know it is correct?
- How do we do it?
- Data collections
- Protein families
- Pathway collections
3Genome annotation The process of identifying the
locations and functions of coding sequences.
- cobalamin biosynthetic enzyme, cobalt-precorrin-4
methyltransferase (CbiF) - molecular/enzymatic (methyltransferase)
- Reaction (methylation)
- Substrate (cobalt-precorrin-4)
- Ligand (S-adenosyl-L-methionine)
- metabolic (cobalamin biosynthesis)
- physiological (maintenance of healthy nerve and
red blood cells, through B12).
4Functional annotation helps make sense out of
nonsense
5Function prediction is mainly based on homology
detection
- Homology
- implies a common evolutionary origin.
- not retention of similarity in any of their
properties. - Homology ? similarity of function.
- Function transfer by homology
6Function transfer based on homology is error prone
Punta Ofran. PLOS Comp Biol. 2008
7Limits in transfer of annotation based on homology
Punta Ofran. PLOS Comp Biol. 2008
8If no similarity is detected use alternative
methods to predict function
- Subcellular localization
- Gene context
- Special sequence motifs features
9Annotation should make sense in the context of
the cell metabolism
Model pathway
Substrate A
Substrate B
Substrate C
Substrate D
Enzyme 2
Genome annotation
10Annotation should make sense.Missing genes may
be present.
11Genome annotation The process of identifying the
locations and functions of coding sequences.
- Helps prediction
- Is error prone.
- Has to make sense.
12There are multiple datasources to help organize
information and facilitate annotation
- Sequence databases
- Protein classification databases
- Specialized databases
13Primary databases store raw information from
various sources
- EMBL/GenBank/DDBJ (http//www.ncbi.nlm.nih.gov/,ht
tp//www.ebi.ac.uk/embl) - Archive containing all sequences from all sources
- GenBank/UnitProt contain translations of
sequences.
Year Base pairs Sequences 2004 44,575,745,176 40,6
04,319 2005 56,037,734,462 52,016,762 2006 69,019,
290,705 64,893,747 2007 83,874,179,730 80,388,382
2008 99,116,431,942 98,868,465
14Primary databases accumulate errors in sequences
and annotations
- In the sequences themselves
- Sequencing errors.
- Cloning vector sequences.
- In the annotations
- Inaccuracies, omissions, and even mistakes.
- Inconsistencies between some fields.
- Redundancy.
15IMG is using Refseq as its primary source
16Protein families use different methods to
classify proteins
- COG/KOG
- Pfam
- TIGRfam
- KEGG Orthologs
- InterPro
17What are COGs/KOGs? How much can I trust them?
gtgnlCOG2723 COG2723, BglB,
Beta-glucosidase/6-phospho-beta-glucosidase/beta-
galactosidase Carbohydrate transport
and metabolism. Length 460 Score
388 bits (998), Expect e-132 Identities
176/503 (34), Positives 251/503 (49), Gaps
75/503 (14) Query 4 SFPKSFRFGWSQAGFQSEMGTPGSE
DPNTDWYVWVHDPENIASGLVSGDLPEHGPGYWGL 63
FPK F G A FQ E DW VWVHD I LVSGD
PE Sbjct 3 KFPKDFLWGGATAAFQVEGAWNEDGKGP
SDWDVWVHDE--IPGRLVSGDPPEEASDFYHR 60 Query 64
YRMFHDNAVKMGLDIARINVEWSRIFPKPMPDPPQGNVEVKGNDVLAVHV
DENDLKRLDE 123 Y A MGL R
EWSRIFP Sbjct
61 YKEDIALAKEMGLNAFRTSIEWSRIFPNGDGGEV------------
-------------- 94 Query 124 AANQEAVRHYREIFSDLKAR
GIHFILNFYHWPLPLWVHDPIRVRKGDLSGPTGWLDVKTV 183
N R Y F LKARGI YH LPLW P
GW TV Sbjct 95 --NEKGLRFYDRLFDELKARGI
EPFVTLYHFDLPLWLQKPYG----------GWENRETV
142 Query 184 INFARFAAYTAWKFDDLADEYSTMNEPNVVHSNG
YMWVKSGFPPSYLNFELSRRVMVNLI 243
FARAA F D T NEPNVV GY G PP
V Sbjct 143 DAFARYAATVFERFGDKVKYWFTFNEPNV
VVELGYL--YGGHPPGIVDPKAAYQVAHHML 200 Query 244
QAHARAYDAVKAISKK-PIGIIYANSSFTPLTDK--DAKAVELAEYDSRW
IFFDAIIKGE 300 AHA A AK I K GII
PLDK D KA E A F DA KGE Sbjct 201
LAHALAVKAIKKINPKGKVGIILNLTPAYPLSDKPEDVKAAENADRFHNR
FFLDAQVKGE 260 Query 301 --------------LMGVTRDDL
----KGRLDWIGVNYYSRTVVKLIGEKSYVSIPGYGY 342
L DL DIGNYY VK
GYG Sbjct 261 YPEYLEKELEENGILPEIEDGDLEI
LKENTVDFIGLNYYTPSRVK---AAEPRYVSGYGP 317
Reciprocal best hit Bidirectional best hit
Blast best hit Unidirectional best hit
18Pfam are based on the detection of domains
http//pfam.sanger.ac.uk
HMMs of protein alignments (local) for domains,
or global (cover whole protein)
19TIGRfam
- Full length alignments.
- Domain alignments.
- Equivalogs families of proteins with specific
function. - Superfamilies families of homologous genes.
- HMMs
http//www.tigr.org/TIGRFAMs/
20How can we search Pfam and TIGRfam?
- GA Gathering method Search threshold to build
the full alignment. - TC Trusted Cutoff Lowest sequence score and
domain score of match in the full alignment. - NC Noise Cutoff Highest sequence score and
domain score of match not in full alignment.
Trusted cutoff
Gathering cutoff
Noise cutoff
Hits to other models
Query BChl_A M357 Accession
PF02327.12 Description Bacteriochlorophyll A
protein Scores for complete sequences (score
includes all domains) --- full sequence ---
--- best 1 domain --- -dom- E-value
score bias E-value score bias exp N
Sequence Description -------
------ ----- ------- ------ ----- ---- --
-------- ----------- 0.00014
11.2 0.0 0.00024 10.5 0.0 1.2 1
trE0STV9E0STV9_IGNAA Glycoside hydrolase family
1 Domain annotation for each sequence (and
alignments) gtgt trE0STV9E0STV9_IGNAA Glycoside
hydrolase family 1 OSIgnisphaera aggregans
(strain DSM) score bias c-Evalue
i-Evalue hmmfrom hmm to alifrom ali to
envfrom env to acc --- ------ -----
--------- --------- ------- ------- -------
------- ------- ------- ---- 1 ! 10.5
0.0 1.1e-05 0.00024 217 273 ..
255 307 .. 240 321 .. 0.84
Alignments for each domain domain 1
score 10.5 bits conditional E-value 1.1e-05
BChl_A 217 fshagsgvvdsisrwaelfpvek
lnkpasveagfrsdsqgievkvdgelpgvsvdag 273
fs gvsi w l e
gfr iev vg l v d trE0STV9E0STV9_IGNAA
255 FSKKPIGIVESIASWIPLREGDR----EAAEKGFRYNLWPIEVAVN
GYLDDVYRDDL 307
89999998877765....3569
99864 PP
21InterPro. Composite pattern databases
- To simplify sequence analysis, the family
databases are being integrated to create a
unified annotation resource InterPro - Release 30.0 (Dec10) contains 21178 entries
- Central annotation resource, with pointers to its
satellite dbs
http//www.ebi.ac.uk/interpro/
22KEGG orthology
lt10-5 evalue rank 5 70 query length 30
identity
Xizeng Mao et al. Bioinformatics Volume
21,(2005)3787-3793
23ENZYME
24Pathway collectionsKEGG
- Contains information about biochemical pathways,
and protein interactions.
http//www.kegg.com
25Pathway collectionsMetacyc
26Functional annotation
http//imgweb.jgi-psf.org/img_er_v260/doc/img_er_a
nn.pdf
27RNA structural and functional annotation are
coupled
- SILVA alignments of rRNAs are used to generate
models - Covariance models for each RNA class are used to
predict genes
28There is a plethora of specialized databases that
one needs to search
http//www.oxfordjournals.org/nar/database/c
29In most cases databases are interconnected but
..not all databases are updated regularly.
Changes of annotation in one database are not
reflected in others
30There are multiple datasources to help organize
information and facilitate annotation
- Sequence databases
- Contain sequences deposited by verious sources
- Protein classification databases
- Utilize sequence homology or other criteria to
group together proteins - COG, Pfam, TIGRfam, InterPro, KO terms
- Specialized databases
- Start by searching for available resources
31Question?
- Genome annotation (Functional)
- How do we know it is correct?
- How do we do it?
- Data collections
- Protein families
- Pathway collections