Title: Benchmarking%20Orthology%20in%20Eukaryotes
1Benchmarking Orthology in Eukaryotes
- 12-01-2004 Nijmegen
- Tim Hulsen
2Summary
- (1) An introduction to orthology
- (2) Orthology determination methods
- (3) Benchmarking
- co-expression
- conservation of co-expression
- SwissProt name
- (4) Conclusions
3An introduction to orthology
(from http//www.ncbi.nlm.nih.gov/Education/BLASTi
nfo/Orthology.html)
4Orthology determination methods
- Orthology databases/methods
- COG/KOG
- Inparanoid
- OrthoMCL
- Inclusiveness
- one-to-one/one-to-many/many-to-many
- organisms
- Best bidirectional hit/Phylogenetic trees
5Benchmarking orthology
- Quality of orthology difficult to test no golden
standard - Orthologs should have highly similar functions
- Measuring conservation of function
- functional annotation
- co-expression
- domain structure
6Benchmarked orthology determination methods
- BBH Best Bidirectional Hit
- KOG euKaryotic Orthologous Groups
- INP INPARANOID
- MCL OrthoMCL
- Z1H All pairs with Z gt 100
- COM Comics Phylogenetic Tree Method
- EQN Equal SwissProt Names
7Data set used
- Protein World all proteins in all available
(SPTREMBL) proteomes compared to each other - Smith-Waterman with Z-value statistics
- 100 randomized shuffles to test significance of
SW score
rnd ori 5SD ? Z 5
O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2.
YMSHQFTVGE etc.
seqs
SW score
8Data set used
- Z-value compensates for
- bias in amino acid composition
- sequence length
- Proteomes used
- Human 28,508 proteins
- Mouse 20,877 proteins
- ? 595,161,516 pairs
9BBH method
- Easiest method best bidirectional hit
- Human protein (1) ? SW ? best hit in mouse (2)
- Mouse protein (2) ? SW ? best hit in human (3)
- If 3 equals 1, the human and mouse protein are
considered to be orthologs - 12,817 human-mouse orthologous pairs (12,817
human, 12,817 mouse proteins)
10KOG method
- KOG euKaryotic Orthologous Groups
- Eukaryotic version of COG, Clusters of
Orthologous Groups - COG method
- All-vs-all seq. comparison (BLAST)
- Detect and collapse obvious paralogs
Sp1-Sp1 Sp2-Sp2 Sp1-Sp2
etc. for other species ? determine BBHs
EHs-Hs lt EBBH ? paralogs EMm-Mm lt EBBH ? paralogs
11KOG method
- Detect triangles of best hits
- Merge triangles with a common side to form COGs
- Case-by-case manual analysis, examination of
large COGs (might be split up)
12KOG method
- KOG method mainly the same as COG method special
attention for eukaryotic multidomain structure - Group orthologies many-to-many
- Cognitor assign a KOG to each protein
- (mouse not yet in KOG)
- 810,697 human-mouse orthologous pairs (20,478
human, 15,640 mouse proteins)
Tatusov et al., The COG database an updated
version includes eukaryotes, BMC Bioinformatics.
2003 Sep 114(1)41
13INP method
- All-vs-all followed by a number of extra steps to
add in-paralogs ? many-to-many relations
possible - 54,553 human-mouse orthologous pairs (19,504
human, 17,030 mouse proteins)
Remm et al., Automatic clustering of orthologs
and in-paralogs from pairwise species
comparisons, J Mol Biol. 2001 Dec
14 314(5)1041-52
14MCL method
- All-vs-all BLASTP ? determine orthologs
recent paralogs ? use Markov clustering to
determine ortholog groups - 7,322 human-mouse orthologous pairs (human 6,332,
mouse 6,115 proteins)
Li et al., OrthoMCL identification of ortholog
groups for eukaryotic genomes, Genome Res. 2003
Sep13(9)2178-89
15Z1H method
- All human-mouse pairs with Z gt 100 in Protein
World set are considered to be orthologs - 290,176 human-mouse orthologous pairs (19,055
human, 16,149 mouse proteins)
16COM method
- Human
-
-
All 9 eukaryotic proteomes in
Protein World -
- Zgt20, RHgt0.5QL
-
- 24,263
groups
PROTEOME
Hs-Mm 85,848 pairs Hs-Dm 55,934 pairs etc.
TREE SCANNING
17COM method
- Example BMP6 (Bone Morphogenetic Protein 6)
- ? 5 Hs-Mm orthologous relations defined
18EQN method
- Consider all Hs-Mm pairs with equal SwissProt
names to be orthologous - e.g. ANDR_HUMAN??ANDR_MOUSE
- Used as benchmark later on
- 5,214 Hs-Mm orthologous pairs (5,214 human, 5,214
mouse proteins)
19Benchmarkingthrough co-expression
- Comparison of expression profiles of each
orthologous gene pair - Using GeneLogic Expressor data set
organism samples fragments tissue categories SNOMED tissue categories
human 3269 44792 115 15
mouse 859 36701 25 12
20Expression tissue categories
HUMAN MOUSE
1 Blood vessel 1 Blood vessel
2 Cardiovascular system 2 Cardiovascular system
3 Digestive organs 3 Digestive organs
4 Digestive system 4 Digestive system
5 Endocrine gland -
6 Female genital system 5 Female genital system
7 Hematopoietic system 6 Hematopoietic system
8 Integumentary system 7 Integumentary system
HUMAN MOUSE
9 Male genital system 8 Male genital system
10 Musculoskeletal system 9 Musculoskeletal system
11 Nervous system 10 Nervous system
12 Product of conception -
13 Respiratory system 11 Respiratory system
14 Topographic region -
15 Urinary tract 12 Urinary tract
21Co-expression calculation
- Calculation of the correlation coefficient
- N?xy (?x)(?y)
- r ----------------------------
- sqrt( (N?x2 - (?x)2)(N?y2 (?y)2))
- Measured over the 12 corresponding SNOMED tissue
categories
22Co-expression example 1
High correlation 0.914167
23Co-expression example 2
Low correlation -0.935731
24Benchmarking throughco-expression
-
25Benchmarking through conservation of co-expression
Human
Gene A Gene B
Co-expression Cab (-1ltcorr.lt1)
(Co-expression calculated over 115 tissues in
human, 25 in mouse)
All-vs-all Human 40,678 chip fragments Mouse
29,910 chip fragments
Mouse
Gene A Gene B
Cab gt Cab
? Increases probability that A and B are involved
in the same process
26Benchmarking through conservation of co-expression
- Gene Ontology (GO) database hierarchical system
of function and location descriptions - Orthologs are in same functional category when
they are in the same 4th level GO - Biological Process class
27Benchmarking through conservation of co-expression
28Benchmarking through SwissProt name
- How many of the predicted orthologous relations
have equal SwissProt names (EQN set in other
benchmarks) - reliable because checked by hand
- - assumes only one-to-one relationships are
possible
29Benchmarking through SwissProt name
(ALL if all possible human-mouse pairs (or
random fraction) would be orthologs)
30Conclusions
- Hard to point out the best orthology
determination method - In most cases lessbetter, moreworse
- Method that should be used depends on research
question do you need few reliable orthologies or
many less reliable orthologies? - Future directions look at conservation of domain
structure as a benchmark
31Credits
- Martijn Huynen
- Peter Groenen
- Comics Group
- Gert Vriend
- Rest of CMBI
- Organon Bioinf. Group