Title: A QuickStart Guide to Using PhyloFacts
1A Quick-Start GuidetoUsing PhyloFacts
2Overview
A
- Background
- Browsing the library and reading PhyloFacts
books - Submitting sequences for functional and
structural classification - Database queries
B
C
D
3Background
Retrieve paper
http//phylogenomics.berkeley.edu/phylofacts/
4Background
Simple overview of different webservers (how to
use)
Detailed description of PhyloFacts construction
and recommended use and interpretation
5Homology-based functional annotations are fraught
with systematic error
Background
Gilks et al, Modeling the percolation of
annotation errors in a database of protein
sequences Bioinformatics 2002 Galperin and
Koonin 1998 "Sources of Systematic Error in
Functional Annotation of Genomes" In Silico
Biology. Brenner, 1999 "Errors in Genome
Annotation" Trends Genet. Brown Sjölander,
"Functional Classification using Phylogenomic
Inference." PLoS Computational Biology, 2006
6Structural phylogenomic inference of protein
function addresses these errors
Background
7Phylogenomic library construction
Background
Cluster genome into global homology groups
8Types of PhyloFacts books
Background
- Global homology sequences that share a common
domain architecture - Alignable over entire length
- Homologs retrieved using FlowerPower
- Domain sequences that contain a structural
domain - Seeded using a PDB structure or SCOP domain
- Conserved region sequences that share a region
of similarity - Correspondence to structure unknown
- Motif short regions (typically lt50aa) conserved
for functional reasons
9Proteins are composed of modular structural
domains which are found in different domain
architectures
Background
Leucine-Rich Repeat (LRR)
Toll-Interleukin Receptor (TIR) domain
PhyloFacts Global Homology books include only
those sequences that can be predicted to share
the same domain architecture (series of
structural domains). These are more suitable for
predicting function. PhyloFacts Domain books
model individual domains that may be found in
different domain architectures these thus
include sequences with different overall folds
and functions.
10KEGG Orthology Group K00002 spans five domain
architectures
Background
Group 1 Zinc-binding dehydrogenase(all cellular
organisms)
ADH_N
ADH_zinc_N
Group 2 Iron-binding dehydrogenase (all
cellular organisms)
Group 3 Cofactor-binding domain of zinc-binding
dehydrogenase (Bacteria/Eukarya)
ADH_zinc_N
Group 4 Sequences of unknown function
(Halobacterium)
ADH_zinc_N
PF02894
Group 5 Aldo-keto reductase (Bacteria/Eukarya)
11Summary
- Each book in PhyloFacts contains
- a multiple sequence alignment
- one or more phylogenetic trees
- Hidden Markov models for each subfamily and
family predicted PFAM domains - predicted trans-membrane helices
- predicted subfamilies
- homologous solved 3D structures
- predicted functional residues
- GO annotations and evidence codes
- UniProt definitions
- links to literature
- links to genome databases and other external
resources - Graphical user interfaces to view
- Multiple sequence alignment
- Phylogenetic tree(s)
- 3D structure
- PhyloFacts is an encyclopedia of protein families
across the Tree of Life - The majority of PhyloFacts books represent
proteins sharing a common domain architecture - The second largest fraction are based on protein
structures and structural domains - Functional annotation of a sequence included in a
PhyloFacts book is enabled by examination of the
sequence in its evolutionary context - New sequences can be classified to PhyloFacts
families and subfamilies using the Sequence
Search page - Results include functional classification,
prediction of 3D structure and detection of
remote homologs