Title: Annotation transfer by homology
1(No Transcript)
2Annotation transfer by homology
- Status quo approach to protein function
prediction - Given a gene (or protein) of unknown function
- Run BLAST to find homologs
- Identify the top BLAST hit(s)
- If the score is significant, transfer the
annotation - If resources permit, predict domains using PFAM
or CDD - Problems
- Approach fails completely for 30 of genes
- Of those with annotations, only 3 have any
supporting experimental evidence - 97 have had functions predicted by homology
alone - High error rate
Based on analysis of gt300K proteins in the
UniProt database
3Tomato Cf-2 Bioinformatics Analysis
Domain fusion and fission events complicate
function prediction by homology, particularly for
particularly common domains (e.g., LRR regions).
Domain structure analysis (e.g., PFAM) is often
critical.
Tomato Cf-2 (GI1587673) Dixon, Jones, Keddie,
Thomas, Harrison and Jones JDG Cell (1996)
Top BLAST hit in Arabidopsis is an RLK!
4Errors due to domain shuffling
(sic)
5Error presumably due to non-orthology of database
hits used for annotation
6Phylogenetic analysis suggests its more
likely a Biogenic Amine GPCR
7Human neutral sphingomyelinase
or bacterial isochorismate synthase?
8Database annotation errors
- Main sources of annotation errors
- Domain shuffling
- Gene duplication (failure to discriminate between
orthologs and paralogs) - Existing database annotation errors
Propagation of existing database annotation errors
Errors in gene structure Contamination Other
Galperin and Koonin, Sources of systematic error
in functional annotation of genomes domain
rearrangement, non-orthologous gene displacement
and operon disruption.In Silico Biol. 1998
9Phylogenomic inference
Eisen Phylogenomics Improving Functional
Predictions for Uncharacterized Genes by
Evolutionary Analysis, Genome Research
1998 Sjölander, Phylogenomic inference of
protein molecular function advances and
challenges," Bioinformatics 2004
10Piet Hein, Grooks
11There is nothing more difficult to take in hand,
more perilous to conduct, or more uncertain in
its success, than to take the lead in the
introduction of a new order of things. Because
the innovator has for enemies all those who have
done well under the old conditions, and lukewarm
defenders in those who may do well under the new.
This coolness arises partly from the
incredulity of men, who do not readily believe
in new things until they have had a long
experience of them.
12Construction of genome-scale phylogenomic
libraries
Cluster genome into global homology groups
13Berkeley Universal Proteome Phylogenomic Explorer
9,707 protein family books and 708K HMMs and
expanding daily
http//phylogenomics.berkeley.edu/UniversalProteom
e
14Protein fold prediction
12 identity
VirB4 TrwB structure (1E9RA)
Active site
15Example Book Voltage-gated K channels
16- SCI-PHY subfamilies supported by ML tree, and
also consistent with subtype and phylogenetic
distribution - (only one branch of ML tree displayed)
17GO annotations for Shal subfamily
18Database queries
Look up protein family books based on the
annotations associated with any sequence. Queries
can be based on GO biological process, PFAM
domains, UniProt accession numbers, etc.
19Key algorithms in PhyloFacts library construction
What clustering methods are appropriate for
inference of protein function?
What alignment methods are accurate?
How to mask?
What tree methods to use?
How to root a tree?
Can we define functional subfamilies
automatically?
20Fraction superposable positions drops with
evolutionary divergence
21FlowerPower
- Clustering global (or glocal) homologs
- Minimize profile drift
- Improved alignment accuracy
Nandini Krishnamurthy, Ph.D.
22Step 1 Construct SearchDB
Qquery Construct SearchDB using PSI-BLAST
against target database
Q
23Step 2 Select and align core set.
Q
Inclusion criteria E-value 1e-10 Bi-directional
coverage MUSCLE multiple alignment (Edgar,
2003)
24Step 3 Run SCI-PHY to identify subfamilies and
build subfamily HMMs (SHMMs)
Q
BETE subfamily identification Sjölander
1998 SHMM construction Brown et al, 2004
25Step 4 SHMMs compete for sequences from
SearchDB. Sequences meeting criteria are aligned
to their closest SHMM.
Q
26Step 5 Run SCI-PHY on extended alignment to
identify new subfamilies and construct SHMMs.
Q
27Iterate until convergence
Q
28Comparing FlowerPower, BLAST, PSI-BLAST and
UCSC T2KTest Clustering global homologs
Agreement at domain structure determined by PFAM.
SCOP used to cluster PFAM domains into
structural equivalence classes.
29Subfamily Classification In PHYlogenomics
(SCI-PHY)
Nandini Krishnamurthy, Ph.D. Duncan Brown
- Agglomerative clustering
- Input MSA
- Initialize construct profile1 for each row in
MSA - While (clusters gt 1)
- Join closest2 pair of clusters
- Re-estimate profile1
- Compute encoding cost3 for this stage
- / cut tree using minimum encoding cost /
- Use Dirichlet mixture densities
- Distance function relative entropy
Detection of critical positions
Sjolander, K. "Phylogenetic inference in protein
superfamilies Analysis of SH2 domains"
Proceedings of Conference Intelligent Systems for
Molecular Biology (ISMB) 1998
30Subfamilies identified using minimum encoding
cost principles
- Each stage of the algorithm defines a different
set of alignments, one for each cluster
(subfamily). - Find the point during the clustering where the
encoding cost of the alignments is minimal. This
defines the subfamily decomposition.
N number of sequences. S number of
subfamilies n c,1n c,s are the amino acids
aligned by subfamilies 1 through s at column c.
? represents the Dirichlet mixture prior.
31SCI-PHY analysis of selected GPCRs
Venter et al, The sequence of the human genome
(2001) Science. Sjolander, "Phylogenomic
inference of protein molecular function advances
and challenges," (2004) Bioinformatics
32Key residue prediction using subfamily and
family-wide conservation analysis
Elizabeth Hua-Mei Kellogg Ryan Ritterson Nandini
Krishnamurthy
Parker JS, Roe SM, Barford D. , EMBO J.,
2004 Tanaka Hall, T. Structure 2005 Rivas et
al, 2005
D
RD E
YAH
33Function Prediction Using HMMs
7TM GPCR
ABC Transporter
Amidohydrolase
ATPase
Family
34Subfamily HMM construction
- At completely conserved positions, and subfamily
gapped positions Use match state distributions
estimated for general (family) HMM. - At other positions
- Estimate Dirichlet mixture density posterior for
each subfamily at each position separately. - Use Dirichlet density posteriors to weight
contributions from other subfamilies. - Compute amino acid distribution using weighted
counts and standard Dirichlet procedure.
Error
Brown et al,Subfamily HMMs in functional
genomics (2005) Pacific Symposium on Biocomputing
35Subfamily HMMs increase the separation between
true and false positives
- 515 unique SCOP folds
- PFAM full MSAs
- Scored against Astral PDB90
1.5 error rate in subfamily classification using
top-scoring SHMM
36SATCHMO Simultaneous Alignment and Tree
Construction using Hidden Markov mOdels
Xia Jiang Nandini Krishnamurthy Duncan
Brown Michael Tung Jake Gunn-Glanville Bob Edgar
Edgar, R., and Sjölander, K., "SATCHMO Sequence
Alignment and Tree Construction using Hidden
Markov models," Bioinformatics. 2003 Jul
2219(11)1404-11
37SATCHMO motivation
- Structural divergence within a superfamily means
that - Multiple sequence alignment (MSA) is hard
- Alignable positions varies according to degree of
divergence - Current MSA methods not designed to handle this
variability - Assume globally alignable, all columns (e.g.
ClustalW) - Over-aligns, i.e. aligns regions that are not
superposable - or identify and align only highly conserved
positions (e.g., SAM software with HMM
surgery) - Challenge
- Different degrees of alignability in different
sequence pairs, different regions - Masking protocols are lossy loop regions may be
variable across the family but may be critical
for function!
38SATCHMO algorithm
- Input unaligned sequences
- Initialize a profile HMM is constructed for each
sequence. - While (clusters gt 1)
- Use profile-profile scoring to select clusters to
join - Align clusters to each other, keeping columns
fixed - Analyze joint MSA to predict which positions
appear to be structurally similar these are
retained, the remainder are masked. - Construct a profile HMM for the new masked MSA
-
- Output Tree and MSA
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43Alignment of proteins with different overall folds
44Assessing sequence alignment with respect to
structural alignment
Xia Jiang Duncan Brown Nandini Krishnamurthy
45Future work Interactive specificity position
identification
- Enable users to select subtrees for analysis
- Identify positions conserved within each
subtree, but which differentiate the
two - Plot over MSA and on structure (if available)
Donald and Shakhnovich, NAR 2005
46Major challenge Phylogenetic uncertainty
Given A (gene tree of unknown function),
gene trees B and C (characterized
function)Predict function for A.
A
B
B
A
C
C
A
C
B
Problem use three phylogenetic tree methods, get
3 or more trees! Change the MSA, you also change
the tree Need Better simulation studies,
benchmark datasets
47http//phylogenomics.berkeley.edu
Berkeley Phylogenomics Group PI Kimmen
Sjölander Nandini Krishnamurthy, Ph.D. Duncan
Brown Sriram Sankararaman Xia Jiang Jake
Gunn-Glanville Lead programmer and web
administrator Dan Kirshner
This work is supported in part by a Presidential
Early Career Award for Scientists and Engineers
from the NSF, and by an R01 from the NHGRI
(NIH).