Annotation transfer by homology - PowerPoint PPT Presentation

About This Presentation

Title:

Annotation transfer by homology

Description:

Extend function prediction through inclusion of structure ... Scored against Astral PDB90. 1.5% error rate in subfamily classification using top-scoring SHMM ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 48

Provided by: kimmens6

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Annotation transfer by homology

1
(No Transcript)
2
Annotation transfer by homology

Status quo approach to protein function
prediction
Given a gene (or protein) of unknown function
Run BLAST to find homologs
Identify the top BLAST hit(s)
If the score is significant, transfer the
annotation
If resources permit, predict domains using PFAM
or CDD
Problems
Approach fails completely for 30 of genes
Of those with annotations, only 3 have any
supporting experimental evidence
97 have had functions predicted by homology
alone
High error rate

Based on analysis of gt300K proteins in the
UniProt database
3
Tomato Cf-2 Bioinformatics Analysis
Domain fusion and fission events complicate
function prediction by homology, particularly for
particularly common domains (e.g., LRR regions).
Domain structure analysis (e.g., PFAM) is often
critical.
Tomato Cf-2 (GI1587673) Dixon, Jones, Keddie,
Thomas, Harrison and Jones JDG Cell (1996)
Top BLAST hit in Arabidopsis is an RLK!
4
Errors due to domain shuffling
(sic)
5
Error presumably due to non-orthology of database
hits used for annotation
6
Phylogenetic analysis suggests its more
likely a Biogenic Amine GPCR
7
Human neutral sphingomyelinase
or bacterial isochorismate synthase?
8
Database annotation errors

Main sources of annotation errors
Domain shuffling
Gene duplication (failure to discriminate between
orthologs and paralogs)
Existing database annotation errors

Propagation of existing database annotation errors
Errors in gene structure Contamination Other
Galperin and Koonin, Sources of systematic error
in functional annotation of genomes domain
rearrangement, non-orthologous gene displacement
and operon disruption.In Silico Biol. 1998
9
Phylogenomic inference
Eisen Phylogenomics Improving Functional
Predictions for Uncharacterized Genes by
Evolutionary Analysis, Genome Research
1998 Sjölander, Phylogenomic inference of
protein molecular function advances and
challenges," Bioinformatics 2004
10
Piet Hein, Grooks
11
There is nothing more difficult to take in hand,
more perilous to conduct, or more uncertain in
its success, than to take the lead in the
introduction of a new order of things. Because
the innovator has for enemies all those who have
done well under the old conditions, and lukewarm
defenders in those who may do well under the new.
This coolness arises partly from the
incredulity of men, who do not readily believe
in new things until they have had a long
experience of them.
12
Construction of genome-scale phylogenomic
libraries
Cluster genome into global homology groups
13
Berkeley Universal Proteome Phylogenomic Explorer
9,707 protein family books and 708K HMMs and
expanding daily
http//phylogenomics.berkeley.edu/UniversalProteom
e
14
Protein fold prediction
12 identity
VirB4 TrwB structure (1E9RA)
Active site
15
Example Book Voltage-gated K channels
16

SCI-PHY subfamilies supported by ML tree, and
also consistent with subtype and phylogenetic
distribution
(only one branch of ML tree displayed)

17
GO annotations for Shal subfamily
18
Database queries
Look up protein family books based on the
annotations associated with any sequence. Queries
can be based on GO biological process, PFAM
domains, UniProt accession numbers, etc.
19
Key algorithms in PhyloFacts library construction
What clustering methods are appropriate for
inference of protein function?
What alignment methods are accurate?
How to mask?
What tree methods to use?
How to root a tree?
Can we define functional subfamilies
automatically?
20
Fraction superposable positions drops with
evolutionary divergence
21
FlowerPower

Clustering global (or glocal) homologs
Minimize profile drift
Improved alignment accuracy

Nandini Krishnamurthy, Ph.D.
22
Step 1 Construct SearchDB
Qquery Construct SearchDB using PSI-BLAST
against target database
Q
23
Step 2 Select and align core set.
Q
Inclusion criteria E-value 1e-10 Bi-directional
coverage MUSCLE multiple alignment (Edgar,
2003)
24
Step 3 Run SCI-PHY to identify subfamilies and
build subfamily HMMs (SHMMs)
Q
BETE subfamily identification Sjölander
1998 SHMM construction Brown et al, 2004
25
Step 4 SHMMs compete for sequences from
SearchDB. Sequences meeting criteria are aligned
to their closest SHMM.
Q
26
Step 5 Run SCI-PHY on extended alignment to
identify new subfamilies and construct SHMMs.
Q
27
Iterate until convergence
Q
28
Comparing FlowerPower, BLAST, PSI-BLAST and
UCSC T2KTest Clustering global homologs
Agreement at domain structure determined by PFAM.
SCOP used to cluster PFAM domains into
structural equivalence classes.
29
Subfamily Classification In PHYlogenomics
(SCI-PHY)
Nandini Krishnamurthy, Ph.D. Duncan Brown

Agglomerative clustering
Input MSA
Initialize construct profile1 for each row in
MSA
While (clusters gt 1)
Join closest2 pair of clusters
Re-estimate profile1
Compute encoding cost3 for this stage
/ cut tree using minimum encoding cost /
Use Dirichlet mixture densities
Distance function relative entropy

Detection of critical positions
Sjolander, K. "Phylogenetic inference in protein
superfamilies Analysis of SH2 domains"
Proceedings of Conference Intelligent Systems for
Molecular Biology (ISMB) 1998
30
Subfamilies identified using minimum encoding
cost principles

Each stage of the algorithm defines a different
set of alignments, one for each cluster
(subfamily).
Find the point during the clustering where the
encoding cost of the alignments is minimal. This
defines the subfamily decomposition.

N number of sequences. S number of
subfamilies n c,1n c,s are the amino acids
aligned by subfamilies 1 through s at column c.
? represents the Dirichlet mixture prior.
31
SCI-PHY analysis of selected GPCRs
Venter et al, The sequence of the human genome
(2001) Science. Sjolander, "Phylogenomic
inference of protein molecular function advances
and challenges," (2004) Bioinformatics
32
Key residue prediction using subfamily and
family-wide conservation analysis
Elizabeth Hua-Mei Kellogg Ryan Ritterson Nandini
Krishnamurthy
Parker JS, Roe SM, Barford D. , EMBO J.,
2004 Tanaka Hall, T. Structure 2005 Rivas et
al, 2005
D
RD E
YAH
33
Function Prediction Using HMMs
7TM GPCR
ABC Transporter
Amidohydrolase
ATPase
Family
34
Subfamily HMM construction

At completely conserved positions, and subfamily
gapped positions Use match state distributions
estimated for general (family) HMM.
At other positions
Estimate Dirichlet mixture density posterior for
each subfamily at each position separately.
Use Dirichlet density posteriors to weight
contributions from other subfamilies.
Compute amino acid distribution using weighted
counts and standard Dirichlet procedure.

Error
Brown et al,Subfamily HMMs in functional
genomics (2005) Pacific Symposium on Biocomputing
35
Subfamily HMMs increase the separation between
true and false positives

515 unique SCOP folds
PFAM full MSAs
Scored against Astral PDB90

1.5 error rate in subfamily classification using
top-scoring SHMM
36
SATCHMO Simultaneous Alignment and Tree
Construction using Hidden Markov mOdels
Xia Jiang Nandini Krishnamurthy Duncan
Brown Michael Tung Jake Gunn-Glanville Bob Edgar
Edgar, R., and Sjölander, K., "SATCHMO Sequence
Alignment and Tree Construction using Hidden
Markov models," Bioinformatics. 2003 Jul
2219(11)1404-11
37
SATCHMO motivation

Structural divergence within a superfamily means
that
Multiple sequence alignment (MSA) is hard
Alignable positions varies according to degree of
divergence
Current MSA methods not designed to handle this
variability
Assume globally alignable, all columns (e.g.
ClustalW)
Over-aligns, i.e. aligns regions that are not
superposable
or identify and align only highly conserved
positions (e.g., SAM software with HMM
surgery)
Challenge
Different degrees of alignability in different
sequence pairs, different regions
Masking protocols are lossy loop regions may be
variable across the family but may be critical
for function!

38
SATCHMO algorithm

Input unaligned sequences
Initialize a profile HMM is constructed for each
sequence.
While (clusters gt 1)
Use profile-profile scoring to select clusters to
join
Align clusters to each other, keeping columns
fixed
Analyze joint MSA to predict which positions
appear to be structurally similar these are
retained, the remainder are masked.
Construct a profile HMM for the new masked MSA
Output Tree and MSA

39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
Alignment of proteins with different overall folds
44
Assessing sequence alignment with respect to
structural alignment
Xia Jiang Duncan Brown Nandini Krishnamurthy
45
Future work Interactive specificity position
identification

Enable users to select subtrees for analysis
Identify positions conserved within each
subtree, but which differentiate the
two
Plot over MSA and on structure (if available)

Donald and Shakhnovich, NAR 2005
46
Major challenge Phylogenetic uncertainty
Given A (gene tree of unknown function),
gene trees B and C (characterized
function)Predict function for A.
A
B
B
A
C
C
A
C
B
Problem use three phylogenetic tree methods, get
3 or more trees! Change the MSA, you also change
the tree Need Better simulation studies,
benchmark datasets
47
http//phylogenomics.berkeley.edu
Berkeley Phylogenomics Group PI Kimmen
Sjölander Nandini Krishnamurthy, Ph.D. Duncan
Brown Sriram Sankararaman Xia Jiang Jake
Gunn-Glanville Lead programmer and web
administrator Dan Kirshner
This work is supported in part by a Presidential
Early Career Award for Scientists and Engineers
from the NSF, and by an R01 from the NHGRI
(NIH).

Write a Comment

User Comments (0)