Annotation transfer by homology - PowerPoint PPT Presentation

About This Presentation
Title:

Annotation transfer by homology

Description:

Extend function prediction through inclusion of structure ... Scored against Astral PDB90. 1.5% error rate in subfamily classification using top-scoring SHMM ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 48
Provided by: kimmens6
Category:

less

Transcript and Presenter's Notes

Title: Annotation transfer by homology


1
(No Transcript)
2
Annotation transfer by homology
  • Status quo approach to protein function
    prediction
  • Given a gene (or protein) of unknown function
  • Run BLAST to find homologs
  • Identify the top BLAST hit(s)
  • If the score is significant, transfer the
    annotation
  • If resources permit, predict domains using PFAM
    or CDD
  • Problems
  • Approach fails completely for 30 of genes
  • Of those with annotations, only 3 have any
    supporting experimental evidence
  • 97 have had functions predicted by homology
    alone
  • High error rate

Based on analysis of gt300K proteins in the
UniProt database
3
Tomato Cf-2 Bioinformatics Analysis
Domain fusion and fission events complicate
function prediction by homology, particularly for
particularly common domains (e.g., LRR regions).
Domain structure analysis (e.g., PFAM) is often
critical.
Tomato Cf-2 (GI1587673) Dixon, Jones, Keddie,
Thomas, Harrison and Jones JDG Cell (1996)
Top BLAST hit in Arabidopsis is an RLK!
4
Errors due to domain shuffling
(sic)
5
Error presumably due to non-orthology of database
hits used for annotation
6
Phylogenetic analysis suggests its more
likely a Biogenic Amine GPCR
7
Human neutral sphingomyelinase
or bacterial isochorismate synthase?
8
Database annotation errors
  • Main sources of annotation errors
  • Domain shuffling
  • Gene duplication (failure to discriminate between
    orthologs and paralogs)
  • Existing database annotation errors

Propagation of existing database annotation errors
Errors in gene structure Contamination Other
Galperin and Koonin, Sources of systematic error
in functional annotation of genomes domain
rearrangement, non-orthologous gene displacement
and operon disruption.In Silico Biol. 1998
9
Phylogenomic inference
Eisen Phylogenomics Improving Functional
Predictions for Uncharacterized Genes by
Evolutionary Analysis, Genome Research
1998 Sjölander, Phylogenomic inference of
protein molecular function advances and
challenges," Bioinformatics 2004
10
Piet Hein, Grooks
11
There is nothing more difficult to take in hand,
more perilous to conduct, or more uncertain in
its success, than to take the lead in the
introduction of a new order of things. Because
the innovator has for enemies all those who have
done well under the old conditions, and lukewarm
defenders in those who may do well under the new.
This coolness arises partly from the
incredulity of men, who do not readily believe
in new things until they have had a long
experience of them.
12
Construction of genome-scale phylogenomic
libraries
Cluster genome into global homology groups
13
Berkeley Universal Proteome Phylogenomic Explorer
9,707 protein family books and 708K HMMs and
expanding daily
http//phylogenomics.berkeley.edu/UniversalProteom
e
14
Protein fold prediction
12 identity
VirB4 TrwB structure (1E9RA)
Active site
15
Example Book Voltage-gated K channels
16
  • SCI-PHY subfamilies supported by ML tree, and
    also consistent with subtype and phylogenetic
    distribution
  • (only one branch of ML tree displayed)

17
GO annotations for Shal subfamily
18
Database queries
Look up protein family books based on the
annotations associated with any sequence. Queries
can be based on GO biological process, PFAM
domains, UniProt accession numbers, etc.
19
Key algorithms in PhyloFacts library construction
What clustering methods are appropriate for
inference of protein function?
What alignment methods are accurate?
How to mask?
What tree methods to use?
How to root a tree?
Can we define functional subfamilies
automatically?
20
Fraction superposable positions drops with
evolutionary divergence
21
FlowerPower
  • Clustering global (or glocal) homologs
  • Minimize profile drift
  • Improved alignment accuracy

Nandini Krishnamurthy, Ph.D.
22
Step 1 Construct SearchDB
Qquery Construct SearchDB using PSI-BLAST
against target database
Q
23
Step 2 Select and align core set.
Q
Inclusion criteria E-value 1e-10 Bi-directional
coverage MUSCLE multiple alignment (Edgar,
2003)
24
Step 3 Run SCI-PHY to identify subfamilies and
build subfamily HMMs (SHMMs)
Q
BETE subfamily identification Sjölander
1998 SHMM construction Brown et al, 2004
25
Step 4 SHMMs compete for sequences from
SearchDB. Sequences meeting criteria are aligned
to their closest SHMM.
Q
26
Step 5 Run SCI-PHY on extended alignment to
identify new subfamilies and construct SHMMs.
Q
27
Iterate until convergence
Q
28
Comparing FlowerPower, BLAST, PSI-BLAST and
UCSC T2KTest Clustering global homologs
Agreement at domain structure determined by PFAM.
SCOP used to cluster PFAM domains into
structural equivalence classes.
29
Subfamily Classification In PHYlogenomics
(SCI-PHY)
Nandini Krishnamurthy, Ph.D. Duncan Brown
  • Agglomerative clustering
  • Input MSA
  • Initialize construct profile1 for each row in
    MSA
  • While (clusters gt 1)
  • Join closest2 pair of clusters
  • Re-estimate profile1
  • Compute encoding cost3 for this stage
  • / cut tree using minimum encoding cost /
  • Use Dirichlet mixture densities
  • Distance function relative entropy

Detection of critical positions
Sjolander, K. "Phylogenetic inference in protein
superfamilies Analysis of SH2 domains"
Proceedings of Conference Intelligent Systems for
Molecular Biology (ISMB) 1998
30
Subfamilies identified using minimum encoding
cost principles
  • Each stage of the algorithm defines a different
    set of alignments, one for each cluster
    (subfamily).
  • Find the point during the clustering where the
    encoding cost of the alignments is minimal. This
    defines the subfamily decomposition.

N number of sequences. S number of
subfamilies n c,1n c,s are the amino acids
aligned by subfamilies 1 through s at column c.
? represents the Dirichlet mixture prior.
31
SCI-PHY analysis of selected GPCRs
Venter et al, The sequence of the human genome
(2001) Science. Sjolander, "Phylogenomic
inference of protein molecular function advances
and challenges," (2004) Bioinformatics
32
Key residue prediction using subfamily and
family-wide conservation analysis
Elizabeth Hua-Mei Kellogg Ryan Ritterson Nandini
Krishnamurthy
Parker JS, Roe SM, Barford D. , EMBO J.,
2004 Tanaka Hall, T. Structure 2005 Rivas et
al, 2005
D
RD E
YAH
33
Function Prediction Using HMMs
7TM GPCR
ABC Transporter
Amidohydrolase
ATPase
Family
34
Subfamily HMM construction
  1. At completely conserved positions, and subfamily
    gapped positions Use match state distributions
    estimated for general (family) HMM.
  2. At other positions
  3. Estimate Dirichlet mixture density posterior for
    each subfamily at each position separately.
  4. Use Dirichlet density posteriors to weight
    contributions from other subfamilies.
  5. Compute amino acid distribution using weighted
    counts and standard Dirichlet procedure.

Error
Brown et al,Subfamily HMMs in functional
genomics (2005) Pacific Symposium on Biocomputing
35
Subfamily HMMs increase the separation between
true and false positives
  • 515 unique SCOP folds
  • PFAM full MSAs
  • Scored against Astral PDB90

1.5 error rate in subfamily classification using
top-scoring SHMM
36
SATCHMO Simultaneous Alignment and Tree
Construction using Hidden Markov mOdels
Xia Jiang Nandini Krishnamurthy Duncan
Brown Michael Tung Jake Gunn-Glanville Bob Edgar
Edgar, R., and Sjölander, K., "SATCHMO Sequence
Alignment and Tree Construction using Hidden
Markov models," Bioinformatics. 2003 Jul
2219(11)1404-11
37
SATCHMO motivation
  • Structural divergence within a superfamily means
    that
  • Multiple sequence alignment (MSA) is hard
  • Alignable positions varies according to degree of
    divergence
  • Current MSA methods not designed to handle this
    variability
  • Assume globally alignable, all columns (e.g.
    ClustalW)
  • Over-aligns, i.e. aligns regions that are not
    superposable
  • or identify and align only highly conserved
    positions (e.g., SAM software with HMM
    surgery)
  • Challenge
  • Different degrees of alignability in different
    sequence pairs, different regions
  • Masking protocols are lossy loop regions may be
    variable across the family but may be critical
    for function!

38
SATCHMO algorithm
  • Input unaligned sequences
  • Initialize a profile HMM is constructed for each
    sequence.
  • While (clusters gt 1)
  • Use profile-profile scoring to select clusters to
    join
  • Align clusters to each other, keeping columns
    fixed
  • Analyze joint MSA to predict which positions
    appear to be structurally similar these are
    retained, the remainder are masked.
  • Construct a profile HMM for the new masked MSA
  • Output Tree and MSA

39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
Alignment of proteins with different overall folds
44
Assessing sequence alignment with respect to
structural alignment
Xia Jiang Duncan Brown Nandini Krishnamurthy
45
Future work Interactive specificity position
identification
  • Enable users to select subtrees for analysis
  • Identify positions conserved within each
    subtree, but which differentiate the
    two
  • Plot over MSA and on structure (if available)

Donald and Shakhnovich, NAR 2005
46
Major challenge Phylogenetic uncertainty
Given A (gene tree of unknown function),
gene trees B and C (characterized
function)Predict function for A.
A
B
B
A
C
C
A
C
B
Problem use three phylogenetic tree methods, get
3 or more trees! Change the MSA, you also change
the tree Need Better simulation studies,
benchmark datasets
47
http//phylogenomics.berkeley.edu
Berkeley Phylogenomics Group PI Kimmen
Sjölander Nandini Krishnamurthy, Ph.D. Duncan
Brown Sriram Sankararaman Xia Jiang Jake
Gunn-Glanville Lead programmer and web
administrator Dan Kirshner
This work is supported in part by a Presidential
Early Career Award for Scientists and Engineers
from the NSF, and by an R01 from the NHGRI
(NIH).
Write a Comment
User Comments (0)
About PowerShow.com