Title: Global Classification of (Plant) Proteins across Multiple Species
1Global Classification of (Plant) Proteins across
Multiple Species
- Kerr Wall
- Jim Leebens-Mack
- Naomi Altman
- Victor Albert
- Dawn Field
- Hong Ma
- Claude dePamphilis
2Global Classification of Proteins
- The protein classification problem
- A method for global classification
- Bootstrap support for global classification
- Structure within clusters
- Structure between clusters
- Results from complete proteome classification
arabidopsis, oryza and populus
3The protein classification problem
- Genomic sequence can be translated into protein
sequence but - The function of most proteins is unknown.
- Protein classification is used to
- infer protein folding structure
- infer protein function
- infer evolutionary relationships
4Similarity of Protein Sequence
- FFHPLECEPTLQMGFHSDQIS-VAA---AGPS--VNNN---
- FFHPLDCGPTLQMGYPSDSLTAEAAASVAGPS--C--S---
- FFHPLECEPTLQIGYQPDPIT-VAA---AGPS--VN-NYMP
- FFHPIECEPTLQMGYQQDQIT-VAAA--AGPSMTMN-S---
- FFQHIECEPTLHIGYQPDQIT-VAA---AGPS--MN-NYMQ
- FFHPLECEPTLQIGYQHDQIT-IAA---PGPS--VS-NYMP
- Each row represents a different protein.
- Each letter represents an amino acid.
- Each represents a space which is missing in
this sequence but has something in it in a
different protein in this set. - In closely related proteins, the distance between
proteins is the number of mismatches. - In distantly related species, the sequences are
given a score often the probability that a
random sequence matches as well (e.g. BLAST
E-value)
5Inferring Evolutionary Relationships
- Main methods
- statistical phylogeny based on sequence
alignment and evolutionary models - -requires a high degree of sequence similarity
- -good alignments use slow algorithms and often
lots of manual intervention - manual curation
- -requires a large amount of manual intervention
- -can incorporate sequence, folding structure
and function. - These methods are good for 100s of genes.
-
6Global Classification of Proteins
Arabidopsis 26,207
Rice 57,915
Poplar 45,555
Total 129,677
Our goal The joint classification of all known
plant proteins using a scaffold derived from
the 3 completely sequenced species
7A method for global classification
- Clustering based on a similarity (or distance)
matrix is commonly used. - A quick method for clustering (sparse matrix
computations are often used). - Our similarity matrix is 129,677 x 129,677 so we
need - A quick method for computing distance (BLAST
E-values are often used we use - -log(E-value) as the similarity measure)
8TribeMCL Clustering Algorithm
Predicted protein sequences from the fully
sequenced genomes of Arabidopsis thaliana
columbia (26207) and Oryza sativa japonica
(57915) were downloaded from TIGR. Populus
trichocarpa (45555) was downloaded from JGI. All
sequences were blasted against each other using
BLASTp 2.4 with an E-value cutoff of 1x10-5 The
TribeMCL package was used to predict putative
protein families at low, medium, and high
(I1.2,3,5) stringencies The results are
stored at http//www.floralgenome.org/cgi-bin/trib
edb/tribe.cgi
9(No Transcript)
10(No Transcript)
11(No Transcript)
12TribeMCL MethodEnright, Van Dongen and Ouzounis
(2002)
- Similarity is measured by
-
- -log10(BLAST E-value)
- Clustering is done by MCL Method
13MCL Algorithmvan Dongen, 2000
- Suppose S is the similarity matrix.
- Normalize the rows of S to sum to 1.
- Raise each entry to the power rgt1. (r is the
stringency) and renormalize. S(r) - Take a Markov step replace S(r)S(r).
- Iterate to convergence.
It is very fast because low similarities are
truncated to zero and sparse matrix methods can
then be used.
14A Heuristic for MCL
- We take a random walk on the graph described by
the similarity matrix - BUT
- After each step we weaken the links between
distant nodes and strengthen the links between
nearby nodes
Graphic from van Dongen, 2000
15r2.0
Similarity Matrix
r2.6
Cluster pattern at Convergence as a function of r
r2.8
Small groups break apart first. The pattern is
quite robust to changes in the similarity of the
green region
r2.9
16r2.0
Similarity Matrix
Cluster pattern at Convergence as a function of
r At r3.6 all units separate
r2.6
16 40 60 50
r2.8
The additional similarity indicated by pink has a
profound effect
r3.1
17r2.0
Similarity Matrix
Cluster pattern at Convergence as a function of r
r2.6
More strongly connecting the background
disrupts the pattern until r2.7, after which we
quickly cycle through the pattern (2.9 turns the
center group into singletons and 3.0 turns
everything into singletons.)
r2.7
r2.8
18r2.0
Similarity Matrix
r2.1
Cluster pattern at Convergence as a function of r
r2.3
Weakening the within cluster similarity
accelerates the breakdown into singletons
19r2.0
Similarity Matrix
Cluster pattern at Convergence as a function of r
r2.3
Strengthening the background while weakening
the within cluster similarity makes it difficult
to pick out the clusters.
20Some Summary Statistics for the Clusters
Protein Set Number of Proteins Number of Clusters at r3 Percent of Singletons
Arabidopsis 26,207 11,467 (44) 69
Arabidopsis Rice 84,122 28,175 (33) 68
Arabidopsis Rice Poplar 129,677 35,873 (28) 67
21(No Transcript)
22Singletons
Cluster ATH Rice Poplar
ATH 30 - -
Rice 17 25 -
Poplar 12 24 15
23(No Transcript)
24(No Transcript)
25Comparing Tribes to Phylogenetic Trees from
Sequence Alignment
Tribes for large gene families show some, but not
complete correspondence to inferred phylogenetic
relationships. Tribes with MADS genes formed at
low, medium and high stringencies are mapped on
to the a recently published Arabidopsis MADS gene
phylogeny (Martinez-Castilla Alvarez-Buylla
2003).
26Comparisons with curated gene families
- Added tribe information to TAIRs gene families
- www.floralgenome.org/cgi-bin/tair/tair.cgi
- E.g. Cytochrome P450
27(No Transcript)
28Bootstrap Support for Clusters
- To determine the stability of the clusters, we
need some type of perturbation of the system. We
use the 0.632 jackknife instead of the
bootstrap (as we want a set of unique proteins).
- We clustered 100 samples, each a random selection
of 63.2 of the proteins. - We count 1 for each tribe each time all the
genes in the tribe selected for the bootstrap
sample are clustered.
29(No Transcript)
30(No Transcript)
31From Tribes to Phylogenetics
- Within each tribe of 3 or more proteins we can do
hierarchical clustering using the similarity
matrix (Harlow, Gogarten, Ragan, 2004) or forming
a careful alignment and doing phylogenetic tree. - We can also form SuperTribes, by clustering the
tribes. Because we still have a large set of
objects to cluster, we continue to use MCL. - Within a SuperTribe, we can do hierarchical
clustering. - The SuperTribe for the MADS family shown earlier
includes all the MADS sequences
32Single Linkage TribeMCL
- Define the distance between tribes as the
smallest pairwise E-value. - Use TribeMCL on the resulting similarity matrix.
- Use hierarchical clustering within supertribes.
Single Linkage Tribe MCL
Hierarchical clustering or phylogenetic trees
33Floral Genome Project and Plant
ProteinClassification
34Use of the Global Classification
- Project goal is to understand the evolution of
flowers. - Data has been collected to various degrees of
intensity on 15 non-model species across the
phylogeny of flowering plants and merged with
data from other projects. - PlantTribes will be used to assist in placing
these proteins into families to infer
evolutionary relationships.
35And many thanks to
- Kerr Wall FGP Bioinformatics (PSU)
- Claude dePamphilis FGP PI (PSU)
- Jim Leebens-Mack FGP Project Director(PSU)
- Hong Ma FGP co-PI (PSU)
- Victor Albert collaborator (U. Oslo)
- Dawn Field collaborator (Oxford U.)
- And FGP collaborators at PSU, UFL and Cornell.
- And especially
- NSF Plant Genome Research Program