Global Classification of (Plant) Proteins across Multiple Species

1 / 35

About This Presentation

Title:

Global Classification of (Plant) Proteins across Multiple Species

Description:

statistical phylogeny based on sequence alignment and evolutionary models ... published Arabidopsis MADS gene phylogeny (Martinez-Castilla & Alvarez-Buylla 2003) ... –

Number of Views:33

Avg rating:3.0/5.0

Slides: 36

Provided by: Nao69

Category:

more less

Transcript and Presenter's Notes

Title: Global Classification of (Plant) Proteins across Multiple Species

1
Global Classification of (Plant) Proteins across
Multiple Species

Kerr Wall
Jim Leebens-Mack
Naomi Altman
Victor Albert
Dawn Field
Hong Ma
Claude dePamphilis

2
Global Classification of Proteins

The protein classification problem
A method for global classification
Bootstrap support for global classification
Structure within clusters
Structure between clusters
Results from complete proteome classification
arabidopsis, oryza and populus

3
The protein classification problem

Genomic sequence can be translated into protein
sequence but
The function of most proteins is unknown.
Protein classification is used to
infer protein folding structure
infer protein function
infer evolutionary relationships

4
Similarity of Protein Sequence

FFHPLECEPTLQMGFHSDQIS-VAA---AGPS--VNNN---
FFHPLDCGPTLQMGYPSDSLTAEAAASVAGPS--C--S---
FFHPLECEPTLQIGYQPDPIT-VAA---AGPS--VN-NYMP
FFHPIECEPTLQMGYQQDQIT-VAAA--AGPSMTMN-S---
FFQHIECEPTLHIGYQPDQIT-VAA---AGPS--MN-NYMQ
FFHPLECEPTLQIGYQHDQIT-IAA---PGPS--VS-NYMP
Each row represents a different protein.
Each letter represents an amino acid.
Each represents a space which is missing in
this sequence but has something in it in a
different protein in this set.
In closely related proteins, the distance between
proteins is the number of mismatches.
In distantly related species, the sequences are
given a score often the probability that a
random sequence matches as well (e.g. BLAST
E-value)

5
Inferring Evolutionary Relationships

Main methods
statistical phylogeny based on sequence
alignment and evolutionary models
-requires a high degree of sequence similarity
-good alignments use slow algorithms and often
lots of manual intervention
manual curation
-requires a large amount of manual intervention
-can incorporate sequence, folding structure
and function.
These methods are good for 100s of genes.

6
Global Classification of Proteins

Very high throughput

Arabidopsis 26,207
Rice 57,915
Poplar 45,555
Total 129,677
Our goal The joint classification of all known
plant proteins using a scaffold derived from
the 3 completely sequenced species
7
A method for global classification

Clustering based on a similarity (or distance)
matrix is commonly used.
A quick method for clustering (sparse matrix
computations are often used).
Our similarity matrix is 129,677 x 129,677 so we
need
A quick method for computing distance (BLAST
E-values are often used we use
-log(E-value) as the similarity measure)

8
TribeMCL Clustering Algorithm
Predicted protein sequences from the fully
sequenced genomes of Arabidopsis thaliana
columbia (26207) and Oryza sativa japonica
(57915) were downloaded from TIGR. Populus
trichocarpa (45555) was downloaded from JGI. All
sequences were blasted against each other using
BLASTp 2.4 with an E-value cutoff of 1x10-5 The
TribeMCL package was used to predict putative
protein families at low, medium, and high
(I1.2,3,5) stringencies The results are
stored at http//www.floralgenome.org/cgi-bin/trib
edb/tribe.cgi
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
TribeMCL MethodEnright, Van Dongen and Ouzounis
(2002)

Similarity is measured by
-log10(BLAST E-value)
Clustering is done by MCL Method

13
MCL Algorithmvan Dongen, 2000

Suppose S is the similarity matrix.
Normalize the rows of S to sum to 1.
Raise each entry to the power rgt1. (r is the
stringency) and renormalize. S(r)
Take a Markov step replace S(r)S(r).
Iterate to convergence.

It is very fast because low similarities are
truncated to zero and sparse matrix methods can
then be used.
14
A Heuristic for MCL

We take a random walk on the graph described by
the similarity matrix
BUT
After each step we weaken the links between
distant nodes and strengthen the links between
nearby nodes

Graphic from van Dongen, 2000
15
r2.0
Similarity Matrix
r2.6
Cluster pattern at Convergence as a function of r
r2.8
Small groups break apart first. The pattern is
quite robust to changes in the similarity of the
green region
r2.9
16
r2.0
Similarity Matrix
Cluster pattern at Convergence as a function of
r At r3.6 all units separate
r2.6
16 40 60 50
r2.8
The additional similarity indicated by pink has a
profound effect
r3.1
17
r2.0
Similarity Matrix
Cluster pattern at Convergence as a function of r
r2.6
More strongly connecting the background
disrupts the pattern until r2.7, after which we
quickly cycle through the pattern (2.9 turns the
center group into singletons and 3.0 turns
everything into singletons.)
r2.7
r2.8
18
r2.0
Similarity Matrix
r2.1
Cluster pattern at Convergence as a function of r
r2.3
Weakening the within cluster similarity
accelerates the breakdown into singletons
19
r2.0
Similarity Matrix
Cluster pattern at Convergence as a function of r
r2.3
Strengthening the background while weakening
the within cluster similarity makes it difficult
to pick out the clusters.
20
Some Summary Statistics for the Clusters
Protein Set Number of Proteins Number of Clusters at r3 Percent of Singletons
Arabidopsis 26,207 11,467 (44) 69
Arabidopsis Rice 84,122 28,175 (33) 68
Arabidopsis Rice Poplar 129,677 35,873 (28) 67
21
(No Transcript)
22
Singletons
Cluster ATH Rice Poplar
ATH 30 - -
Rice 17 25 -
Poplar 12 24 15
23
(No Transcript)
24
(No Transcript)
25
Comparing Tribes to Phylogenetic Trees from
Sequence Alignment
Tribes for large gene families show some, but not
complete correspondence to inferred phylogenetic
relationships. Tribes with MADS genes formed at
low, medium and high stringencies are mapped on
to the a recently published Arabidopsis MADS gene
phylogeny (Martinez-Castilla Alvarez-Buylla
2003).
26
Comparisons with curated gene families

Added tribe information to TAIRs gene families
www.floralgenome.org/cgi-bin/tair/tair.cgi
E.g. Cytochrome P450

27
(No Transcript)
28
Bootstrap Support for Clusters

To determine the stability of the clusters, we
need some type of perturbation of the system. We
use the 0.632 jackknife instead of the
bootstrap (as we want a set of unique proteins).
We clustered 100 samples, each a random selection
of 63.2 of the proteins.
We count 1 for each tribe each time all the
genes in the tribe selected for the bootstrap
sample are clustered.

29
(No Transcript)
30
(No Transcript)
31
From Tribes to Phylogenetics

Within each tribe of 3 or more proteins we can do
hierarchical clustering using the similarity
matrix (Harlow, Gogarten, Ragan, 2004) or forming
a careful alignment and doing phylogenetic tree.
We can also form SuperTribes, by clustering the
tribes. Because we still have a large set of
objects to cluster, we continue to use MCL.
Within a SuperTribe, we can do hierarchical
clustering.
The SuperTribe for the MADS family shown earlier
includes all the MADS sequences

32
Single Linkage TribeMCL

Define the distance between tribes as the
smallest pairwise E-value.
Use TribeMCL on the resulting similarity matrix.
Use hierarchical clustering within supertribes.

Single Linkage Tribe MCL
Hierarchical clustering or phylogenetic trees
33
Floral Genome Project and Plant
ProteinClassification
34
Use of the Global Classification

Project goal is to understand the evolution of
flowers.
Data has been collected to various degrees of
intensity on 15 non-model species across the
phylogeny of flowering plants and merged with
data from other projects.
PlantTribes will be used to assist in placing
these proteins into families to infer
evolutionary relationships.

35
And many thanks to