The use of 4grams for Protein Classification and Sequence Comparison

About This Presentation

Title:

The use of 4grams for Protein Classification and Sequence Comparison

Description:

Sequence homogeneity within a family is a prerequisite for successful ... Sequence Search ... cosine value and Sequence alignment % identity. Conclusions ... –

Number of Views:66

Avg rating:3.0/5.0

Slides: 22

Provided by: dror2

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: The use of 4grams for Protein Classification and Sequence Comparison

1
The use of 4-grams for Protein Classification and
Sequence Comparison
Dror Tobi, ShannChing Chen, Ivet Bahar
2
The 4-gram Concept
QLIR
a
AASD
FGTY
4-gram a short sequence of four amino acids
3
Representation of Sequence(s) as 4-gram Vector(s)
Three steps

Calculating 4-gram frequencies in the examined DB
Calculating 4-gram frequencies for a given
sequence or a given family of sequences
Creating a 4-gram vector using a weight function

4
1. Calculating 4-gram frequencies in DB
As a reference DB we chose the Swiss-Prot. A
table of the of occurrences of each 4-gram was
created
The table enables us to calculate the database
frequency of 4-gram i as
5
2. Calculating 4-gram frequencies of a sequence
(or family)
The 4-gram frequencies for a given sequence or a
family of sequences is done using a hash
table. Each 4-gram is entered into a hash table
from which the 4-gram family frequency is
calculated
6
3. The 4-gram weight function
where is the average number of times
4-gram i appears in family f
(no important contribution)
7
Building a 4-gram Vector (contd)
4-gram vector of length k is built from the k
4-grams with the highest Wi values. These
4-grams are referred to as the k most
discriminative 4-grams. The selection of the k
most discriminative 4-grams is done using a heap
data structure.
1
2
k
Identity
xxxx1 w1
xxxx5 w5
xxxx9 w9
xxxx1050 w1050
xxxx1001 w1001
Weight
The vector elements are sorted according to their
4-gram identity using quick sort algorithm.
8
Comparing two Vectors
Vector similarity is measured by the cosine of
the angle between the two vectors
a
xxxx1 w1
xxxx5 w5
xxxx9 w9
xxxx1050 w1050
xxxx1001 w1001
xxxx5 w5
xxxx6 w6
xxxx9 w9
xxxx1056 w1056
xxxx1001 w1001
9
EC4 family classification
EC4 Test 1769 families (containing a total of
10,919 enzymes) defined at the EC level4
classification (at Expasy) were considered (). A
4-gram vector (model, probe vector) was built for
each EC4 family. The cosine between the probe
vector for a given EC4 family and the 4-gram
vector of each sequence in the Swiss-Prot was
calculated. All sequences were rank-ordered based
on their cosine values.
() out of a total of 4000 in SWISS-PROT release
27.7, excluding families that do not contain any
sequences
10
Success Definition
success is defined as the of family members
having a cosine value higher then any non family
sequence in the Swiss-Prot DB.
Example for a family (F00X) that has five
members F001-5
A case of 80 success. Family members are colored
blue.
F001 0.567 F003 0.456 F005 0.354 F002 0.333 P0
SD 0.301 F004 0.255 ..
11
EC4 Initial Results
12
EC 1.14.12.3 a case of failure
EC 1.14.12.3 is a family of four proteins. When
we tested this family against Swiss-Prot no
family member had a higher cosine value than the
highest cosine value of non-family members.
EC 1.14.12.3 Phylogenetic tree

THIS DIOXYGENASE SYSTEM CONSISTS OF FOUR
PROTEINS THE TWO SUBUNITS OF THE HYDROXYLASE
COMPONENT (BEDC1 AND BEDC2), A FERREDOXIN (BEDB)
AND A FERREDOXIN REDUCTASE (BEDA).

13
Sequence homogeneity is a prerequisite for
successful 4-gram classification
Sub Family
Family vector
Sub Family
14
Preliminary Conclusions

4-gram classification is a fast way to
classify/cluster sequences. 120,000 comparisons
take 4 min on regular desktop.
Sequence homogeneity within a family is a
prerequisite for successful classification.
The EC classification classifies enzymes
according to their function, which does not
necessarily correlate with classification based
upon sequence similarity.

15
4-grams uses in Sequence Search
The 4-gram vector as is measures sequence
identity and therefore can easily detect close
sequences ( gt55 identity) But what about
sequences with low sequence identity (30-55)?
16
Case of P03579 / P03581
43.6 identity Global alignment score 414
10 20 30 40
50 60 P03579 MPYTINSPSQFVYLSSAYADPVQLINLCT
NALGNQFQTQQARTTVQQQFADAWKPVPSMT .
.... . .. .... ... . .
P03581 MAYSIPTPSQLVYFTENYADYIPFVNRLINARSNSFQ
TQSGRDELREILIKSQVSVVSPI 10
20 30 40 50 60
70 80 90 100
110 P03579 VRFPASD-FYVYRYNSTLDPLITALLNSFD
TRNRIIEVDNQPAPNTTEIVNATQRVDDAT
.. . ... . . ... ..
..... P03581 SRFPAEPAYYIYLRDPSISTVYTALLQSTDT
RNRVIEVENSTNVTTAEQLNAVRRTDDAS 70
80 90 100 110 120
120 130 140 150
P03579 VAIRASINNLANELVRGTGMFNQAGFETASGLVW--TTTPAT
- .. .... . . ......
P03581 TAIHNNLEQLLSLLTNGTGVFNRTSFESASGLTWLVTTTP
RTA 130 140 150
160
Cos(P03579, P03581) 0.04
17
Improving Sensitivity using homology 4-grams
P03579 MPYTINSPSQFVYLSSAY . ....
P03581 MAYSIPTPSQLVYFTENY
Identity 4-grams
Homology 4-grams
SPSQ
APSQ NPSQ TPSQ SPSK
18
Including homology in vector comparison
Homology Vector
Query Sequence
Identity Vector
ah
ai
Unknown Sequence
Score cos( ai ) lcos( ah )
19
4-gram Search Results
20
Correlation between cosine value and Sequence
alignment identity
21
Conclusions
The use of homology 4-grams improve detection of
distant sequences (30 55 sequence
identity). The 4-gram based method seems to be
suitable also for sequence search. After
precalculation of the sequences 4-gram vector it
is possible to compare two sequences with time
complexity of O(1).

Write a Comment

User Comments (0)