Title: The use of 4grams for Protein Classification and Sequence Comparison
1The use of 4-grams for Protein Classification and
Sequence Comparison
Dror Tobi, ShannChing Chen, Ivet Bahar
2The 4-gram Concept
QLIR
a
AASD
FGTY
4-gram a short sequence of four amino acids
3 Representation of Sequence(s) as 4-gram Vector(s)
Three steps
- Calculating 4-gram frequencies in the examined DB
- Calculating 4-gram frequencies for a given
sequence or a given family of sequences - Creating a 4-gram vector using a weight function
41. Calculating 4-gram frequencies in DB
As a reference DB we chose the Swiss-Prot. A
table of the of occurrences of each 4-gram was
created
The table enables us to calculate the database
frequency of 4-gram i as
52. Calculating 4-gram frequencies of a sequence
(or family)
The 4-gram frequencies for a given sequence or a
family of sequences is done using a hash
table. Each 4-gram is entered into a hash table
from which the 4-gram family frequency is
calculated
63. The 4-gram weight function
where is the average number of times
4-gram i appears in family f
(no important contribution)
7Building a 4-gram Vector (contd)
4-gram vector of length k is built from the k
4-grams with the highest Wi values. These
4-grams are referred to as the k most
discriminative 4-grams. The selection of the k
most discriminative 4-grams is done using a heap
data structure.
1
2
k
Identity
xxxx1 w1
xxxx5 w5
xxxx9 w9
xxxx1050 w1050
xxxx1001 w1001
Weight
The vector elements are sorted according to their
4-gram identity using quick sort algorithm.
8Comparing two Vectors
Vector similarity is measured by the cosine of
the angle between the two vectors
a
xxxx1 w1
xxxx5 w5
xxxx9 w9
xxxx1050 w1050
xxxx1001 w1001
xxxx5 w5
xxxx6 w6
xxxx9 w9
xxxx1056 w1056
xxxx1001 w1001
9EC4 family classification
EC4 Test 1769 families (containing a total of
10,919 enzymes) defined at the EC level4
classification (at Expasy) were considered (). A
4-gram vector (model, probe vector) was built for
each EC4 family. The cosine between the probe
vector for a given EC4 family and the 4-gram
vector of each sequence in the Swiss-Prot was
calculated. All sequences were rank-ordered based
on their cosine values.
() out of a total of 4000 in SWISS-PROT release
27.7, excluding families that do not contain any
sequences
10Success Definition
success is defined as the of family members
having a cosine value higher then any non family
sequence in the Swiss-Prot DB.
Example for a family (F00X) that has five
members F001-5
A case of 80 success. Family members are colored
blue.
F001 0.567 F003 0.456 F005 0.354 F002 0.333 P0
SD 0.301 F004 0.255 ..
11EC4 Initial Results
12EC 1.14.12.3 a case of failure
EC 1.14.12.3 is a family of four proteins. When
we tested this family against Swiss-Prot no
family member had a higher cosine value than the
highest cosine value of non-family members.
EC 1.14.12.3 Phylogenetic tree
- THIS DIOXYGENASE SYSTEM CONSISTS OF FOUR
PROTEINS THE TWO SUBUNITS OF THE HYDROXYLASE
COMPONENT (BEDC1 AND BEDC2), A FERREDOXIN (BEDB)
AND A FERREDOXIN REDUCTASE (BEDA).
13Sequence homogeneity is a prerequisite for
successful 4-gram classification
Sub Family
Family vector
Sub Family
14Preliminary Conclusions
- 4-gram classification is a fast way to
classify/cluster sequences. 120,000 comparisons
take 4 min on regular desktop. - Sequence homogeneity within a family is a
prerequisite for successful classification. - The EC classification classifies enzymes
according to their function, which does not
necessarily correlate with classification based
upon sequence similarity.
154-grams uses in Sequence Search
The 4-gram vector as is measures sequence
identity and therefore can easily detect close
sequences ( gt55 identity) But what about
sequences with low sequence identity (30-55)?
16Case of P03579 / P03581
43.6 identity Global alignment score 414
10 20 30 40
50 60 P03579 MPYTINSPSQFVYLSSAYADPVQLINLCT
NALGNQFQTQQARTTVQQQFADAWKPVPSMT .
.... . .. .... ... . .
P03581 MAYSIPTPSQLVYFTENYADYIPFVNRLINARSNSFQ
TQSGRDELREILIKSQVSVVSPI 10
20 30 40 50 60
70 80 90 100
110 P03579 VRFPASD-FYVYRYNSTLDPLITALLNSFD
TRNRIIEVDNQPAPNTTEIVNATQRVDDAT
.. . ... . . ... ..
..... P03581 SRFPAEPAYYIYLRDPSISTVYTALLQSTDT
RNRVIEVENSTNVTTAEQLNAVRRTDDAS 70
80 90 100 110 120
120 130 140 150
P03579 VAIRASINNLANELVRGTGMFNQAGFETASGLVW--TTTPAT
- .. .... . . ......
P03581 TAIHNNLEQLLSLLTNGTGVFNRTSFESASGLTWLVTTTP
RTA 130 140 150
160
Cos(P03579, P03581) 0.04
17Improving Sensitivity using homology 4-grams
P03579 MPYTINSPSQFVYLSSAY . ....
P03581 MAYSIPTPSQLVYFTENY
Identity 4-grams
Homology 4-grams
SPSQ
APSQ NPSQ TPSQ SPSK
18Including homology in vector comparison
Homology Vector
Query Sequence
Identity Vector
ah
ai
Unknown Sequence
Score cos( ai ) lcos( ah )
194-gram Search Results
20Correlation between cosine value and Sequence
alignment identity
21Conclusions
The use of homology 4-grams improve detection of
distant sequences (30 55 sequence
identity). The 4-gram based method seems to be
suitable also for sequence search. After
precalculation of the sequences 4-gram vector it
is possible to compare two sequences with time
complexity of O(1).