Title: svingaitqb.unl.pt
W-metric for protein sequence comparison in SCOP
database Susana Vinga(1), Rodrigo
Gouveia-Oliveira(1), Jonas S. Almeida(1,2)
svinga_at_itqb.unl.pt rodrigo_at_itqb.unl.pt
1. Biomathematics Group, ITQB/UNL - R. Qta
Grande-6, 2780-156 Oeiras, Portugal. 2. Dept
Biometry Epidemiology, Medical Univ. South
Carolina 135 Cannon Street, Suite 303, P.O. Box
250835, Charleston, SC 29425, USA
Abstract Alignment-free metrics for sequence
comparison were recently reviewed by the authors
1, but have not until now been object of a
comparative quantitative study. In order to
complement the existing word composition methods,
we propose a novel W-metric between two proteins
based in their aminoacid frequency differences,
conveniently weighted by a scoring matrix, e.g.,
Blosum50 or PAM250. This way both aminoacid
composition and mutational rate information is
included in the dissimilarity calculation,
hopefully improving classification
accuracy. Receiver Operating Characteristic
curves (ROC curves) are used to assess and
compare the accuracy of these classification
schemes. All algorithms are tested in the
Structural Classification of Proteins (SCOP)
database. For additional results with other
datasets and more details see also reference
2. Availability All Matlab code used to
generate the data is available upon request to
the authors. Additional material available at
Protein test dataset SCOP database
Structural Classification of Proteins (SCOP)
database provides a detailed and reliable
description of protein structure relationships
and homology, including Protein Data Bank (PDB)
entries. Its hierarchical organization in 4
different levels or groups, family (fa),
superfamily (sf), class fold (cf) and class (cl),
allows the study of each metric for different
levels of similarity. The protein dataset used to
test the metrics was the PDB40, which corresponds
to sequences with less than 40 similarity to
each other.
FIGURE SCOP/ASTRAL db hierarchical
classification of proteins. Example of Fibroblast
growth factor receptor (FGFR2) classification in
each of the four levels.
References 1 S.Vinga and J.S.Almeida,
Alignment-free sequence comparison a review.
Bioinformatics, 19 513-523, 2003. 2 S.Vinga,
R.Gouveia-Oliveira and J.S.Almeida, Comparative
evaluation of word composition distances for the
recognition of SCOP relationships.
Bioinformatics, (accepted).
Acknowledgments S.Vinga and J.S.Almeida
thankfully acknowledge the financial support by
grants SFRH/BD/3134/2000 and SAPIENS/34794/99
from Fundação para a Ciência e a Tecnologia (FCT)
of the Portuguese Ministério da Ciência e do
Ensino Superior. R.Gouveia-Oliveira thankfully
acknowledges grant QLK2-CT-2000-01020 (EURIS)
from the European Commission.