Title: Identification of specificity-determining positions in protein alignments
1Identification of specificity-determining
positions in protein alignments
- Mikhail Gelfand
- Research and Training Center Bioinformatics
- Institute for Information Transmission Problems,
RAS - ECCB2005, Madrid
2Motivation
- Large protein families with general function
assigned by homology, not much functional
information - Much less structural data. Not many structures
with substrates, cofactors etc. - Some specificity assignments from comparative
genomics - gt
- Search for specificity-determining positions in
alignments - identification of functional sites
- prediction of specificity
- understanding and eventually re-design of function
3Specificity (of transporters) from comparative
genomics three examples. 1. New specificities
in a little studied family
S-box (rectangle frame)MetJ (circle
frame)LYS-element (circles)Tyr-T-box
(rectangles)
malate/lactate
42. Misleading homology The PnuC family of
transporters
The THI elements
The RFN elements
53. A nightmare. The NiCoT family of nickel-cobalt
transporters
6SDP (Specificity-Determining Position)
- Alignment position that is conserved within
- groups of proteins having the same specificity
- (specificity groups) but differs between them
SDP is not equivalent to a functionally important
position
7Measure of specificity mutual information
- count of amino acid a in group i at position p
divided by the total number of sequences -
- frequency of amino acid a in position p
- fraction of proteins in group i
8Taking into account the structure of the
phylogenetic tree random shuffling and linear
regression
linear regression
? min
gt positions that are more specific than expected
given the tree
9Smoothing pseudocounts and similarity between
amino acid residues
- m(a?b) amino acid substitution matrix
- n(a,i) count of amino acid a at position i
10Automated threshold setting the Bernoulli
estimator
- Are 5 SDP with Z-score gt 12 better than 10 SDP
with Z-score gt 9?
?
11Other similar techniques
- Evolutionary trace (Lichtarge et al. 1996, 1997)
need structure gradual construction of
group-specific consensus - Evolutionary rate shifts (DIVERGE, Gu et al.
2002) positions with group-specific
evolutionary rate - Surface patches of slowly evolving residues
(Rate4Site, Pupko et al. 2002) need structure - PCA in the sequence space (Casari et al., 1995)
- Correlated mutations (Pazos and Valencia, 2002)
- Prediction of functional sub-types (Hannenhalli
and Russell, 2000) relative entropy of HMM
profiles for groups
12SDPpred Web interface
Input multiple alignment of proteins divided
into specificity groups
AQP spQ9L772AQPZ_BRUME ----------------
---------------------mlnklsaeffgtfwlvfggcgsa ilaa-
-afp-------elgigflgvalafgltvltmayavggisg--ghfnpavs
lgltv iiilgsts------------------------------slap--
---------------- qlwlfwvaplvgavigaiiwkgllgrd------
--------------------------- ------ spP48838AQPZ
_ECOLI -------------------------------------mfrkla
aecfgtfwlvfggcgsa vlaa--gfp-------elgigfagvalafglt
vltmafavghisg--ghfnpavtiglwa lvihgatd-------------
-----------------kfap------------------ qlwffwvvpi
vggiiggliyrtllekrd--------------------------------
------ trQ92ZW9 -------------------------------
------mfkklcaeflgtcwlvlggcgsa vlas--afp-------qvgi
gllgvsfafgltvltmaytvggisg--ghfnpavslglav iiilgsth-
-----------------------------rrvp-----------------
- qlwlfwiaplfgaaiagivwksvgeefrpvd-----------------
------------ ------ GLP spP11244GLPF_EC
OLI ----------------------------msqt---stlkgqciaef
lgtglliffgvgcv aalkvag---------a-sfgqweisviwglgvam
aiyltagvsg--ahlnpavtialwl glilaltd----------------
--------------dgn--------------g-vpr -flvplfgpivga
ivgafayrkligrhlpcdicvveek--etttpseqkasl-------- --
---- spP44826GLPF_HAEIN -----------------------
-----mdks-----lkancigeflgtalliffgvgcv
13SDPpred Output
Alignment of the family with the SDPs
highlighted (Alignment view)
Detailed description of each SDP (List of SDPs)
Plot of probabilities used by the Bernoulli
estimator to set the cutoff (Probability plot
view)
14Transcription factors from the LacI family
- Training set 459 sequences,
- average length 338 amino acids,
- 85 specificity groups
44 SDPs
10 residues contact NPF (analog of the effector)
7 residues in the effector contact zone
(5?ltdminlt10?)
6 residues in the intersubunit contacts
5 residues in the intersubunit contact zone
(5?ltdminlt10?)
7 residues contact the operator sequence
6 residues in the operator contact zone
(5?ltdminlt10?)
LacI from E.coli
15SDP clusters at the subunit contact region
Cluster I
Effector
Cluster II
DNA operator
LacI (lactose repressor) from E.coli (1jwl)
16Overall statistics (LacI of E. coli)
Non-contacting residues (distance to the DNA,
effector, or the other subunit gt10?)
- Total 348 amino acids
- 44 SDP
Contact zone (may be functional)
Contacting residues (distance to the DNA,
effector, or the other subunit lt5?)
17Membrane channels of the MIP family
- Training set 17 sequences,
- average length 280 amino acids,
- 2 specificity groups
- Aquaporines glyceroaquaporines
21 SDPs
8 residues contact glycerol (substrate) (dminlt5?)
8 residues oriented to the channel
5 residues in the contacts with other subunits
GlpF from E.coli
18Two SDP clusters at the contact of subunits
forming the tetramer
Cluster II
Cluster I
20Leu, 24Ile, 108Tyr of one subunit, 193Ser of
another subunit
Glu43
Substrate (glycerol)
Subunit I
Glpf (glycerol facilitator) from E. coli (1fx8)
19Overall statistics (GlpF from E.coli)
Non-contacting residues (distance to the
substrate, or another subunit gt10?)
- Total 281 amino acids
- 21 SDP
Contact zone (may be functional)
Contacting residues (distance to the substrate,
or another subunit lt5?)
20isocitrate/isopropylmalate dehydrogenases
combinations of specificities towards substrate
and cofactor
- IDH catalyzes the oxidation of isocitrate to
a-ketoglutorate and CO2 (TCA) using either NAD or
NADP as a cofactor in organisms from prokaryotes
to higher eukaryotes - IMDH catalyzes oxidative decarboxylation of
3-isopropylmalate into 2-oxo-4-methylvalerate
(leucine biosynthesis) in prokaryotes and fungi,
the cofactor is NAD
Eukaryota
Archaea Bacteria Eukaryota
Mitochondria
Archaea Bacteria
21Selecting specificity groups
1. By substrate all IDHs vs. all IMDHs
2. By cofactor all NAD-dependent vs. all
NADP-dependent
3. Four groups
IDH (NADP) type II
IDH (NADP) type II
IDH (NADP) type II
IDH (NAD)
IDH (NAD)
IDH (NAD)
IMDH (NAD)
IMDH (NAD)
IMDH (NAD)
IDH (NADP) type I
IDH (NADP) type I
IDH (NADP) type I
22Predicted SDPs
most SDPs near the substrate
SDPs near the substrate and the cofactor
SDPs near the substrate, the cofactor and the
other subunit
23SDPs, the cofactor and the substrate
Substrate (isocitrate)
100Lys, 104Thr, 105Thr, 107Val, 337Ala,
341Thr substrate-specific and four group SDPs,
functionally not characterized
Cofactor (NADP)
Nicotinamide nucleotide
Adenine nucleotide
344Lys, 345Tyr, 351Val cofactor-specific
SDPs, known determinants of specificity to
cofactor
NADP-dependent IDH from E. coli (1ai2)
24SDPs predicted for different groupings
substrate-specific SDPs
cofactor-specific SDPs
208Arg
337Ala
100Lys
300Ala
105Thr
341Thr
229His
154Glu
103Leu
233Ile
97Val
158Asp
115Asn
305Asn
308Tyr
98Ala
155Asn
231Gly
327Asn
287Gln
344Lys
164Glu
345Tyr
351Val
241Phe
38Gly
40Asp
104Thr
Color code Contacts cofactor Contacts substrate
AND cofactor Contacts substrate Contacts
substrate AND the other subunit Contacts the
other subunit
107Val
152Phe
161Ala
232Asn
245Gly
323Ala
31Tyr
36Gly
162Gly
Four groups
45Met
25Overview
- Transcription factors contacts with the cofactor
and the DNA - Transporters contacts with the substrate
- Enzymes contacts with the substrate and the
cofactor - And all
- contacts between subunits
26Protein-DNA interactions
Entropy at aligned sites (blue plots) and the
number of contacts (red heavy atoms in a base
pair at a distance ltcutoff from a protein atom)
CRP
PurR
IHF
TrpR
27The observed correlation does not depend on the
distance cutoff
28CRP/FNR family of regulators
29Correlation between contacting nucleotides and
amino acid residues
- CooA in Desulfovibrio spp.
- CRP in Gamma-proteobacteria
- HcpR in Desulfovibrio spp.
- FNR in Gamma-proteobacteria
Contacting residues REnnnR TG 1st arginine GA
glutamate and 2nd arginine
DD COOA ALTTEQLSLHMGATRQTVSTLLNNLVR DV COOA
ELTMEQLAGLVGTTRQTASTLLNDMIR EC CRP
KITRQEIGQIVGCSRETVGRILKMLED YP CRP
KXTRQEIGQIVGCSRETVGRILKMLED VC CRP
KITRQEIGQIVGCSRETVGRILKMLEE DD HCPR
DVSKSLLAGVLGTARETLSRALAKLVE DV HCPR
DVTKGLLAGLLGTARETLSRCLSRMVE EC FNR
TMTRGDIGNYLGLTVETISRLLGRFQK YP FNR
TMTRGDIGNYLGLTVETISRLLGRFQK VC FNR
TMTRGDIGNYLGLTVETISRLLGRFQK
TGTCGGCnnGCCGACA
TTGTGAnnnnnnTCACAA
TTGTgAnnnnnnTcACAA
TTGATnnnnATCAA
30The correlation holds for other factors in the
family
31Plans and perspectives. Protein-DNA interactions
LacI family of transcriptional regulators (each
branch represents a subfamily)
32 and their signals
1605 regulators from 189 genomes, forming 302
groups of orthologs and binding 2518 sites
33Plans and perspectives. Experimental verification
- A new family of Ni/Co transporters
- No structural data
- Specificity predicted by comparative genomics
- Predicted SDPs form several clusters in the
alignment, are located on the same sides of
alpha-helices - Mutational analysis
34Terminators of translation in prokaryotes /
decoding of stop-codons. Specificity of RF1
(UAG, UAA) and RF2 (UGA, UAA)
Fragment of the alignment (117 pairs). SDPs are
shown by black boxes above the alignment.
35Interesting positions invariant, SDPs,
variable rate.
36SDPs and invariant positionstwo decoding sites?
37Plans and perspectives
- Use of 3D structures, when available.
Identification of functional sites as spatial
clusters of SDPs and conserved positions - Automated identification of specificity groups
based on the analysis of the phylogenetic tree - Protein-DNA interactions
- Identification of protein-protein contact surfaces
38Publications
- N.J.Oparina, O.V.Kalinina, M.S.Gelfand,
L.L.Kisselev (2005) Common and specific amino
acid residues in the prokaryotic polypeptide
release factors RF1 and RF2 possible functional
implications. Nucleic Acids Research 33 (in
press). - O.V.Kalinina, A.A.Mironov, M.S.Gelfand,
A.B.Rakhmaninova (2004) Automated selection of
positions determining functional specificity of
proteins by comparative analysis of orthologous
groups in protein families. Protein Science 13
443-456. - O.V.Kalinina, P.S.Novichkov, A.A.Mironov,
M.S.Gelfand, A.B.Rakhmaninova (2004) SDPpred a
tool for prediction of amino acid residues that
determine differences in functional specificity
of homologous proteins. Nucleic Acids Research
32 W424-W428. - O.V.Kalinina, M.S.Gelfand, A.A.Mironov,
A.B.Rakhmaninova (2003) Amino acid residues
forming specific contacts between subunits in
tetramers of the membrane channel GlpF.
Biophysics (Moscow) 48 S141-S145. - L.A.Mirny, M.S.Gelfand (2002) Using orthologous
and paralogous proteins to identify specificity
determining residues in bacterial transcription
factors. Journal of Molecular Biology 321 7-20. - L.Mirny, M.S.Gelfand (2002) Structural analysis
of conserved base-pairs in protein-DNA complexes.
Nucleic Acids Research 30 1704-1711. - http//math.belozersky.msu.ru/psn/
39Acknowledgements
- Leonid Mirny (Harvard, MIT)
- Olga Kalinina
- Andrei A. Mironov
- Alexandra B. Rakhmaninova
- Dmitry Rodionov
- Olga Laikova
- Howard Hughes Medical Institute
- Ludwig Institute of Cancer Research
- Russian Fund of Basic Research
- Russian Academy of Sciences, programs Molecular
and Cellular Biologyand Origin and Evolution
of the Biosphere
40(No Transcript)