Methods of Protein Structure Alignment - PowerPoint PPT Presentation

About This Presentation

Title:

Methods of Protein Structure Alignment

Description:

Methods of Protein Structure Alignment. David Hoksza. Charles University in Prague ... SCOP (Structural Classification of Proteins) ... – PowerPoint PPT presentation

Number of Views:460

Avg rating:3.0/5.0

Slides: 20

Provided by: tup

Category:

more less

Transcript and Presenter's Notes

Title: Methods of Protein Structure Alignment

1
Methods of Protein Structure Alignment

David Hoksza
Charles University in Prague Department of
Software Engineering Czech Republic

2
Presentation Outline

Biological background
Protein databases
Protein structure
Similarity measures
Algorithms

3
Terminology

DNA (deoxyribonucleic acid)
sequence of nucleotides (A, C, G, T)
double-helix
RNA (ribonucleic acid)
single-helix sequence of nucleotides (A, C, G, U)
messenger RNA (mRNA)
transfer RNA (tRNA)
ribosomal RNA (rRNA)
Proteins
molecules
translated from mRNA in ribosomes
sequence of amino acids (20 AAs)
coded by codon (triplet of nucleotides)
genetic code
DNA ? RNA ? protein

central dogma
transcription
translation
4
Protein Similarity

Interaction of proteins determines biological
functions
Function of protein derived from its three
dimensional structure
similar proteins (many common amino acids on
appropriate places) have similar structure
? similar proteins have similar functions
similar proteins have a common ancestor
Identifying protein sequence
? finding similar proteins
? getting clue to the function

5
Protein Databases

Finding similar proteins
even among different species
Prominent non-structural databases
GenBank
EMBL (European Molecular Biology Laboratory Data)
DDBJ (DNA Data Bank of Japan)
UniProt
Swissprot trEMBL (translated EMBL) PIR
(Protein Information Resource)
Prominent structural databases
PDB (Protein Data Bank)
SCOP (Structural Classification of Proteins)
ASTRAL Compendium (for Sequence and Structure
Analysis)

not moderated
moderated
6
Databases Growth
7
Databases Growth (PDB)
8
Levels of complexity of protein structure

Primary structure
linear sequences of amino acids
Secondary structure
local three-dimensional segments which are folded
into specific repeated structures
alpha helices, beta sheets (strands)
Tertiary structure
the atomic coordinates - spatial relations among
the secondary structure elements
Quaternary structure
multiple polypeptide chains

9
Protein structure

Amino-acids differ in their side chains
(R-groups)
Connection peptide bonds
Protein sequence ? sequence of rigid planes
Degrees of freedom
Planes
R-groups

3D conformation described by dihedral angles
Only a-carbons usually considered

10
Protein structure cont.

SSE Secondary Structural Elements
Repetitive structures arising by H bonds
Alpha helices
ith amino-acids is connected to (i4)th
amino-acid
f and ? angles are constant
peptide units per turn is 3,6
Beta sheets (strands)
multiple strands connected to each other by H
bonds
parallel/antiparallel
Motifs
Combinations (second form) of SSEs
beta ribbon, beta-barrel, beta-hairpin,
helix-loop-helix, greek key,

11
Similarity measures

RMSD Root Mean Square deviation/difference/dista
nce
Summarizes partial distances of aligned residue
pairs
Evaluates quality of a matching (superposition)
cRMSD (core RMSD)
inter-residue
dRMSD (distance RMSD)
intra-residue
elastic similarity score
disregarding outliers
fragmented dRMSD
aimed to recognition of similar substructures

12
Algorithms

Goals
Alignment
direct similarity
Classification
indirect similarity
Methods
Incremental extension
extending initial partial alignment
Dynamic programming
dynamic programming matrix of (usually) distances
Indexing
using features to be indexed by
trees
geometric hashing

13
Algorithms - Incremental Extension

DALI
Elastic similarity score
Matrix of inter-residual distances
Similar proteins similar inter-residual
distances similar distance matrices
Contatct pattern (CP) - submatrix of fixed size
(hexapeptides)
Similar pairs of CPs are stored and one is used
as a seed
Monte-Carlo optimization is used for extend the
already created alignment

CE
AFP Aligned Fragment Pair (constant length
portions local structures)
Fragmented dRMSD
Joining of AFPs based on three different distance
measures
Several path are computed and best of them is
optimized by Smith-Waterman (on the distance
matrix)

14
Algorithms Dynamic Programming

SAP
Double dynamic programming
View vector of distances to other resiudes
Between pairs of views, optimal alignments
(Smith-Waterman) are computed which are used to
fill up final DP matrix ? final alignment
PROSUP
Identification of seed fragments
Expand seed fragments to initial alignments
Apply DP (Needleman-Wunsch) to refine initial
alignments
Evaluate refined alignments
STRUCTAL
Evaluated as the best structural algorithm in
Kolodny, R., Koehl, P., Levitt, M. Comprehensive
evaluation of protein structure alignment
methods scoring by geometric measures. J. Mol.
Biol. 346 (2005), 1173-88.
Initial alignment based on several rules (match
beginning of structures, match ends, match pairs
based on sequence alignment, )
Refine the alignments by DP (Needleman-Wunsch)
Exposure weighting
Position dependent gap penalties
TALI
Incorporates torsion angles ? mutual distances
form DP matrix ? Smith-Waterman

15
Algorithms Indexing - Trees

PSIST
Features
sliding window of size w
distance
angle
Suffix-tree

PSI
Represenatation
SSE approximation
amino-acids ? centers of masses
defining local neighborhood for each SSE
features (9 dimensions)
distances in triplets
angles between pairs in triplets
R-tree

16
Algorithms Indexing - Hashing

ProGreSS
Combines structure and sequence
Features (sliding window of size w)
structure
curvature (dim w)
torsion angles (dim w)
Haar wavelet transormation ? dim dt
normalization ? 0,1 dt space
sequence
line in scoring matrix (PAM, BLOSUM, ..) ?
scoring vector ? chaining scoring vectors ? dim
20w
Haar wavelet transormation ? dim dq
normalization ? 0,1 dq space
SSE type

CTSS
Theory of differential geometry
Structure 3D spline
Features
torsion angle
curvature
SSE type
Pairwise comparison
Smith-Waterman
scoring matrix
based on features
For each pair

17
End
18
PDB record

HEADER HYDROLASE(O-GLYCOSYL) 25-JAN-94
149L
TITLE CONSERVATION OF SOLVENT-BINDING SITES
..
REMARK 1
REMARK 1 REFERENCE 1
REMARK 1 AUTH M.MATSUMURA,W.J.BECKTEL,
..
REMARK 2 RESOLUTION. 2.60 ANGSTROMS.
REMARK 3
REMARK 3 REFINEMENT.
REMARK 3 PROGRAM TNT
REMARK 3 AUTHORS TRONRUD,TEN EYCK
...
SEQRES 1 A 164 MET ASN LEU PHE GLU MET LEU
ARG
SEQRES 2 A 164 ARG LEU LYS ILE TYR LYS ASP
THR
..
HELIX 1 H1 LEU A 3 GLU A 11 1
9
HELIX 2 H2 LEU A 39 ILE A 50 1
12
..

HELIX 10 H10 PRO A 143 THR A 155 1
13
SHEET 1 A 3 TYR A 18 LYS A 19 0
SHEET 2 A 3 TYR A 25 ILE A 27 -1 N THR
A 26 O TYR A 18
..
SHEET 3 A 3 HIS A 31 THR A 34 -1 N HIS
A 31 O ILE A 27
TURN 1 T1 ASP A 20 GLY A 23
..
ATOM 1 N MET A 1 29.360 -4.880
38.742 1.00 65.91 N
ATOM 2 CA MET A 1 29.892 -6.057
38.096 1.00 60.68 C
ATOM 3 C MET A 1 30.674 -5.673
36.863 1.00 56.33 C
..
ATOM 302 CG PRO A 37 51.531 -30.219
18.738 1.00 78.60 C
ATOM 303 CD PRO A 37 52.005 -28.775
18.641 1.00 78.61 C
ATOM 304 N SER A 38 53.483 -28.405
22.129 1.00 70.92 N
ATOM 305 CA SER A 38 54.604 -28.517
23.043 1.00 67.86 C
..
ATOM 1309 OXT LEU A 164 25.719 -18.888
43.195 1.00 25.30 O
TER 1310 LEU A 164
..

19
Similarity Measures (primary structure)

two strings of amino-acids
hamming distance
sequences of equal length
number of non-identical positions
edit distance
minimal number of operations insert/update/delete
to convert one sequence to the other
weighted edit distance
takes into account probability of updating one
letter to the other
scoring (substitution) matrices
PAM, BLOSUM,
different costs for opening/extending a gap
global/local alignment