Methods of Protein Structure Alignment - PowerPoint PPT Presentation

About This Presentation
Title:

Methods of Protein Structure Alignment

Description:

Methods of Protein Structure Alignment. David Hoksza. Charles University in Prague ... SCOP (Structural Classification of Proteins) ... – PowerPoint PPT presentation

Number of Views:457
Avg rating:3.0/5.0
Slides: 20
Provided by: tup
Category:

less

Transcript and Presenter's Notes

Title: Methods of Protein Structure Alignment


1
Methods of Protein Structure Alignment
  • David Hoksza
  • Charles University in Prague Department of
    Software Engineering Czech Republic

2
Presentation Outline
  • Biological background
  • Protein databases
  • Protein structure
  • Similarity measures
  • Algorithms

3
Terminology
  • DNA (deoxyribonucleic acid)
  • sequence of nucleotides (A, C, G, T)
  • double-helix
  • RNA (ribonucleic acid)
  • single-helix sequence of nucleotides (A, C, G, U)
  • messenger RNA (mRNA)
  • transfer RNA (tRNA)
  • ribosomal RNA (rRNA)
  • Proteins
  • molecules
  • translated from mRNA in ribosomes
  • sequence of amino acids (20 AAs)
  • coded by codon (triplet of nucleotides)
  • genetic code
  • DNA ? RNA ? protein

central dogma
transcription
translation
4
Protein Similarity
  • Interaction of proteins determines biological
    functions
  • Function of protein derived from its three
    dimensional structure
  • similar proteins (many common amino acids on
    appropriate places) have similar structure
  • ? similar proteins have similar functions
  • similar proteins have a common ancestor
  • Identifying protein sequence
  • ? finding similar proteins
  • ? getting clue to the function

5
Protein Databases
  • Finding similar proteins
  • even among different species
  • Prominent non-structural databases
  • GenBank
  • EMBL (European Molecular Biology Laboratory Data)
  • DDBJ (DNA Data Bank of Japan)
  • UniProt
  • Swissprot trEMBL (translated EMBL) PIR
    (Protein Information Resource)
  • Prominent structural databases
  • PDB (Protein Data Bank)
  • SCOP (Structural Classification of Proteins)
  • ASTRAL Compendium (for Sequence and Structure
    Analysis)

not moderated
moderated
6
Databases Growth
7
Databases Growth (PDB)
8
Levels of complexity of protein structure
  • Primary structure
  • linear sequences of amino acids
  • Secondary structure
  • local three-dimensional segments which are folded
    into specific repeated structures
  • alpha helices, beta sheets (strands)
  • Tertiary structure
  • the atomic coordinates - spatial relations among
    the secondary structure elements
  • Quaternary structure
  • multiple polypeptide chains

9
Protein structure
  • Amino-acids differ in their side chains
    (R-groups)
  • Connection peptide bonds
  • Protein sequence ? sequence of rigid planes
  • Degrees of freedom
  • Planes
  • R-groups
  • 3D conformation described by dihedral angles
  • Only a-carbons usually considered

10
Protein structure cont.
  • SSE Secondary Structural Elements
  • Repetitive structures arising by H bonds
  • Alpha helices
  • ith amino-acids is connected to (i4)th
    amino-acid
  • f and ? angles are constant
  • peptide units per turn is 3,6
  • Beta sheets (strands)
  • multiple strands connected to each other by H
    bonds
  • parallel/antiparallel
  • Motifs
  • Combinations (second form) of SSEs
  • beta ribbon, beta-barrel, beta-hairpin,
    helix-loop-helix, greek key,

11
Similarity measures
  • RMSD Root Mean Square deviation/difference/dista
    nce
  • Summarizes partial distances of aligned residue
    pairs
  • Evaluates quality of a matching (superposition)
  • cRMSD (core RMSD)
  • inter-residue
  • dRMSD (distance RMSD)
  • intra-residue
  • elastic similarity score
  • disregarding outliers
  • fragmented dRMSD
  • aimed to recognition of similar substructures

12
Algorithms
  • Goals
  • Alignment
  • direct similarity
  • Classification
  • indirect similarity
  • Methods
  • Incremental extension
  • extending initial partial alignment
  • Dynamic programming
  • dynamic programming matrix of (usually) distances
  • Indexing
  • using features to be indexed by
  • trees
  • geometric hashing

13
Algorithms - Incremental Extension
  • DALI
  • Elastic similarity score
  • Matrix of inter-residual distances
  • Similar proteins similar inter-residual
    distances similar distance matrices
  • Contatct pattern (CP) - submatrix of fixed size
    (hexapeptides)
  • Similar pairs of CPs are stored and one is used
    as a seed
  • Monte-Carlo optimization is used for extend the
    already created alignment
  • CE
  • AFP Aligned Fragment Pair (constant length
    portions local structures)
  • Fragmented dRMSD
  • Joining of AFPs based on three different distance
    measures
  • Several path are computed and best of them is
    optimized by Smith-Waterman (on the distance
    matrix)

14
Algorithms Dynamic Programming
  • SAP
  • Double dynamic programming
  • View vector of distances to other resiudes
  • Between pairs of views, optimal alignments
    (Smith-Waterman) are computed which are used to
    fill up final DP matrix ? final alignment
  • PROSUP
  • Identification of seed fragments
  • Expand seed fragments to initial alignments
  • Apply DP (Needleman-Wunsch) to refine initial
    alignments
  • Evaluate refined alignments
  • STRUCTAL
  • Evaluated as the best structural algorithm in
    Kolodny, R., Koehl, P., Levitt, M. Comprehensive
    evaluation of protein structure alignment
    methods scoring by geometric measures. J. Mol.
    Biol. 346 (2005), 1173-88.
  • Initial alignment based on several rules (match
    beginning of structures, match ends, match pairs
    based on sequence alignment, )
  • Refine the alignments by DP (Needleman-Wunsch)
  • Exposure weighting
  • Position dependent gap penalties
  • TALI
  • Incorporates torsion angles ? mutual distances
    form DP matrix ? Smith-Waterman

15
Algorithms Indexing - Trees
  • PSIST
  • Features
  • sliding window of size w
  • distance
  • angle
  • Suffix-tree
  • PSI
  • Represenatation
  • SSE approximation
  • amino-acids ? centers of masses
  • defining local neighborhood for each SSE
  • features (9 dimensions)
  • distances in triplets
  • angles between pairs in triplets
  • R-tree

16
Algorithms Indexing - Hashing
  • ProGreSS
  • Combines structure and sequence
  • Features (sliding window of size w)
  • structure
  • curvature (dim w)
  • torsion angles (dim w)
  • Haar wavelet transormation ? dim dt
  • normalization ? 0,1 dt space
  • sequence
  • line in scoring matrix (PAM, BLOSUM, ..) ?
    scoring vector ? chaining scoring vectors ? dim
    20w
  • Haar wavelet transormation ? dim dq
  • normalization ? 0,1 dq space
  • SSE type
  • CTSS
  • Theory of differential geometry
  • Structure 3D spline
  • Features
  • torsion angle
  • curvature
  • SSE type
  • Pairwise comparison
  • Smith-Waterman
  • scoring matrix
  • based on features
  • For each pair

17
End
18
PDB record
  • HEADER HYDROLASE(O-GLYCOSYL) 25-JAN-94
    149L
  • TITLE CONSERVATION OF SOLVENT-BINDING SITES
  • ..
  • REMARK 1
  • REMARK 1 REFERENCE 1
  • REMARK 1 AUTH M.MATSUMURA,W.J.BECKTEL,
  • ..
  • REMARK 2 RESOLUTION. 2.60 ANGSTROMS.
  • REMARK 3
  • REMARK 3 REFINEMENT.
  • REMARK 3 PROGRAM TNT
  • REMARK 3 AUTHORS TRONRUD,TEN EYCK
  • ...
  • SEQRES 1 A 164 MET ASN LEU PHE GLU MET LEU
    ARG
  • SEQRES 2 A 164 ARG LEU LYS ILE TYR LYS ASP
    THR
  • ..
  • HELIX 1 H1 LEU A 3 GLU A 11 1
    9
  • HELIX 2 H2 LEU A 39 ILE A 50 1
    12
  • ..
  • HELIX 10 H10 PRO A 143 THR A 155 1
    13
  • SHEET 1 A 3 TYR A 18 LYS A 19 0
  • SHEET 2 A 3 TYR A 25 ILE A 27 -1 N THR
    A 26 O TYR A 18
  • ..
  • SHEET 3 A 3 HIS A 31 THR A 34 -1 N HIS
    A 31 O ILE A 27
  • TURN 1 T1 ASP A 20 GLY A 23
  • ..
  • ATOM 1 N MET A 1 29.360 -4.880
    38.742 1.00 65.91 N
  • ATOM 2 CA MET A 1 29.892 -6.057
    38.096 1.00 60.68 C
  • ATOM 3 C MET A 1 30.674 -5.673
    36.863 1.00 56.33 C
  • ..
  • ATOM 302 CG PRO A 37 51.531 -30.219
    18.738 1.00 78.60 C
  • ATOM 303 CD PRO A 37 52.005 -28.775
    18.641 1.00 78.61 C
  • ATOM 304 N SER A 38 53.483 -28.405
    22.129 1.00 70.92 N
  • ATOM 305 CA SER A 38 54.604 -28.517
    23.043 1.00 67.86 C
  • ..
  • ATOM 1309 OXT LEU A 164 25.719 -18.888
    43.195 1.00 25.30 O
  • TER 1310 LEU A 164
  • ..

19
Similarity Measures (primary structure)
  • two strings of amino-acids
  • hamming distance
  • sequences of equal length
  • number of non-identical positions
  • edit distance
  • minimal number of operations insert/update/delete
    to convert one sequence to the other
  • weighted edit distance
  • takes into account probability of updating one
    letter to the other
  • scoring (substitution) matrices
  • PAM, BLOSUM,
  • different costs for opening/extending a gap
  • global/local alignment
Write a Comment
User Comments (0)
About PowerShow.com