Title: Bioinformatics course
1Homology vs. similarity
- Homology Evolutionary relation of two proteins
with similar biological function that indicates
common ancestry. - Sequence similarity is observable homology is a
hypothesis based on observation. - We want to know whether two sequences are truly
homologous because this will enable us to make
conclusions about their probable structure and
function. - Sequence similarity can be global (overall
sequence similarity) or local (motifs) in
distinguishing probable homologues
2Basics of Data Base Search
- How to find similar sequences (query sequence
versus database) - 1.     Use a similarity matrix (substitution
matrix) to score the similarity of amino acids. - 2.     Generate all possible alignments and
calculate a score for each alignment - 3.     The optimal alignment is the alignment
with the highest score. - Procedure of exhaustive enumerations and scoring
of all alignments is not feasible. - Â
Number of all possible alignments for two
sequences of lengths n
3Most widely used algorithms
- Two basic types of algorithms
- Needleman-Wunsch algorithm1,2
- Global algorithm which gives an overall best fit
alignment of the entire sequence. - rigorous algorithm to find optimal solution.
- requires tremendous amount of computing power.
- not sensitive for highly diverged sequences
- Smith-Waterman algorithm 3
- Local alignment procedure which tries to find a
sub sequence (or several small) subsequences of
high similarity.
Ref 1 Needleman, S.B. Wunsch, C.D. 1970. J.
Mol. Biol. 48, 443-453. 2 Gotoh, O. 1982. J.
Mol. Biol. 162, 705-708. 3 Smith, T.F.
Waterman, M.S. 1981. J. Mol. Biol. 147, 195-197.
4How to derive score parameters for sequence
alignments
- Substitution matrices
- general idea
- from pairwise alignment of proteins derive
probabilities pab for exchanging amino acid (or
nucleotides) a and b and compare it to the
expected probability in a random model R, pRqaqb - Questions
- (a) How to score an alignment ?
- (b) What is a statistically valid set of protein
sequences ? - (c) What is a good random model ?
- (d) Different pairs of proteins have evolved
from a common ancestor in a different amount. - (e.g. compare homologs in a human and mouse to
homologs in human and E.coli) - (e) How to find the best alignment?
5Examples of Substitution matrices
- Dayhoff PAM Matrices
- Dayhoff, Schwartz, Orcutt (1978). A model of
evolutionary change in proteins. In Dayhoff ,
Atlas of Protein sequence and Structure, Vol.5,
NBRF, Washington, pp.345-352. - BLOSUM matrices
- Henikoff, Henikoff (1992). Amino acid
substitution matrices from protein blocks. PNAS
89, 10915-10919. - In both cases S(i,j) is determined (in a
simplified way) by - Â
- Â
- Nexch(i,j) are the occurrences substituting a.a.
i with j Ni and NJ are the number of occurrences
for the individual amino acids i and j
6Parameters of the substitution matrices
- Both matrices have a parameter attached, these
characterized the selection of proteins which are
used to calculate the matrices - PAM n
- point accepted mutation one PAM unit is
equivalent to an average change of 1 of all
amino acids - first use closely related sequences, then expand
using a model of evolution - widely used PAM250
- BLOSUM m.
- use only ungapped, aligned regions of protein
families (BLOCKS) - the sequences from each block are clustered, two
sequences in the same cluster if the sequence
identity is large than a cutoff (m) - smaller values of m means more diverse sequences
- BLOSUM 62 is widely used
- Â
7PAM point accepted mutation
- PAM scores are derived from alignments of
closely relatedsequences, i.e., proteins whose
function is known to be the same (Hemoglobin,
cytochrome c, ribosomal proteins, RNase A...)
from many organisms. The original PAM scoring
matrix was derived by Margaret Dayhoff, a pioneer
in sequence analysis - Numbers may be expressed in terms of
time-dependent probability matrices (P(t)) One
PAM unit is the time required to achieve an
average change of 1 in the amino acid positions.
The original aim was to relate observed changes
to the evolutionary distance between organisms,
as reflected by the geological record. Thus PAM
units may be expressed in millions of years of
evolution. - PAM250 will be drawn from a more diverse sequence
alignment than PAM100. -
8PAM 250 Substitution matrix expressed as log odds
9Similarity score is the sum of the matrix
elements in an alignment
- Residue pairs with scores above 0 replace each
other more often in related sequences than in
random sequences. This is an indication that both
residues can carry out similar functions (similar
size, hydrogen bonding, etc). A score exactly
equal to zero indicates amino acid pairs that are
found as alternatives at exactly the frequency
predicted by chance. Residue pairs with scores
less than 0 replace each other less often than in
random sequences and might be an indication that
these residues are not functionally equivalent. - The score of an alignment is calculated as the
sum of the substitution matrix elements in this
alignment. This can be derived by assuming that
all positions in a protein sequence are
independent. Then the odds for the alignment is
given by - In practice one deals with sums rather than
products.
10Example
- Scoring the distance between two sequences with
PAM250 - for example, two sequences from the EGF domain
of rabbit and pig fertilin) - QNCNN
- EKCHN
- S211222
11BLOSUM the BLOcks SUbstitution Matrix.
- Scores are derived from alignments of distantly
related sequences, without regard to function - Should give a better substitution matrix for more
distantly related sequences than the PAM
matrices. Also, as PAM is limited to proteins of
known function for its derivation, you have more
sequences contributing to the BLOSUM numbers
(better statistics) - the sequence alignments are the from the BLOCKS
database, with the numerical value derived from
the cutoff value for the diversity of the
sequence - BLOSUM62 (sequences are gt62 identical) will be
drawn from a less diverse sequence alignment than
BLOSUM35 (where the sequences are gt35 identical)
12Using the BLOSUM62 log odds matrix to score an
alignment add the numbers in the matrix
If you have a D above a W in an alignment, the
score is -3. For F to W, the score is 1.
Everytime F matches F, or D,D, add 6.
13Other scoring matrices
- Gaston Gonnet and coworkers derived a matrix
much like PAM250 by using pairwise alignments of
all the sequences known in 1992, in an iterative
fashion starting with alignments based on PAM250.
They noted that their results were different when
they used closely related sequence alignments vs.
more distantly related ones. - Identity matrix sort of the original, but only
useful if it is scored according to the frequency
of occurrence of amino acids in the database.
14Concepts of protein structure prediction
- Why is there a need for protein structure
prediction ? - the sequence of a protein is easily available
- the determination of 3D structures is still a
slow process - energy based methods
- free energy of the protein in the native state
is minimal - Anfinson experiment
- ab initio structure prediction is still an
unsolved problem - holy grail of computational biology
- knowledge-based methods
- parameters are extracted from currently known 3D
structures - examples
- secondary structure prediction
- fold recognition methods (threading)
- knowledge based force field terms are added to
free energy term
15EXPLOSION OF GENOMIC DATA
- Gene sequences from genome projects far
outnumber the experimentally determined 3D
structures - Prediction of 3D structures of proteins is a
necessity
16Tertiary structure prediction methods
Search for Similar Gene or Protein Sequences
Align sequences to find common motifs that
correlate with structure and functions
Predict the 3D Structure and Function of a New
Protein by
Ab initio Structure Prediction
Comparative Homology Modeling
Protein Fold Recognition (protein threading)
17Tertiary Structure Prediction
- Ab initio modeling from sequence to 3D structure
- Two approaches
- purely energy based
- (2) prediction of secondary structure and short
long range contacts -
- (1) uses a force field (molecular mechanics)
method and molecular dynamics (or other global
minimization algorithms) to find the native state
of a protein - Molecular mechanics computes the molecular energy
based on classical (Newtonian) mechanics and
considers molecules as atoms bonded with elastic
bonds - Molecular Energy Bond Energy Angle Energy
Torsion Energy Electrostatic Energy Hydrogen
Bonds Energy Solvation Energy SS Bridge
Energy
18Energy terms (see handouts)
- Bond Energy (Bond length, Bond Angle)
- Torsion angle energy (torsion motions around
bonds, improper dihedrals) - Electrostatics
- Van der Waals
- Hydrogen bond
19- (2) prediction of secondary structure and long
range contacts -
- Secondary structure prediction derive propensity
values of residues from statistical analysis of
residues in known secondary structure - More sophisticated methods Neural Network,
combined prediction from MSA and HMM - Long range contacts
- Tree-determinant residues
- Motifs
- Correlated mutations
20Comparative Homology Modeling
- 3D structures of proteins come in families and
superfamilies - E.g. SCOP http//scop.mrc-lmb.cam.ac.uk/scop/
- families sequence identities high (gt 35), same
functional residues - superfamilies similar 3D fold some common
functional motifs - No universal definition of superfamilies
- . folds similar 3D fold
- Rule of thumb if two proteins have an alignment
with a sequence identity gt 30 they have the same
fold. - More sophisticated methods for fold recognition
3D profiles or threading - Steps
- - for a target sequence find a homologous PDB
template structure, - - make an optimum alignment between the target
and template sequences, - - generate the the tertiary structure of the
target using the template geometry.
21Additional considerations
What is the secondary structure? Is it homologous
to other protein sequences? Is it homologous to
other protein structures? What is the best
sequence alignment between your target protein
and homologous PDB structures? Examine the
regions of insertions and deletions. Are they
located in the loop regions? On the surface? Is
the region hydrophobic or hydrophilic? The PDB
template might have functional sites and
established motifs. Does your target sequence has
the same features? If disulphide bridges are
present in the PDB template, are cysteine
residues aligned?
22Basics of Secondary Structure Prediction
- Propensity values for secondary structures of
amino acids - Statistical analysis for the occurence of amino
acids in regular secondary structures of a
database of representative proteins - assignment of secondary structure,
- X denotes one of the 20 aa, XAla, Val,..
- naX number of aa X in a-helical regions
- Na number of all amino acids in a-helical
regions - NX number of aa X in the database Ntot total
number of all amino acids in the data base
frequency of amino acid X in a-helical
regions average frequency of a-helices in all
proteins Propensity values
23Methods for prediction
- Classic method (Chou Fasman, 1985)
- simplified rules
- separate amino acids into groups of helix
(b-strand) formers and breakers - search for clusters of formers (four h-former out
of six contiguous residues three b-former out of
five residues extend the segments in both
dimensions until a tetrapeptide of breakers is
found - later improvements
- Garnier Osguthorpe Robson (GOR) method
- influence of residue at postion j on secondary
structure in the neighborhood of the residue is
included - main effect is statistically found in the range
j-8 lt i lt j8
24Improvements of the methods
major improvements larger databases multiple
sequence alignments neural network
method consensus prediction Meta server
25Secondary Structure Prediction Servers
- APSSP2 www.imtech.res.in/raghava/apssp2/
- Advanced Protein Secondary Structure Prediction
Server, GPS Raghava, Bioinformatics Center,
Chandigarh - PSIPRED bioinf.cs.ucl.ac.uk/psipred/index.html
- The PSIPRED Protein Structure Prediction Server,
D. T. Jones, Department of Computer Science,
University College London, UK. - PROF www.aber.ac.uk/phiwww/prof/
- University of Wales, Aberystwyth, Computational
Biology Group. - PredictProtein cubic.bioc.columbia.edu/predictpro
tein/ - The PredictProtein server , B. Rost, Columbia
University, NY. - SAM-T02sec www.cse.ucsc.edu/research/compbio/HMM-
apps/T02-query.html - HMM methods, K. Karplus, UCSC
- JPRED www.compbio.dundee.ac.uk/www-jpred/
- A consensus method for protein secondary
structure prediction - G. Barton, University of Dundee