Bioinformatics course - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Bioinformatics course

Description:

Sequence similarity is observable; homology is a hypothesis based on observation. We want to know whether two ... E.g. SCOP http://scop.mrc-lmb.cam.ac.uk/scop ... – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 26

Provided by: werner6

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics course

1
Homology vs. similarity

Homology Evolutionary relation of two proteins
with similar biological function that indicates
common ancestry.
Sequence similarity is observable homology is a
hypothesis based on observation.
We want to know whether two sequences are truly
homologous because this will enable us to make
conclusions about their probable structure and
function.
Sequence similarity can be global (overall
sequence similarity) or local (motifs) in
distinguishing probable homologues

2
Basics of Data Base Search

How to find similar sequences (query sequence
versus database)
1. Use a similarity matrix (substitution
matrix) to score the similarity of amino acids.
2. Generate all possible alignments and
calculate a score for each alignment
3. The optimal alignment is the alignment
with the highest score.
Procedure of exhaustive enumerations and scoring
of all alignments is not feasible.

Number of all possible alignments for two
sequences of lengths n
3
Most widely used algorithms

Two basic types of algorithms
Needleman-Wunsch algorithm1,2
Global algorithm which gives an overall best fit
alignment of the entire sequence.
rigorous algorithm to find optimal solution.
requires tremendous amount of computing power.
not sensitive for highly diverged sequences
Smith-Waterman algorithm 3
Local alignment procedure which tries to find a
sub sequence (or several small) subsequences of
high similarity.

Ref 1 Needleman, S.B. Wunsch, C.D. 1970. J.
Mol. Biol. 48, 443-453. 2 Gotoh, O. 1982. J.
Mol. Biol. 162, 705-708. 3 Smith, T.F.
Waterman, M.S. 1981. J. Mol. Biol. 147, 195-197.
4
How to derive score parameters for sequence
alignments

Substitution matrices
general idea
from pairwise alignment of proteins derive
probabilities pab for exchanging amino acid (or
nucleotides) a and b and compare it to the
expected probability in a random model R, pRqaqb
Questions
(a) How to score an alignment ?
(b) What is a statistically valid set of protein
sequences ?
(c) What is a good random model ?
(d) Different pairs of proteins have evolved
from a common ancestor in a different amount.
(e.g. compare homologs in a human and mouse to
homologs in human and E.coli)
(e) How to find the best alignment?

5
Examples of Substitution matrices

Dayhoff PAM Matrices
Dayhoff, Schwartz, Orcutt (1978). A model of
evolutionary change in proteins. In Dayhoff ,
Atlas of Protein sequence and Structure, Vol.5,
NBRF, Washington, pp.345-352.
BLOSUM matrices
Henikoff, Henikoff (1992). Amino acid
substitution matrices from protein blocks. PNAS
89, 10915-10919.
In both cases S(i,j) is determined (in a
simplified way) by
Nexch(i,j) are the occurrences substituting a.a.
i with j Ni and NJ are the number of occurrences
for the individual amino acids i and j

6
Parameters of the substitution matrices

Both matrices have a parameter attached, these
characterized the selection of proteins which are
used to calculate the matrices
PAM n
point accepted mutation one PAM unit is
equivalent to an average change of 1 of all
amino acids
first use closely related sequences, then expand
using a model of evolution
widely used PAM250
BLOSUM m.
use only ungapped, aligned regions of protein
families (BLOCKS)
the sequences from each block are clustered, two
sequences in the same cluster if the sequence
identity is large than a cutoff (m)
smaller values of m means more diverse sequences
BLOSUM 62 is widely used

7
PAM point accepted mutation

PAM scores are derived from alignments of
closely relatedsequences, i.e., proteins whose
function is known to be the same (Hemoglobin,
cytochrome c, ribosomal proteins, RNase A...)
from many organisms. The original PAM scoring
matrix was derived by Margaret Dayhoff, a pioneer
in sequence analysis
Numbers may be expressed in terms of
time-dependent probability matrices (P(t)) One
PAM unit is the time required to achieve an
average change of 1 in the amino acid positions.
The original aim was to relate observed changes
to the evolutionary distance between organisms,
as reflected by the geological record. Thus PAM
units may be expressed in millions of years of
evolution.
PAM250 will be drawn from a more diverse sequence
alignment than PAM100.

8
PAM 250 Substitution matrix expressed as log odds
9
Similarity score is the sum of the matrix
elements in an alignment

Residue pairs with scores above 0 replace each
other more often in related sequences than in
random sequences. This is an indication that both
residues can carry out similar functions (similar
size, hydrogen bonding, etc). A score exactly
equal to zero indicates amino acid pairs that are
found as alternatives at exactly the frequency
predicted by chance. Residue pairs with scores
less than 0 replace each other less often than in
random sequences and might be an indication that
these residues are not functionally equivalent.
The score of an alignment is calculated as the
sum of the substitution matrix elements in this
alignment. This can be derived by assuming that
all positions in a protein sequence are
independent. Then the odds for the alignment is
given by
In practice one deals with sums rather than
products.

10
Example

Scoring the distance between two sequences with
PAM250
for example, two sequences from the EGF domain
of rabbit and pig fertilin)
QNCNN
EKCHN
S211222

11
BLOSUM the BLOcks SUbstitution Matrix.

Scores are derived from alignments of distantly
related sequences, without regard to function
Should give a better substitution matrix for more
distantly related sequences than the PAM
matrices. Also, as PAM is limited to proteins of
known function for its derivation, you have more
sequences contributing to the BLOSUM numbers
(better statistics)
the sequence alignments are the from the BLOCKS
database, with the numerical value derived from
the cutoff value for the diversity of the
sequence
BLOSUM62 (sequences are gt62 identical) will be
drawn from a less diverse sequence alignment than
BLOSUM35 (where the sequences are gt35 identical)

12
Using the BLOSUM62 log odds matrix to score an
alignment add the numbers in the matrix
If you have a D above a W in an alignment, the
score is -3. For F to W, the score is 1.
Everytime F matches F, or D,D, add 6.
13
Other scoring matrices

Gaston Gonnet and coworkers derived a matrix
much like PAM250 by using pairwise alignments of
all the sequences known in 1992, in an iterative
fashion starting with alignments based on PAM250.
They noted that their results were different when
they used closely related sequence alignments vs.
more distantly related ones.
Identity matrix sort of the original, but only
useful if it is scored according to the frequency
of occurrence of amino acids in the database.

14
Concepts of protein structure prediction

Why is there a need for protein structure
prediction ?
the sequence of a protein is easily available
the determination of 3D structures is still a
slow process
energy based methods
free energy of the protein in the native state
is minimal
Anfinson experiment
ab initio structure prediction is still an
unsolved problem
holy grail of computational biology
knowledge-based methods
parameters are extracted from currently known 3D
structures
examples
secondary structure prediction
fold recognition methods (threading)
knowledge based force field terms are added to
free energy term

15
EXPLOSION OF GENOMIC DATA

Gene sequences from genome projects far
outnumber the experimentally determined 3D
structures
Prediction of 3D structures of proteins is a
necessity

16
Tertiary structure prediction methods
Search for Similar Gene or Protein Sequences
Align sequences to find common motifs that
correlate with structure and functions
Predict the 3D Structure and Function of a New
Protein by
Ab initio Structure Prediction
Comparative Homology Modeling
Protein Fold Recognition (protein threading)
17
Tertiary Structure Prediction

Ab initio modeling from sequence to 3D structure
Two approaches
purely energy based
(2) prediction of secondary structure and short
long range contacts
(1) uses a force field (molecular mechanics)
method and molecular dynamics (or other global
minimization algorithms) to find the native state
of a protein
Molecular mechanics computes the molecular energy
based on classical (Newtonian) mechanics and
considers molecules as atoms bonded with elastic
bonds
Molecular Energy Bond Energy Angle Energy
Torsion Energy Electrostatic Energy Hydrogen
Bonds Energy Solvation Energy SS Bridge
Energy

18
Energy terms (see handouts)

Bond Energy (Bond length, Bond Angle)
Torsion angle energy (torsion motions around
bonds, improper dihedrals)
Electrostatics
Van der Waals
Hydrogen bond

(2) prediction of secondary structure and long
range contacts
Secondary structure prediction derive propensity
values of residues from statistical analysis of
residues in known secondary structure
More sophisticated methods Neural Network,
combined prediction from MSA and HMM
Long range contacts
Tree-determinant residues
Motifs
Correlated mutations

20
Comparative Homology Modeling

3D structures of proteins come in families and
superfamilies
E.g. SCOP http//scop.mrc-lmb.cam.ac.uk/scop/
families sequence identities high (gt 35), same
functional residues
superfamilies similar 3D fold some common
functional motifs
No universal definition of superfamilies
. folds similar 3D fold
Rule of thumb if two proteins have an alignment
with a sequence identity gt 30 they have the same
fold.
More sophisticated methods for fold recognition
3D profiles or threading
Steps
- for a target sequence find a homologous PDB
template structure,
- make an optimum alignment between the target
and template sequences,
- generate the the tertiary structure of the
target using the template geometry.

21
Additional considerations
What is the secondary structure? Is it homologous
to other protein sequences? Is it homologous to
other protein structures? What is the best
sequence alignment between your target protein
and homologous PDB structures? Examine the
regions of insertions and deletions. Are they
located in the loop regions? On the surface? Is
the region hydrophobic or hydrophilic? The PDB
template might have functional sites and
established motifs. Does your target sequence has
the same features? If disulphide bridges are
present in the PDB template, are cysteine
residues aligned?
22
Basics of Secondary Structure Prediction

Propensity values for secondary structures of
amino acids
Statistical analysis for the occurence of amino
acids in regular secondary structures of a
database of representative proteins
assignment of secondary structure,
X denotes one of the 20 aa, XAla, Val,..
naX number of aa X in a-helical regions
Na number of all amino acids in a-helical
regions
NX number of aa X in the database Ntot total
number of all amino acids in the data base

frequency of amino acid X in a-helical
regions average frequency of a-helices in all
proteins Propensity values
23
Methods for prediction

Classic method (Chou Fasman, 1985)
simplified rules
separate amino acids into groups of helix
(b-strand) formers and breakers
search for clusters of formers (four h-former out
of six contiguous residues three b-former out of
five residues extend the segments in both
dimensions until a tetrapeptide of breakers is
found
later improvements
Garnier Osguthorpe Robson (GOR) method
influence of residue at postion j on secondary
structure in the neighborhood of the residue is
included
main effect is statistically found in the range
j-8 lt i lt j8

24
Improvements of the methods
major improvements larger databases multiple
sequence alignments neural network
method consensus prediction Meta server
25
Secondary Structure Prediction Servers

APSSP2 www.imtech.res.in/raghava/apssp2/
Advanced Protein Secondary Structure Prediction
Server, GPS Raghava, Bioinformatics Center,
Chandigarh
PSIPRED bioinf.cs.ucl.ac.uk/psipred/index.html
The PSIPRED Protein Structure Prediction Server,
D. T. Jones, Department of Computer Science,
University College London, UK.
PROF www.aber.ac.uk/phiwww/prof/
University of Wales, Aberystwyth, Computational
Biology Group.
PredictProtein cubic.bioc.columbia.edu/predictpro
tein/
The PredictProtein server , B. Rost, Columbia
University, NY.
SAM-T02sec www.cse.ucsc.edu/research/compbio/HMM-
apps/T02-query.html
HMM methods, K. Karplus, UCSC
JPRED www.compbio.dundee.ac.uk/www-jpred/
A consensus method for protein secondary
structure prediction
G. Barton, University of Dundee