Bioinformatics course - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Bioinformatics course

Description:

Sequence similarity is observable; homology is a hypothesis based on observation. We want to know whether two ... E.g. SCOP http://scop.mrc-lmb.cam.ac.uk/scop ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 26
Provided by: werner6
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics course


1
Homology vs. similarity
  • Homology Evolutionary relation of two proteins
    with similar biological function that indicates
    common ancestry.
  • Sequence similarity is observable homology is a
    hypothesis based on observation.
  • We want to know whether two sequences are truly
    homologous because this will enable us to make
    conclusions about their probable structure and
    function.
  • Sequence similarity can be global (overall
    sequence similarity) or local (motifs) in
    distinguishing probable homologues

2
Basics of Data Base Search
  • How to find similar sequences (query sequence
    versus database)
  • 1.      Use a similarity matrix (substitution
    matrix) to score the similarity of amino acids.
  • 2.      Generate all possible alignments and
    calculate a score for each alignment
  • 3.      The optimal alignment is the alignment
    with the highest score.
  • Procedure of exhaustive enumerations and scoring
    of all alignments is not feasible.
  •  

Number of all possible alignments for two
sequences of lengths n
3
Most widely used algorithms
  • Two basic types of algorithms
  • Needleman-Wunsch algorithm1,2
  • Global algorithm which gives an overall best fit
    alignment of the entire sequence.
  • rigorous algorithm to find optimal solution.
  • requires tremendous amount of computing power.
  • not sensitive for highly diverged sequences
  • Smith-Waterman algorithm 3
  • Local alignment procedure which tries to find a
    sub sequence (or several small) subsequences of
    high similarity.

Ref 1 Needleman, S.B. Wunsch, C.D. 1970. J.
Mol. Biol. 48, 443-453. 2 Gotoh, O. 1982. J.
Mol. Biol. 162, 705-708. 3 Smith, T.F.
Waterman, M.S. 1981. J. Mol. Biol. 147, 195-197.
4
How to derive score parameters for sequence
alignments
  • Substitution matrices
  • general idea
  • from pairwise alignment of proteins derive
    probabilities pab for exchanging amino acid (or
    nucleotides) a and b and compare it to the
    expected probability in a random model R, pRqaqb
  • Questions
  • (a) How to score an alignment ?
  • (b) What is a statistically valid set of protein
    sequences ?
  • (c) What is a good random model ?
  • (d) Different pairs of proteins have evolved
    from a common ancestor in a different amount.
  • (e.g. compare homologs in a human and mouse to
    homologs in human and E.coli)
  • (e) How to find the best alignment?

5
Examples of Substitution matrices
  • Dayhoff PAM Matrices
  • Dayhoff, Schwartz, Orcutt (1978). A model of
    evolutionary change in proteins. In Dayhoff ,
    Atlas of Protein sequence and Structure, Vol.5,
    NBRF, Washington, pp.345-352.
  • BLOSUM matrices
  • Henikoff, Henikoff (1992). Amino acid
    substitution matrices from protein blocks. PNAS
    89, 10915-10919.
  • In both cases S(i,j) is determined (in a
    simplified way) by
  •  
  •  
  • Nexch(i,j) are the occurrences substituting a.a.
    i with j Ni and NJ are the number of occurrences
    for the individual amino acids i and j

6
Parameters of the substitution matrices
  • Both matrices have a parameter attached, these
    characterized the selection of proteins which are
    used to calculate the matrices
  • PAM n
  • point accepted mutation one PAM unit is
    equivalent to an average change of 1 of all
    amino acids
  • first use closely related sequences, then expand
    using a model of evolution
  • widely used PAM250
  • BLOSUM m.
  • use only ungapped, aligned regions of protein
    families (BLOCKS)
  • the sequences from each block are clustered, two
    sequences in the same cluster if the sequence
    identity is large than a cutoff (m)
  • smaller values of m means more diverse sequences
  • BLOSUM 62 is widely used
  •  

7
PAM point accepted mutation
  • PAM scores are derived from alignments of
    closely relatedsequences, i.e., proteins whose
    function is known to be the same (Hemoglobin,
    cytochrome c, ribosomal proteins, RNase A...)
    from many organisms. The original PAM scoring
    matrix was derived by Margaret Dayhoff, a pioneer
    in sequence analysis
  • Numbers may be expressed in terms of
    time-dependent probability matrices (P(t)) One
    PAM unit is the time required to achieve an
    average change of 1 in the amino acid positions.
    The original aim was to relate observed changes
    to the evolutionary distance between organisms,
    as reflected by the geological record. Thus PAM
    units may be expressed in millions of years of
    evolution.
  • PAM250 will be drawn from a more diverse sequence
    alignment than PAM100.

8
PAM 250 Substitution matrix expressed as log odds
9
Similarity score is the sum of the matrix
elements in an alignment
  • Residue pairs with scores above 0 replace each
    other more often in related sequences than in
    random sequences. This is an indication that both
    residues can carry out similar functions (similar
    size, hydrogen bonding, etc). A score exactly
    equal to zero indicates amino acid pairs that are
    found as alternatives at exactly the frequency
    predicted by chance. Residue pairs with scores
    less than 0 replace each other less often than in
    random sequences and might be an indication that
    these residues are not functionally equivalent.
  • The score of an alignment is calculated as the
    sum of the substitution matrix elements in this
    alignment. This can be derived by assuming that
    all positions in a protein sequence are
    independent. Then the odds for the alignment is
    given by
  • In practice one deals with sums rather than
    products.

10
Example
  • Scoring the distance between two sequences with
    PAM250
  • for example, two sequences from the EGF domain
    of rabbit and pig fertilin)
  • QNCNN
  • EKCHN
  • S211222

11
BLOSUM the BLOcks SUbstitution Matrix.
  • Scores are derived from alignments of distantly
    related sequences, without regard to function
  • Should give a better substitution matrix for more
    distantly related sequences than the PAM
    matrices. Also, as PAM is limited to proteins of
    known function for its derivation, you have more
    sequences contributing to the BLOSUM numbers
    (better statistics)
  • the sequence alignments are the from the BLOCKS
    database, with the numerical value derived from
    the cutoff value for the diversity of the
    sequence
  • BLOSUM62 (sequences are gt62 identical) will be
    drawn from a less diverse sequence alignment than
    BLOSUM35 (where the sequences are gt35 identical)

12
Using the BLOSUM62 log odds matrix to score an
alignment add the numbers in the matrix
If you have a D above a W in an alignment, the
score is -3. For F to W, the score is 1.
Everytime F matches F, or D,D, add 6.
13
Other scoring matrices
  • Gaston Gonnet and coworkers derived a matrix
    much like PAM250 by using pairwise alignments of
    all the sequences known in 1992, in an iterative
    fashion starting with alignments based on PAM250.
    They noted that their results were different when
    they used closely related sequence alignments vs.
    more distantly related ones.
  • Identity matrix sort of the original, but only
    useful if it is scored according to the frequency
    of occurrence of amino acids in the database.

14
Concepts of protein structure prediction
  • Why is there a need for protein structure
    prediction ?
  • the sequence of a protein is easily available
  • the determination of 3D structures is still a
    slow process
  • energy based methods
  • free energy of the protein in the native state
    is minimal
  • Anfinson experiment
  • ab initio structure prediction is still an
    unsolved problem
  • holy grail of computational biology
  • knowledge-based methods
  • parameters are extracted from currently known 3D
    structures
  • examples
  • secondary structure prediction
  • fold recognition methods (threading)
  • knowledge based force field terms are added to
    free energy term

15
EXPLOSION OF GENOMIC DATA
  • Gene sequences from genome projects far
    outnumber the experimentally determined 3D
    structures
  • Prediction of 3D structures of proteins is a
    necessity

16
Tertiary structure prediction methods
Search for Similar Gene or Protein Sequences
Align sequences to find common motifs that
correlate with structure and functions
Predict the 3D Structure and Function of a New
Protein by
Ab initio Structure Prediction
Comparative Homology Modeling
Protein Fold Recognition (protein threading)
17
Tertiary Structure Prediction
  • Ab initio modeling from sequence to 3D structure
  • Two approaches
  • purely energy based
  • (2) prediction of secondary structure and short
    long range contacts
  • (1) uses a force field (molecular mechanics)
    method and molecular dynamics (or other global
    minimization algorithms) to find the native state
    of a protein
  • Molecular mechanics computes the molecular energy
    based on classical (Newtonian) mechanics and
    considers molecules as atoms bonded with elastic
    bonds
  • Molecular Energy Bond Energy Angle Energy
    Torsion Energy Electrostatic Energy Hydrogen
    Bonds Energy Solvation Energy SS Bridge
    Energy

18
Energy terms (see handouts)
  • Bond Energy (Bond length, Bond Angle)
  • Torsion angle energy (torsion motions around
    bonds, improper dihedrals)
  • Electrostatics
  • Van der Waals
  • Hydrogen bond

19
  • (2) prediction of secondary structure and long
    range contacts
  • Secondary structure prediction derive propensity
    values of residues from statistical analysis of
    residues in known secondary structure
  • More sophisticated methods Neural Network,
    combined prediction from MSA and HMM
  • Long range contacts
  • Tree-determinant residues
  • Motifs
  • Correlated mutations

20
Comparative Homology Modeling
  • 3D structures of proteins come in families and
    superfamilies
  • E.g. SCOP http//scop.mrc-lmb.cam.ac.uk/scop/
  • families sequence identities high (gt 35), same
    functional residues
  • superfamilies similar 3D fold some common
    functional motifs
  • No universal definition of superfamilies
  • . folds similar 3D fold
  • Rule of thumb if two proteins have an alignment
    with a sequence identity gt 30 they have the same
    fold.
  • More sophisticated methods for fold recognition
    3D profiles or threading
  • Steps
  • - for a target sequence find a homologous PDB
    template structure,
  • - make an optimum alignment between the target
    and template sequences,
  • - generate the the tertiary structure of the
    target using the template geometry.

21
Additional considerations
What is the secondary structure? Is it homologous
to other protein sequences? Is it homologous to
other protein structures? What is the best
sequence alignment between your target protein
and homologous PDB structures? Examine the
regions of insertions and deletions. Are they
located in the loop regions? On the surface? Is
the region hydrophobic or hydrophilic? The PDB
template might have functional sites and
established motifs. Does your target sequence has
the same features? If disulphide bridges are
present in the PDB template, are cysteine
residues aligned?
22
Basics of Secondary Structure Prediction
  • Propensity values for secondary structures of
    amino acids
  • Statistical analysis for the occurence of amino
    acids in regular secondary structures of a
    database of representative proteins
  • assignment of secondary structure,
  • X denotes one of the 20 aa, XAla, Val,..
  • naX number of aa X in a-helical regions
  • Na number of all amino acids in a-helical
    regions
  • NX number of aa X in the database Ntot total
    number of all amino acids in the data base

frequency of amino acid X in a-helical
regions average frequency of a-helices in all
proteins Propensity values
23
Methods for prediction
  • Classic method (Chou Fasman, 1985)
  • simplified rules
  • separate amino acids into groups of helix
    (b-strand) formers and breakers
  • search for clusters of formers (four h-former out
    of six contiguous residues three b-former out of
    five residues extend the segments in both
    dimensions until a tetrapeptide of breakers is
    found
  • later improvements
  • Garnier Osguthorpe Robson (GOR) method
  • influence of residue at postion j on secondary
    structure in the neighborhood of the residue is
    included
  • main effect is statistically found in the range
    j-8 lt i lt j8

24
Improvements of the methods
major improvements larger databases multiple
sequence alignments neural network
method consensus prediction Meta server
25
Secondary Structure Prediction Servers
  • APSSP2 www.imtech.res.in/raghava/apssp2/
  • Advanced Protein Secondary Structure Prediction
    Server, GPS Raghava, Bioinformatics Center,
    Chandigarh
  • PSIPRED bioinf.cs.ucl.ac.uk/psipred/index.html
  • The PSIPRED Protein Structure Prediction Server,
    D. T. Jones, Department of Computer Science,
    University College London, UK.
  • PROF www.aber.ac.uk/phiwww/prof/
  • University of Wales, Aberystwyth, Computational
    Biology Group.
  • PredictProtein cubic.bioc.columbia.edu/predictpro
    tein/
  • The PredictProtein server , B. Rost, Columbia
    University, NY.
  • SAM-T02sec www.cse.ucsc.edu/research/compbio/HMM-
    apps/T02-query.html
  • HMM methods, K. Karplus, UCSC
  • JPRED www.compbio.dundee.ac.uk/www-jpred/
  • A consensus method for protein secondary
    structure prediction
  • G. Barton, University of Dundee
Write a Comment
User Comments (0)
About PowerShow.com