LSM2104 Part II Lectures 3 and 4 - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

LSM2104 Part II Lectures 3 and 4

Description:

Derivation of PAM and BLOSUM and Scoring Matrices. Basic Introduction to Dynamic Programming Techniques ... MSA ( Gupta, Kececioglu & Schaffer, 1995) MACAW ( Schu ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 52
Provided by: everestB
Category:
Tags: lectures | lsm2104 | macaw | part

less

Transcript and Presenter's Notes

Title: LSM2104 Part II Lectures 3 and 4


1
LSM2104Part IILectures 3 and 4
  • Tan Tin Wee
  • Dept of Biochemistry
  • NUS

2
Topics
  • Derivation of PAM and BLOSUM and Scoring Matrices
  • Basic Introduction to Dynamic Programming
    Techniques
  • Details of Needleman-Wunsch and Smith-Waterman
    Algoriths
  • FASTA and BLAST Heuristics
  • Database Similarity Searches and Statistics of
    the E Value
  • Multiple Sequence Alignments

3
PAM
  • PAM (Percent Accepted Mutations) similarity
    matrix
  • PAM matrices pertaining to protein sequences are
    constructed using amino acid similarities in
    evolutionarily related sequences. They show
    probability scores of replacement of amino acids
    by each other based on natural mutation rates in
    related protein families. Hence, these matrices
    are sometimes called "substitution matrices".
  • A score above zero assigned to two amino acids
    indicates that these two replace each other more
    often than expected by chance alone. ie., they
    are functionally exchangeable.
  • A negative score (below zero) indicates that the
    two amino acids are rarely interchangeable. eg.,
    a basic amino acid for an acidic one or one with
    an aromatic side chain for one with aliphatic
    side chain.
  • The number 250 in PAM250 corresponds to an
    average of 250 amino acid replacements per 100
    residues from a data set of 71 aligned sequences
    Ref Dayhoff, M, Schwartz, RM, Orcutt, BC (1978)
    A model of evolutionary change in proteins. in
    Atlas of Protein Sequence and Structure, vol 5,
    sup. 3, pp 345-352. M. Dayhoff ed., National
    Biomedical Research Foundation, Silver Spring,
    MD.. The higher the matrix number, the farther
    the evolutionary distance between the compared
    sequences

4
PAM
  • Margaret Dayhoff et al (1970s)
  • First to assemble sequences into protein seq
    atlas families and superfamilies based on seq
    similarity
  • Tables of frequency of changes/mutations observed
    in the sequences of a family derived.
  • Percent amino acid mutations accepted by
    evolutionary selection or PAM Tables derived.
  • Shows probability that one amino acid change into
    any other in these families
  • Shows which amino acids are most conserved at the
    corresponding position in two sequences.
  • Assess the probabilities of residue replacement
    in related proteins
  • Model of molecular evolution
  • 1 PAM average change in 1 of all amino acid
    possibilities
  • 100 PAMs does not mean every residue is changed
  • Time is not correlated with PAM ie. Different
    families of proteins evolve at different rates

5
BLOSUM
  • BLOSUM (Blocks Substitution Matrix) matrix
  • These are substitution matrices derived from the
    observed frequencies of amino acid replacements
    in highly conserved regions of ungapped local
    alignments. The data for the substitution scores
    in these matrices come from about 2000 blocks of
    aligned sequence segments characterizing more
    than 500 groups of related proteins Ref
    Henikoff, S., and Henikoff, J. G. (1992) Proc.
    Natl. Acad. Sci. USA 89 10915-10919
  • The BLAST server from NCBI and the search servers
    from EBI use different versions of the BLOSUM
    matrix for protein similarity searches and
    alignments.

6
BLOSUM Scoring Matrices
  • Block Substitution Matrix
  • Henikoff and Henikoff PNAS 1992
  • Number indicate percent identity within set eg.
    BLOSUM62 means 62 identity
  • Finds short highly similar sequences

7
Use of these matrices
  • PAM, BLOSUM, and other matrices
  • Used as scoring matrices for sequence alignment
    programs

8
Elements of Dynamic Programming
  • Dynamic programming is a computational method
    that is used in bioinformatics and computational
    biology to align two protein or nucleic acid
    sequences
  • Every pair of characters in the two sequences are
    compared pairwise
  • O(MN)
  • Dynamic programming algorithms proven
    mathematically to generate the best or OPTIMAL
    alignment between two sequences under a given set
    of conditions
  • http//bioinformatics.weizmann.ac.il/courses/BCG/l
    ectures/02_pairwise/2.4tools/index.html
  • http//www.sbc.su.se/per/molbioinfo2001/dynprog/d
    ynamic.html

9
Example of Dynamic programming algorithm
  • Needleman-Wunsch techniques. For this example,
    the two sequences to be globally aligned are G A
    A T T C A G T T A (sequence 1) G G A T C G A
    (sequence 2)
  • So M 11 and N 7 (the length of sequence 1
    and sequence 2, respectively)
  • A simple scoring scheme is assumed where
  • Si,j 1 if the residue at position i of sequence
    1 is the same as the residue at position j of
    sequence 2 (match score) otherwise
  • Si,j 0 (mismatch score)
  • w 0 (gap penalty)

10
  • Three steps in dynamic programming
  • Initialization
  • Matrix fill (scoring)
  • Traceback (alignment)
  • Good detailed description
  • http//www.sbc.su.se/per/molbioinfo2001/dynprog/a
    dv_dynamic.html

11
Needleman-Wunsch
  • S.B. Needleman and C.D. Wunsch. A general method
    applicable to the search for similarities in the
    amino acid sequence of two proteins. J. Mol.
    Biol., 48443--453, 1970
  • General Algorithm for Sequence Comparison
  • Compuationally expensive O(M,N) where M and N are
    lengths of sequences being pairwise compared
  • To Calculate the alignment score S(i,j), you only
    need to enumerate and score all ways in which one
    aligned pair can be added to a shorter alignment
    to produce an alignment of the first I residues
    of sequence A and the first j residues of
    sequence B

12
  • All possible pairs are calculated in a two-D
    array
  • All possible alignments are represented by
    pathways through this array, the shortest being
    the most optimal
  • Global alignments every residue of the two
    sequences A and B is involved in the calculation.
  • Misses out on specific motifs or active sites
    homology, or domains within a longer sequence
    which match each other, where other parts dont
    match

13
Step 1 assign similarity values
  • Assign Similarity Scores
  • A numerical value/score is assigned to every cell
    in the 2D array, based on the scoring matrix.
  • The similarity scores e.g. unitary soring matrix,
    PAM, BLOSUM etc.

14
Step 2 for each cell trace back all possible
pathways back to the start allowing for indels,
giving cell the value of the pathway with the max
score
  • Score pathways through the 2D Array
  • For each cell, calculate the maximum score for an
    alignment ending at that point.
  • Compute Cumulative score by adding scores in a
    path through the matrix
  • Search subrow and subcolumn for the highest score
  • Include Gap penalty and Gap extension penalty
  • The best optimal alignment is the pathway with
    the highest score
  • Max match is the largest number of amino acids of
    one protein that can be matched with those of
    another protein while allowing for gaps

15
Step 3 traceback
  • Construct the alignment or pathway back from the
    highest scoring cell to give the highest scoring
    alignment
  • Examples of implementations GAP (in GCG package)

16
  • N-W statistical significance
  • Probability of obtaining the same result of the
    maximum score from a pair of random sequences
  • Estimated using
  • Form pairs of random sequences by randomly
    drawing one member from each set e.g. with the
    same amino acid composition as the real sequences
  • If the value calculated for real proteins is
    significantly different from that of the random
    proteins, then it is likely that the alignment is
    due to the specific sequences rather than
    accidentally due to their composition.

17
Smith Waterman Algorithm
  • Smith, T. F., Waterman, M. S., J. Mol. Biol.
    (1981) 147195-197
  • Based on Needleman Wunsch Algorithm
  • Local Alignment instead of N-W global alignment
  • Examples of implementations BESTFIT (in GCG
    package)
  • S-M Algo compares segments of all possible
    lengths (local alignments) rather than entire
    length (global alignments) in the NW algo and
    chooses whichever optimises the similarity score
    of these local alignments

18
  • Comparing two sequences A (a1 a2 a3 ... an) and
    B (b1 b2 b3 ... bm)
  • HijmaxHi-1, j-1 s(ai,bj), maxHi-k,j -Wk,
    maxHi, j-l -Wl, 0
  • Hij is the maximum similarity of two segments
    ending in ai and bj respectively
  • Including the alternative that the similarity
    score be zero in the expression for Hij allows
    the local alignment to restart at any pair of
    aligned residues. In addition, it makes the
    calculation much more sensitive to the precise
    match and mismatch scores and gap penalties.

19
  • For every cell, the SW algo computes all possible
    paths leadng to it, including paths of any
    length, and not necessarily leading from the
    beginning of the sequence (unlike the global
    alignment).
  • So paths can be of any length, starting anywhere
    in the sequences being compared, and can contain
    insertions or deletions.
  • Relies on gap penalties to inhibit extensions

20
Forming the optimal alignment diagonal path
  • For every residue in the query sequence
  • Align with next residue of the db target sequence
    Previous score plus similarity score for the
    two residues
  • Deletion score is previous score minus gap
    penalty and gap size penalty query seq with gap
  • Insertion score is previous score minus gap
    penalty and gap size penalty db target with gap
  • Stop when score is zero.
  • Choose whichever above is highest

21
SW alignment
  • Score in each cell is max possible score for an
    alignment of any length (not necessarily from the
    beginning unlike the NW) ending at that cell
  • Trace pathway back form the highest scoring cell
  • This cell can be anywhere in the 2D array
  • Align the highest scoring segment
  • http//www.maths.tcd.ie/lily/pres2/

22
NW vs SW
  • Global vs local alignment
  • NW Alignment score for a pair of residues must be
    gt 0 while SW can be or
  • No gap penalty required for NW to work properly,
    but SW needs gap penalty
  • Score cannot decrease between two cells of a
    pathway for NW, but for SW, in getting to an
    alignment, score can increase, decrease or stay
    level between two cells of a pathway

23
BLAST and FASTA
  • Heuristic approximations of N-W and S-M
    algorithms
  • Reduce computation and increase speed
  • NW and SM exhaustive, guarantee optimal
  • BLAST is statistically sound. But it is not
    mathematically guaranteed to produce the best
    optimal alignment

24
FASTA
  • Pearson, W. R., Lipman, D. J., P.N.A.S. (1988)
    852444-2448
  • 1. Lookup Table
  • http//acer.gen.tcd.ie/amclysag/fasta.html
  • FastA is a family of heuristic algorithms
    developed by William Pearson of the University of
    Virginia. FastA lies between BLAST and
    Smith-Waterman in both accuracy and speed. An
    optimized FastA option makes use of
    Smith-Waterman for part of the alignment process.
    The FastA family includes DNA to DNA, protein to
    protein, and translation searches.

25
FASTA
  • http//workbench.sdsc.edu
  • Rapid Global Alignment
  • Allows alignments to shift frames
  • LALIGN a FASTA derivative for local alignments
  • Works for internal repeats missed by FASTA
    because of gaps (now there is gapped BLAST
    BLAST 2.0)

26
Basic Local Alignment Search Tool BLAST
  • Altschul, S. F., Gish, W., Miller, W., Myers, E.
    W., Lipman, D. J., J. Mol. Biol. (1990)
    215403-410
  • Strategy is to Optimise Maximal Segment Pair
    (MSP) score
  • Heuristic modification of the SW dynamic
    programming algo
  • Ungapped alignments only
  • BLAST1 uses PAM120 matrix for protein alignments
  • DNA alignment score matrix 5 for match and 4
    for mismatch

27
BLAST is heuristic
  • BLAST (Basic Local Alignment Search Tool) is a
    heuristic algorithm first released in 1990 by the
    National Center for Biotechnology Information
    (NCBI). The original BLAST compares very short
    segments of the query and database sequences in
    search of alignments that exceed a particular
    score. No insertions or deletions are considered
    during this process. If the required score is
    exceeded, BLAST investigates whether the aligned
    segments can be lengthened to produce a
    higher-scoring alignment. In the last two years,
    new and improved versions of BLAST have been
    released. These newer releases consider
    insertions and deletions to some extent producing
    better results. BLAST offers several search
    types DNA to DNA, protein to protein, and
    translation searches.

28
MSP max segment pair
  • Similarity Score for a segment is the Sum of
    Scores for individual pairs of residues.
  • MSP is highest scoring segment
  • MSP is locally max if score cannot be increased
    by extending or shortening both segments
  • Search for all locally maximal MSPs above some
    cutoff/threshold.
  • Word pair of fixed length w
  • Look for scores above threshold T
  • Scan target db sequence for word of length w with
    score geT - Find word hits
  • Extend word hits to see if it is part of a longer
    segment with the segment pair with score at least
    S where S is highest MSP score at which chance
    similarities are likely to appear

29
Algorithm
  • Compile a list of high scoring words of length w
    (w4 for proteins and 12 for nt) for query
    sequence
  • Scan for word hits of score greater than
    threshold T against db target sequence
  • Extend word hit in both directions to find High
    Scoring Pairs with scores greater than S

30
Step 1 compile list of high scoring words
  • Generate All words of w length w-mers in query
    sequence
  • Look for w-mers with score at least T when
    matched with target sequence
  • Speed of program is dependent on list generated
    in a time proportional to the length of the list
    unlike SW or NW which is proportional to the
    product of the lengths.

31
Step 2. Scanning Phase
  • Scan DB
  • Search long sequences for occurrences of words in
    the list
  • Use deterministic finite automaton or finite
    state machine at approx 0.5M residues a sec

32
Step 3. Extending Hits
  • Extend w-mers
  • Stop when score falls a certain distance below
    highest so far.ie the highest score for shorter
    or no extensions.
  • http//acer.gen.tcd.ie/amclysag/blast.html for
    DNA

33
Other more advanced programs
  • See http//www.paracel.com/faq/faq_algorithm_prime
    r.html
  • Gapped Blast (BLAST 2.0) extend words from no-gap
    to gapped, can generate gapped alignments
  • PSI Blast Position specific iterated BLASTie.
    Use gapped BLAST, generate a profile from
    multiple iterations. This profile is used instead
    of the input and distance matrix

34
Limitations to BLAST for sequence database
searching
  • Needs islands of strong similarity
  • Limits on scoring and penalty values
  • Blastx, tblastn and tblastx use 6 frame
    translation but miss sequences with frameshifts
  • Finds and reports only local alignments

35
BLAST rules of thumb
  • For short aas 20-40, 50 identity happens by
    chance
  • If A homologus to B, B to C then A and C are
    homologous
  • Similarity can be present even if there is
    absence of homology e.g. low complexity
    transmembrane and coiled coil regions.

36
BLAST significance
  • S (lambdaS-lnK)/ln2
  • Lambda and K are associated with the scoring
    system
  • S with a given E, is sigificant if it is greater
    than N/E where N is the size of the search space.

37
Significance
  • A common and simple test to determine if the
    alignment of two sequences is statistically
    significant is to carry out a simple permutation
    test. This consists of
  • Randomly rearrange the order of one or both
    sequences
  • Align the permuted sequences
  • Record the score for this alignment
  • Repeat steps 1-3 a large number of times.

38
  • http//helix.biology.mcmaster.ca/721/outline2/node
    40.htmlSECTION00630000000000000000
  • For protein sequences Doolittle's rule of thumb
    is that greater than 25 identity will suggest
    homology, less than 15 is doubtful and for those
    cases between 15-25 identity a strong
    statistical argument is required. Personally, I
    would prefer the statistical test in all cases,
    since they are easy to do and things such as
    internal repeats and unusual amino acid
    compositions can sometimes confuse the picture.

39
  • Gumbel Extreme Value Distribution
  • Make many random protein sequences of varying
    lengths
  • Locally align two sequences of a given length
    with the same scoring matrix and gap penalty
  • Repeat for many pairs of sequences of approx same
    length
  • Plot distribution of scores
  • See variation from normal distribution curve
    skew the Gumbel extreme value distribution

40
  • Probability that a score S between two unrelated
    sequence is equal to or greater than a value x,
    P(Sgtx) is given by P 1- exp (-Kmn e -lambdax)
  • Where m and n are sequence lengths, and lambda is
    scaling factor for the scoring matrix used,
  • K is a constant that depends on the scoring
    matrix and the gap penalty combination used.
  • For significant matches, we are looking for v.
    low value of P, less than 0.01 to 0.05.
  • Calculate another value E, the expect value for
    the alignment score that depends on how we found
    P. E is used by BLAST.
  • For BLOSUM62, gap 11/-1, lambda 0.3 and K0.1
    approx

41
Sequence scrambling method shuffling
  • Align two sequences and obtain optimal score S
  • Scramble one of the sequences thousands of times
    (N) and align it with the first sequence and
    obtain a distribution of scores, scrambled
    sequences should preserve same composition
  • Fit the scores into an extreme value distribution
    and calculate lambda and K
  • Calculate P for probability that one of the
    scrambled pairs will exceed optimal score S
  • Calculate E(expect value), which is usually P x
    the number of sequence pairs compared. Eg. If P
    10-6, and N 10,000 then E10-2
  • E is the number we want to be lt0.01 to 0.05

42
Tips on how to assess statistical significance
  • http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/Bl
    ast_output.html

43
  • Extreme-Value Distribution Probability density
    function for distribution of local alignment
    score.
  • Probability density function
  • Several illustrative visual examples may be found
    in
  • Karlin, Samuel and Altschul, Stephen,
    Applications and statistics for multiple
    high-scoring segments in molecular sequences
    Proc. Natl. Acad. Sci. USA Vol. 90, pp.
    5873-5877, June 1993.

44
  • Take an example (ChenXin)
  • Suppose the genome of B.pseudomallei has some
    base pairs that could code for 10,000,000 amino
    acids. Now, we want to know if this genome could
    code for a very short sequence "MF". What would
    be the possibility of finding a sequence "MF" in
    the sequence coded by the B.Pse... genome?
  • Math have it 1 - ( (1- (1/20)(1/20) )
    9,999,999 )
  • Since we have 20 amino acids, the possibility of
    getting a sequence of "MF" is (1/20)(1/20). The
    possibility of the sequence is not "MF" is
    1-(1/20)(1/20). In the genome, there are
    9,999,999 possible starting point of a length-two
    sequence. So, the possibility that they are all
    not "MF" is (1- (1/20)(1/20) ) 9,999,999. So
    the possibility of finding a "MF" is 1 - ( (1-
    (1/20)(1/20) ) 9,999,999 ). If you know well
    about math, you will know this value is nearly
    1(100) which means you are very, very likely to
    find the sequence.
  • The sequence "MF" is arbitrarily dictated by me
    which have no biological significance. That means
    the existence of "MF" is because of chance (by
    distribution), not because of the evolutionary
    pressure. The sequence we are interested are
    those developed and fixed through the
    evolutionary process by the force of natural
    selection (survival of the fittest). So, we wish
    to calculate how possible the match is because of
    chance (by distribution).
  • This is the E value. So the smaller the E value
    is (The less possible the match can occur by
    distribution), the more biological significance
    it have (survived through selection).

45
Multiple Sequence Alignments
  • NW algorithm can be extended from pairwise to gt 2
    sequences. Matrix become n dimensional.
  • O(MN) to O(Nm)
  • just 100 nucleotides from 5 species, this is
  • 100e5 10,000,000,000 operations
  • Eg. Like alignment of 2 seqs 100,000 nt long
  • Most popular is Higgins and Sharp (1989) Clustal

46
4 Steps in CLustal
  • All pairwise similarity scores calculated using
    Rapid alignment methods
  • Create a similarity matrix and to cluster the
    sequences based on this similarity using a
    cluster algorithm
  • Create an alignment of clusters via a consensus
    method.
  • Create a progressive multiple alignment by
    sequentially aligning groups of sequences
    according to their branching order in the
    clustering.
  • ClustalW a local alignment algorithm
  • ClustalV the older global alignment method

47
(No Transcript)
48
  • MSA ( Gupta, Kececioglu Schaffer, 1995)
  • MACAW ( Schuler, Altschul Lipman, 1991).
  • a multiple dimensional dot plot and then look for
    dots that are common to each group ( Vingron
    Argos 1991).)

49
  • Need to construct alignment and phylogeny at the
    same time.
  • Eg TREEALIGN
  • When all is said and done, people will still find
    that the alignments produced by the programs can
    be improved by a judicious and critical
    examination by eye. Spending time to slowly and
    carefully re-examine your alignments by hand is
    recommended
  • http//helix.biology.mcmaster.ca/721/outline2/node
    37.html

50
FastDNAML Phylogenetic analysis Construction of
a Phylogenetic tree http//www.bic.nus.edu.sg/ bi
ogrid/phylo/cascon.html
51
Major Milestones in bioinformatics
  • http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/mi
    lestones.html
  • 1962 Pauling's theory of molecular evolution
  • 1965 Margaret Dayhoff's Atlas of Protein
    Sequences
  • 1970 Needleman-Wunsch algorithm
  • 1977 DNA sequencing and software to analyze it
    (Staden)
  • 1981 Smith-Waterman algorithm developed
  • 1981 The concept of a sequence motif (Doolittle)
  • 1982 GenBank Release 3 made public
  • 1982 Phage lambda genome sequenced
  • 1983 Sequence database searching algorithm
    (Wilbur-Lipman)
  • 1985 FASTP/FASTN fast sequence similarity
    searching
  • 1988 National Center for Biotechnology
    Information (NCBI) created at NIH/NLM
  • 1988 EMBnet network for database distribution
  • 1990 BLAST fast sequence similarity searching
  • 1991 EST expressed sequence tag sequencing
  • 1993 Sanger Centre, Hinxton, UK
  • 1994 EMBL European Bioinformatics Institute,
    Hinxton, UK
  • 1995 First bacterial genomes completely sequenced
  • 1996 Yeast genome completely sequenced
Write a Comment
User Comments (0)
About PowerShow.com