BCB 444/544 - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

BCB 444/544

Description:

BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats. 2 ... mainly re-ordering, symbols, color 'coding' Mon Sept 3 - NO CLASSES AT ISU (Labor Day) ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 45
Provided by: publicI
Category:
Tags: bcb | reordering

less

Transcript and Presenter's Notes

Title: BCB 444/544


1
BCB 444/544
  • Lecture 6
  • Finish Dynamic Programming
  • Scoring Matrices
  • Alignment Statistics
  • 6_Aug31

2
Required Reading (before lecture)
  • Mon Aug 27 - for Lecture 4
  • Pairwise Sequence Alignment
  • Chp 3 - pp 31-41
  • Wed Aug 29 - for Lecture 5
  • Dynamic Programming
  • Eddy What is Dynamic Programming? 2004 Nature
    Biotechnol 22909
  • http//www.nature.com/nbt/journal/v22/n7/abs/nbt07
    04-909.html
  • Thurs Aug 30 - Lab 2
  • Databases, ISU Resources Pairwise Sequence
    Alignment
  • Fri Aug 31 - for Lecture 6
  • Scoring Matrices Alignment Statistics
  • Chp 3 - pp 41-49

3
Announcements
  • Fri Aug 31 - Revised notes for Lecture 5 posted
    online
  • Changes? mainly re-ordering, symbols, color
    "coding"
  • Mon Sept 3 - NO CLASSES AT ISU (Labor Day)!! -
    Enjoy!!
  • Tues Sept 4 - Lab 2 Exercise Writeup Due by 5 PM
    (or sooner!)
  • Send via email to Pete Zaback
    petez_at_iastate.edu
  • (HW2 assignment will be posted online)
  • Fri Sept 14 - HW2 Due by 5 PM (or sooner!)
  • Fri Sept 21 - Exam 1

4
Chp 3- Sequence Alignment
  • SECTION II SEQUENCE ALIGNMENT
  • Xiong Chp 3
  • Pairwise Sequence Alignment
  • vEvolutionary Basis
  • vSequence Homology versus Sequence Similarity
  • vSequence Similarity versus Sequence Identity
  • Methods - cont
  • Scoring Matrices
  • Statistical Significance of Sequence Alignment

Adapted from Brown and Caragea, 2007, with some
slides from Altman, Fernandez-Baca, Batzoglou,
Craven, Hunter, Page.
5
Methods
  • vGlobal and Local Alignment
  • vAlignment Algorithms
  • vDot Matrix Method
  • Dynamic Programming Method - cont
  • Gap penalities
  • DP for Global Alignment
  • DP for Local Alignment
  • Scoring Matrices
  • Amino acid scoring matrices
  • PAM
  • BLOSUM
  • Comparisons between PAM BLOSUM
  • Statistical Significance of Sequence Alignment

6
Sequence Homology vs Similarity
  • Homologous sequences - sequences that share a
    common evolutionary ancestry
  • Similar sequences - sequences that have a high
    percentage of aligned residues with similar
    physicochemical properties
  • (e.g., size, hydrophobicity, charge)
  • IMPORTANT
  • Sequence homology
  • An inference about a common ancestral
    relationship, drawn when two sequences share a
    high enough degree of sequence similarity
  • Homology is qualitative
  • Sequence similarity
  • The direct result of observation from a sequence
    alignment
  • Similarity is quantitative can be described
    using percentages

7
Goal of Sequence Alignment
  • Find the best pairing of 2 sequences, such that
    there is maximum correspondence between residues
  • DNA 4 letter alphabet ( gap)
  • TTGACAC
  • TTTACAC
  • Proteins 20 letter alphabet ( gap)
  • RKVA-GMA
  • RKIAVAMA

8
Statement of Problem
  • Given
  • 2 sequences
  • Scoring system for evaluating match (or mismatch)
    of two characters
  • Penalty function for gaps in sequences
  • Find Optimal pairing of sequences that
  • Retains the order of characters
  • Introduces gaps where needed
  • Maximizes total score

9
Avoiding Random Alignments with a Scoring
Function
  • Introducing too many gaps generates nonsense
    alignments
  • s--e-----qu---en--cesometimesquipsentice
  • Need to distinguish between alignments that occur
    due to homology and those that occur by chance
  • Define a scoring function that rewards matches
    () and penalizes mismatches (-) and gaps (-)

Scoring Function (S) e.g.
Match ? 1 Mismatch ? 1
Gap ? 0 S ?(matches) -
?(mismatches) - ?(gaps)
Note I changed symbols colors on this slide!
10
Not All Mismatches are the Same
  • Some amino acids are more "exchangeable" than
    others (physicochemical properties are similar)
  • e.g., Ser Thr are more similar than Trp Ala
  • Substitution matrix can be used to introduce
    "mismatch costs" for handling different types of
    substitutions
  • Mismatch costs are not usually used in aligning
    DNA or RNA sequences, because no substitution is
    "better" than any other (in general)

11
Substitution Matrix
  • s(a,b) corresponds to score of aligning character
    a with character b
  • Match scores are often calculated
  • based on frequency of mutations in very similar
    sequences
  • (more details later)

12
Global vs Local Alignment
  • Global alignment
  • Finds best possible alignment across entire
    length of 2 sequences
  • Aligned sequences assumed to be generally similar
    over entire length
  • Local alignment
  • Finds local regions with highest similarity
    between 2 sequences
  • Aligns these without regard for rest of sequence
  • Sequences are not assumed to be similar over
    entire length

13
Global vs Local Alignment - example
1 CTGTCGCTGCACG 2 TGCCGTG
Which is better?
14
Global vs Local Alignment Which should be used
when?
  • It is critical to choose correct method!
  • Global Alignment vs Local Alignment?
  • Shout out the answers!! Which should we use for?
  • Searching for conserved motifs in DNA or protein
    sequences?
  • Aligning two closely related sequences with
    similar lengths?
  • Aligning highly divergent sequences?
  • Generating an extended alignment of closely
    related sequences?
  • Generating an extended alignment of closely
    related sequences with very different lengths?
  • Hmmm - we'll work on that
  • Excellent!

15
Alignment Algorithms
  • 3 major methods for pairwise sequence alignment
  • Dot matrix analysis
  • Dynamic programming
  • Word or k-tuple methods (later, in Chp 4)

16
Dot Matrix Method (Dot Plots)
  • Place 1 sequence along top row of matrix
  • Place 2nd sequence along left column of matrix
  • Plot a dot each time there is a match between an
    element of row sequence and an element of column
    sequence
  • For proteins, usually use more sophisticated
    scoring schemes than "identical match"
  • Diagonal lines indicate areas of match
  • Contiguous diagonal lines reveal alignment
    "breaks" gaps (indels)

17
Interpretation of Dot Plots
  • When comparing 2 sequences
  • Diagonal lines of dots indicate regions of
    similarity between 2 sequences
  • Reverse diagonals (perpendicular to diagonal)
    indicate inversions
  • What do such patterns mean when comparing a
    sequence with itself (or its reverse complement)?
  • e.g. Reverse diagonals crossing diagonals (X's)
    indicate palindromes

Exploring Dot Plots
18
Dynamic Programming
For Pairwise sequence alignment
Idea Display one sequence above another with
spaces inserted in both to reveal similarity
  • C A T - T C A - C
  • C - T C G C A G C

19
Global Alignment Scoring
CTGTCG-CTGCACG -TGC-CG-TG----
Reward for matches ? Mismatch penalty
? Space/gap penalty ?
Score ?w ?x - ?y
w matches x mismatches y spaces
Note I changed symbols colors on this slide!
20
Global Alignment Scoring
Reward for matches 10 Mismatch penalty
-2 Space/gap penalty -5
C T G T C G C T G C - T G C
C G T G -
-5 10 10 -2 -5 -2 -5 -5 10 10 -5
Total 11
Note I changed symbols colors on this slide!
We could have done better!!
21
Alignment Algorithms
  • Global Needleman-Wunsch
  • Local Smith-Waterman
  • Both NW and SW use dynamic programming
  • Variations
  • Gap penalty functions
  • Scoring matrices

22
Dynamic Programming - Key Idea
  • The score of the best possible alignment that
    ends at a given pair of positions (i, j) is equal
    to
  • the score of best alignment ending just
    previous to those two positions (i.e., ending at
    i-1, j-1)
  • PLUS
  • the score for aligning xi and yj

23
Global Alignment DP Problem Formulation
Notations
  • Given two sequences (strings)
  • X x1x2xN of length N x AGC N 3
  • Y y1y2yM of length M y AAAC M 4
  • Construct a matrix with (N1) x (M1) elements,
    where
  • S(i,j) Score of best alignment of
    x1..ix1x2xi with y1..jy1y2yj

Which means Score of best alignment of a prefix
of X and a prefix of Y
24
Dynamic Programming - 4 Steps
  1. Define score of optimum alignment, using
    recursion
  2. Initialize and fill in a DP matrix for storing
    optimal scores of subproblems, by solving
    smallest subproblems first (bottom-up approach)
  3. Calculate score of optimum alignment(s)
  4. Trace back through matrix to recover optimum
    alignment(s) that generated optimal score

25
1- Define Score of Optimum Alignment using
Recursion
Define
Initial conditions
Recursive definition For 1 ? i ? N, 1 ? j ?
M
26
2- Initialize Fill in DP Matrix for Storing
Optimal Scores of Subproblems
  • Construct sequence vs sequence matrix

Recursion
Initialization
27
2- cont Fill in DP Matrix
  • Fill in from 0,0 to N,M (row by row),
    calculating best
  • possible score for each alignment including
    residues at i,j
  • Keep track of dependencies of scores (in a
    pointer matrix).

28
3- Calculate Score S(N,M) of Optimum Alignment
- for Global Alignment
  • What happens in last step in alignment of x1..i
    to y1..j?
  • 1 of 3 cases applies

29
Example
30
Fill in the matrix
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
?
10
5
C
A
T
T
C
A
C
10 for match, -2 for mismatch, -5 for space
31
Calculate score of optimum alignment
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
-5 10 5 0 -5 -10 -15 -20 -25
-10 5 8 3 -2 -7 0 -5 -10
-15 0 15 10 5 0 -5 -2 -7
-20 -5 10 13 8 3 -2 -7 -4
-25 -10 5 20 15 18 13 8 3
-30 -15 0 15 18 13 28 23 18
-35 -20 -5 10 13 28 23 26 33
10 for match, -2 for mismatch, -5 for space
32
4- Trace back through matrix to recover optimum
alignment(s) that generated the optimal score
  • How? "Repeat" alignment calculations in reverse
    order, starting at from position with highest
    score and following path, position by position,
    back through matrix
  • Result? Optimal alignment(s) of sequences

33
Traceback - for Global Alignment
  • Start in lower right corner trace back to upper
    left
  • Each arrow introduces one character at end of
    sequence alignment
  • A horizontal move puts a gap in left sequence
  • A vertical move puts a gap in top sequence
  • A diagonal move uses one character from each
    sequence

34
Traceback to Recover Alignment
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
-5 10 5 0 -5 -10 -15 -20 -25
-10 5 8 3 -2 -7 0 -5 -10
-15 0 15 10 5 0 -5 -2 -7
-20 -5 10 13 8 3 -2 -7 -4
-25 -10 5 20 15 18 13 8 3
-30 -15 0 15 18 13 28 23 18
-35 -20 -5 10 13 28 23 26 33
Can have gt1 optimum alignment this example has 2
35
Local Alignment Motivation
  • To "ignore" stretches of non-coding DNA
  • Non-coding regions (if "non-functional") are more
    likely to contain mutations than coding regions
  • Local alignment between two protein-encoding
    sequences is likely to be between two exons
  • To locate protein domains or motifs
  • Proteins with similar structures and/or similar
    functions but from different species (for
    example), often exhibit local sequence
    similarities
  • Local sequence similarities may indicate
    functional modules

Non-coding - "not encoding protein" Exons -
"protein-encoding" parts of genes vs Introns
"intervening sequences" - segments of
eukaryotic genes that "interrupt" exons
Introns are transcribed into RNA, but are later
removed by RNA processing are not translated
into protein
36
Local Alignment Example
g g t c t g a g a a a c g a
Match 2 Mismatch or space -1
Best local alignment
g g t c t g a g a a a c g a -
Score 5
37
Local Alignment Algorithm
  • S i, j Score for optimally aligning a suffix
    of X with a suffix of Y
  • Initialize top row leftmost column of matrix
    with "0"
  • Recall for Global Alignment,
  • S i, j Score for optimally aligning a
    prefix of X with a prefix of Y
  • Initialize top row leftmost column of with gap
    penalty

38
Traceback - for Local Alignment
? C T C G C
A G C
0 0 0 0 0 0 0 0 0
0 1 0 1 0 1 0 0 1
0 0 0 0 0 0 2 0 0
0 0 1 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0
0 1 0 2 0 1 0 0 1
0 0 0 0 1 0 2 0 0
0 1 0 1 0 2 0 1 1
1 for a match, -1 for a mismatch, -5 for a space
39
Some Results re Alignment Algorithms (for
ComS, CprE Math types!)
  • Most pairwise sequence alignment problems can be
    solved in O(mn) time
  • Space requirement can be reduced to O(mn), while
    keeping run-time fixed Myers88
  • Highly similar sequences can be aligned in O (dn)
    time, where d measures the distance between the
    sequences Landau86

40
"Scoring" or "Substitution" Matrices
  • 2 Major types for Amino Acids PAM BLOSUM
  • PAM Point Accepted Mutation
  • relies on "evolutionary model" based on
    observed differences in alignments of closely
    related proteins
  • BLOSUM BLOck SUbstitution Matrix
  • based on aa substitutions observed in blocks
    of conserved sequences within evolutionarily
    divergent proteins

41
PAM Matrix
  • PAM Point Accepted Mutation
  • relies on "evolutionary model" based on observed
    differnces in closely related proteins
  • Model includes defined rate for each type of
    sequence change
  • Suffix number (n) reflects amount of "time"
    passed rate of expected mutation if n of amino
    acids had changed
  • PAM1 - for less divergent sequences (shorter
    time)
  • PAM250 - for more divergent sequences (longer
    time)

42
BLOSUM Matrix
  • BLOSUM BLOck SUbstitution Matrix
  • based on aa substitutions observed in blocks
    of conserved sequences within evolutionarily
    divergent proteins
  • Doesn't rely on a specific evolutionary model
  • Suffix number (n) reflects expected similarity
    average aa identity in the MSA from which the
    matrix was generated
  • BLOSUM45 - for more divergent sequences
  • BLOSUM62 - for less divergent sequences

43
Statistical Significance of Sequence Alignment
44
Affine Gap Penalty Functions
  • Gap penalty h gk
  • where
  • k length of gap
  • h gap opening penalty
  • g gap extension penalty

Can also be solved in O(nm) time using dynamic
programming
Write a Comment
User Comments (0)
About PowerShow.com