BCB 444/544

About This Presentation

Title:

BCB 444/544

Description:

BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats. 2 ... mainly re-ordering, symbols, color 'coding' Mon Sept 3 - NO CLASSES AT ISU (Labor Day) ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 45

Provided by: publicI

Category:

more less

Transcript and Presenter's Notes

Title: BCB 444/544

1
BCB 444/544

Lecture 6
Finish Dynamic Programming
Scoring Matrices
Alignment Statistics
6_Aug31

2
Required Reading (before lecture)

Mon Aug 27 - for Lecture 4
Pairwise Sequence Alignment
Chp 3 - pp 31-41
Wed Aug 29 - for Lecture 5
Dynamic Programming
Eddy What is Dynamic Programming? 2004 Nature
Biotechnol 22909
http//www.nature.com/nbt/journal/v22/n7/abs/nbt07
04-909.html
Thurs Aug 30 - Lab 2
Databases, ISU Resources Pairwise Sequence
Alignment
Fri Aug 31 - for Lecture 6
Scoring Matrices Alignment Statistics
Chp 3 - pp 41-49

3
Announcements

Fri Aug 31 - Revised notes for Lecture 5 posted
online
Changes? mainly re-ordering, symbols, color
"coding"
Mon Sept 3 - NO CLASSES AT ISU (Labor Day)!! -
Enjoy!!
Tues Sept 4 - Lab 2 Exercise Writeup Due by 5 PM
(or sooner!)
Send via email to Pete Zaback
petez_at_iastate.edu
(HW2 assignment will be posted online)
Fri Sept 14 - HW2 Due by 5 PM (or sooner!)
Fri Sept 21 - Exam 1

4
Chp 3- Sequence Alignment

SECTION II SEQUENCE ALIGNMENT
Xiong Chp 3
Pairwise Sequence Alignment
vEvolutionary Basis
vSequence Homology versus Sequence Similarity
vSequence Similarity versus Sequence Identity
Methods - cont
Scoring Matrices
Statistical Significance of Sequence Alignment

Adapted from Brown and Caragea, 2007, with some
slides from Altman, Fernandez-Baca, Batzoglou,
Craven, Hunter, Page.
5
Methods

vGlobal and Local Alignment
vAlignment Algorithms
vDot Matrix Method
Dynamic Programming Method - cont
Gap penalities
DP for Global Alignment
DP for Local Alignment
Scoring Matrices
Amino acid scoring matrices
PAM
BLOSUM
Comparisons between PAM BLOSUM
Statistical Significance of Sequence Alignment

6
Sequence Homology vs Similarity

Homologous sequences - sequences that share a
common evolutionary ancestry
Similar sequences - sequences that have a high
percentage of aligned residues with similar
physicochemical properties
(e.g., size, hydrophobicity, charge)

IMPORTANT
Sequence homology
An inference about a common ancestral
relationship, drawn when two sequences share a
high enough degree of sequence similarity
Homology is qualitative
Sequence similarity
The direct result of observation from a sequence
alignment
Similarity is quantitative can be described
using percentages

7
Goal of Sequence Alignment

Find the best pairing of 2 sequences, such that
there is maximum correspondence between residues
DNA 4 letter alphabet ( gap)
TTGACAC
TTTACAC
Proteins 20 letter alphabet ( gap)
RKVA-GMA
RKIAVAMA

8
Statement of Problem

Given
2 sequences
Scoring system for evaluating match (or mismatch)
of two characters
Penalty function for gaps in sequences
Find Optimal pairing of sequences that
Retains the order of characters
Introduces gaps where needed
Maximizes total score

9
Avoiding Random Alignments with a Scoring
Function

Introducing too many gaps generates nonsense
alignments
s--e-----qu---en--cesometimesquipsentice
Need to distinguish between alignments that occur
due to homology and those that occur by chance
Define a scoring function that rewards matches
() and penalizes mismatches (-) and gaps (-)

Scoring Function (S) e.g.
Match ? 1 Mismatch ? 1
Gap ? 0 S ?(matches) -
?(mismatches) - ?(gaps)
Note I changed symbols colors on this slide!
10
Not All Mismatches are the Same

Some amino acids are more "exchangeable" than
others (physicochemical properties are similar)
e.g., Ser Thr are more similar than Trp Ala
Substitution matrix can be used to introduce
"mismatch costs" for handling different types of
substitutions
Mismatch costs are not usually used in aligning
DNA or RNA sequences, because no substitution is
"better" than any other (in general)

11
Substitution Matrix

s(a,b) corresponds to score of aligning character
a with character b
Match scores are often calculated
based on frequency of mutations in very similar
sequences
(more details later)

12
Global vs Local Alignment

Global alignment
Finds best possible alignment across entire
length of 2 sequences
Aligned sequences assumed to be generally similar
over entire length

Local alignment
Finds local regions with highest similarity
between 2 sequences
Aligns these without regard for rest of sequence
Sequences are not assumed to be similar over
entire length

13
Global vs Local Alignment - example
1 CTGTCGCTGCACG 2 TGCCGTG
Which is better?
14
Global vs Local Alignment Which should be used
when?

It is critical to choose correct method!
Global Alignment vs Local Alignment?
Shout out the answers!! Which should we use for?
Searching for conserved motifs in DNA or protein
sequences?
Aligning two closely related sequences with
similar lengths?
Aligning highly divergent sequences?
Generating an extended alignment of closely
related sequences?
Generating an extended alignment of closely
related sequences with very different lengths?
Hmmm - we'll work on that
Excellent!

15
Alignment Algorithms

3 major methods for pairwise sequence alignment
Dot matrix analysis
Dynamic programming
Word or k-tuple methods (later, in Chp 4)

16
Dot Matrix Method (Dot Plots)

Place 1 sequence along top row of matrix
Place 2nd sequence along left column of matrix
Plot a dot each time there is a match between an
element of row sequence and an element of column
sequence
For proteins, usually use more sophisticated
scoring schemes than "identical match"
Diagonal lines indicate areas of match
Contiguous diagonal lines reveal alignment
"breaks" gaps (indels)

17
Interpretation of Dot Plots

When comparing 2 sequences
Diagonal lines of dots indicate regions of
similarity between 2 sequences
Reverse diagonals (perpendicular to diagonal)
indicate inversions
What do such patterns mean when comparing a
sequence with itself (or its reverse complement)?
e.g. Reverse diagonals crossing diagonals (X's)
indicate palindromes

Exploring Dot Plots
18
Dynamic Programming
For Pairwise sequence alignment
Idea Display one sequence above another with
spaces inserted in both to reveal similarity

C A T - T C A - C
C - T C G C A G C

19
Global Alignment Scoring
CTGTCG-CTGCACG -TGC-CG-TG----
Reward for matches ? Mismatch penalty
? Space/gap penalty ?
Score ?w ?x - ?y
w matches x mismatches y spaces
Note I changed symbols colors on this slide!
20
Global Alignment Scoring
Reward for matches 10 Mismatch penalty
-2 Space/gap penalty -5
C T G T C G C T G C - T G C
C G T G -
-5 10 10 -2 -5 -2 -5 -5 10 10 -5
Total 11
Note I changed symbols colors on this slide!
We could have done better!!
21
Alignment Algorithms

Global Needleman-Wunsch
Local Smith-Waterman
Both NW and SW use dynamic programming
Variations
Gap penalty functions
Scoring matrices

22
Dynamic Programming - Key Idea

The score of the best possible alignment that
ends at a given pair of positions (i, j) is equal
to
the score of best alignment ending just
previous to those two positions (i.e., ending at
i-1, j-1)
PLUS
the score for aligning xi and yj

23
Global Alignment DP Problem Formulation
Notations

Given two sequences (strings)
X x1x2xN of length N x AGC N 3
Y y1y2yM of length M y AAAC M 4
Construct a matrix with (N1) x (M1) elements,
where
S(i,j) Score of best alignment of
x1..ix1x2xi with y1..jy1y2yj

Which means Score of best alignment of a prefix
of X and a prefix of Y
24
Dynamic Programming - 4 Steps

Define score of optimum alignment, using
recursion
Initialize and fill in a DP matrix for storing
optimal scores of subproblems, by solving
smallest subproblems first (bottom-up approach)
Calculate score of optimum alignment(s)
Trace back through matrix to recover optimum
alignment(s) that generated optimal score

25
1- Define Score of Optimum Alignment using
Recursion
Define
Initial conditions
Recursive definition For 1 ? i ? N, 1 ? j ?
M
26
2- Initialize Fill in DP Matrix for Storing
Optimal Scores of Subproblems

Construct sequence vs sequence matrix

Recursion
Initialization
27
2- cont Fill in DP Matrix

Fill in from 0,0 to N,M (row by row),
calculating best
possible score for each alignment including
residues at i,j
Keep track of dependencies of scores (in a
pointer matrix).

28
3- Calculate Score S(N,M) of Optimum Alignment
- for Global Alignment

What happens in last step in alignment of x1..i
to y1..j?
1 of 3 cases applies

29
Example
30
Fill in the matrix
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
?
10
5
C
A
T
T
C
A
C
10 for match, -2 for mismatch, -5 for space
31
Calculate score of optimum alignment
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
-5 10 5 0 -5 -10 -15 -20 -25
-10 5 8 3 -2 -7 0 -5 -10
-15 0 15 10 5 0 -5 -2 -7
-20 -5 10 13 8 3 -2 -7 -4
-25 -10 5 20 15 18 13 8 3
-30 -15 0 15 18 13 28 23 18
-35 -20 -5 10 13 28 23 26 33
10 for match, -2 for mismatch, -5 for space
32
4- Trace back through matrix to recover optimum
alignment(s) that generated the optimal score

How? "Repeat" alignment calculations in reverse
order, starting at from position with highest
score and following path, position by position,
back through matrix
Result? Optimal alignment(s) of sequences

33
Traceback - for Global Alignment

Start in lower right corner trace back to upper
left
Each arrow introduces one character at end of
sequence alignment
A horizontal move puts a gap in left sequence
A vertical move puts a gap in top sequence
A diagonal move uses one character from each
sequence

34
Traceback to Recover Alignment
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
-5 10 5 0 -5 -10 -15 -20 -25
-10 5 8 3 -2 -7 0 -5 -10
-15 0 15 10 5 0 -5 -2 -7
-20 -5 10 13 8 3 -2 -7 -4
-25 -10 5 20 15 18 13 8 3
-30 -15 0 15 18 13 28 23 18
-35 -20 -5 10 13 28 23 26 33
Can have gt1 optimum alignment this example has 2
35
Local Alignment Motivation

To "ignore" stretches of non-coding DNA
Non-coding regions (if "non-functional") are more
likely to contain mutations than coding regions
Local alignment between two protein-encoding
sequences is likely to be between two exons
To locate protein domains or motifs
Proteins with similar structures and/or similar
functions but from different species (for
example), often exhibit local sequence
similarities
Local sequence similarities may indicate
functional modules

Non-coding - "not encoding protein" Exons -
"protein-encoding" parts of genes vs Introns
"intervening sequences" - segments of
eukaryotic genes that "interrupt" exons
Introns are transcribed into RNA, but are later
removed by RNA processing are not translated
into protein
36
Local Alignment Example
g g t c t g a g a a a c g a
Match 2 Mismatch or space -1
Best local alignment
g g t c t g a g a a a c g a -
Score 5
37
Local Alignment Algorithm

S i, j Score for optimally aligning a suffix
of X with a suffix of Y
Initialize top row leftmost column of matrix
with "0"

Recall for Global Alignment,
S i, j Score for optimally aligning a
prefix of X with a prefix of Y
Initialize top row leftmost column of with gap
penalty

38
Traceback - for Local Alignment
? C T C G C
A G C
0 0 0 0 0 0 0 0 0
0 1 0 1 0 1 0 0 1
0 0 0 0 0 0 2 0 0
0 0 1 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0
0 1 0 2 0 1 0 0 1
0 0 0 0 1 0 2 0 0
0 1 0 1 0 2 0 1 1
1 for a match, -1 for a mismatch, -5 for a space
39
Some Results re Alignment Algorithms (for
ComS, CprE Math types!)

Most pairwise sequence alignment problems can be
solved in O(mn) time
Space requirement can be reduced to O(mn), while
keeping run-time fixed Myers88
Highly similar sequences can be aligned in O (dn)
time, where d measures the distance between the
sequences Landau86

40
"Scoring" or "Substitution" Matrices

2 Major types for Amino Acids PAM BLOSUM
PAM Point Accepted Mutation
relies on "evolutionary model" based on
observed differences in alignments of closely
related proteins
BLOSUM BLOck SUbstitution Matrix
based on aa substitutions observed in blocks
of conserved sequences within evolutionarily
divergent proteins

41
PAM Matrix

PAM Point Accepted Mutation
relies on "evolutionary model" based on observed
differnces in closely related proteins
Model includes defined rate for each type of
sequence change
Suffix number (n) reflects amount of "time"
passed rate of expected mutation if n of amino
acids had changed
PAM1 - for less divergent sequences (shorter
time)
PAM250 - for more divergent sequences (longer
time)

42
BLOSUM Matrix

BLOSUM BLOck SUbstitution Matrix
based on aa substitutions observed in blocks
of conserved sequences within evolutionarily
divergent proteins
Doesn't rely on a specific evolutionary model
Suffix number (n) reflects expected similarity
average aa identity in the MSA from which the
matrix was generated
BLOSUM45 - for more divergent sequences
BLOSUM62 - for less divergent sequences

43
Statistical Significance of Sequence Alignment
44
Affine Gap Penalty Functions