BCB 444544 F07 ISU Dobbs - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

BCB 444544 F07 ISU Dobbs

Description:

BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST. 2 ... Initialize and fill in a DP matrix for storing optimal scores of subproblems, by ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 50
Provided by: publicI
Category:
Tags: bcb | isu | dobbs | dp | f07

less

Transcript and Presenter's Notes

Title: BCB 444544 F07 ISU Dobbs


1
BCB 444/544
  • Lecture 8
  • Finish Dynamic Programming
  • Global vs Local Alignment
  • Scoring Matrices Alignment Statistics
  • BLAST
  • 8_Sept7

2
Required Reading (before lecture)
  • vLast week - for Lectures 4-7
  • Pairwise Sequence Alignment, Dynamic Programming,
  • Global vs Local Alignment, Scoring Matrices,
    Statistics
  • Xiong Chp 3
  • Eddy What is Dynamic Programming? 2004 Nature
    Biotechnol 22909
  • http//www.nature.com/nbt/journal/v22/n7/abs/nbt
    0704-909.html
  • vWed Sept 5 - for Lecture 7 Lab 3
  • Database Similarity Searching BLAST (nope,
    more DP)
  • Chp 4 - pp 51-62
  • Fri Sept - for Lecture 8 (will finish on
    Monday)
  • BLAST variations BLAST vs FASTA
  • Chp 4 - pp 51-62

3
Assignments Announcements
  • vTues Sept 4 - Lab 2 Exercise Writeup due by 5
    PM Send via email to Pete Zaback
    petez_at_iastate.edu
  • (For now, no late penalty - just send ASAP)
  • vWed Sept 5 - Notes for Lecture 5 posted online
  • - HW2 posted online sent via email
  • handed out in class
  • Fri Sept 14 - HW2 Due by 5 PM
  • Fri Sept 21 - Exam 1

4
Chp 3- Sequence Alignment
  • SECTION II SEQUENCE ALIGNMENT
  • Xiong Chp 3
  • Pairwise Sequence Alignment
  • vEvolutionary Basis
  • vSequence Homology versus Sequence Similarity
  • vSequence Similarity versus Sequence Identity
  • Methods - cont
  • Scoring Matrices
  • Statistical Significance of Sequence Alignment

Adapted from Brown and Caragea, 2007, with some
slides from Altman, Fernandez-Baca, Batzoglou,
Craven, Hunter, Page.
5
Methods
  • vGlobal and Local Alignment
  • vAlignment Algorithms
  • vDot Matrix Method
  • Dynamic Programming Method - cont
  • Gap penalities
  • DP for Global Alignment
  • DP for Local Alignment
  • Scoring Matrices
  • Amino acid scoring matrices
  • PAM
  • BLOSUM
  • Comparisons between PAM BLOSUM
  • Statistical Significance of Sequence Alignment

6
Dynamic Programming - 4 Steps
  • Define score of optimal alignment, using
    recursion
  • Initialize and fill in a DP matrix for storing
    optimal scores of subproblems, by solving
    smallest subproblems first (bottom-up approach)
  • Calculate score of optimal alignment(s)
  • Trace back through matrix to recover optimal
    alignment(s) that generated optimal score

7
1- Define Score of Optimal Alignment using
Recursion
Define
  • ? Match Reward
  • Mismatch Penalty
  • ? Gap penalty

Recursive definition For 1 ? i ? N, 1 ? j ?
M
?(xi,yj) ? or ?
? Gap penalty
8
2- Initialize Fill in DP Matrix for Storing
Optimal Scores of Subproblems
  • Construct sequence vs sequence matrix
  • Fill in from 0,0 to N,M (row by row),
    calculating best
  • possible score for each alignment ending at
    residues at i,j

0
1
N
0
S(0,0)0
1
S(i,j)
S(N,M)
M
9
How do we calculate S(i,j)? i.e., Score for
alignment of x1..i to y1..j?
  • 1 of 3 cases ? optimal score for this subproblem

10
Specific Example
Note I changed sequences on this slide (to
match the rest of DP example)
Scoring Consequence?
Case 1 Line up xi with yj
i - 1
i
x C - T C G C A y C A T - T C
A
Match Bonus
j - 1
j
Case 2 Line up xi with space
i - 1
i
x C - T C G C - A y C A T
- T C A -
Space Penalty
j
Case 3 Line up yj with space
i
x C - T C G C A - y C A T -
T C - A
Space Penalty
j -1
j
11
Ready? Fill in DP Matrix
Keep track of dependencies of scores (in a
pointer matrix)
0
1
N
0
S(0,0)0
1
S(i-1,j)
S(i-1,j-1)
  • ? Match Reward
  • Mismatch Penalty
  • ? Gap penalty

S(i,j-1)
S(i,j)
S(N,M)
M
12
Fill in the DP matrix !!
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
?
C
10
5
A
T
T
C
A
C
10 for match, -2 for mismatch, -5 for space
13
3- Calculate Score S(N,M) of Optimal Alignment -
for Global Alignment
? C T C G C
A G C
10 for match, -2 for mismatch, -5 for space
14
3- Calculate Score S(N,M) of Optimal Alignment -
for Global Alignment
? C T C G C
A G C
10 for match, -2 for mismatch, -5 for space
15
4- Trace back through matrix to recover optimal
alignment(s) that generated the optimal score
  • How? "Repeat" alignment calculations in reverse
    order, starting at from position with highest
    score and following path, position by position,
    back through matrix
  • Result? Optimal alignment(s) of sequences

16
Traceback - for Global Alignment
  • Start in lower right corner trace back to upper
    left
  • Each arrow introduces one character at end of
    alignment
  • A horizontal move puts a gap in left sequence
  • A vertical move puts a gap in top sequence
  • A diagonal move uses one character from each
    sequence

17
Traceback to Recover Alignment
? C T C G C
A G C
Can have 1 optimal alignment this example has 2
18
Traceback to Recover Alignment
? C T C G C
A G C
Where did red arrows come from?
19
Traceback to Recover Alignment
? C T C G C
A G C
10 for match, -2 for mismatch, -5 for space
  • Where did 33 come from? Match 10, so 33-10
    23
  • Must have come from diagonal
  • Where did 23 come from? (Not a match)
  • Left? 28-5 23 Diag? 13-2 11 Top? 8-5 3

20
Traceback to Recover Alignment
? C T C G C
A G C
10 for match, -2 for mismatch, -5 for space
  • Where did 8 come from?
  • Two possibilities 13-5 8 or 10-28
  • Then, follow both paths

21
Traceback to Recover Alignment
? C T C G C
A G C
C with C
- with A
T with T
C with -
G with T
C with C
G with -
A with A
C with C
Great - but what are the alignments? 1
22
Traceback to Recover Alignment
? C T C G C
A G C
C with C
- with A
T with T
C with T
C with C
G with -
G with -
A with A
C with C
Great - but what are the alignments? 2
23
What are the 2 Global Alignments with Optimal
Score 33?
Top C T C G C A G C Left
C A T T C A C
C - T C G C A
G C
2
  • A horizontal move puts a gap in left sequence
  • A vertical move puts a gap in top sequence
  • A diagonal move uses one character from each
    sequence

24
What are the 2 Global Alignments with Optimal
Score 33?
Top C T C G C A G C Left
C A T T C A C
C - T C G C A
G C C A T T - C
A - C
2
  • A horizontal move puts a gap in left sequence
  • A vertical move puts a gap in top sequence
  • A diagonal move uses one character from each
    sequence

25
Check Traceback?
? C T C G C
A G C
  • A horizontal move puts a gap in left sequence
  • A vertical move puts a gap in top sequence
  • A diagonal move uses one character from each
    sequence

26
Local Alignment Motivation
  • To "ignore" stretches of non-coding DNA
  • Non-coding regions (if "non-functional") are more
    likely to contain mutations than coding regions
  • Local alignment between two protein-encoding
    sequences is likely to be between two exons
  • To locate protein domains or motifs
  • Proteins with similar structures and/or similar
    functions but from different species (for
    example), often exhibit local sequence
    similarities
  • Local sequence similarities may indicate
    functional modules

Non-coding - "not encoding protein" Exons -
"protein-encoding" parts of genes vs Introns
"intervening sequences" - segments of
eukaryotic genes that "interrupt" exons
Introns are transcribed into RNA, but are later
removed by RNA processing are not translated
into protein
27
Local Alignment Example
G G T C T G A G A A A C G A
Match 2 Mismatch or space -1
Best local alignment
G G T C T G A G A A A C G A -
Score 5
28
Local Alignment Algorithm
  • S i, j Score for optimally aligning a suffix
    of X with a suffix of Y
  • Initialize top row leftmost column of matrix
    with "0"
  • Recall for Global Alignment,
  • S i, j Score for optimally aligning a
    prefix of X with a prefix of Y
  • Initialize top row leftmost column of with gap
    penalty

29
Filling in DP Matrix for Local Alignment
? C T C G C
A G C
1 for a match, -1 for a mismatch, -5 for a space
30
Traceback - for Local Alignment
? C T C G C
A G C
1 for a match, -1 for a mismatch, -5 for a space
31
What are the 4 Local Alignments with Optimal
Score 2?
C T C G C A
G C C A T T
C A C
32
Some Results re Alignment Algorithms (for
ComS, CprE Math types!)
  • Most pairwise sequence alignment problems can be
    solved in O(mn) time
  • Space requirement can be reduced to O(mn), while
    keeping run-time fixed Myers88
  • Highly similar sequences can be aligned in O (dn)
    time, where d measures the distance between the
    sequences Landau86

33
Affine Gap Penalty Functions
  • Affine Gap Penalties Differential Gap
    Penalties used to reflect cost differences
    between opening a gap and extending an existing
    gap
  • Total Gap Penalty is linear function of gap
    length
  • W ? ? X (k - 1)
  • where ? gap opening penalty
  • ? gap extension penalty
  • k length of gap
  • Sometimes, a Constant Gap Penalty is used, but it
    is usually least realistic than the Affine Gap
    Penalty

Can also be solved in O(nm) time using DP
34
Methods
  • vGlobal and Local Alignment
  • vAlignment Algorithms
  • vDot Matrix Method
  • vDynamic Programming Method - cont
  • Gap penalities
  • DP for Global Alignment
  • DP for Local Alignment
  • Scoring Matrices
  • Amino acid scoring matrices
  • PAM
  • BLOSUM
  • Comparisons between PAM BLOSUM
  • Statistical Significance of Sequence Alignment

35
"Scoring" or "Substitution" Matrices
  • 2 Major types for Amino Acids PAM BLOSUM
  • PAM Point Accepted Mutation
  • relies on "evolutionary model" based on
    observed differences in alignments of closely
    related proteins
  • BLOSUM BLOck SUbstitution Matrix
  • based on aa substitutions observed in blocks
    of conserved sequences within evolutionarily
    divergent proteins

36
PAM Matrix
  • PAM Point Accepted Mutation
  • relies on "evolutionary model" based on observed
    differences in closely related proteins
  • Model includes defined rate for each type of
    sequence change
  • Suffix number (n) reflects amount of "time"
    passed rate of expected mutation if n of amino
    acids had changed
  • PAM1 - for less divergent sequences (shorter
    time)
  • PAM250 - for more divergent sequences (longer
    time)

37
BLOSUM Matrix
  • BLOSUM BLOck SUbstitution Matrix
  • based on aa substitutions observed in blocks
    of conserved sequences within evolutionarily
    divergent proteins
  • Doesn't rely on a specific evolutionary model
  • Suffix number (n) reflects expected similarity
    average aa identity in the MSA from which the
    matrix was generated
  • BLOSUM45 - for more divergent sequences
  • BLOSUM62 - for less divergent sequences

38
PAM250 vs BLOSUM 62
  • See Text
  • Fig 3.5 PAM250
  • Fig 3.6 BLOSUM62
  • Usually only 1/2 of matrix is displayed (it is
    symmetric)
  • Here
  • s(a,b) corresponds to score of aligning character
    a with character b

39
Which is Better? PAM or BLOSUM
  • PAM matrices
  • derived from evolutionary model
  • often used in reconstructing phylogenetic trees -
    but, not very good for highly divergent sequences
  • BLOSUM matrices
  • based on direct observations
  • more 'realistic" - and outperform PAM matrices in
    terms of accuracy in local alignment

40
Which Type of Matrix Should You Use?
  • Several other types of matrices available
  • Gonnet Jones-Taylor-Thornton
  • very robust in tree construction
  • "Best" matrix depends on task
  • different matrices for different applications
  • ADVICE if unsure, try several different
    matrices choose the one that gives best
    alignment result

41
Sequence Alignment Statistics
  • Distribution of similarity scores in sequence
    alignment is not a simple "normal" distribution
  • "Gumble extreme value distribution" - a highly
    skewed normal distribution with a long tail

42
How Assess Statistical Significance of an
Alignment?
  • Compare score of an alignment with distribution
    of scores of alignments for many 'randomized'
    (shuffled) versions of the original sequence
  • If score is in extreme margin, then unlikely due
    to random chance
  • P-value probability that original alignment is
    due to random chance (lower P is better)
  • P 10-5 - 10-50 sequences have clear homology
  • P 10-1 no better than random

Check out PRSS (Probability of Random
Shuffles) http//www.ch.embnet.org/software/PRSS_f
orm.html
43
Chp 4- Database Similarity Searching
  • SECTION II SEQUENCE ALIGNMENT
  • Xiong Chp 4
  • Database Similarity Searching
  • Unique Requirements of Database Searching
  • Heuristic Database Searching
  • Basic Local Alignment Search Tool (BLAST)
  • FASTA
  • Comparison of FASTA and BLAST
  • Database Searching with Smith-Waterman Method

44
Exhaustive vs Heuristic Methods
  • Exhaustive - tests every possible solution
  • guaranteed to give best answer
  • (identifies optimal solution)
  • can be very time/space intensive!
  • e.g., Dynamic Programming
  • as in Smith-Waterman algorithm
  • Heuristic - does NOT test every possibility
  • no guarantee that answer is best
  • (but, often can identify optimal
    solution)
  • sacrifices accuracy (potentially) for speed
  • uses "rules of thumb" or "shortcuts"
  • e.g., BLAST FASTA

45
Today's Lab focus on BLAST Basic Local
Alignment Search Tool
  • STEPS
  • Create list of very possible "word" (e.g., 3-11
    letters) from query sequence
  • Search database to identify sequences that
    contain matching words
  • Score match of word with sequence, using a
    substitution matrix
  • Extend match (seed) in both directions, while
    calculating alignment score at each step
  • Continue extension until score drops below a
    threshold (due to mismatches)
  • Contiguous aligned segment pair (no gaps) is
    called
  • High Scoring Segment Pair (HSP)

46
Lab3 focus on BLAST Basic Local Alignment
Search Tool
  • BLAST Results?
  • Original version of BLAST?
  • List of HSPs Maximum Scoring Pairs
  • More recent, improved version of BLAST?
  • Allows gaps Gapped Alignment
  • How? Allows score to drop below threshold,
  • (but only temporarily)

47
BLAST - a few details
  • Developed by Stephen Aultschul at NCBI in 1990
  • Word length?
  • Typically 3 aa for protein sequence
  • 11 nt for DNA sequence
  • Substitution matrix?
  • Default is BLOSUM62
  • Can change under Algorithm Parameters
  • Choose other BLOSUM or PAM matrices
  • Stop-Extension Threshold?
  • Typically 22 for proteins
  • 20 for DNA

48
BLAST - Statistical Significance?
  • E-value E m x n x P
  • m total number of residues in database
  • n number of residues in query sequence
  • P probability that an HSP is result of random
    chance
  • lower E-value, less likely to result from random
    change, thus higher significance
  • Bit Score S'
  • normalized score, to account for differences in
    sequence length size of database
  • 3. Low Complexity Masking
  • remove repeats that confound scoring

49
BLAST - a Family of Programs Several different
BLAST "flavors"
  • BLASTN -
  • BLASTP -
  • BLASTX -
  • TBLASTN -
  • TBLASTN -
Write a Comment
User Comments (0)
About PowerShow.com