BCB 444/544 - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

BCB 444/544

Description:

Title: PowerPoint Presentation Last modified by: GDCB Document presentation format: US Letter Paper Other titles: Times New Roman Comic Sans MS Times Arial MS P ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 49
Provided by: publicIa3
Category:

less

Transcript and Presenter's Notes

Title: BCB 444/544


1
BCB 444/544
  • Lecture 5
  • Dynamic Programming
  • 5_Aug29

2
Required Reading (before lecture)
  • Mon Aug 27 - for Lecture 4
  • Pairwise Sequence Alignment
  • Chp 3 - pp 31-41
  • Wed Aug 29 - for Lecture 5
  • Dynamic Programming
  • Eddy What is Dynamic Programming?
  • 2004 Nature Biotechnol 22909
  • Thurs Aug 30 - Lab 2
  • Databases, ISU Resources Pairwise Sequence
    Alignment
  • Fri Aug 31 - for Lecture 6
  • Scoring Matrices and Alignment Statistics
  • Chp 3 - pp 41-49

3
Review Chp 2- Biological Databases
  • Xiong Chp 2
  • Introduction to Biological Databases
  • What is a Database?
  • Types of Databases
  • Biological Databases
  • Pitfalls of Biological Databases
  • Information Retrieval from Biological
    Databases

4
Types of Databases
  • 3 Major types of electronic databases
  • Flat files - simple text files
  • no organization to facilitate retrieval
  • Relational - data organized as tables
    ("relations")
  • shared features among tables allows rapid search
  • Object-oriented - data organized as "objects"
  • objects associated hierarchically

5
Examples of Biological Databases
  • 1- Primary
  • DNA sequences
  • GenBank - USA
  • European Molecular Biology Lab - EMBL
  • DNA Data Bank of Japan - DDBJ
  • Structures (Protein, DNA, RNA)
  • PDB - Protein Data Bank
  • NDB - Nucleic Acid Data Bank

6
Examples of Biological Databases
  • 2- Secondary
  • Protein sequences
  • Swiss-Prot, TreEMBL, PIR
  • these recently combined into UniProt
  • 3- Specialized
  • Species-specific (or "taxonomic" specific)
  • Flybase, WormBase, AceDB, PlantDB
  • Molecule-specific, disease-specific
  • See http//www.oxfordjournals.org/nar/database/c
    /

7
SUMMARY 2- Biological Databases
BEWARE!
Who was that Icelandic fellow?
8
Chp 3- Sequence Alignment
  • SECTION II SEQUENCE ALIGNMENT
  • Xiong Chp 3
  • Pairwise Sequence Alignment
  • Evolutionary Basis
  • Sequence Homology versus Sequence Similarity
  • Sequence Similarity versus Sequence Identity
  • Methods
  • Scoring Matrices
  • Statistical Significance of Sequence Alignment

Adapted from Brown and Caragea, 2007, with some
slides from Altman, Fernandez-Baca, Batzoglou,
Craven, Hunter, Page.
9
Motivation for Sequence Alignment
  • "Sequence comparison lies at the heart of
    bioinformatics analysis." Jin Xiong
  • Sequence comparison is important for drawing
    functional evolutionary inferences re new
    genes/proteins
  • Pairwise sequence alignment is fundamental it
    used to
  • Search for common patterns of characters
  • Establish pair-wise correspondence between
    related sequences
  • Pairwise sequence alignment is basis for
  • Database searching (e.g., BLAST)
  • Multiple sequence alignment (MSA)

10
Homology
  • Homology has a very specific meaning in
    evolutionary computational biology - term is
    often used incorrectly
  • For us
  • Homology similarity due to descent from a
    common evolutionary ancestor
  • But, HOMOLOGY ? SIMILARITY
  • When 2 sequences share a sufficiently high degree
    of sequence similarity (or identity), we may
    infer that they are homologous
  • We can infer homology from similarity (can't
    prove it!)

11
Orthologs vs Paralogs
  • 2 types of homologous sequences
  • Orthologs - "same genes" in different species
  • result of common ancestry
  • corresponding proteins have "same" functions
  • (e.g., human ?-globin mouse ?-globin)
  • Paralogs - "similar genes" within a species
  • result of gene duplication events
  • proteins may (or may not) have similar functions
  • (e.g., human ?-globin human ?-globin)

A
A is the parent gene Speciation leads to B
C Duplication leads to C
Speciation
Duplication
B and C are Orthologous C and C are Paralogous
B
C
C'
12
Sequence Homology vs Similarity
  • Homologous sequences - sequences that share a
    common evolutionary ancestry
  • Similar sequences - sequences that have a high
    percentage of aligned residues with similar
    physicochemical properties
  • (e.g., size, hydrophobicity, charge)
  • IMPORTANT
  • Sequence homology
  • An inference about a common ancestral
    relationship, drawn when two sequences share a
    high enough degree of sequence similarity
  • Homology is qualitative
  • Sequence similarity
  • The direct result of observation from a sequence
    alignment
  • Similarity is quantitative can be described
    using percentages

13
Sequence Similarity vs Identity
  • For nucleotide sequences (DNA RNA), sequence
    similarity and identity have the "same" meaning
  • Two DNA sequences can share a high degree of
    sequence identity (or similarity) -- means the
    same thing
  • Drena's opinion Always use "identity" when
    making quantitative comparisons re DNA or RNA
    sequences (to avoid confusion!)
  • For protein sequences, sequence similarity and
    identity have different meanings
  • Identity of exact matches between two aligned
    sequences
  • Similarity of aligned residues that share
    similar characteristics (e.g, physicochemical
    characteristics, structural propsensities,
    evolutionary profiles)
  • Drena's opinion Always use "identity" when
    making quantitative comparisons re protein
    sequences (to avoid confusion!)

14
What is Sequence Alignment?
  • Given 2 sequences of letters, and a scoring
    scheme for evaluating matching letters, find an
    optimal pairing of letters in one sequence to
    letters of other sequence.

Align 1 THIS IS A RATHER LONGER SENTENCE THAN
THE NEXT. 2 THIS IS A SHORT SENTENCE.
1 THIS IS A RATHER LONGER SENTENCE THAN THE
NEXT. 2 THIS IS A SHORTSENTENCE
. OR 1 THIS IS A RATHER LONGER SENTENCE
THAN THE NEXT. 2 THIS IS A SHORTSENTENC
E.
Is one of these alignments "optimal"? Which is
better?
15
Goal of Sequence Alignment
  • Find the best pairing of 2 sequences, such that
    there is maximum correspondence between residues
  • DNA 4 letter alphabet ( gap)
  • TTGACAC
  • TTTACAC
  • Proteins 20 letter alphabet ( gap)
  • RKVA-GMA
  • RKIAVAMA

16
Statement of Problem
  • Given
  • 2 sequences
  • Scoring system for evaluating match (or mismatch)
    of two characters
  • Penalty function for gaps in sequences
  • Find Optimal pairing of sequences that
  • Retains the order of characters
  • Introduces gaps where needed
  • Maximizes total score

17
Types of Sequence Variation
  • Sequences can diverge from a common ancestor
    through various types of mutations
  • Substitutions ACGA ? AGGA
  • Insertions ACGA ? ACCGA
  • Deletions ACGA ? AGA
  • Insertions or deletions ("indels") result in gaps
    in alignments
  • Substitutions result in mismatches
  • No change? match

18
Gaps
  • Indels of various sizes can occur in one sequence
    relative to the other
  • e.g., corresponding to a shortening of the
    polypeptide chain in a protein

19
Avoiding Random Alignments with a Scoring
Function
  • Introducing too many gaps generates nonsense
    alignments
  • s--e-----qu---en--cesometimesquipsentice
  • Need to distinguish between alignments that occur
    due to homology and those that occur by chance
  • Define a scoring function that accounts for
    mismatches and gaps

Scoring Function (F) e.g. Match
w 1 Mismatch - x 0
Gap - y -1 F w(matches)
x(mismatches) y(gaps)
20
Not All Mismatches are the Same
  • Some amino acids are more "exchangeable" than
    others (physicochemical properties are similar)
  • e.g., Ser Thr are more similar than Trp Ala
  • Substitution matrix can be used to introduce
    "mismatch costs" for handling different types of
    substitutions
  • Mismatch costs are not usually used in aligning
    DNA or RNA sequences, because no substitution is
    "better" than any other (in general)

21
Substitution Matrix
  • s(a,b) corresponds to score of aligning character
    a with character b
  • Match scores are often calculated
  • based on frequency of mutations in very similar
    sequences
  • (more details later)

22
Methods
  • Global and Local Alignment
  • Alignment Algorithms
  • Dot Matrix Method
  • Dynamic Programming Method
  • Gap penalities
  • DP for Global Alignment
  • DP for Local Alignment
  • Scoring Matrices
  • Amino acid scoring matrices
  • PAM
  • BLOSUM
  • Comparisons between PAM BLOSUM
  • Statistical Significance of Sequence Alignment

23
Global vs Local Alignment
  • Global alignment
  • Finds best possible alignment across entire
    length of 2 sequences
  • Aligned sequences assumed to be generally similar
    over entire length
  • Local alignment
  • Finds local regions with highest similarity
    between 2 sequences
  • Aligns these without regard for rest of sequence
  • Sequences are not assumed to be similar over
    entire length

24
Global vs Local Alignment - example
S CTGTCGCTGCACG T TGCCGTG
Which is better?
25
Global vs Local Alignment Which should be used
when?
  • Both are important
  • but it is critical to use right method for a
    given task!
  • Global alignment
  • Good for aligning closely related sequences of
    similar length
  • Not good for divergent sequences or sequences
    with different lengths
  • Local Alignment
  • Good for searching for conserved patterns
    (domains or motifs) in DNA or protein sequences
  • Not good for generating an alignment of closely
    related sequences
  • Global and local alignments are fundamentally
    similar they differ only in optimization
    strategy used to align similar residues

26
Alignment Algorithms
  • 3 major methods for pairwise sequence alignment
  • Dot matrix analysis
  • Dynamic programming
  • Word or k-tuple methods (later, in Chp 4)

27
Dot Matrix Method (Dot Plots)
  • Place 1 sequence along top row of matrix
  • Place 2nd sequence along left column of matrix
  • Plot a dot each time there is a match between an
    element of row sequence and an element of column
    sequence
  • For proteins, usually use more sophisticated
    scoring schemes than "identical match"
  • Diagonal lines indicate areas of match
  • Contiguous diagonal lines reveal alignment
    "breaks" gaps (indels)

28
Interpretation of Dot Plots
  • When comparing 2 sequences
  • Diagonal lines of dots indicate regions of
    similarity between 2 sequences
  • Reverse diagonals (perpendicular to diagonal)
    indicate inversions
  • What do similar patterns mean when comparing a
    sequence with itself (reverse complement)?
  • e.g. Reverse diagonals crossing diagonals (X's)
    indicate palindromes

Exploring Dot Plots
29
Dot Matrix Variations
  • Compare 2 sequences
  • Identify matching regions
  • Identities for DNA seqs
  • Similarities for protein seqs
  • Compare sequence with itself
  • Identify repeated regions
  • Identify inverted repeats
  • Identify palindromes
  • For long sequences?
  • Too many dots! Noisy!
  • Instead of per "residue," plot one dot per
    "window" of n matching residues to reduce noise

30
Strengths Weakneses of Dot Plots
  • Strengths
  • Fast and easy
  • Allows direct visual identification of regions of
    similarity
  • Repeats, inversions, etc. are readily apparent
  • Displays all possible matches
  • Weaknesses
  • Doesn't generate full alignment - user must
    "connect the diagonals"
  • No statistical assessment of quality of alignment
    (score)
  • Impractical and noisy for long sequences
  • Difficult to scale up to muliple alignment

31
Dynamic Programming
For Pairwise sequence alignment
Idea Display one sequence above another with
spaces inserted in both to reveal similarity
  • A C A T - T C A - C
  • B C - T C G C A G C

32
Global alignment Scoring
CTGTCG-CTGCACG -TGC-CG-TG----
Reward for matches ? Mismatch penalty
? Space/gap penalty ?
Score ?w ?x - ?y
w matches x mismatches y spaces
33
Global alignment Scoring
Reward for matches 10 Mismatch penalty
2 Space/gap penalty 5
C T G T C G C T G C - T G C
C G T G -
-5 10 10 -2 -5 -2 -5 -5 10 10 -5
Total 11
34
Optimum Alignment
  • Score of an alignment is a measure of its quality
  • Optimum alignment problem Given a pair of
    sequences X and Y, find an alignment (global or
    local) with maximum score

35
Alignment algorithms
  • Global Needleman-Wunsch
  • Local Smith-Waterman
  • Both NW and SW use dynamic programming
  • Variations
  • Gap penalty functions
  • Scoring matrices

36
Dynamic Programming (DP)
  • As computer science concept - formalized in early
    1950's by Bellman at RAND Corporation
  • Frequently, however, there are only a
    polynomial number of subproblems If we keep
    track of the solution to each subproblem solved,
    and simply look up the answer when needed, we
    obtain a polynomial-time algorithm.
  • ----Aho, Hopcroft, Ullman
  • Reported to biologists for sequence alignment
    problems by Needleman Wunsch, 1969

37
Key Idea
  • Score of the best possible alignment that ends
    at a given pair of positions (i,j) in two
    sequences is the score of the best alignment
    previous to those two positions PLUS the score
    for aligning those two positions

Next best alignment previous best local best
38
Problem Formulation and Notations
  • Given two sequences (strings)
  • X x1x2xN of length N x AGC N 3
  • Y y1y2yM of length M y AAAC M 4
  • Construct a matrix with (N1) x (M1) elements,
    where
  • S(i,j) score of best alignment of
    x1..ix1x2xi with y1..jy1y2yj

39
Dynamic Programming 4 Components
  1. Recursive definition for optimal score
  2. Matrix for storing optimal scores of subproblems
  3. Bottom-up approach for filling the matrix, by
    solving smallest subproblems first
  4. Traceback of path through matrix to recover the
    optimal alignment(s) that gave the optimal score

40
Global Alignment Algorithm
41
Calculating Score of Optimum Alignment
S(i,j) satisfies the following relationships
Initial conditions
Recursive definition For 1 ? i ? n, 1 ? j ?
m
42
Computing the best current score
0
1
N
0
S(0,0)0
1
S(i,j)
S(N,M)
M
Recursion
Initialization
43
What happens at the last step in the alignment of
x1..i to y1..j?
1 of 3 cases
yj aligns to a gap
xi aligns to a gap
xi aligns to yj
44
DP Implementationn - 3 steps
  1. Construct sequence vs sequence matrix and fill in
    from 0,0 to N,M, the best possible scores for
    alignments including the residues at i,j. Also,
    keep track of dependencies of scores (in a
    pointer matrix).
  2. For a global alignment of the sequences, find the
    score S(N,M)
  3. Trace back through pointer matrix to get the
    optimal alignment. Do this position by position
    to retrieve alignment of all residues of
    sequences, including gaps (i.e., repeat alignment
    calculations in reverse order, following path
    back through matrix, starting at from position
    with highest score.

45
Example
Case 1 Line up xi with yj
i
i - 1
x C A T T C A C y C - T T C A
G
j
j -1
Case 2 Line up xi with space
i - 1
i
x C A T T C A - C y C - T T
C A G -
j
Case 3 Line up yj with space
i
x C A T T C A C - y C - T T
C A - G
j
j -1
46
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
?
10
5
C
A
T
T
C
A
C
10 for match, -2 for mismatch, -5 for space
47
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
-5 10 5 0 -5 -10 -15 -20 -25
-10 5 8 3 -2 -7 0 -5 -10
-15 0 15 10 5 0 -5 -2 -7
-20 -5 10 13 8 3 -2 -7 -4
-25 -10 5 20 15 18 13 8 3
-30 -15 0 15 18 13 28 23 18
-35 -20 -5 10 13 28 23 26 33
?
C
A
T
T
C
A
C
Traceback can yield both optimal alignments
48
Affine Gap Penalty Functions
  • Gap penalty h gk
  • where
  • k length of gap
  • h gap opening penalty
  • g gap continuation penalty

Can also be solved in O(nm) time using dynamic
programming
Write a Comment
User Comments (0)
About PowerShow.com