Algorithms for Pairwise Sequence Alignment - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Algorithms for Pairwise Sequence Alignment

Description:

D. Inversion. Influences on Variation. Rates of mutations are influenced by: ... While building the table, keep track of where optimal score came from, reverse arrows ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 60

Provided by: HYG1

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms for Pairwise Sequence Alignment

1
Algorithms for Pairwise Sequence Alignment

Craig A. Struble, Ph.D.
Marquette University

2
Overview

Pairwise Sequence Alignment
Dynamic Programming Solution
Global Alignment
Local Alignment

3
Goals

Define the pairwise sequence alignment problem
Understand the difference between global and
local alignment
Understand the significance of pairwise sequence
alignment

4
Pairwise Sequence Alignment

Problem
Given two sequences (DNA or AA), line them up
in a biologically meaningful way.

HEAGAWGHE-E
HEAGAWGHEE
PAWHEAE
P-A--W-HEAE
5
Origins Of Similar Sequences
2
a
duplication
1
a1
a2
speciation
duplication
a1
a2
a2
a1
Species 2
Species 1
2
2
1
1
Transfer
Convergence
6
Why is comparing sequences important?

One of the fundamental phenomena explored by
bioinformatics, around which many tools are built
Databases, data selection, etc.
Researchers compare sequences in order to
infer the function of genes
infer the structure of genes and gene products
infer the evolutionary history of genes and
organisms
identify variation responsible for disease and
other complex phenotypes

7
Why is this a challenging problem?

Similar sequences contain variation
Sequences mutate over time
Mutations are spontaneous changes in sequence
caused by replication (or other) errors. Mutation
rates vary, and can be influenced by many
factors.
Sequence data contains errors
Sequencing techniques are imperfect

8
Four Basic Types of Mutations

Substitution

C. Insertion
Thr Tyr Leu Leu ACC TAT TTG CTG
Thr Tyr Leu Leu ACC TAT TTG CTG
ACC TCT TTG CTG Thr Ser Leu Leu
ACC TAC TTT GCT G-- Thr Tyr Phe Ala
B. Deletion
D. Inversion
Thr Tyr Leu Leu ACC TAT TTG CTG
Thr Tyr Leu Leu ACC TAT TTG CTG
ACC TTT ATG CTG Thr Phe Met Leu
ACC TAT TGC TG- Thr Tyr Cys
9
Influences on Variation

Rates of mutations are influenced by
Substitution class (transition/transversion)
Coding site (synonymous/nonsynonymous)
Length of insertion/deletion
Codon usage bias
Nucleotide consist (GC content)
Stability fate of variation depends upon
Drift
Selection (positive Darwinian/purifying, sexual,
artificial)
Other mutations (reversions are not uncommon)

10
Homology vs. Similarity

Homology is a discrete state pertaining to
relatedness - two genes are homologues if and
only if they share a commone gene ancestor
Orthologues in different organisms, a result of
speciation
Paralogues in the same organism, a result of
gene duplication
Homologues may have the same, similar, or
different functions
Similarity is a continuous state describing the
degree of to which two homologues share
characteristics
Generally a percentage
Distance estimates are also estimates of
similarity

11
Kinds of Alignments

The local alignment includes only regions of
identity (or strong similarity). The favors
finding conserved regions.
The global alignment is stretched over the entire
sequence length, including as many matches as
possible.

12
When do you choose local vs. global?

Choose local alignment when
DNA sequences encode genes with introns
Amino acid sequences encoding proteins
Choose a global alignment when
Sequences can be seen to be very similar
Similar regions are in the same order and
orientation

13
Methods Of Sequence Alignment

Dot matrix analysis
Dynamic programming algorithms
Word or k-tuple methods
BLAST, FASTA
Discussed later in the semester

14
Dot Matrix Analysis

Visualization of sequence similarity
First technique to use on pairs of sequences
Insertions/deletions
Inverted repeats
Does not show actual alignment
Optimal alignment not obvious

15
Simple Dot Matrix Example

For sequences
ATGCGTCGTT
ATCCGCGAT

A T G C G T C G T T
A T C C G C G A T

Steps
Arrange sequences on a matrix
Place a dot anywhere nucleotides match
Diagonal stretches (here indicated by a line) are
areas of alignment
More than one area of alignment can appear

16
DNA sequence matrix Noisy

Sequence alignment of 2 long DNA sequences
Many random matches make it difficult or
impossible to find areas of alignment
Using a window stringency setting, we can
eliminate some of the noise

17
DNA sequence matrix Less noisy

To decrease noise of random matches, a window of
11 nucleotides was defined, and a dot placed when
at least 7 matches occur
Window 11, Stringency 7
Some diagonal lines begin to appear

18
DNA sequence matrix Less noisy

To decrease noise of random matches, a window of
23 nucleotides was defined, and a dot placed when
at least 15 matches occur
Window 23, Stringency 15
A clear diagonal line appears, indicating an area
of alignment
A few other areas are still apparent - probably
long random matches

19
Protein sequence matrix Noisy

Sequence comparison of amino acid sequence (same
gene as previous example)
Window 1, stringency 1
To decrease noise due to random matches,
conditions can be tightened

20
Protein sequence matrix Less noisy

Same sequence comparison, tighter analysis
conditions
Window 3, stringency 2
A single aligned region is visible, with a number
of areas of random matches

21
Evidence of repeats in a DNA sequence
Window 1, stringency 1
Window 23, stringency 7
22
Programs for Dot Matrix Analysis

DNA Strider (Macintosh)
Dotter (Unix/Linux, X-Windows)
In the lab
COMPARE, DOTPLOT in GCG
PLALIGN (FASTA)
Plots alignments found by DP method
Dotlet
http//www.isrec.isb-sib.ch/java/dotlet/Dotlet.htm
l

23
Optimal Sequence Alignments

Example
Which one is better?

HEAGAWGHEE
PAWHEAE
HEAGAWGHE-E
HEAGAWGHE-E
P-A--W-HEAE
--P-AW-HEAE
24
Scoring

To compare two sequence alignments, calculate a
score
Scoring matrix
Provide a score for each match/mismatch
Sometimes a mismatch is acceptable
PAM, BLOSUM are two classes of scoring matrices
Gap penalty
Initiating a gap
Gap extension penalty
Extending a gap

25
Scoring Matrix Example

Gap penalty -8
Gap extension -4

HEAGAWGHE-E
--P-AW-HEAE
(-8) (-4) (-1) 5 15 (-8) 10 6
(-8) 6 13
HEAGAWGHE-E
Exercise Calculate for
P-A--W-HEAE
26
Formal Description

Problem PairSeqAlign
Input Two sequences x,y
Scoring matrix s
Gap penalty d
Gap extension penalty e
Output The optimal sequence alignment

27
How Difficult Is This?

Consider two sequences of length n
There are
possible global alignments, and we need to find
an optimal one from amongst those!

28
So what?

So at n 20, we have over 120 billion possible
alignments
We want to be able to align much, much longer
sequences
Some proteins have 1000 amino acids
Genes can have several thousand base pairs

29
Dynamic Programming

General algorithmic development technique
Reuses the results of previous computations
Store intermediate results in a table for reuse
Look up in table for earlier result to build from

30
Global Alignment

Needleman-Wunsch 1970
Idea Build up optimal alignment from optimal
alignments of subsequences
Three ways to align x1..i with y1..j

Extend both strings at the same time
xi already aligned, align yj with a gap
IGAxi LGVyj
AIG Axi GVyj--
GAxi-- SLG Vyj
yj already aligned, align xi with a gap
31
Global Alignment

Notation
xi ith letter of string x
yj jth letter of string y
x1..i Prefix of x from letters 1 through I
F matrix of optimal scores
F(i,j) represents optimal score lining up x1..i
with y1..j
d gap penalty
s scoring matrix

32
Global Alignment

The work is to build up F
Initialize F(0,0) 0, F(i,0) id, F(0,j)jd
Fill from top left to bottom right using the
recursive relation

33
Global Alignment
yj aligned to gap
Move ahead in both
s(xi,yj)
d
d
xi aligned to gap
While building the table, keep track of where
optimal score came from, reverse arrows
34
Example
35
Completed Table
36
Traceback

Trace arrows back from the lower right to top
left
Diagonal both
Up upper gap
Left lower gap

HEAGAWGHE-E --P-AW-HEAE
37
Summary

Uses recursion to fill in intermediate results
table
Uses O(nm) space and time
O(n2) algorithm
Feasible for moderate sized sequences, but not
for aligning whole genomes.

38
Local Alignment

Smith-Waterman (1981)
Another dynamic programming solution

39
Example
40
Traceback
Start at highest score and traceback to first 0
AWGHE AW-HE
41
Summary

Similar to global alignment algorithm
For this to work, expected match with random
sequence must have negative score.
Behavior is like global alignment otherwise
Similar extensions for repeated and overlap
matching
Care must be given to gap penalties to maintain
O(nm) time complexity

42
Scoring Matrices

Substitutions
Models of substitutions
PAM
BLOSUM
Gap penalties

43
DNA
44
Transitional and Transversional Nucleotide
Substitutions

? ? are rates of transitional and
transversional substitutions, respectively
Generally, ? gt ?
Possible substitutions (total 16)
Identical (freq O) 4
Transitions (P) 4
Transversions (Q) 8
Giving us
p P Q
R P/Q
R is usually between 0.5 and 2 for nuclear
genes, higher for mitochondrial genes (up to 15)

45
Synonymous and Non-synonymous substitutions
Non-synonymous
Synonymous
Thr Tyr Leu Leu ACC TAT TTG CTG
Thr Tyr Leu Leu ACC TAT TTG CTG
ACC TCT TTG CTG Thr Ser Leu Leu
ACC TAC TTG CTG Thr Tyr Leu Leu

Synonymous substitutions more likely to occur
Preserve AA

46
Categories of Amino Acids
Grouped according to properties of side chain
47
Amino Acid Substitutions

Tend to preserve chemical similarity
Tend to preserve structure
Tend to preserve function
More frequent in non-functional domains

48
Models of Substitution

Percept Accepted Mutation (PAM)
Dayhoff 1978
Accepted Mutation changes accepted by natural
selection
PAM1 represents evolutionary divergence where 1
of amino change
Blocks Amino Acid Substitution Matrices (BLOSUM)
Henikoff and Henikoff 1992
Observed AA substitutions in conserved AA blocks
Maximum level of identity, BLOSUM62 represents
62 identity

49
PAM

Markov model

Probability of transitioning from one state to
another
pstpts
S
T
C

State for amino acid
P
50
PAM

Assumes substitutions are independent
pxy is calculated from observations
1572 changes in 71 groups of proteins
Organized into phylogenetic trees
Changes counted
Divided by normalizing factor
The probabilities are stored in a matrix
Probability form
PAM1 represents 10 my evolutionary distance
PAMN is derived from PAM1N because Markov Model
is used

51
PAM1 for DNA
0.00333
0.00333
0.00333
0.99
52
BLOSUM

2000 conserved amino acid patterns
blocks ungapped patterns
3-60 AA long
gt500 families of related proteins
Software
MOTIF (H. Smith et al. 1990)
PROTOMAT (Henikoff and Henikoff)

53
Computing BLOSUM Scores

Consider all pairs (dont know ancestor)
fAA3216 fAL4 fAS4 fLS1
Calculate frequency of occurrence
qAAfAA/(fAAfALfASfLS) 0.4
Calculate expected frequency of being in a pair
pA(qAAqAS/2qAL/2)0.66
Calculate expected frequency of a pair
eAApApA0.44
Matrix entry for pair
mAA qAA/eAA 0.9

A
L
A
S
A
A
54
Log Odds Scoring

Each of the previous matrices are converted to
log odds matrices
DP algorithm based on addition
log(xy)log(x)log(y)
Compares real occurrence with random occurence.
BLOSUM
sAAlog2(qAA/eAA) 2 -0.304 (will be rounded)
PAM1 DNA (uniform)
sCT log2(pCMCT / pCpT)
log2(0.25 0.00333/ 0.252)
-6.23

55
The PAM250 Matrix
Each matrix value is calculated by first dividing
the frequency of change, for each amino acid
pair, in related proteins separated by one step
in an evolutionary tree by the probability of a
chance alignment based on the frequency of the
amino acids. The ratios are expressed as
logarithms to the base 10 (approx. 1/3 bit
values).

Note
High values on diagonal
High values for similar groups

56
The Blosum62 Matrix
Each entry is the actual frequency of occurrence
of the amino acid pair in the blocks database,
clustered at the 62 level, divided by the
expected probability of occurrence. The expected
value is calculated from the frequency of
occurrence of each of the two individual amino
acids in the blocks database,and provides a
measure of a chance alignment of the two amino
acids. The actual/expected ratio is expressed as
a log odds. A zero score means that the frequency
of the amino acid pair in the database was as
expected by chance, a positive score that the
pair was found more often than by chance, and a
negative score that the pair was found less often
than by chance.
57
Selecting Matrices

PAM
Mutational model of evolution
Tracks evolutionary origins of proteins/sequences
Use lower numbers for evolutionarily close
sequences, higher numbers for distance sequences
BLOSUM
No model of evolution, conserved AA motifs
Designed to find conserved domains
Similar sequences, use higher numbers,
Divergent sequences, use lower numbers.

58
GAP Penalties

Recall d is gap opening penalty, e is gap
extension penalty
Total gap penalty wxde(x-1)
In order to make things work properly, need
affine gap function (Smith et al. 1981)
wx dx
Any affine function works
For the linear function above, e d
Typical gap penalties (Mount p.113)
BLOSUM50 d15, e8-15
PAM250 d15, e5-15

59
Next Time