Sequence Similarity - PowerPoint PPT Presentation

About This Presentation

Title:

Sequence Similarity

Description:

PSI-BLAST. Given a sequence query x, and database D ... BLAST Variants. BLASTN genomic sequences. BLASTP proteins ... Score of the global map is maximum ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 43

Provided by: root

Learn more at: https://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Similarity

1
Sequence Similarity
2
The Viterbi algorithm for alignment

Compute the following matrices (DP)
M(i, j) most likely alignment of x1xi with
y1yj ending in state M
I(i, j) most likely alignment of x1xi with
y1yj ending in state I
J(i, j) most likely alignment of x1xi with
y1yj ending in state J
M(i, j) log( Prob(xi, yj) )
max M(i-1, j-1) log(1-2?),
I(i-1, j) log(1-?),
J(i, j-1) log(1-?)
I(i, j) max M(i-1, j) log ?,
I(i-1, j) log ?

log(1 2?)
M P(xi, yj)
log Prob(xi, yj)
log(1 ?)
log(1 ?)
log ?
log ?
I P(xi)
J P(yj)
log ?
log ?
3
One way to view the state paths State M

y1
yn
x1

xm
4
State I

y1
yn
x1

xm
5
State J

y1
yn
x1

xm
6
Putting it all together
States I(i, j) are connected with states J and M
(i-1, j) States J(i, j) are connected with
states I and M (i-1, j) States M(i, j) are
connected with states J and I (i-1, j-1)

y1
yn
x1

xm
7
Putting it all together
States I(i, j) are connected with states J and M
(i-1, j) States J(i, j) are connected with
states I and M (i-1, j) States M(i, j) are
connected with states J and I (i-1, j-1) Optimal
solution is the best scoring path from top-left
to bottom-right corner This gives the likeliest
alignment according to our HMM

y1
yn
x1

xm
8
Yet another way to represent this model
Ix
Ix
BEGIN
END
Iy
Iy
Mx1
Mxm
Sequence X
We are aligning, or threading, sequence Y through
sequence X Every time yj lands in state xi, we
get substitution score s(xi, yj) Every time yj
is gapped, or some xi is skipped, we pay gap
penalty
9
From this model, we can compute additional
statistics

P(xi yj x, y) The probability that positions
i, j align, given that sequences x and y align
P(xi yj x, y) ?a alignmentP(a x, y) 1(xi
yj in a)
We will not cover the details, but
this quantity can also be
calculated with DP

log(1 2?)
M P(xi, yj)
log Prob(xi, yj)
log(1 ?)
log(1 ?)
log ?
log ?
I P(xi)
J P(yj)
log ?
log ?
10
Fast database search BLAST

(Basic Local Alignment Search Tool)
Main idea
Construct a dictionary of all the words in the
query
Initiate a local alignment for each word match
between query and DB
Running Time O(MN)
However, orders of magnitude faster than
Smith-Waterman

query
DB
11
BLAST ? Original Version

Dictionary
All words of length k (11 nucl. 4 aa)
Alignment initiated between words of alignment
score ? T (typically T k)
Alignment
Ungapped extensions until score below
statistical threshold
Output
All local alignments with score gt statistical
threshold

query

scan
DB
query
12
PSI-BLAST

Given a sequence query x, and database D
Find all pairwise alignments of x to sequences in
D
Collect all matches of x to y with some minimum
significance
Construct position specific matrix M
Each sequence y is given a weight so that many
similar sequences cannot have much influence on a
position (Henikoff Henikoff 1994)
Using the matrix M, search D for more matches
Iterate 14 until convergence

Profile M
13
BLAST Variants

BLASTN genomic sequences
BLASTP proteins
BLASTX translated genome versus proteins
TBLASTN proteins versus translated genomes
TBLASTX translated genome versus translated
genome
PSIBLAST iterated BLAST search
http//www.ncbi.nlm.nih.gov/BLAST

14
Multiple Sequence Alignments
15
Protein Phylogenies

Proteins evolve by both duplication and species
divergence

16
(No Transcript)
17
(No Transcript)
18
Definition

Given N sequences x1, x2,, xN
Insert gaps (-) in each sequence xi, such that
All sequences have the same length L
Score of the global map is maximum
A faint similarity between two sequences becomes
significant if present in many
Multiple alignments can help improve the pairwise
alignments

19
Scoring Function Sum Of Pairs

Definition Induced pairwise alignment
A pairwise alignment induced by the multiple
alignment
Example
x AC-GCGG-C
y AC-GC-GAG
z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAG
y ACGC-GAC z GCCGC-GAG z GCCGCGAG

20
Sum Of Pairs (contd)

Heuristic way to incorporate evolution tree

Human
Mouse
Duck
Chicken

Weighted SOP
S(m) ?kltl wkl s(mk, ml)
wkl weight decreasing with distance

21
A Profile Representation
- A G G C T A T C A C C T G T
A G C T A C C A - - - G C A G
C T A C C A - - - G C A G C
T A T C A C G G C A G C T A
T C G C G G A 1 1
.8 C .6 1 .4 1 .6
.2 G 1 .2 .2 .4
1 T .2 1 .6 .2 - .2
.8 .4 .8 .4

Given a multiple alignment M m1mn
Replace each column mi with profile entry pi
Frequency of each letter in ?
gaps
Optional gap openings, extensions, closings
Can think of this as a likelihood of each
letter in each position

22
Multiple Sequence Alignments

Algorithms

23
Multidimensional DP

Generalization of Needleman-Wunsh
S(m) ?i S(mi)
(sum of column scores)
F(i1,i2,,iN) Optimal alignment up to (i1, ,
iN)
F(i1,i2,,iN) max(all neighbors of
cube)(F(nbr)S(nbr))

24
Multidimensional DP

Example in 3D (three sequences)
7 neighbors/cell

F(i,j,k) max F(i-1,j-1,k-1)S(xi, xj,
xk), F(i-1,j-1,k )S(xi, xj, -
), F(i-1,j ,k-1)S(xi, -, xk), F(i-1,j
,k )S(xi, -, - ), F(i ,j-1,k-1)S( -,
xj, xk), F(i ,j-1,k )S( -, xj,
xk), F(i ,j ,k-1)S( -, -, xk)
25
Multidimensional DP

Running Time
Size of matrix LN
Where L length of each sequence
N number of sequences
Neighbors/cell 2N 1
Therefore O(2N LN)

26
Multidimensional DP

How do gap states generalize?
VERY badly!
Require 2N states, one per combination of
gapped/ungapped sequences
Running time O(2N ? 2N ? LN) O(4N LN)

Running Time
Size of matrix LN
Where L length of each sequence
N number of sequences
Neighbors/cell 2N 1
Therefore O(2N LN)

Y
YZ
XY
XYZ
Z
X
XZ
27
Progressive Alignment
x
pxy
y
z
pxyzw
pzw
w

When evolutionary tree is known
Align closest first, in the order of the tree
In each step, align two sequences x, y, or
profiles px, py, to generate a new alignment with
associated profile presult
Weighted version
Tree edges have weights, proportional to the
divergence in that edge
New profile is a weighted average of two old
profiles

28
Progressive Alignment
x
y
Example Profile (A, C, G, T, -) px (0.8, 0.2,
0, 0, 0) py (0.6, 0, 0, 0, 0.4) s(px, py)
0.80.6s(A, A) 0.20.6s(C, A)
0.80.4s(A, -) 0.20.4s(C, -) Result pxy
(0.7, 0.1, 0, 0, 0.2) s(px, -) 0.81.0s(A, -)
0.21.0s(C, -) Result px- (0.4, 0.1, 0, 0,
0.5)
z
w

When evolutionary tree is known
Align closest first, in the order of the tree
In each step, align two sequences x, y, or
profiles px, py, to generate a new alignment with
associated profile presult
Weighted version
Tree edges have weights, proportional to the
divergence in that edge
New profile is a weighted average of two old
profiles

29
Progressive Alignment
x
y
?
z
w

When evolutionary tree is unknown
Perform all pairwise alignments
Define distance matrix D, where D(x, y) is a
measure of evolutionary distance, based on
pairwise alignment
Construct a tree
Align on the tree

30
Heuristics to improve alignments

Iterative refinement schemes
A-based search
Consistency
Simulated Annealing

31
Iterative Refinement

One problem of progressive alignment
Initial alignments are frozen even when new
evidence comes
Example
x GAAGTT
y GAC-TT
z GAACTG
w GTACTG

Frozen!
Now clear correct y GA-CTT
32
Iterative Refinement

Algorithm (Barton-Stenberg)
For j 1 to N,
Remove xj, and realign to x1xj-1xj1xN
Repeat 4 until convergence

z
x
y
33
Iterative Refinement

Example align (x,y), (z,w), (xy, zw)
x GAAGTTA
y GAC-TTA
z GAACTGA
w GTACTGA
After realigning y
x GAAGTTA
y G-ACTTA 3 matches
z GAACTGA
w GTACTGA

Variant Refinement on a tree tree
partitioning
34
Iterative Refinement

Example align (x,y), (z,w), (xy, zw)
x GAAGTTA
y GAC-TTA
z GAACTGA
w GTACTGA
After realigning y
x GAAGTTA
y G-ACTTA 3 matches
z GAACTGA
w GTACTGA

35
Iterative Refinement

Example not handled well
x GAAGTTA
y1 GAC-TTA
y2 GAC-TTA
y3 GAC-TTA
z GAACTGA
w GTACTGA

Realigning any single yi changes nothing

36
Some Resources

http//www.ncbi.nlm.nih.gov/BLAST
BLAST PSI-BLAST
http//www.ebi.ac.uk/clustalw/
CLUSTALW most widely used
http//phylogenomics.berkeley.edu/cgi-bin/muscle/i
nput_muscle.py
MUSCLE most scalable
http//probcons.stanford.edu/
PROBCONS most accurate

37
MUSCLE at a glance

Fast measurement of all pairwise distances
between sequences
DDRAFT(x, y) defined in terms of common k-mers
(k3) O(N2 L logL) time
Build tree TDRAFT based on DDRAFT, with a
hierarchical clustering method (UPGMA)
Progressive alignment over TDRAFT, resulting in
multiple alignment MDRAFT
Measure distances D(x, y) based on MDRAFT
Build tree T based on D
Progressive alignment over T, to build M
Iterative refinement for many rounds, do
Tree Partitioning Split M on one branch and
realign the two resulting profiles
If new alignment M has better sum-of-pairs score
than previous one, accept

38
PROBCONS Probabilistic Consistency-based
Multiple Alignment of Proteins
xi
yj
MATCH
INSERT X
INSERT Y
?
yj
39
A pair-HMM model of pairwise alignment
x
ABRACA-DABRA AB-ACARDI---
y

Parameterizes a probability distribution, P(A),
over all possible alignments of all possible
pairs of sequences
Transition probabilities gap penalties
Emission probabilities substitution matrix

40
Computing Pairwise Alignments
P(a)
aviterbi
P(a x, y)

The Viterbi algorithm
conditional distribution P(a x, y) reflects
models uncertainty over the correct alignment
of x and y
identifies highest probability alignment,
aviterbi, in O(L2) time
Caveat the most likely alignment is not the most
accurate
Alternative find the alignment of maximum
expected accuracy

41
The Lazy-Teacher Analogy