LING124 Sequence alignment - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

LING124 Sequence alignment

Description:

Find the alignment such that point-wise comparison between the two gives you the ... Point-wise comparison is typically based on Euclidean distance ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 16
Provided by: hahn7
Category:

less

Transcript and Presenter's Notes

Title: LING124 Sequence alignment


1
LING124 Sequence alignment
  • September 23, 2008

2
Class outline
  • Sequence comparison
  • String
  • Vector
  • Sequence alignment
  • Dynamic programming (DP)
  • String alignment using DP algorithm
  • Exercise

3
Problem
  • Suppose a speaker made a mistake and said k r ?
    p
  • b r ? k
  • k r ? b
  • k r ? b
  • k r ? s p
  • r ? p

4
Alignment
  • Identify the corresponding points

k r ? p b r ? k
k r ? p k r ? b
k r ? p k r ? s p
k r ? p r ? p
5
How?
  • Find the alignment such that point-wise
    comparison between the two gives you the minimal
    distance
  • For strings, constituent symbols will be compared
  • For speech sounds,
  • each sound will be represented by a sequence of
    vectors (e.g. vector of cepstral coefficients)
  • constituent vectors will be compared

6
Levenshtein distance
  • Minimum number of edit operations required to
    derive one string from another
  • Substitution
  • Insertion
  • Deletion

7
Euclidean distance
  • X ltx1, x2, ..., xngt
  • Y lty1, y2, ..., yngt

8
A naive approach
  • List all possible alignments
  • Calculate the distance for each aligned pair
  • Identify the aligned pair with the minimal
    distance
  • What is wrong with this approach?

9
Dynamic programming
  • Break the problem into smaller overlapping
    sub-problems
  • Find the optimal answer to each sub-problem
  • Combine the optimal answers for the sub-problems
    to find the optimal answer for the whole problem

10
String alignment
  • Suppose
  • Two strings
  • X k r I p
  • Y r I p
  • Edit costs
  • insdelsub1
  • (n1)(m1) matrix
  • n length of Y
  • m length of X

11
String alignment (2)
  • Fill in the costs
  • C(i, 0) i (0 i n)
  • C(0, j) j (0 j m)
  • For i in 1 to n
  • For j in 1 to m
  • sub0 if yixj
  • sub1 if yi?xj
  • C(i,j)the minimum of
  • C(i-1, j-1) sub
  • C(i-1, j) del
  • C(i, j-1) ins

12
String alignment (3)
  • Find the shortest path from bottom-right to
    top-left
  • Trace back
  • Start from bottom-right
  • Move towards top-left
  • Identify adjacent cells
  • Move to the one with the shortest distance
  • Define trace-back priority in case of tie-score
    (e.g. left gt up gt diagonal)

13
String alignment (4)
  • Write the string backwards from right to left as
    you trace back
  • If you reached cell(i,j) by moving diagonally
  • Prefix Y with Yi
  • Prefix X with Xj
  • If you reached cell(i,j) by moving up
  • Prefix Y with Yi
  • Prefix X with _
  • If you reached cell(i,j) by moving left
  • Prefix Y with _
  • Prefix X with Xj
  • X k r I p
  • Y _ r I p

14
Exercise
  • Suppose
  • Xbrick, Ybig
  • Edit costs ins del sub 1
  • Trace-back priority diagonal gt up gt left
  • Create the matrix
  • Fill in the cells with Levenshtein distance
  • Retrace to derive the alignment

15
Feature-vector alignment
  • Similar to string alignment except
  • Point-wise comparison is typically based on
    Euclidean distance
  • Various methods to reduce the time and space
    complexity
  • More on this in class on Dynamic Time Warping

Figure from Salvador and Chan (2004)
Write a Comment
User Comments (0)
About PowerShow.com