CSE182-L4:%20Scoring%20matrices,%20Dictionary%20Matching - PowerPoint PPT Presentation

About This Presentation

Title:

CSE182-L4:%20Scoring%20matrices,%20Dictionary%20Matching

Description:

Title: L3: Blast: Keyword match basics Author: Vineet Bafna Last modified by: Vineet Bafna Created Date: 10/4/2005 8:32:21 AM Document presentation format – PowerPoint PPT presentation

Number of Views:136

Avg rating:3.0/5.0

Slides: 31

Provided by: Vine84

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSE182-L4:%20Scoring%20matrices,%20Dictionary%20Matching

1
CSE182-L4 Scoring matrices, Dictionary Matching
2
Class Mailing List

fa05_182_at_cs.ucsd.edu
To subscribe, send email to
fa05_182-subscribe_at_cs.ucsd.edu
You can subscribe from the course web page
Use the list for all course related queries,
discussions,

3
Silly Quiz

Name a famous Bioinformatics Researcher

Name a famous Bioinformatics Researcher who is a
woman

4
Scoring DNA

DNA has structure.

5
DNA scoring matrices

So far, we considered a simple match/mismatch
criterion.
The nucleotides can be grouped into Purines (A,G)
and Pyrimidines.
Nucleotide substitutions within a group
(transitions) are more likely than those across a
group (transversions)

6
Scoring proteins

Scoring protein sequence alignments is a much
more complex task than scoring DNA
Not all substitutions are equal
Problem was first worked on by Pauling and
collaborators
In the 1970s, Margaret Dayhoff created the first
similarity matrices.
One size does not fit all
Homologous proteins which are evolutionarily
close should be scored differently than proteins
that are evolutionarily distant
Different proteins might evolve at different
rates and we need to normalize for that

7
PAM 1 distance

Two sequences are 1 PAM apart if they differ in 1
of the residues.

1 mismatch

PAM1(a,b) Prresidue b substitutes residue a,
when the sequences are 1 PAM apart

8
PAM1 matrix

Align many proteins that are very similar
Is this a problem?
PAM1 distance is the probability of a
substitution when 1 of the residues have changed
Estimate the frequency Pba of residue a being
substituted by residue b.
S(a,b) log10(Pab/PaPb) log10(Pba/Pb)

9
PAM 1
10
PAM distance

Two sequences are 1 PAM apart when they differ in
1 of the residues.
When are 2 sequences 2 PAMs apart?

2 PAM
11
Higher PAMs

PAM2(a,b) ?c PAM1(a,c). PAM1 (c,b)
PAM2 PAM1 PAM1 (Matrix multiplication)
PAM250
PAM1PAM249
PAM1250

12
Note This is not the score matrix What happens
as you keep increasing the power?
13
Scoring using PAM matrices

Suppose we know that two sequences are 250 PAMs
apart.
S(a,b) log10(Pab/PaPb) log10(Pba/Pb)
log10(PAM250(a,b)/Pb)

14
BLOSUM series of Matrices

Henikoff Henikoff Sequence substitutions in
evolutionarily distant proteins do not seem to
follow the PAM distributions
A more direct method based on hand-curated
multiple alignments of distantly related proteins
from the BLOCKS database.
BLOSUM60 Merge all proteins that have greater
than 60. Then, compute the substitution
probability.
In practice BLOSUM62 seems to work very well.

15
PAM vs. BLOSUM

What is the correspondence?
PAM1 Blosum1
PAM2 Blosum2
Blosum62
PAM250 Blosum100

16
Dictionary Matching, R.E. matching, and position
specific scoring
17
Dictionary Matching
1POTATO 2POTASSIUM 3TASTE
P O T A S T P O T A T O
database
dictionary

Q Given k words (si has length li), and a
database of size n, find all matches to these
words in the database string.
How fast can this be done?

18
Dict. Matching string matching

How fast can you do it, if you only had one word
of length m?
Trivial algorithm O(nm) time
Pre-processing O(m), Search O(n) time.
Dictionary matching
Trivial algorithm (l1l2l3)n
Using a keyword tree, lpn (lp is the length of
the longest pattern)
Aho-Corasick O(n) after preprocessing O(l1l2..)
We will consider the most general case

19
Direct Algorithm
P O P O P O T A S T P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O

Observations
When we mismatch, we (should) know something
about where the next match will be.
When there is a mismatch, we (should) know
something about other patterns in the dictionary
as well.

20
The Trie Automaton

Construct an automaton A from the dictionary
Av,x describes the transition from node v to a
node w upon reading x.
Au,T v, and Au,S w
Special root node r
Some nodes are terminal, and labeled with the
index of the dictionary word.

1POTATO 2POTASSIUM 3TASTE
v
u
1
r
S
2
w
3
21
An O(lpn) algorithm for keyword matching

Start with the first position in the db, and the
root node.
If successful transition
Increment current pointer
Move to a new node
If terminal node success
Else
Retract current pointer
Increment start pointer
Move to root repeat

22
Illustration
P O T A S T P O T A T O
v
1
S
23
Idea for improving the time

Suppose we have partially matched pattern i
(indicated by l, and c), but fail subsequently.
If some other pattern j is to match
Then prefix(pattern j) suffix first c-l
characters of pattern(i))

c
l
P O T A S T P O T A T O
P O T A S S I U M
Pattern i
T A S T E
1POTATO 2POTASSIUM 3TASTE
Pattern j
24
Improving speed of dictionary matching

Every node v corresponds to a string sv that is a
prefix of some pattern.
Define Fv to be the node u such that su is the
longest suffix of sv
If we fail to match at v, we should jump to Fv,
and commence matching from there
Let lpv su

2
3
4
5
1
S
11
6
7
9
10
8
25
An O(n) alg. For keyword matching

Start with the first position in the db, and the
root node.
If successful transition
Increment current pointer
Move to a new node
If terminal node success
Else (if at root)
Increment current pointer
Mv start pointer
Move to root
Else
Move start pointer forward
Move to failure node

26
Illustration
P O T A S T P O T A T O
l
c
1
P
O
T
A
T
O
v
T
S
U
I
S
M
A
S
E
T
27
Time analysis

In each step, either c is incremented, or l is
incremented
Neither pointer is ever decremented (lpv lt
c-l).
l and c do not exceed n
Total time lt 2n

l
c
P O T A S T P O T A T O
28
Blast Putting it all together

Input Query of length m, database of size n
Select word-size, scoring matrix, gap penalties,
E-value cutoff

29
Blast Steps

Generate an automaton of all query keywords.
Scan database using a Dictionary Matching
algorithm (O(n) time). Identify all hits.
Extend each hit using a variant of local
alignment algorithm. Use the scoring matrix and
gap penalties.
For each alignment with score S, compute the
bit-score, E-value, and the P-value. Sort
according to increasing E-value until the cut-off
is reached.
Output results.

30
Protein Sequence Analysis

What can you do if BLAST does not return a hit?
Sometimes, homology (evolutionary similarity)
exists at very low levels of sequence similarity.

A Accept hits at higher P-value.
This increases the probability that the sequence
similarity is a chance event.
How can we get around this paradox?
Reformulated Q suppose two sequences B,C have
the same level of sequence similarity to sequence
A. If A B are related in function, can we assume
that A C are? If not, how can we distinguish?

Write a Comment

User Comments (0)