Title: CSE182-L4:%20Scoring%20matrices,%20Dictionary%20Matching
1CSE182-L4 Scoring matrices, Dictionary Matching
2Class Mailing List
- fa05_182_at_cs.ucsd.edu
- To subscribe, send email to
- fa05_182-subscribe_at_cs.ucsd.edu
- You can subscribe from the course web page
- Use the list for all course related queries,
discussions,
3Silly Quiz
- Name a famous Bioinformatics Researcher
- Name a famous Bioinformatics Researcher who is a
woman
4Scoring DNA
5DNA scoring matrices
- So far, we considered a simple match/mismatch
criterion. - The nucleotides can be grouped into Purines (A,G)
and Pyrimidines. - Nucleotide substitutions within a group
(transitions) are more likely than those across a
group (transversions)
6Scoring proteins
- Scoring protein sequence alignments is a much
more complex task than scoring DNA - Not all substitutions are equal
- Problem was first worked on by Pauling and
collaborators - In the 1970s, Margaret Dayhoff created the first
similarity matrices. - One size does not fit all
- Homologous proteins which are evolutionarily
close should be scored differently than proteins
that are evolutionarily distant - Different proteins might evolve at different
rates and we need to normalize for that
7PAM 1 distance
- Two sequences are 1 PAM apart if they differ in 1
of the residues.
1 mismatch
- PAM1(a,b) Prresidue b substitutes residue a,
when the sequences are 1 PAM apart
8PAM1 matrix
- Align many proteins that are very similar
- Is this a problem?
- PAM1 distance is the probability of a
substitution when 1 of the residues have changed - Estimate the frequency Pba of residue a being
substituted by residue b. - S(a,b) log10(Pab/PaPb) log10(Pba/Pb)
9PAM 1
10PAM distance
- Two sequences are 1 PAM apart when they differ in
1 of the residues. - When are 2 sequences 2 PAMs apart?
2 PAM
11Higher PAMs
- PAM2(a,b) ?c PAM1(a,c). PAM1 (c,b)
- PAM2 PAM1 PAM1 (Matrix multiplication)
- PAM250
- PAM1PAM249
- PAM1250
12Note This is not the score matrix What happens
as you keep increasing the power?
13Scoring using PAM matrices
- Suppose we know that two sequences are 250 PAMs
apart. - S(a,b) log10(Pab/PaPb) log10(Pba/Pb)
log10(PAM250(a,b)/Pb)
14BLOSUM series of Matrices
- Henikoff Henikoff Sequence substitutions in
evolutionarily distant proteins do not seem to
follow the PAM distributions - A more direct method based on hand-curated
multiple alignments of distantly related proteins
from the BLOCKS database. - BLOSUM60 Merge all proteins that have greater
than 60. Then, compute the substitution
probability. - In practice BLOSUM62 seems to work very well.
15PAM vs. BLOSUM
- What is the correspondence?
- PAM1 Blosum1
- PAM2 Blosum2
- Blosum62
- PAM250 Blosum100
16Dictionary Matching, R.E. matching, and position
specific scoring
17Dictionary Matching
1POTATO 2POTASSIUM 3TASTE
P O T A S T P O T A T O
database
dictionary
- Q Given k words (si has length li), and a
database of size n, find all matches to these
words in the database string. - How fast can this be done?
18Dict. Matching string matching
- How fast can you do it, if you only had one word
of length m? - Trivial algorithm O(nm) time
- Pre-processing O(m), Search O(n) time.
- Dictionary matching
- Trivial algorithm (l1l2l3)n
- Using a keyword tree, lpn (lp is the length of
the longest pattern) - Aho-Corasick O(n) after preprocessing O(l1l2..)
- We will consider the most general case
19Direct Algorithm
P O P O P O T A S T P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O
- Observations
- When we mismatch, we (should) know something
about where the next match will be. - When there is a mismatch, we (should) know
something about other patterns in the dictionary
as well.
20The Trie Automaton
- Construct an automaton A from the dictionary
- Av,x describes the transition from node v to a
node w upon reading x. - Au,T v, and Au,S w
- Special root node r
- Some nodes are terminal, and labeled with the
index of the dictionary word.
1POTATO 2POTASSIUM 3TASTE
v
u
1
r
S
2
w
3
21An O(lpn) algorithm for keyword matching
- Start with the first position in the db, and the
root node. - If successful transition
- Increment current pointer
- Move to a new node
- If terminal node success
- Else
- Retract current pointer
- Increment start pointer
- Move to root repeat
22Illustration
P O T A S T P O T A T O
v
1
S
23Idea for improving the time
- Suppose we have partially matched pattern i
(indicated by l, and c), but fail subsequently.
If some other pattern j is to match - Then prefix(pattern j) suffix first c-l
characters of pattern(i))
c
l
P O T A S T P O T A T O
P O T A S S I U M
Pattern i
T A S T E
1POTATO 2POTASSIUM 3TASTE
Pattern j
24Improving speed of dictionary matching
- Every node v corresponds to a string sv that is a
prefix of some pattern. - Define Fv to be the node u such that su is the
longest suffix of sv - If we fail to match at v, we should jump to Fv,
and commence matching from there - Let lpv su
2
3
4
5
1
S
11
6
7
9
10
8
25An O(n) alg. For keyword matching
- Start with the first position in the db, and the
root node. - If successful transition
- Increment current pointer
- Move to a new node
- If terminal node success
- Else (if at root)
- Increment current pointer
- Mv start pointer
- Move to root
- Else
- Move start pointer forward
- Move to failure node
26Illustration
P O T A S T P O T A T O
l
c
1
P
O
T
A
T
O
v
T
S
U
I
S
M
A
S
E
T
27Time analysis
- In each step, either c is incremented, or l is
incremented - Neither pointer is ever decremented (lpv lt
c-l). - l and c do not exceed n
- Total time lt 2n
l
c
P O T A S T P O T A T O
28Blast Putting it all together
- Input Query of length m, database of size n
- Select word-size, scoring matrix, gap penalties,
E-value cutoff
29Blast Steps
- Generate an automaton of all query keywords.
- Scan database using a Dictionary Matching
algorithm (O(n) time). Identify all hits. - Extend each hit using a variant of local
alignment algorithm. Use the scoring matrix and
gap penalties. - For each alignment with score S, compute the
bit-score, E-value, and the P-value. Sort
according to increasing E-value until the cut-off
is reached. - Output results.
30Protein Sequence Analysis
- What can you do if BLAST does not return a hit?
- Sometimes, homology (evolutionary similarity)
exists at very low levels of sequence similarity.
- A Accept hits at higher P-value.
- This increases the probability that the sequence
similarity is a chance event. - How can we get around this paradox?
- Reformulated Q suppose two sequences B,C have
the same level of sequence similarity to sequence
A. If A B are related in function, can we assume
that A C are? If not, how can we distinguish?