CSE182-L4:%20Scoring%20matrices,%20Dictionary%20Matching - PowerPoint PPT Presentation

About This Presentation
Title:

CSE182-L4:%20Scoring%20matrices,%20Dictionary%20Matching

Description:

Title: L3: Blast: Keyword match basics Author: Vineet Bafna Last modified by: Vineet Bafna Created Date: 10/4/2005 8:32:21 AM Document presentation format – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 31
Provided by: Vine84
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: CSE182-L4:%20Scoring%20matrices,%20Dictionary%20Matching


1
CSE182-L4 Scoring matrices, Dictionary Matching
2
Class Mailing List
  • fa05_182_at_cs.ucsd.edu
  • To subscribe, send email to
  • fa05_182-subscribe_at_cs.ucsd.edu
  • You can subscribe from the course web page
  • Use the list for all course related queries,
    discussions,

3
Silly Quiz
  • Name a famous Bioinformatics Researcher
  • Name a famous Bioinformatics Researcher who is a
    woman

4
Scoring DNA
  • DNA has structure.

5
DNA scoring matrices
  • So far, we considered a simple match/mismatch
    criterion.
  • The nucleotides can be grouped into Purines (A,G)
    and Pyrimidines.
  • Nucleotide substitutions within a group
    (transitions) are more likely than those across a
    group (transversions)

6
Scoring proteins
  • Scoring protein sequence alignments is a much
    more complex task than scoring DNA
  • Not all substitutions are equal
  • Problem was first worked on by Pauling and
    collaborators
  • In the 1970s, Margaret Dayhoff created the first
    similarity matrices.
  • One size does not fit all
  • Homologous proteins which are evolutionarily
    close should be scored differently than proteins
    that are evolutionarily distant
  • Different proteins might evolve at different
    rates and we need to normalize for that

7
PAM 1 distance
  • Two sequences are 1 PAM apart if they differ in 1
    of the residues.

1 mismatch
  • PAM1(a,b) Prresidue b substitutes residue a,
    when the sequences are 1 PAM apart

8
PAM1 matrix
  • Align many proteins that are very similar
  • Is this a problem?
  • PAM1 distance is the probability of a
    substitution when 1 of the residues have changed
  • Estimate the frequency Pba of residue a being
    substituted by residue b.
  • S(a,b) log10(Pab/PaPb) log10(Pba/Pb)

9
PAM 1
10
PAM distance
  • Two sequences are 1 PAM apart when they differ in
    1 of the residues.
  • When are 2 sequences 2 PAMs apart?

2 PAM
11
Higher PAMs
  • PAM2(a,b) ?c PAM1(a,c). PAM1 (c,b)
  • PAM2 PAM1 PAM1 (Matrix multiplication)
  • PAM250
  • PAM1PAM249
  • PAM1250

12
Note This is not the score matrix What happens
as you keep increasing the power?
13
Scoring using PAM matrices
  • Suppose we know that two sequences are 250 PAMs
    apart.
  • S(a,b) log10(Pab/PaPb) log10(Pba/Pb)
    log10(PAM250(a,b)/Pb)

14
BLOSUM series of Matrices
  • Henikoff Henikoff Sequence substitutions in
    evolutionarily distant proteins do not seem to
    follow the PAM distributions
  • A more direct method based on hand-curated
    multiple alignments of distantly related proteins
    from the BLOCKS database.
  • BLOSUM60 Merge all proteins that have greater
    than 60. Then, compute the substitution
    probability.
  • In practice BLOSUM62 seems to work very well.

15
PAM vs. BLOSUM
  • What is the correspondence?
  • PAM1 Blosum1
  • PAM2 Blosum2
  • Blosum62
  • PAM250 Blosum100

16
Dictionary Matching, R.E. matching, and position
specific scoring
17
Dictionary Matching
1POTATO 2POTASSIUM 3TASTE
P O T A S T P O T A T O
database
dictionary
  • Q Given k words (si has length li), and a
    database of size n, find all matches to these
    words in the database string.
  • How fast can this be done?

18
Dict. Matching string matching
  • How fast can you do it, if you only had one word
    of length m?
  • Trivial algorithm O(nm) time
  • Pre-processing O(m), Search O(n) time.
  • Dictionary matching
  • Trivial algorithm (l1l2l3)n
  • Using a keyword tree, lpn (lp is the length of
    the longest pattern)
  • Aho-Corasick O(n) after preprocessing O(l1l2..)
  • We will consider the most general case

19
Direct Algorithm
P O P O P O T A S T P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O
  • Observations
  • When we mismatch, we (should) know something
    about where the next match will be.
  • When there is a mismatch, we (should) know
    something about other patterns in the dictionary
    as well.

20
The Trie Automaton
  • Construct an automaton A from the dictionary
  • Av,x describes the transition from node v to a
    node w upon reading x.
  • Au,T v, and Au,S w
  • Special root node r
  • Some nodes are terminal, and labeled with the
    index of the dictionary word.

1POTATO 2POTASSIUM 3TASTE
v
u
1
r
S
2
w
3
21
An O(lpn) algorithm for keyword matching
  • Start with the first position in the db, and the
    root node.
  • If successful transition
  • Increment current pointer
  • Move to a new node
  • If terminal node success
  • Else
  • Retract current pointer
  • Increment start pointer
  • Move to root repeat

22
Illustration
P O T A S T P O T A T O
v
1
S
23
Idea for improving the time
  • Suppose we have partially matched pattern i
    (indicated by l, and c), but fail subsequently.
    If some other pattern j is to match
  • Then prefix(pattern j) suffix first c-l
    characters of pattern(i))

c
l
P O T A S T P O T A T O
P O T A S S I U M
Pattern i
T A S T E
1POTATO 2POTASSIUM 3TASTE
Pattern j
24
Improving speed of dictionary matching
  • Every node v corresponds to a string sv that is a
    prefix of some pattern.
  • Define Fv to be the node u such that su is the
    longest suffix of sv
  • If we fail to match at v, we should jump to Fv,
    and commence matching from there
  • Let lpv su

2
3
4
5
1
S
11
6
7
9
10
8
25
An O(n) alg. For keyword matching
  • Start with the first position in the db, and the
    root node.
  • If successful transition
  • Increment current pointer
  • Move to a new node
  • If terminal node success
  • Else (if at root)
  • Increment current pointer
  • Mv start pointer
  • Move to root
  • Else
  • Move start pointer forward
  • Move to failure node

26
Illustration
P O T A S T P O T A T O
l
c
1
P
O
T
A
T
O
v
T
S
U
I
S
M
A
S
E
T
27
Time analysis
  • In each step, either c is incremented, or l is
    incremented
  • Neither pointer is ever decremented (lpv lt
    c-l).
  • l and c do not exceed n
  • Total time lt 2n

l
c
P O T A S T P O T A T O
28
Blast Putting it all together
  • Input Query of length m, database of size n
  • Select word-size, scoring matrix, gap penalties,
    E-value cutoff

29
Blast Steps
  1. Generate an automaton of all query keywords.
  2. Scan database using a Dictionary Matching
    algorithm (O(n) time). Identify all hits.
  3. Extend each hit using a variant of local
    alignment algorithm. Use the scoring matrix and
    gap penalties.
  4. For each alignment with score S, compute the
    bit-score, E-value, and the P-value. Sort
    according to increasing E-value until the cut-off
    is reached.
  5. Output results.

30
Protein Sequence Analysis
  • What can you do if BLAST does not return a hit?
  • Sometimes, homology (evolutionary similarity)
    exists at very low levels of sequence similarity.
  • A Accept hits at higher P-value.
  • This increases the probability that the sequence
    similarity is a chance event.
  • How can we get around this paradox?
  • Reformulated Q suppose two sequences B,C have
    the same level of sequence similarity to sequence
    A. If A B are related in function, can we assume
    that A C are? If not, how can we distinguish?
Write a Comment
User Comments (0)
About PowerShow.com