CSE182-L5: Scoring matrices Dictionary Matching - PowerPoint PPT Presentation

About This Presentation
Title:

CSE182-L5: Scoring matrices Dictionary Matching

Description:

Trivial algorithm O(nm) time. Pre-processing O(m), Search O(n) time. Dictionary matching ... or for their enzymatic activity are conserved in both structure ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 64
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: CSE182-L5: Scoring matrices Dictionary Matching


1
CSE182-L5 Scoring matrices Dictionary Matching
2
Scoring DNA
  • DNA has structure.

3
DNA scoring matrices
  • So far, we considered a simple match/mismatch
    criterion.
  • The nucleotides can be grouped into Purines (A,G)
    and Pyrimidines.
  • Nucleotide substitutions within a group
    (transitions) are more likely than those across a
    group (transversions)

4
Scoring proteins
  • Scoring protein sequence alignments is a much
    more complex task than scoring DNA
  • Not all substitutions are equal
  • Problem was first worked on by Pauling and
    collaborators
  • In the 1970s, Margaret Dayhoff created the first
    similarity matrices.
  • One size does not fit all
  • Homologous proteins which are evolutionarily
    close should be scored differently than proteins
    that are evolutionarily distant
  • Different proteins might evolve at different
    rates and we need to normalize for that

5
PAM 1 distance
  • Two sequences are 1 PAM apart if they differ in 1
    of the residues.

1 mismatch
  • PAM1(a,b) Prresidue b substitutes residue a,
    when the sequences are 1 PAM apart

6
PAM1 matrix
  • Align many proteins that are very similar
  • Is this a problem?
  • PAM1 distance is the probability of a
    substitution when 1 of the residues have changed
  • Estimate the frequency Pba of residue a being
    substituted by residue b.
  • S(a,b) log10(Pab/PaPb) log10(Pba/Pb)

7
PAM 1
8
PAM distance
  • Two sequences are 1 PAM apart when they differ in
    1 of the residues.
  • When are 2 sequences 2 PAMs apart?

2 PAM
9
Higher PAMs
  • PAM2(a,b) ?c PAM1(a,c). PAM1 (c,b)
  • PAM2 PAM1 PAM1 (Matrix multiplication)
  • PAM250
  • PAM1PAM249
  • PAM1250

10
PAM250 based scoring matrix
  • S250(a,b) log10(Pab/PaPb) log10(PAM250(ba)/Pb
    )

11
Scoring using PAM matrices
  • Suppose we know that two sequences are 250 PAMs
    apart.
  • S(a,b) log10(Pab/PaPb) log10(Pba/Pb)
    log10(PAM250(a,b)/Pb)
  • How does it help?
  • S250(A,V) gtgt S1(A,V)
  • Scoring of hum vs. Dros should be using a higher
    PAM matrix than scoring hum vs. mus.
  • An alignment with a smaller identity could
    still have a higher score and be more significant

hum
mus
dros
12
BLOSUM series of Matrices
  • Henikoff Henikoff Sequence substitutions in
    evolutionarily distant proteins do not seem to
    follow the PAM distributions
  • A more direct method based on hand-curated
    multiple alignments of distantly related proteins
    from the BLOCKS database.
  • BLOSUM60 Merge all proteins that have greater
    than 60. Then, compute the substitution
    probability.
  • In practice BLOSUM62 seems to work very well.

13
PAM vs. BLOSUM
  • What is the correspondence?
  • PAM1 Blosum1
  • PAM2 Blosum2
  • Blosum62
  • PAM250 Blosum100

14
The last step in Blast
  • We have discussed
  • Alignments
  • Db filtering using keywords
  • E-values and P-values
  • Scoring matrices
  • The last step Database filtering requires us to
    scan a large sequence fast for matching keywords

15
Dictionary Matching, R.E. matching, and position
specific scoring
16
Keyword search
  • Recall In BLAST, we get a collection of keywords
    from the query sequence, and identify all db
    locations with an exact match to the keyword.
  • Question Given a collection of strings
    (keywords), find all occrrences in a database
    string where they keyword might match.

17
Dictionary Matching
1POTATO 2POTASSIUM 3TASTE
P O T A S T P O T A T O
database
dictionary
  • Q Given k words (si has length li), and a
    database of size n, find all matches to these
    words in the database string.
  • How fast can this be done?

18
Dict. Matching string matching
  • How fast can you do it, if you only had one word
    of length m?
  • Trivial algorithm O(nm) time
  • Pre-processing O(m), Search O(n) time.
  • Dictionary matching
  • Trivial algorithm (l1l2l3)n
  • Using a keyword tree, lpn (lp is the length of
    the longest pattern)
  • Aho-Corasick O(n) after preprocessing O(l1l2..)
  • We will consider the most general case

19
Direct Algorithm
P O P O P O T A S T P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O
  • Observations
  • When we mismatch, we (should) know something
    about where the next match will be.
  • When there is a mismatch, we (should) know
    something about other patterns in the dictionary
    as well.

20
The Trie Automaton
  • Construct an automaton A from the dictionary
  • Av,x describes the transition from node v to a
    node w upon reading x.
  • Au,T v, and Au,S w
  • Special root node r
  • Some nodes are terminal, and labeled with the
    index of the dictionary word.

1POTATO 2POTASSIUM 3TASTE
v
u
1
r
S
2
w
3
21
An O(lpn) algorithm for keyword matching
  • Start with the first position in the db, and the
    root node.
  • If successful transition
  • Increment current pointer
  • Move to a new node
  • If terminal node success
  • Else
  • Retract current pointer
  • Increment start pointer
  • Move to root repeat

22
Illustration
P O T A S T P O T A T O
v
1
S
23
Idea for improving the time
  • Suppose we have partially matched pattern i
    (indicated by l, and c), but fail subsequently.
    If some other pattern j is to match
  • Then prefix(pattern j) suffix first c-l
    characters of pattern(i))

c
l
P O T A S T P O T A T O
P O T A S S I U M
Pattern i
T A S T E
1POTATO 2POTASSIUM 3TASTE
Pattern j
24
Improving speed of dictionary matching
  • Every node v corresponds to a string sv that is a
    prefix of some pattern.
  • Define Fv to be the node u such that su is the
    longest suffix of sv
  • If we fail to match at v, we should jump to Fv,
    and commence matching from there
  • Let lpv su

2
3
4
5
1
S
11
6
7
9
10
8
25
An O(n) alg. For keyword matching
  • Start with the first position in the db, and the
    root node.
  • If successful transition
  • Increment current pointer
  • Move to a new node
  • If terminal node success
  • Else (if at root)
  • Increment current pointer
  • Mv start pointer
  • Move to root
  • Else
  • Move start pointer forward
  • Move to failure node

26
Illustration
P O T A S T P O T A T O
l
c
1
P
O
T
A
T
O
v
T
S
U
I
S
M
A
S
E
T
27
Time analysis
  • In each step, either c is incremented, or l is
    incremented
  • Neither pointer is ever decremented (lpv lt
    c-l).
  • l and c do not exceed n
  • Total time lt 2n

l
c
P O T A S T P O T A T O
28
Blast Putting it all together
  • Input Query of length m, database of size n
  • Select word-size, scoring matrix, gap penalties,
    E-value cutoff

29
Blast Steps
  1. Generate an automaton of all query keywords.
  2. Scan database using a Dictionary Matching
    algorithm (O(n) time). Identify all hits.
  3. Extend each hit using a variant of local
    alignment algorithm. Use the scoring matrix and
    gap penalties.
  4. For each alignment with score S, compute the
    bit-score, E-value, and the P-value. Sort
    according to increasing E-value until the cut-off
    is reached.
  5. Output results.

30
Protein Sequence Analysis
  • What can you do if BLAST does not return a hit?
  • Sometimes, homology (evolutionary similarity)
    exists at very low levels of sequence similarity.
  • A Accept hits at higher P-value.
  • This increases the probability that the sequence
    similarity is a chance event.
  • How can we get around this paradox?
  • Reformulated Q suppose two sequences B,C have
    the same level of sequence similarity to sequence
    A. If A B are related in function, can we assume
    that A C are? If not, how can we distinguish?

31
Silly Quiz
32
Silly Quiz
33
Protein sequence motifs
  • Premise
  • The sequence of a protein sequence gives clues
    about its structure and function.
  • Not all residues are equally important in
    determining function.
  • Suppose we knew the key residues of a family. If
    our query matches in those residues, it is a
    member. Otherwise, it is not.
  • How can we identify these key residues?

34
Prosite
  • In some cases the sequence of an unknown protein
    is too distantly related to any protein of known
    structure to detect its resemblance by overall
    sequence alignment. However, relationships can be
    revealed by the occurrence in its sequence of a
    particular cluster of residue types, which is
    variously known as a pattern, motif, signature or
    fingerprint. These motifs arise because specific
    region(s) of a protein which may be important,
    for example, for their binding properties or for
    their enzymatic activity are conserved in both
    structure and sequence. These structural
    requirements impose very tight constraints on the
    evolution of this small but important portion(s)
    of a protein sequence. The use of protein
    sequence patterns or profiles to determine the
    function of proteins is becoming very rapidly one
    of the essential tools of sequence analysis. Many
    authors ( 3,4) have recognized this reality.
    Based on these observations, we decided in 1988,
    to actively pursue the development of a database
    of regular expression-like patterns, which would
    be used to search against sequences of unknown
    function.
  • Kay Hofmann ,Philipp Bucher, Laurent Falquet and
    Amos Bairoch
  • The PROSITE database, its status in 1999

35
Basic idea
  • It is a heuristic approach. Start with the
    following
  • A collection of sequences with the same function.
  • Region/residues known to be significant for
    maintaining structure and function.
  • Develop a pattern of conserved residues around
    the residues of interest
  • Iterate for appropriate sensitivity and
    specificity

36
EX Zinc Finger domain
37
Proteins containing zf domains
How can we find a motif corresponding to a zf
domain
38
From alignment to regular expressions
ALRDFATHDDF SMTAEATHDSI
ECDQAATHEAS
ATH-DE
  • Search Swissprot with the resulting pattern
  • Refine pattern to eliminate false positives
  • Iterate

39
The sequence analysis perspective
  • Zinc Finger motif
  • C-x(2,4)-C-x(3)-LIVMFYWC-x(8)-H-x(3,5)-H
  • 2 conserved C, and 2 conserved H
  • How can we search a database using these motifs?
  • The motif is described using a regular
    expression. What is a regular expression?

40
Regular Expressions
  • Concise representation of a set of strings over
    alphabet ?.
  • Described by a string over
  • R is a r.e. if and only if

41
Regular Expression
  • Q Let ?A,C,E
  • Is (AC)EEC a regular expression?
  • (AC)?
  • AC..E?
  • Q When is a string s in a regular expression?
  • R (AC)EEC
  • Is CEEC in R?
  • AEC?
  • ACEE?

42
Regular Expression Automata
  • Every R.E can be expressed by an automaton (a
    directed graph) with the following properties
  • The automaton has a start and end node
  • Each edge is labeled with a symbol from ?, or ?
  • Suppose R is described by automaton A
  • S ? R if and only if there is a path from start
    to end in A, labeled with s.

43
Examples Regular Expression Automata
  • (AC)EEC

C
A
E
E
start
end
C
44
Constructing automata from R.E
?
  • R ?
  • R ?, ? ? ?
  • R R1 R2
  • R R1 R2
  • R R1

?
?
?
?
?
45
Regular Expression Matching
  • Given a database D, and a regular expression R,
    is a substring of D in R?
  • Is there a string Dl..c that is accepted by the
    automaton of R?
  • Simpler Q Is D1..c accepted by the automaton
    of R?

46
Alg. For matching R.E.
  • If D1..c is accepted by the automaton RA
  • There is a path labeled D1Dc that goes from
    START to END in RA

?
D1
D2
Dc
47
Alg. For matching R.E.
  • If D1..c is accepted by the automaton RA
  • There is a path labeled D1Dc that goes from
    START to END in RA
  • There is a path labeled D1..Dc-1 from START
    to node u, and a path labeled Dc from u to the
    END

u
D1 .. Dc-1
Dc
48
D.P. to match regular expression
u
?
v
  • Define
  • Au,? Automaton node reached from u after
    reading ?
  • Eps(u) set of all nodes reachable from node u
    using epsilon transitions.
  • Nc subset of nodes reachable from START node
    after reading D1..c
  • Q when is v ? Nc

?
u
Eps(u)
49
D.P. to match regular expression
  • Q when is v ? Nc?
  • A If for some u ? Nc-1, w Au,Dc,
  • v ? w Eps(w)

50
Algorithm
51
The final step
  • We have answered the question
  • Is D1..c accepted by R?
  • Yes, if END ? Nc
  • We need to answer
  • Is Dl..c (for some l, and some c) accepted by R

52
Profiles versus regular expressions
  • Regular expressions are intolerant to an
    occasional mis-match.
  • The Union operation (IVL) does not quantify the
    relative importance of I,V,L. It could be that V
    occurs in 80 of the family members.
  • Profiles capture some of these ideas.

53
Profiles
  • Start with an alignment of strings of length m,
    over an alphabet A,
  • Build an A X m matrix F(fki)
  • Each entry fki represents the frequency of symbol
    k in position i

0.71
0.14
0.28
0.14
54
Scoring Profiles
Scoring Matrix
i
k
fki
s
55
Psi-BLAST idea
  • Multiple alignments are important for capturing
    remote homology.
  • Profile based scores are a natural way to handle
    this.
  • Q What if the query is a single sequence.
  • A Iterate
  • Find homologs using Blast on query
  • Discard very similar homologs
  • Align, make a profile, search with profile.

56
Psi-BLAST speed
  • Two time consuming steps.
  • Multiple alignment of homologs
  • Searching with Profiles.
  • Does the keyword search idea work?

57
Databases of Motifs
  • Functionally related proteins have sequence
    motifs.
  • The sequence motifs can be represented in many
    ways, and different biological databases capture
    these representations
  • Collection of sequences (SMART)
  • Multiple alignments (BLOCKS)
  • Profiles (Pfam (HMMs)/Impala))
  • Regular Expressions (Prosite)
  • Different representations must be queried in
    different ways

58
Databases of protein domains
59
Pfam
http//pfam.wustl.edu/ Also at Sanger
60
PROSITE
http//us.expasy.org/prosite/
61
(No Transcript)
62
BLOCKS
63
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com