Combinatorial Pattern Matching

About This Presentation

Title:

Combinatorial Pattern Matching

Description:

... Needleman-Wunsch global alignment algorithm 1981: Smith-Waterman local alignment algorithm 1985: FASTA 1990: BLAST (basic local alignment search tool) ... – PowerPoint PPT presentation

Number of Views:159

Avg rating:3.0/5.0

Slides: 103

Provided by: mch116

Learn more at: https://www.cs.brandeis.edu

Category:

more less

Transcript and Presenter's Notes

Title: Combinatorial Pattern Matching

1
Combinatorial Pattern Matching
2
Outline

Hash Tables
Repeat Finding
Exact Pattern Matching
Keyword Trees
Suffix Trees
Heuristic Similarity Search Algorithms
Approximate String Matching
Filtration
Comparing a Sequence Against a Database
Algorithm behind BLAST
Statistics behind BLAST
PatternHunter and BLAT

3
Outline- CHANGES

Hash Tables CHANGE- start from an
example-modify
Heuristic Similarity Search Algorithms _SHOW
FILTERED DOT-MATRIX

4
Genomic Repeats

Example of repeats
ATGGTCTAGGTCCTAGTGGTC
Motivation to find them
Genomic rearrangements are often associated with
repeats
Trace evolutionary secrets
Many tumors are characterized by an explosion of
repeats

5
Genomic Repeats

The problem is often more difficult
ATGGTCTAGGACCTAGTGTTC
Motivation to find them
Genomic rearrangements are often associated with
repeats
Trace evolutionary secrets
Many tumors are characterized by an explosion of
repeats

6
l-mer Repeats

Long repeats are difficult to find
Short repeats are easy to find (e.g., hashing)
Simple approach to finding long repeats
Find exact repeats of short l-mers (l is usually
10 to 13)
Use l-mer repeats to potentially extend into
longer, maximal repeats

7
l-mer Repeats (contd)

There are typically many locations where an l-mer
is repeated
GCTTACAGATTCAGTCTTACAGATGGT
The 4-mer TTAC starts at locations 3 and 17

8
Extending l-mer Repeats

GCTTACAGATTCAGTCTTACAGATGGT
Extend these 4-mer matches
GCTTACAGATTCAGTCTTACAGATGGT
Maximal repeat TTACAGAT

9
Maximal Repeats

To find maximal repeats in this way, we need ALL
start locations of all l-mers in the genome
Hashing lets us find repeats quickly in this
manner

10
Hashing What is it?

What does hashing do?
For different data, generate a unique integer
Store data in an array at the unique integer
index generated from the data
Hashing is a very efficient way to store and
retrieve data

11
Hashing Definitions

Hash table array used in hashing
Records data stored in a hash table
Keys identifies sets of records
Hash function uses a key to generate an index to
insert at in hash table
Collision when more than one record is mapped to
the same index in the hash table

12
Hashing Example

Where do the animals eat?
Records each animal
Keys where each animal eats

13
Hashing DNA sequences

Each l-mer can be translated into a binary string
(A, T, C, G can be represented as 00, 01, 10, 11)
After assigning a unique integer per l-mer it is
easy to get all start locations of each l-mer in
a genome

14
Hashing Maximal Repeats

To find repeats in a genome
For all l-mers in the genome, note the start
position and the sequence
Generate a hash table index for each unique l-mer
sequence
In each index of the hash table, store all genome
start locations of the l-mer which generated that
index
Extend l-mer repeats to maximal repeats

15
Hashing Collisions

Dealing with collisions
Chain all start locations of l-mers (linked
list)

16
Hashing Summary

When finding genomic repeats from l-mers
Generate a hash table index for each l-mer
sequence
In each index, store all genome start locations
of the l-mer which generated that index
Extend l-mer repeats to maximal repeats

17
Pattern Matching

What if, instead of finding repeats in a genome,
we want to find all sequences in a database that
contain a given pattern?
This leads us to a different problem, the Pattern
Matching Problem

18
Pattern Matching Problem

Goal Find all occurrences of a pattern in a text
Input Pattern p p1pn and text t t1tm
Output All positions 1lt i lt (m n 1) such
that the n-letter substring of t starting at i
matches p
Motivation Searching database for a known pattern

19
Exact Pattern Matching A Brute-Force Algorithm

PatternMatching(p,t)
n ? length of pattern p
m ? length of text t
for i ? 1 to (m n 1)
if titin-1 p
output i

20
Exact Pattern Matching An Example
GCAT

PatternMatching algorithm for
Pattern GCAT
Text CGCATC

CGCATC
GCAT
CGCATC
GCAT
CGCATC
GCAT
CGCATC
GCAT
CGCATC
21
Exact Pattern Matching Running Time

PatternMatching runtime O(nm)
Probability-wise, its more like O(m)
Rarely will there be close to n comparisons in
line 4
Better solution suffix trees
Can solve problem in O(m) time
Conceptually related to keyword trees

22
Keyword Trees Example

Keyword tree
Apple

23
Keyword Trees Example (contd)

Keyword tree
Apple
Apropos

24
Keyword Trees Example (contd)

Keyword tree
Apple
Apropos
Banana

25
Keyword Trees Example (contd)

Keyword tree
Apple
Apropos
Banana
Bandana

26
Keyword Trees Example (contd)

Keyword tree
Apple
Apropos
Banana
Bandana
Orange

27
Keyword Trees Properties

Stores a set of keywords in a rooted labeled tree
Each edge labeled with a letter from an alphabet
Any two edges coming out of the same vertex have
distinct labels
Every keyword stored can be spelled on a path
from root to some leaf

28
Keyword Trees Threading (contd)

Thread appeal
appeal

29
Keyword Trees Threading (contd)

Thread appeal
appeal

30
Keyword Trees Threading (contd)

Thread appeal
appeal

31
Keyword Trees Threading (contd)

Thread appeal
appeal

32
Keyword Trees Threading (contd)

Thread apple
apple

33
Keyword Trees Threading (contd)

Thread apple
apple

34
Keyword Trees Threading (contd)

Thread apple
apple

35
Keyword Trees Threading (contd)

Thread apple
apple

36
Keyword Trees Threading (contd)

Thread apple
apple

37
Multiple Pattern Matching Problem

Goal Given a set of patterns and a text, find
all occurrences of any of patterns in text
Input k patterns p1,,pk, and text t t1tm
Output Positions 1 lt i lt m where substring of t
starting at i matches pj for 1 lt j lt k
Motivation Searching database for known multiple
patterns

38
Multiple Pattern Matching Straightforward
Approach

Can solve as k Pattern Matching Problems
Runtime
O(kmn)
using the PatternMatching algorithm k times
m - length of the text
n - average length of the pattern

39
Multiple Pattern Matching Keyword Tree Approach

Or, we could use keyword trees
Build keyword tree in O(N) time N is total
length of all patterns
With naive threading O(N nm)
Aho-Corasick algorithm O(N m)

40
Keyword Trees Threading

To match patterns in a text using a keyword tree
Build keyword tree of patterns
Thread the text through the keyword tree

41
Keyword Trees Threading (contd)

Threading is complete when we reach a leaf in
the keyword tree
When threading is complete, weve found a
pattern in the text

42
Suffix TreesCollapsed Keyword Trees

Similar to keyword trees, except edges that form
paths are collapsed
Each edge is labeled with a substring of a text
All internal edges have at least two outgoing
edges
Leaves labeled by the index of the pattern.

43
Suffix Tree of a Text

Suffix trees of a text is constructed for all its
suffixes

ATCATG TCATG CATG ATG
TG G
Keyword Tree
Suffix Tree
44
Suffix Tree of a Text

Suffix trees of a text is constructed for all its
suffixes

ATCATG TCATG CATG ATG
TG G
Keyword Tree
Suffix Tree
How much time does it take?
45
Suffix Tree of a Text

Suffix trees of a text is constructed for all its
suffixes

ATCATG TCATG CATG ATG
TG G
Keyword Tree
Suffix Tree
quadratic
Time is linear in the total size of all suffixes,
i.e., it is quadratic in the length of the text
46
Suffix Trees Advantages

Suffix trees of a text is constructed for all its
suffixes
Suffix trees build faster than keyword trees

ATCATG TCATG CATG ATG
TG G
Keyword Tree
Suffix Tree
quadratic
linear (Weiner suffix tree algorithm)
47
Use of Suffix Trees

Suffix trees hold all suffixes of a text
i.e., ATCGC ATCGC, TCGC, CGC, GC, C
Builds in O(m) time for text of length m
To find any pattern of length n in a text
Build suffix tree for text
Thread the pattern through the suffix tree
Can find pattern in text in O(n) time!
O(n m) time for Pattern Matching Problem
Build suffix tree and lookup pattern

48
Pattern Matching with Suffix Trees

SuffixTreePatternMatching(p,t)
Build suffix tree for text t
Thread pattern p through suffix tree
if threading is complete
output positions of all p-matching leaves in
the tree
else
output Pattern does not appear in text

49
Suffix Trees Example
50
Multiple Pattern Matching Summary

Keyword and suffix trees are used to find
patterns in a text
Keyword trees
Build keyword tree of patterns, and thread text
through it
Suffix trees
Build suffix tree of text, and thread patterns
through it

51
Approximate vs. Exact Pattern Matching

So far all weve seen exact pattern matching
algorithms
Usually, because of mutations, it makes much more
biological sense to find approximate pattern
matches
Biologists often use fast heuristic approaches
(rather than local alignment) to find approximate
matches

52
Heuristic Similarity Searches

Genomes are huge Smith-Waterman quadratic
alignment algorithms are too slow
Alignment of two sequences usually has short
identical or highly similar fragments
Many heuristic methods (i.e., FASTA) are based on
the same idea of filtration
Find short exact matches, and use them as seeds
for potential match extension
Filter out positions with no extendable matches

53
Dot Matrices

Dot matrices show similarities between two
sequences
FASTA makes an implicit dot matrix from short
exact matches, and tries to find long diagonals
(allowing for some mismatches)

54
Dot Matrices (contd)

Identify diagonals above a threshold length
Diagonals in the dot matrix indicate exact
substring matching

55
Diagonals in Dot Matrices

Extend diagonals and try to link them together,
allowing for minimal mismatches/indels
Linking diagonals reveals approximate matches
over longer substrings

56
Approximate Pattern Matching Problem

Goal Find all approximate occurrences of a
pattern in a text
Input A pattern p p1pn, text t t1tm, and
k, the maximum number of mismatches
Output All positions 1 lt i lt (m n 1) such
that titin-1 and p1pn have at most k
mismatches (i.e., Hamming distance between
titin-1 and p lt k)

57
Approximate Pattern Matching A Brute-Force
Algorithm

ApproximatePatternMatching(p, t, k)
n ? length of pattern p
m ? length of text t
for i ? 1 to m n 1
dist ? 0
for j ? 1 to n
if tij-1 ! pj
dist ? dist 1
if dist lt k
output i

58
Approximate Pattern Matching Running Time

That algorithm runs in O(nm).
Landau-Vishkin algorithm O(kn)
We can generalize the Approximate Pattern
Matching Problem into a Query Matching
Problem
We want to match substrings in a query to
substrings in a text with at most k mismatches
Motivation we want to see similarities to some
gene, but we may not know which parts of the gene
to look for

59
Query Matching Problem

Goal Find all substrings of the query that
approximately match the text
Input Query q q1qw,
text t t1tm,
n (length of matching
substrings),
k (maximum number of
mismatches)
Output All pairs of positions (i, j) such that
the
n-letter substring of q starting
at i approximately matches the
n-letter substring of t starting
at j,
with at most k mismatches

60
Approximate Pattern Matching vs Query Matching
61
Query Matching Main Idea

Approximately matching strings share some
perfectly matching substrings.
Instead of searching for approximately matching
strings (difficult) search for perfectly matching
substrings (easy).

62
Filtration in Query Matching

We want all n-matches between a query and a text
with up to k mismatches
Filter out positions we know do not match
between text and query
Potential match detection find all matches of
l-tuples in query and text for some small l
Potential match verification Verify each
potential match by extending it to the left and
right, until (k 1) mismatches are found

63
Filtration Match Detection

If x1xn and y1yn match with at most k
mismatches, they must share an l-tuple that is
perfectly matched, with l ?n/(k 1)?
Break string of length n into k1 parts, each
each of length ?n/(k 1)?
k mismatches can affect at most k of these k1
parts
At least one of these k1 parts is perfectly
matched

64
Filtration Match Detection (contd)

Suppose k 3. We would then have
ln/(k1)n/4
There are at most k mismatches in n, so at the
very least there must be one out of the k1 l
tuples without a mismatch

1l
l 12l
2l 13l
3l 1n
1
2
k
k 1
65
Filtration Match Verification

For each l -match we find, try to extend the
match further to see if it is substantial

Extend perfect match of length l until we find an
approximate match of length n with k mismatches
query
text
66
Filtration Example
k 0 k 1 k 2 k 3 k 4 k 5
l -tuple length n n/2 n/3 n/4 n/5 n/6
Shorter perfect matches required
Performance decreases
67
Local alignment is to slow

Quadratic local alignment is too slow while
looking for similarities between long strings
(e.g. the entire GenBank database)

68
Local alignment is to slow

Quadratic local alignment is too slow while
looking for similarities between long strings
(e.g. the entire GenBank database)

69
Local alignment is to slow

Quadratic local alignment is too slow while
looking for similarities between long strings
(e.g. the entire GenBank database)

70
Local alignment is to slow

Quadratic local alignment is too slow while
looking for similarities between long strings
(e.g. the entire GenBank database)
Guaranteed to find the optimal local alignment
Sets the standard for sensitivity

71
Local alignment is to slow

Quadratic local alignment is too slow while
looking for similarities between long strings
(e.g. the entire GenBank database)
Basic Local Alignment Search Tool
Altschul, S., Gish, W., Miller, W., Myers, E.
Lipman, D.J.
Journal of Mol. Biol., 1990
Search sequence databases for local alignments to
a query

72
BLAST

Great improvement in speed, with a modest
decrease in sensitivity
Minimizes search space instead of exploring
entire search space between two sequences
Finds short exact matches (seeds), only
explores locally around these hits

73
What Similarity Reveals

BLASTing a new gene
Evolutionary relationship
Similarity between protein function
BLASTing a genome
Potential genes

74
BLAST algorithm

Keyword search of all words of length w from the
query of length n in database of length m with
score above threshold
w 11 for DNA queries, w 3 for proteins
Local alignment extension for each found keyword
Extend result until longest match above threshold
is achieved
Running time O(nm)

75
BLAST algorithm (contd)
keyword
Query KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVL
KIFLENVIRD
GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK
11 GEK 11 GDK 11
Neighborhood words
neighborhood score threshold (T 13)
extension
Query 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK
60 DN G IR L GK I L E
RGK Sbjct 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EK
HRGIIK 263
High-scoring Pair (HSP)
76
Original BLAST

Dictionary
All words of length w
Alignment
Ungapped extensions until score falls below some
statistical threshold
Output
All local alignments with score gt threshold

77
Original BLAST Example
A C G A A G T A A G G T C
C A G T

w 4
Exact keyword match of GGTC
Extend diagonals with mismatches until score is
under 50
Output result
GTAAGGTCC
GTTAGGTCC

C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
78
Gapped BLAST Example
A C G A A G T A A G G T C
C A G T

Original BLAST exact keyword search, THEN
Extend with gaps around ends of exact match until
score lt threshold
Output result
GTAAGGTCCAGT
GTTAGGTC-AGT

C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
79
Incarnations of BLAST

blastn Nucleotide-nucleotide
blastp Protein-protein
blastx Translated query vs. protein database
tblastn Protein query vs. translated database
tblastx Translated query vs. translated
database (6 frames each)

80
Incarnations of BLAST (contd)

PSI-BLAST
Find members of a protein family or build a
custom position-specific score matrix
Megablast
Search longer sequences with fewer differences
WU-BLAST (Wash U BLAST)
Optimized, added features

81
Assessing sequence similarity

Need to know how strong an alignment can be
expected from chance alone
Chance relates to comparison of sequences that
are generated randomly based upon a certain
sequence model
Sequence models may take into account
GC content
Poly-A tails
Junk DNA
Codon bias
Etc.

82
BLAST Segment Score

BLAST uses scoring matrices (d) to improve on
efficiency of match detection
Some proteins may have very different amino acid
sequences, but are still similar
For any two l-mers x1xl and y1yl
Segment pair pair of l-mers, one from each
sequence
Segment score Sli1 d(xi, yi)

83
BLAST Locally Maximal Segment Pairs

A segment pair is maximal if it has the best
score over all segment pairs
A segment pair is locally maximal if its score
cant be improved by extending or shortening
Statistically significant locally maximal segment
pairs are of biological interest
BLAST finds all locally maximal segment pairs
with scores above some threshold
A significantly high threshold will filter out
some statistically insignificant matches

84
BLAST Statistics

Threshold Altschul-Dembo-Karlin statistics
Identifies smallest segment score that is
unlikely to happen by chance
matches above q has mean E(q) Kmne-lq K is a
constant, m and n are the lengths of the two
compared sequences
Parameter l is positive root of
S x,y in A(pxpyed(x,y)) 1, where px and py are
frequenceies of amino acids x and y, and A is the
twenty letter amino acid alphabet

85
P-values

The probability of finding b HSPs with a score S
is given by
(e-EEb)/b!
For b 0, that chance is
e-E
Thus the probability of finding at least one HSP
with a score S is
P 1 e-E

86
Sample BLAST output

Blast of human beta globin protein against zebra
fish

Score E
Sequences producing significant alignments
(bits) Value
gi18858329refNP_571095.1 ba1 globin Danio
rerio gtgi147757... 171 3e-44
gi18858331refNP_571096.1 ba2 globin
SIdZ118J2.3 Danio rer... 170 7e-44
gi37606100embCAE48992.1 SIbY187G17.6 (novel
beta globin) D... 170 7e-44
gi31419195gbAAH53176.1 Ba1 protein Danio
rerio 168 3e-43
ALIGNMENTS
gtgi18858329refNP_571095.1 ba1 globin Danio
rerio
Length 148
Score 171 bits (434), Expect 3e-44
Identities 76/148 (51), Positives 106/148
(71), Gaps 1/148 (0)
Query 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWT
QRFFESFGDLSTPDAVMGNPK 60
MV T EA LWGKNDEG AL R
LVYPWTQRF FGLSP AMGNPK
Sbjct 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWT
QRYFATFGNLSSPAAIMGNPK 60

87
Sample BLAST output (contd)

Blast of human beta globin DNA against human DNA

Score E
Sequences producing significant alignments
(bits) Value
gi19849266gbAF487523.1 Homo sapiens gamma A
hemoglobin (HBG1... 289 1e-75
gi183868gbM11427.1HUMHBG3E Human gamma-globin
mRNA, 3' end 289 1e-75
gi44887617gbAY534688.1 Homo sapiens A-gamma
globin (HBG1) ge... 280 1e-72
gi31726embV00512.1HSGGL1 Human messenger RNA
for gamma-globin 260 1e-66
gi38683401refNR_001589.1 Homo sapiens
hemoglobin, beta pseud... 151 7e-34
gi18462073gbAF339400.1 Homo sapiens haplotype
PB26 beta-glob... 149 3e-33
ALIGNMENTS
gtgi28380636refNG_000007.3 Homo sapiens beta
globin region (HBB_at_) on chromosome 11
Length 81706
Score 149 bits (75), Expect 3e-33
Identities 183/219 (83)
Strand Plus / Plus
Query 267 ttgggagatgccacaaagcacctggatgatctcaagg
gcacctttgcccagctgagtgaa 326

88
Timeline

1970 Needleman-Wunsch global alignment algorithm
1981 Smith-Waterman local alignment algorithm
1985 FASTA
1990 BLAST (basic local alignment search tool)
2000s BLAST has become too slow in genome vs.
genome comparisons - new faster algorithms
evolve!
Pattern Hunter
BLAT

89
PatternHunter faster and even more sensitive

BLAST matches short consecutive sequences
(consecutive seed)
Length k
Example (k 11)
11111111111
Each 1 represents a match

PatternHunter matches short non-consecutive
sequences (spaced seed)
Increases sensitivity by locating homologies that
would otherwise be missed
Example (a spaced seed of length 18 w/ 11
matches)
111010010100110111
Each 0 represents a dont care, so there can be
a match or a mismatch

90
Spaced seeds

Example of a hit using a spaced seed

How does this result in better sensitivity?
91
Why is PH better?

PatternHunter

BLAST redundant hits

This results in gt 1 hit and creates clusters of
redundant hits
This results in very few redundant hits
92
Why is PH better?

BLAST may also miss a hit
GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT
GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT

9 matches
In this example, despite a clear homology, there
is no sequence of continuous matches longer than
length 9. BLAST uses a length 11 and because of
this, BLAST does not recognize this as a
hit! Resolving this would require reducing the
seed length to 9, which would have a damaging
effect on speed
93
Advantage of Gapped Seeds
11 positions
11 positions
10 positions
94
Why is PH better?

Higher hit probability
Lower expected number of random hits

95
Use of Multiple Seeds

Basic Searching Algorithm
Select a group of spaced seed models
For each hit of each model, conduct extension to
find a homology.

96
Another method BLAT

BLAT (BLAST-Like Alignment Tool)
Same idea as BLAST - locate short sequence hits
and extend

97
BLAT vs. BLAST Differences

BLAT builds an index of the database and scans
linearly through the query sequence, whereas
BLAST builds an index of the query sequence and
then scans linearly through the database
Index is stored in RAM which is memory intensive,
but results in faster searches

98
BLAT Fast cDNA Alignments

Steps
1. Break cDNA into 500 base chunks.
2. Use an index to find regions in genome similar
to each chunk of cDNA.
3. Do a detailed alignment between genomic
regions and cDNA chunk.
4. Use dynamic programming to stitch together
detailed alignments of chunks into detailed
alignment of whole.

99
BLAT Indexing

An index is built that contains the positions of
each k-mer in the genome
Each k-mer in the query sequence is compared to
each k-mer in the index
A list of hits is generated - positions in cDNA
and in genome that match for k bases

100
Indexing An Example

Here is an example with k 3
Genome cacaattatcacgaccgc
3-mers (non-overlapping) cac aat tat cac gac cgc
Index aat 3 gac 12
cac 0,9 tat 6
cgc 15
cDNA (query sequence) aattctcac
3-mers (overlapping) aat att ttc tct ctc tca cac
0 1 2 3 4 5 6
Hits aat 0,3
cac 6,0
cac 6,9
clump cacAATtatCACgaccgc

Multiple instances map to single index
Position of 3-mer in query, genome
101
However

BLAT was designed to find sequences of 95 and
greater similarity of length gt40 may miss more
divergent or shorter sequence alignments

102
PatternHunter and BLAT vs. BLAST

PatternHunter is 5-100 times faster than Blastn,
depending on data size, at the same sensitivity
BLAT is several times faster than BLAST, but best
results are limited to closely related sequences