Title: MUMmer: fast alignment of large-scale DNA and protein sequences
1MUMmer fast alignment of large-scale DNA and
protein sequences
- Presented by Arthur Loder
- Course CSE 497 Computational Issues in
Molecular Biology - Date February 17, 2004
2MUMmers Significance
- MUMmer is a system for rapidly aligning entire
genomes or very large protein sequences - Input2 strings Outputbase-to-base alignment
- What distinguishes MUMmer from previous
algorithms? - Can align very large sequences (millions of
nucleotides long)
3Complete Genome Alignment
- What it means to align complete genomes
- Human chromosome 1 200 million base pairs
- Previous alignment algorithms run space
efficiently O(n) - Time complexity O(n2) is unacceptably slow
- We would like to align using a global dynamic
programming algorithm (Needleman/Wunsch), but
infeasible need a shortcut
4MUMmer vs. BLAST
- Instead, use technique somewhat similar to BLAST
(find similar regions between strings) - BLAST for comparing 1 known string to a large set
of unknown strings - MUMmer for aligning 2 very similar known strings
5You know youve made it when
6Overview MUMs Part 1
- String A
- CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAG
SCTGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCG
AGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGT
CGAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATC
ATCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCT
AGTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGA
TTAGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGA
TCGATCTAATCTATCTCTAAATGGAGCGA - String B
- GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCA
TCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGG
AGTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGT
CGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGC
TAGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGA
TCTATCTCGGCTTCGATCGTA - How align without Needleman/Wunsch ?
7Overview MUMs Part 2
- String A
- CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAG
SCTGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCG
AGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGT
CGAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATC
ATCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCT
AGTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGA
TTAGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGA
TCGATCTAATCTATCTCTAAATGGAGCGA - String B
- GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCA
TCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGG
AGTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGT
CGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGC
TAGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGA
TCTATCTCGGCTTCGATCGTA - Easier with large exact matches highlighted?
8Overview MUMs Part 3
- Any optimal global alignment will probably use
these two subsequences as anchors - This is the shortcut needed to calculate global
alignment quickly on large sequences - Very intuitive in alignment process, but only a
heuristic
9Overview MUMs Part 4
- MUM Maximal Unique Match
- MUMs occur exactly once in each sequence
- Ignores repeat sequences
- MUMs found efficiently using suffix tree data
structure (to be explained later)
10Overview Choosing MUMs
- Once have anchors, need to choose which ones to
use in alignment - All MUMs
- MUMs used in alignment (subset)
11Overview Closing the Gaps
- After choose/align anchors, what next?
- Close the gaps
- Use dynamic programming (Smith-Waterman)
- to extend MUMs
- Smaller regions, so can compute quickly
- Implicit assumption sequences very similar
123 Phases of MUMmer
- 3 Phases
- 1) Obtaining MUMs (via Suffix Trees)
- 2) MUM choosing (via Longest Increasing
Subsequences) - 3) Gap closing (via dynamic programming,
Smith-Waterman) - Comprised of previously known algorithms packaged
to form a unique algorithm
13Critiquing MUMmers Output
- Sample Output
- Sequence A ACTGC_TGAC_CTA
- Sequence B ACC_CA_GGCTCG_
-
- MUMmer best-case same alignment as
Needleman/Wunsch - MUMmer worst-case sub optimal alignment
- At least computable, whereas Needleman/Wunsch is
not
14Phase 1 Obtaining MUMs
- via Suffix Trees
- Edward M. McCreight. (1973) A Space-Economical
Suffix Tree Construction Algorithm.
http//doi.acm.org/10.1145/321941.321946
15Suffix Trees Outline
- I. Suffix Trees
- A. Motivations
- B. Tries
- C. Suffix Trees
- D. How Mummer utilizes suffix trees
16Tries
- Term trie comes from retrieval
- Introduced in 1960s by Fredkin
- Suffix trees are a type of trie
- Uses
- Quickly search large text via preprocessing
- Used for regular expressions, longest common
substring, automatic command completion, etc
17Non-Compact Trie Example
- 5 strings encoded BIG, BIGGER, BILL, GOOD, GOSH
- Every edge represents a symbol of the alphabet
18Implementation of Tries
- Use linked list
- Include pointers to sibling and first child
19Compacting Tries Part 1
- Method 1 trim chains leading to leaves
- Compact trie for strings BIG, BIGGER, BILL,
GOOD, GOSH
20Compacting Tries Part 2
- Method 2 Patricia Tries
- Before, one edge per character
- Now, unary nodes are collapsed
21Suffix Trie
- Normal trie, but input strings are suffixes
- Assume text string t1tn
- Q Tree has how many leaves?
- A Tree has n Leaves
22Suffix Tree
- First compact suffix trie
- Next collapse unary nodes
23Suffix Trees Decreasing storage
- Rather than storing strings, store a pair of
indices (x,y) where x is beginning of string and
y is the end - Storage becomes O(n)
24Suffix Tree Algorithms
- First linear-time algorithm given by Weiner
(1973) - McCreight developed more space efficient
algorithm (1976) - Two original papers reputations difficult to
understand
25McCreights Algorithm Part 1
- Algorithm M
- Maps finite string S of characters into
auxiliary index to S in the form of a digital
search tree T whose paths are the suffixes of S,
and whose terminal nodes correspond uniquely to
positions within S. - A Space-Economical Suffix Tree Construction
Algorithm. Edward M. McCreight. - http//doi.acm.org/10.1145/321941.321946
26McCreights Algorithm Part 2
- S ababc
- Definitions
- sufi is suffix of S beginning at character
position i - headi is longest prefix of sufi which is also
prefix for sufj for some - j lt i
- taili is sufi headi
- suf3 ?
- abc
- head3 ?
- ab
- tail3 ?
- abc ab c
27McCreights Algorithm Part 3
- Builds suffix tree by adding sufi to Treei-1
- Initially, Tree1 contains only suff1 (the entire
string) - To obtain Tree2, add suff2 to Tree1
- Continue until you have added suffn to
- Treen-1
- Treen is the final suffix tree
28McCreights Algorithm Part 4
- Adding a suffix (going from T2 to T3)
- suf3 abc head3 ab tail3 c
29Suffix Trees Complexity
- Adding a non-terminal and a new arc that
corresponds to tail takes at most constant time - If could find head in at most constant time, it
would run in linear time n, the length of the
string S - Do so by using suffix links (see paper for
details)
30Finding MUMs from Suffix Trees
31Finding MUMs from Suffix Trees 2
32Finding MUMs from Suffix Trees 3
33Phase 2 Choosing MUMs For Alignment
- via Longest Increasing Subsequence (LIS)
- Gusfield, D. (1997) Algorithms on Strings, Trees
and Sequences Computer Science and Computational
Biology.
34Motivation For Choosing MUMs
- Q Why cant we use all MUMs for alignment?
- A Due to crossing of MUMs can only choose
increasing set of MUMs - Problem given a set of MUMs, how do we choose
the optimal sequence?
35Choosing MUMs (Continued)
- Configuration can be uniquely represented
- P 1, 2, 3, 4, 6, 7, 5
- LIS(P) 1, 2, 3, 4, 6, 7
- Determining optimal sequence of MUMs reduces to
- finding LIS of P
36IS Definition
- Increasing Subsequence values (strictly)
increase from left to right - Sequence P 4, 2, 1, 5, 8, 6, 9, 10
- Examples of two increasing subsequences
-
- 4, 5, 9 or 2, 5, 6, 9, 10
37DS Subsequence
- Decreasing Subsequence numbers that are
decreasing from left to right - Sequence P 4, 2, 1, 5, 8, 6, 9, 10
- Examples? ltinsert class participation heregt
-
- 4, 2, 1, 4, 2, 4, 1, 2, 1, or 8, 6
38Covers Definition Part 1
- Cover of P set of decreasing subsequences of P
that contains all numbers of P - P 7, 3, 4, 8, 6, 2, 1
- Some possible covers ?
- 7, 3 4 8 6, 2, 1
- OR
- 7, 3, 2, 1 4 8, 6
- And Others
39Covers Definitions Part 2
- Size of cover number of decreasing subsequences
it contains - Smallest cover cover with minimum size
- If I is an increasing subsequence of P with
length equal to the size of a cover of P, call it
C, then I is a longest increasing subsequence of
P and C is a smallest cover of P - Why?
40Covers Relation to LIS
- If I is an increasing subsequence of P with
length equal to the size of a cover of P, call it
C, then I is a longest increasing subsequence of
P and C is a smallest cover of P. Why? - Because no increasing subsequence can contain
more than one character from each decreasing
subsequence in a cover
41Covers Examples
- P 7, 3, 4, 8, 6, 2, 1
- Two possible covers
- 7, 3 4 8 6, 2, 1
- 7, 3, 2, 1 4 8, 6
- What is the size of the smallest cover?
- 3 (no cover can contain lt 3 decreasing
subsequences) - How many elements in LIS?
- 3
42Covers Examples Continued
- P 7, 3, 4, 8, 6, 2, 1
- For a particular cover, say
- 7, 3, 2, 1 4 8, 6
- You can only choose one element from each
subsequence, otherwise subsequence would not be
increasing. - Example
- IS 3, 4, 6 to add an element, would need to
choose from a subsequence from which you already
chose
43Greedy Cover Algorithm Part 1
- To create a smallest cover, use Greedy Cover
algorithm - Start from left of sequence P
- Examine each number
- Place number at the end of the left-most
subsequence it can extend - If none exists, make a new decreasing subsequence
(to the far right)
44Greedy Cover Algorithm Part 2
- Example P 6, 3, 5, 1, 9, 7, 2, 10, 4
- 6
- 6, 3
- 6, 3 5
- 6, 3, 1 5
- 6, 3, 1 5 9
- 6, 3, 1 5 9, 7
- 6, 3, 1 5, 2 9, 7
- 6, 3, 1 5, 2 9, 7 10
- 6, 3, 1 5, 2 9, 7, 4 10
- 6, 3, 1 5, 2 9, 7, 4 10 (smallest cover)
45Obtaining LIS From Smallest Cover
- LIS Algorithm
- Set i subsequences in greedy cover
- Set I to empty list
- Choose any element x in subsequence I and place
in front of List I - While i gt 1
- Scan from top of subsequence (i-1) and find first
element y smaller than x - x y and i i -1
- Place x in the front of list I
46Obtaining LIS Example
- P 6, 3, 5, 1, 9, 7, 2, 10, 4
- Smallest Cover 6, 3, 1 5, 2 9, 7, 4 10
- 6 5 9 10
- 3 2 7
- 1 4
- i subsequences 4
- i 4 x 10 I 10
- i 3 x 9 I 9, 10
- i 2 x 5 I 5, 9, 10
- i 1 x 3 I 3, 5, 9, 10
- P 6, 3, 5, 1, 9, 7, 2, 10, 4
- LIS 3, 5, 9, 10
47How Mummer Utilizes LIS
48Obtaining LIS Complexity Analysis
- Greedy cover can be found in O(nlogn)
- LIS found from greedy cover in O(n)
49Phase 3 Closing the Gaps
- via Smith-Waterman
- Smith, T. and Waterman, M. (1981) Identification
of Common Molecular Subsequences. Journal of
Molecular Biology , 147, 195-197. - http//citeseer.ist.psu.edu/smith81identification.
html
50Closing the Gaps
- After global-MUM alignment found, need to close
local gaps - Gap interruption in MUM-alignment
- Types of gaps
- 1) SNP Single Nucleotide Polymorphisms
- 2) Insertion
- 3) Highly polymorphic region
- 4) Repeat
51Types of Gaps Examples
52SNP Processing
- SNP Single Nucleotide Polymorphism
- SNPs in human DNA appear to be associated with
many health issues (genetic disease) - Q How can we determine SNPs by using MUMmer?
- A By looking between MUMs SNPs surrounded by
matching subsequences
53Inserts Transpositions
- Large gaps in one genome but not the other
- Transpositions subsequences that were deleted
from one location, inserted elsewhere - Detected during post-processing
- How?
- Part of
- MUM-alignment
54Simple Inserts
- Simple inserts subsequences that appear in only
one genome - Do not appear in MUM-alignment
55Polymorphic Regions
- Gaps in alignment caused by sequences with large
numbers of differences - If regions small enough, align using dynamic
programming (Smith-Waterman) - Gives optimal alignment given pre-specified
insertion and mutation costs - If regions too large, recursively call MUMmer
algorithm - Q What must change when running MUMmer again?
- A Minimum MUM size must be smaller
56Repeat Processing
- Repeat sequences do not appear in MUM alignment
- Only includes sequences that appear exactly once
in each genome - In a sense, repeat sequences can fake out the
alignment
57Finding Inversions
- Professor Lopresti mentioned a method of finding
inversions last class. - Q How can we identify inversions by using
MUMmer? - A By running in both directions on gap regions
58Results Mummer 1.0 Storage
- 12 bytes per leaf node (suffix tree)
- 24 bytes per internal node (suffix tree)
- 1 byte for each base in genome
- Generous upper-bound 37 bytes per base
- Therefore, comparing two 100Mb would require lt 8
gigabytes of memory
59Improvements Mummer 2.0
- New suffix tree algorithm (Kurtz)
- At most 20 bytes per base pair (compared to 37)
- Stream query string against suffix tree, cutting
down suffix tree storage
60Improvements Mummer 3.0
- New graphical modules
- Search operations are optimal or near-optimal
- Code rewrite (modular design)
- OpenSource
61MUMmer Conclusion
62In Conclusion
- MUMmer allows alignment of sequences which are
too long to be aligned using previous algorithms - Utilizes suffix trees, LIS, and Smith-Waterman to
obtain results - Outputs a base-to-base alignment of two sequences
63MUMmer Discussion
64Discussion Questions
- What if MUMs overlap? What if some are longer
than others? How does LIS take this into account? - Are repeats really ignored?
- Should repeats be ignored? Arent they part of
the global alignment? The goal is to obtain the
same optimal alignment as in Needleman/Wunsch,
which does not ignore repeats.
65Demonstration
- http//www.tigr.org/tigr-scripts/CMR2/webmum/mumpl
ot