Alignment of whole genomes using suffix trees - PowerPoint PPT Presentation

About This Presentation

Title:

Alignment of whole genomes using suffix trees

Description:

... LIS (Longest Increasing Subsequences), Smith-Waterman alignment algorithm ... Use the Smith-Waterman alignments ( short gaps) 13. MUMmer Output. 14. Observations ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 26

Provided by: mahshid

Category:

more less

Transcript and Presenter's Notes

Title: Alignment of whole genomes using suffix trees

1
Alignment of whole genomes using suffix trees

Mahshid Shakiba
Nov 17, 2004

IFT 6299, University of Montreal
2
Outline

Motivation
MUMmer
Algorithms
Observations
MUMmer 2
Algorithms
NUCmer
PROmer
Observations
MUMmer 3

3
Motivation

We need to
Align the entire genomes of two closely related
organisms (millions bps)
Compare sequence assemblies at different stages
of project
Compare the results of different assembly
algorithms
Current algorithms
Ineffective in aligning very long sequences of
DNA
Can not detect large scale changes

4
Application

MUMmer , a system to align and compare very large
DNA Protein sequences in linear time
History
MUMmer 1.0, 1999
MUMmer 2.1, 2002
MUMmer 3.0, 2004
Created at Tiger institute
Website http//www.tigr.org/software/mummer/

5
MUMmer 1.0

Assumption Two DNA sequences are closely
related
Inputs Two DNA sequences and length of the
shortest MUM (Maximal Unique Match)
Output A base-to-base alignment of input
sequences, highlights SNPs, large inserts,
significant repeats ,transpositions and reversals
Techniques Suffix trees , LIS (Longest
Increasing Subsequences), Smith-Waterman
alignment algorithm

6
MUMmer

Algorithm

1- Construct suffix tree for AB 2- Finding,
sorting MUMs 3-Matching MUMs 4-Closing gaps
MUMs
gaps
7
Suffix tree

Definition a compact representation of all
possible suffixes of an input S
Can be built in O(m) time and space where m S
Search of sub-string X takes O(n) time, n X

8
Suffix tree
Suffixes 1 gaaccgacct 2 aaccgacct 3
accgacct 4 ccgacct 5 cgacct 6
gacct 7 acct 8
cct 9 ct 10
t
9
Maximal Unique Match

Sequences in genomes A and B that
occur exactly once in A and in B
are not contained in any larger such sequence
Genome A tcgatcGACGATCGCAGCATAAcgact
Genome B gcattaGACGATCGCAGCATAAtcca

A
B
10
Finding, sorting MUMs

MUM Internal node with a leaf from each genome
in its subtree
With single scan of the suffix tree, find all
MUMs
Sort MUMs based on their position in genome A.

11
Matching MUMs
2
1
3
4
5
6
7
A
B
3
1
2
4
5
7
6
Select longest consistent set of MUMs occur in
the same order in A and B
2
1
4
5
7
A
B
1
2
4
5
7
12
Closing Gaps

Gaps interruption in MUM-alignment
Types of gaps
SNP Single Nucleotide Polymorphisms
Insertion
Highly polymorphic region
Repeat
Closing methods
Repeat procedure using a shorter minimum length
for MUMS (long gaps)
Use the Smith-Waterman alignments ( short gaps)

13
MUMmer Output
14
Observations

Align 2 highly homologous strains of
M.tuberculosis, 4.4 million bps.
Time 5 s suffix tree construction, 45 s sorting
MUMs, 5 s Smith-Waterman alignments.
Longest MUM ,24 563 bp 249 MUMs gt 5000 bp gt90
identical
Align 2 cousin bacteria, M.genitalium (580 kbp)
and M.pneumoniae (816 kbp)
Time 6.5 s suffix tree finding LIS 0.02 s 116
s alignments.
Longest MUM, 281 bp, 16 MUMs gt 100 bp, lt50
identical

15
Observations

Align 2 syntenic sequences from human chromosome
12 and mouse chromosome 6 (225 kbp).
Time 29 s in total, 1.6 s for suffix tree.
Longest MUM, 117 bp, 10 MUMs gt 50bp

16
MUMmer 2

Problem with MUMmer 1
Align only DNA sequences
Needs lots of memory
Can not align incomplete genomes
Solution MUMmer2
3 times faster than MUMmer 1
Requires 1/3 space
Can align protein strands and incomplete genomes
Parallel alignment

17
MUMmer 2

Algorithm
Use only 20 bytes per bp (MUMmer, 38 bytes)
Build suffix tree for the shorter sequence
Find MUMs by streaming the second sequences
against suffix tree
cluster the matches

18
Streaming algorithm
Streaming String
...atgtcc...
atgtgtgtc

c
t
gt
9
1
10
i
1
c
c
gt
gt
8
7
i
c
gtc
c
gt
5
3
6
Suffix Tree for String
atgtgtgtc
1
2
3
4
5
6
7
9
10
c
gtc
8
Use suffix links to find the start of next match.
4
2
19
Cluster MUMs

Align unfinished assembly which needs
rearrangement
Cluster MUMs which are
Close enough
Find Longest Increasing Subsequence

20
NUCmer

Multiple-contigs alignment program
Uses MUMmer 2
Can
Compare assemblies at different stages of project
Compare unfinished genomes to a closely related
genome (speed up finishing step)
Compare outputs of 2 different assembly program

21
NUCmer

Inputs 2 multi-fasta files
Output alignment of every contig in the first
file to every sequence in the second file
Algorithm
Create a map of all contig positions within each
file
Concatenate contigs in each file
Run MUMmer to find MUMs
Map back the matches to the separate contigs
Cluster MUMs

22
PROmer

Protein-based alignment program
Input 2 multi-fasta files
Technique
Translate DNA into AA in all 6 reading frames
Map each protein to DNA sequence
Concatenate all potential proteins
Run MUMmer, cluster MUMs based on DNA coordinates
Examine a series of consecutive, consistent
matches

23
Observation

Align P.yeolii (5 coverage) and P.falciparum (8
coverage) ,size 25 Mb
PROmer time lt 1 h
Blast time weeks
gt70 of human chromosome 14 is duplication of
part of chromosome 2
Align E.coli (4.7 Mb) and V.cholerae (3 Mb) on 1
GHz desktop computer
MUMmer 1 74 s, 293 MB
MUMmer 2 27 s, 100 MB

24
MUMmer 3.0

Requires 25 less memory
Open source
View results with graphical UI
Align Human vs human genome
Computer Sun-Sparc, Solaris OS,64 GB, 950 MHz
Size 2,839 Mbps
Time suffix tree, 4.7 h 4 GB Memory query,
101.5 h Total 4.5 days

25
References

Kurtz et al. (2004) Versatile and open software
for comparing large genomes, Genome Biology 2004,
5R12
Delcher et al. (2002) Fast algorithms for
large-scale genome alignment and comparison,
Nucleic Acids Res. 2478-2483
Delcher et al. (1999) Alignment of whole genomes,
Nucleic Acids Res., 27,2369-2376
Gusfield, D. (1997) Algorithms on Strings, Trees
and SequencesComputer Science and Computational
Biology
http//neo.bu.edu/kasif/be777/HarvardFeb2002.ppt
http//theorie.informatik.uni-ulm.de/Lehre/SS4/CCG
/MUMmer.ppt
http//www.sna.csie.ndhu.edu.tw/lung/seminar/whol
e.ppt