Alignment of whole genomes using suffix trees - PowerPoint PPT Presentation

About This Presentation
Title:

Alignment of whole genomes using suffix trees

Description:

... LIS (Longest Increasing Subsequences), Smith-Waterman alignment algorithm ... Use the Smith-Waterman alignments ( short gaps) 13. MUMmer Output. 14. Observations ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 26
Provided by: mahshid
Category:

less

Transcript and Presenter's Notes

Title: Alignment of whole genomes using suffix trees


1
Alignment of whole genomes using suffix trees
  • Mahshid Shakiba
  • Nov 17, 2004

IFT 6299, University of Montreal
2
Outline
  • Motivation
  • MUMmer
  • Algorithms
  • Observations
  • MUMmer 2
  • Algorithms
  • NUCmer
  • PROmer
  • Observations
  • MUMmer 3

3
Motivation
  • We need to
  • Align the entire genomes of two closely related
    organisms (millions bps)
  • Compare sequence assemblies at different stages
    of project
  • Compare the results of different assembly
    algorithms
  • Current algorithms
  • Ineffective in aligning very long sequences of
    DNA
  • Can not detect large scale changes

4
Application
  • MUMmer , a system to align and compare very large
    DNA Protein sequences in linear time
  • History
  • MUMmer 1.0, 1999
  • MUMmer 2.1, 2002
  • MUMmer 3.0, 2004
  • Created at Tiger institute
  • Website http//www.tigr.org/software/mummer/

5
MUMmer 1.0
  • Assumption Two DNA sequences are closely
    related
  • Inputs Two DNA sequences and length of the
    shortest MUM (Maximal Unique Match)
  • Output A base-to-base alignment of input
    sequences, highlights SNPs, large inserts,
    significant repeats ,transpositions and reversals
  • Techniques Suffix trees , LIS (Longest
    Increasing Subsequences), Smith-Waterman
    alignment algorithm

6
MUMmer
  • Algorithm

1- Construct suffix tree for AB 2- Finding,
sorting MUMs 3-Matching MUMs 4-Closing gaps
MUMs
gaps
7
Suffix tree
  • Definition a compact representation of all
    possible suffixes of an input S
  • Can be built in O(m) time and space where m S
  • Search of sub-string X takes O(n) time, n X

8
Suffix tree
Suffixes 1 gaaccgacct 2 aaccgacct 3
accgacct 4 ccgacct 5 cgacct 6
gacct 7 acct 8
cct 9 ct 10
t
9
Maximal Unique Match
  • Sequences in genomes A and B that
  • occur exactly once in A and in B
  • are not contained in any larger such sequence
  • Genome A tcgatcGACGATCGCAGCATAAcgact
  • Genome B gcattaGACGATCGCAGCATAAtcca

A
B
10
Finding, sorting MUMs
  • MUM Internal node with a leaf from each genome
    in its subtree
  • With single scan of the suffix tree, find all
    MUMs
  • Sort MUMs based on their position in genome A.

11
Matching MUMs
2
1
3
4
5
6
7
A
B
3
1
2
4
5
7
6
Select longest consistent set of MUMs occur in
the same order in A and B
2
1
4
5
7
A
B
1
2
4
5
7
12
Closing Gaps
  • Gaps interruption in MUM-alignment
  • Types of gaps
  • SNP Single Nucleotide Polymorphisms
  • Insertion
  • Highly polymorphic region
  • Repeat
  • Closing methods
  • Repeat procedure using a shorter minimum length
    for MUMS (long gaps)
  • Use the Smith-Waterman alignments ( short gaps)

13
MUMmer Output
14
Observations
  • Align 2 highly homologous strains of
    M.tuberculosis, 4.4 million bps.
  • Time 5 s suffix tree construction, 45 s sorting
    MUMs, 5 s Smith-Waterman alignments.
  • Longest MUM ,24 563 bp 249 MUMs gt 5000 bp gt90
    identical
  • Align 2 cousin bacteria, M.genitalium (580 kbp)
    and M.pneumoniae (816 kbp)
  • Time 6.5 s suffix tree finding LIS 0.02 s 116
    s alignments.
  • Longest MUM, 281 bp, 16 MUMs gt 100 bp, lt50
    identical

15
Observations
  • Align 2 syntenic sequences from human chromosome
    12 and mouse chromosome 6 (225 kbp).
  • Time 29 s in total, 1.6 s for suffix tree.
  • Longest MUM, 117 bp, 10 MUMs gt 50bp

16
MUMmer 2
  • Problem with MUMmer 1
  • Align only DNA sequences
  • Needs lots of memory
  • Can not align incomplete genomes
  • Solution MUMmer2
  • 3 times faster than MUMmer 1
  • Requires 1/3 space
  • Can align protein strands and incomplete genomes
  • Parallel alignment

17
MUMmer 2
  • Algorithm
  • Use only 20 bytes per bp (MUMmer, 38 bytes)
  • Build suffix tree for the shorter sequence
  • Find MUMs by streaming the second sequences
    against suffix tree
  • cluster the matches

18
Streaming algorithm
Streaming String
...atgtcc...
atgtgtgtc

c
t
gt
9
1
10
i
1
c
c
gt
gt
8
7
i
c
gtc
c
gt
5
3
6
Suffix Tree for String
atgtgtgtc
1
2
3
4
5
6
7
9
10
c
gtc
8
Use suffix links to find the start of next match.
4
2
19
Cluster MUMs
  • Align unfinished assembly which needs
    rearrangement
  • Cluster MUMs which are
  • Close enough
  • Find Longest Increasing Subsequence

20
NUCmer
  • Multiple-contigs alignment program
  • Uses MUMmer 2
  • Can
  • Compare assemblies at different stages of project
  • Compare unfinished genomes to a closely related
    genome (speed up finishing step)
  • Compare outputs of 2 different assembly program

21
NUCmer
  • Inputs 2 multi-fasta files
  • Output alignment of every contig in the first
    file to every sequence in the second file
  • Algorithm
  • Create a map of all contig positions within each
    file
  • Concatenate contigs in each file
  • Run MUMmer to find MUMs
  • Map back the matches to the separate contigs
  • Cluster MUMs

22
PROmer
  • Protein-based alignment program
  • Input 2 multi-fasta files
  • Technique
  • Translate DNA into AA in all 6 reading frames
  • Map each protein to DNA sequence
  • Concatenate all potential proteins
  • Run MUMmer, cluster MUMs based on DNA coordinates
  • Examine a series of consecutive, consistent
    matches

23
Observation
  • Align P.yeolii (5 coverage) and P.falciparum (8
    coverage) ,size 25 Mb
  • PROmer time lt 1 h
  • Blast time weeks
  • gt70 of human chromosome 14 is duplication of
    part of chromosome 2
  • Align E.coli (4.7 Mb) and V.cholerae (3 Mb) on 1
    GHz desktop computer
  • MUMmer 1 74 s, 293 MB
  • MUMmer 2 27 s, 100 MB

24
MUMmer 3.0
  • Requires 25 less memory
  • Open source
  • View results with graphical UI
  • Align Human vs human genome
  • Computer Sun-Sparc, Solaris OS,64 GB, 950 MHz
  • Size 2,839 Mbps
  • Time suffix tree, 4.7 h 4 GB Memory query,
    101.5 h Total 4.5 days

25
References
  • Kurtz et al. (2004) Versatile and open software
    for comparing large genomes, Genome Biology 2004,
    5R12
  • Delcher et al. (2002) Fast algorithms for
    large-scale genome alignment and comparison,
    Nucleic Acids Res. 2478-2483
  • Delcher et al. (1999) Alignment of whole genomes,
    Nucleic Acids Res., 27,2369-2376
  • Gusfield, D. (1997) Algorithms on Strings, Trees
    and SequencesComputer Science and Computational
    Biology
  • http//neo.bu.edu/kasif/be777/HarvardFeb2002.ppt
  • http//theorie.informatik.uni-ulm.de/Lehre/SS4/CCG
    /MUMmer.ppt
  • http//www.sna.csie.ndhu.edu.tw/lung/seminar/whol
    e.ppt
Write a Comment
User Comments (0)
About PowerShow.com