Whole Genome Alignment - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Whole Genome Alignment

Description:

A perfect alignment between A and B would completely fill the positive diagonal. B ... Global pairwise alignment ...AAGCTTGGCTTAGCTGCTAGGGTAGGCTTGGG... – PowerPoint PPT presentation

Number of Views:497
Avg rating:3.0/5.0
Slides: 36
Provided by: adammph
Category:

less

Transcript and Presenter's Notes

Title: Whole Genome Alignment


1
Whole Genome Alignment
  • MUMmer and Alignment
  • October 2nd, 2007
  • Adam M Phillippy
  • amp_at_cs.umd.edu

2
Goal of WGA
  • For two genomes, A and B, find a mapping from
    each position in A to its corresponding position
    in B

41 bp genome
3
Not so fast...
  • Genome A may have insertions, deletions,
    translocations, inversions, duplications or SNPs
    with respect to B (sometimes all of the above)

4
Visualization
  • How can we visualize alignments?
  • With an identity plot
  • XY plot
  • Let x position in genome A
  • Let y similarity of Ax to corresponding
    position in B
  • Plot the identity function
  • This can reveal islands of conservation, e.g.
    exons

5
Identity plot example
6
WGA visualization
  • How can we visualize whole genome alignments?
  • With an alignment dot plot
  • N x M matrix
  • Let i position in genome A
  • Let j position in genome B
  • Fill cell (i,j) if Ai shows similarity to Bj
  • A perfect alignment between A and B would
    completely fill the positive diagonal

7
Translocation
Inversion
Insertion
http//mummer.sourceforge.net/manual/AlignmentType
s.pdf
8
(No Transcript)
9
Global vs. Local
  • Global pairwise alignment
  • ...AAGCTTGGCTTAGCTGCTAGGGTAGGCTTGGG...
  • ...AAGCTGGGCTTAGTTGCTAG..TAGGCTTTGG...
  • Whole genome alignment
  • Often impossible to represent as a global
    alignment
  • We will assume a set of local alignments
  • This works great for draft sequence

10
Global vs. Local
11
Alignment tools
  • Whole genome alignment
  • MUMmer (nucmer)
  • Developed, supported and available at TIGR
  • LAGAN, AVID
  • VISTA identity plots
  • Multiple genome alignment
  • MGA, MLAGAN, DIALIGN, MAVID
  • Multiple alignment
  • Muscle, ClustalW
  • Local sequence alignment
  • BLAST, FASTA, Vmatch

open source
12
MUMmer
  • Maximal Unique Matcher (MUM)
  • match
  • exact match of a minimum length
  • maximal
  • cannot be extended in either direction without a
    mismatch
  • unique
  • occurs only once in both sequences (MUM)
  • occurs only once in a single sequence (MAM)
  • occurs one or more times in either sequence (MEM)

13
Fee Fi Fo Fum,is it a MAM, MEM or MUM?
MUM maximal unique match
MAM maximal almost-unique match
MEM maximal exact match
R
Q
14
Seed and extend
  • How can we make MUMs BIGGER?
  • Find MUMs
  • using a suffix tree
  • Cluster MUMs
  • using size, gap and distance parameters
  • Extend clusters
  • using modified Smith-Waterman algorithm

15
Seed and extend
FIND all MUMs
CLUSTER consistent MUMs
EXTEND alignments
R
Q
16
Suffix Tree for atgtgtgtc
Drawing credit Art Delcher
17
Clustering
cluster length ?mi
gap distance C
indel factor B A / B or B A
18
Extending
R
score 70
Q
19
Banded alignment
0
20
MUMmer suite
  • mummer
  • exact matching
  • nucmer
  • DNA multi-FastA input
  • whole genome alignment
  • promer
  • DNA multi-FastA input
  • whole genome alignment
  • run-mummer1
  • FastA input
  • global alignment
  • run-mummer3
  • FastA input w/ draft
  • whole genome alignment
  • exact-tandems
  • FastA input
  • exact tandem repeats
  • NUCmer / PROmer utilities
  • mapview
  • alignment plotter
  • draft sequence mapping
  • mummerplot
  • alignment visualization
  • show-coords
  • alignment summary
  • delta-filter
  • alignment filter
  • show-aligns
  • pairwise alignments
  • System utilities
  • gnuplot
  • xfig

21
mummer
  • Primary uses
  • exact matching (seeding)
  • dot plotting
  • Pros
  • very efficient O(n) time and space
  • 17 bytes per bp of reference sequence
  • E. coli K12 vs. E. coli O157H7 (5Mbp each)
  • 17 seconds using 77 MB RAM
  • multi-FastA input
  • Cons
  • exact matches only

22
nucmer
  • Primary uses
  • whole genome alignment and analysis
  • draft sequence alignment
  • Pros
  • multi-FastA inputs
  • well suited for genome and contig mapping
  • convenient helper utilities
  • show-coords, delta-filter, mummerplot
  • Cons
  • low sensitivity (w\ default parameters) with
    respect to BLAST

23
WGA example
  • Yersina pestis CO92 vs. Yersina pestis KIM
  • High nucleotide similarity, 99.86
  • Two strains of the same species
  • Extensive genome shuffling
  • Global alignment will not work
  • Highly repetitive
  • Will confuse local alignment (e.g. BLAST)

24
COMMANDwhole genome alignment
  • nucmer maxmatch CO92.fasta KIM.fasta
  • -maxmatch Find maximal exact matches (MEMs)
  • delta-filter m out.delta gt out.filter.m
  • -m Many-to-many mapping
  • -1 One-to-one mapping
  • show-coords -r out.delta.m gt out.coords
  • -r Sort alignments by reference position
  • mummerplot --large --fat out.delta.m
  • --large Large plot
  • --fat Nice layout for multi-fasta files
  • --x11 Default, draw using x11 (--postscript,
    --png)
  • requires gnuplot

25
(No Transcript)
26
show-coords output
  • S1 start of the alignment region in the
    reference sequence
  • E1 end of the alignment region in the reference
    sequence
  • S2 start of the alignment region in the query
    sequence
  • E2 end of the alignment region in the query
    sequence
  • LEN 1 length of the alignment region in the
    reference sequence
  • LEN 2 length of the alignment region in the
    query sequence
  • IDY percent identity of the alignment
  • SIM percent similarity of the alignment
  • STP percent of stop codons in the alignment
  • LEN R length of the reference sequence
  • LEN Q length of the query sequence
  • COV R percent alignment coverage in the
    reference sequence
  • COV Q percent alignment coverage in the query
    sequence
  • FRM reading frame for the reference and query
    sequence alignments respectively
  • TAGS the reference and query FastA IDs
    respectively.
  • All output coordinates and lengths are relative
    to the forward strand

27
Comparative assembly
  • Assembly
  • Orient and place sequencing reads
  • Using overlaps and mate-pair information
  • Scaffolding
  • Order and orient draft contigs
  • Using mate-pair information and experimental
    validation
  • Comparative assembly and scaffolding
  • Orient and place reads and contigs
  • Using a reference genome and alignment mapping
  • e.g. AMOScmp (nucmer)

28
Comparative assembly

mate-pairs
physical map
reference genome
homology map

29
Comparative assemblycaveats
Finished
Un-finished
A
B
A
B
A
B
30
COMMANDnucmer read/contig mapping
  • nucmer maxmatch REF.fasta QRY.fasta
  • -maxmatch Find maximal exact matches (MEMs)
  • REF Reference sequence (genome)
  • QRY Query sequence to be mapped (reads,
    contigs)
  • delta-filter q out.delta gt out.delta.q
  • -q Best one-to-one mapping for each query
  • show-coords rcl out.delta.q gt out.coords
  • -r Sort alignments by reference
  • -c Display alignment coverage percentage
  • -l Display sequence length

31
Arachne vs. CA Drosophila virilis Multiple CA
contigs (Y) mapping to a single Arachne contig (X)
9kb insertion
5kb translocation
32
Gene reassembly
  • nucmer -maxmatch genes.fasta reads.fasta
  • delta-filter -q out.delta.q
  • show-coords -THqcl out.delta.q gt out.coords
  • Define matching criteria
  • identity, coverage, max gap size
  • Assemble matching reads
  • AMOS, minimus, hawkeye

33
References
  • Documentation
  • http//mummer.sourceforge.net
  • publication listing
  • http//mummer.sourceforge.net/manual
  • thorough documentation
  • http//mummer.sourceforge.net/examples
  • walkthroughs
  • Email
  • mummer-help (at) lists.sourceforge.net
  • mummer-users (at) lists.sourceforge.net

34
(No Transcript)
35
Acknowledgements
Art Delcher
Steven Salzberg
Mihai Pop
Stefan Kurtz
Mike Schatz
Write a Comment
User Comments (0)
About PowerShow.com