BioPerl: MUMmer - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

BioPerl: MUMmer

Description:

Goal: collect computational methods routinely used in bioinformatics ... bioinformatics toolkit for format conversion, report processing, data ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 27
Provided by: Nike73
Category:

less

Transcript and Presenter's Notes

Title: BioPerl: MUMmer


1
BioPerl MUMmer
  • Jason Switzer
  • Joshua Wu
  • Aimee Seufer

2
Agenda
  • Background
  • BioPerl
  • Mummer
  • Algorithm/Data Structure
  • Suffix Trees (implicit vs explicit)
  • Examples
  • Limitations for BioPerl
  • BioAlignIOmummer

3
What is BioPerl
  • A project (developed by volunteer engineers)
  • Goal collect computational methods routinely
    used in bioinformatics
  • Tools for computational molecular biology
  • bioinformatics toolkit for format conversion,
    report processing, data manipulation, sequence
    analysis, batch processing and more
  • Open source
  • http//bioperl.org/
  • Collection of modules (1450)

4
Example
5
Example
6
What is MUMmer
  • MUMmer maximum unique exact match
  • Definition it is a suffix tree algorithm
    designed to find maximal exact matches of some
    minimum length between two input sequences.

7
Achievements
  • Suffix tree a very efficient data structure
  • constructed and searched in linear time
  • ideal for large scale pattern matching
  • Memory usage dependent only on reference sequence
  • Finds maximal unique solution to dataset
  • How efficient? find all 20 base pair maximal
    exact matches between 2 5 million base pair
    bacterial genomes in 20 seconds, using 90 MB of
    RAM, on a typical 1.7 GHz Linux desktop computer

8
Why Suffix Trees?
  • "Suffix trees are widely used in the computer
    field... Recent improvements in the method have
    cut the memory requirement to 17 bytes per
    letter, which brings the method to the verge of
    practicality for bioinformatics applications"
    -- Nat Goodman (Genome Technology).

9
Introduction
  • Any string of length m can be degenerated into m
    suffixes, and these suffixes can be stored in a
    suffix tree.
  • Setup time O(m) (m is length of string)
  • searching time O(n) (n is length of pattern)

10
(No Transcript)
11
(No Transcript)
12
Sample input Homo Sapien
  • cagctcctgagactgctggcatgaaggggagccgtgccctcctgctggtg
    gccctcaccctgttctgcatctgccggatggccacaggggaggacaacga
    tgagtttttcatggacttcctgcaaacactactggtggggaccccagagg
    agctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaag
    gcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgca
    caaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacg
    gtgcctaagtggacctcagacatggctcagccataggacctgccacacaa
    gcagccgtggacacaacgcccactaccacctcccacatggaaatgtatcc
    tcaaaccgtttaatcaataa

13
Sample result
14
Sample input 2 plants
  • EARPIVVGPPPPLSGGLPGTENSDQARDGTLPYTKDRFYLQPLPPTEAAQ
    RAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDE
    KKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG

15
Sample output
16
Comparisons Homo Sapiens
17
Chicken
18
Sample Input Chicken
  • RVKRVWPLVIRTVIAGYNLYRAIKKK

19
Chicken
20
Investigation
  • Explicit suffix trees require more space than
    implicit suffix trees in real data.
  • Explicit trees should be used for smaller use of
    storage

21
Limitations on development
  • No unified output format for all tools
  • Tools available mummer, repeat-match,
    exact-tandems, gaps, mgaps, nucmer, promer,
    run-mummer1, run-mummer3, show-aligns,
    show-coords, show-snps, show-tiling
  • Variable command line options
  • Poor documentation
  • Not user-friendly

22
Difficulties With BioPerl
  • Extensive Framework
  • everything from IO utilities to BLAST
  • Aging Codebase
  • lots of copy-and-paste
  • old coding techniques data
  • Few Developers
  • few core developers maintaining the toolkit
  • few people understand the
  • Uncommon Perl Practices
  • much derision over practices such as Tie handles
    and AUTOLOAD

23
Common BioPerl Objects
  • BioSeqIO
  • reads/writes sequence files (e.g. genbank)
  • fully symmetric converter (between various
    formats)
  • lots of documentation
  • BioSeq
  • store various sequences (BioRichSeq)
  • used primarily with BioSeqIO
  • BioAlignIO
  • reads/writes alignment files (fasta)
  • not fully symmetric (between formats)
  • significantly less documentation
  • BioLocatableSeq
  • stores locatable sequence data (within another
    sequence)
  • used primarily with BioAlignIO

24
Example - Input Data
25
Example - Code
26
THANK YOU FOR LISTENING
Write a Comment
User Comments (0)
About PowerShow.com