Working with thousands of jigsaw puzzles" - PowerPoint PPT Presentation

About This Presentation
Title:

Working with thousands of jigsaw puzzles"

Description:

'Working with thousands of jigsaw puzzles' by. Brandon Sutherlin ... Metagenomics ( Environmental Genomics or Community Genomics) is the study of ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 23
Provided by: guylaine5
Learn more at: http://www.hawaii.edu
Category:

less

Transcript and Presenter's Notes

Title: Working with thousands of jigsaw puzzles"


1
Software Infrastructure for High-Performance
Metagenomics Annotation
  • Working with thousands of jigsaw puzzles"

by Brandon Sutherlin
2
What is Metagenomics?
  • Metagenomics ( Environmental Genomics or
    Community Genomics) is the study of genomes
    recovered from environmental samples without the
    need for culturing them
  • Metagenomics processes data using bioinformatics
    tools

3
What is Bioinformatics?
  • There is no universally accepted definition of
    bioinformatics.
  • 2 definitions
  • Classical Bioinformatics Any use of computers
    to store, compare, retrieve, analyze, simulate or
    predict the composition or the structure of
    biologically based molecules (DNA, RNA, protein
    etc.)
  • New Bioinformatics Bioinformatics derives
    knowledge from computer analysis of biological
    data. These can consist of the information stored
    in the genetic code, but also experimental
    results from various sources, patient statistics,
    and scientific literature.
  • Other fields are closely related and incorporated
    in the new bioinformatics (Eco-informatics,
    medical informatics, quantitative biology, etc)

4
Why is Metagenomics Important?
  • All reasons lead to more knowledge.
  • Organisms can be studied directly in their
    environments bypassing the need to isolate each
    species
  • There are significant advantages for viral
    metagenomics, because of difficulties cultivating
    the appropriate host
  • Genomic information has advanced research in a
    diverse array of fields, including forensic
    science and biomedical research

5
Whole Gene Shotgun Sequencing for Metagenomics
One genome
Random genome fragmentation
Multiple genomes
Genome assembly using overlaps
Random genomes fragmentation
Genomes assembly using overlaps
6
Many projects, many fragments
  • Many different projects are now completed or in
    progress
  • Examples
  • Prokaryote
  • Sargasso Sea (Venter et al 2004) 1.6 billion
    base pairs generated estimated to come from 1800
    genomic species
  • Viral
  • Marine water (Breitbart et al 2002) Mission Bay
    and Scripps Pier. 873 sequences for the Mission
    Bay and 1061 for Scripps Pier with respectively
    more than 65 and 73 of unknown

7
Many projects, many fragments
  • Three years after the Marine Water project, most
    of sequences are still unique. Despite the fact
    that GenBank has more than doubled in size.
  • All of the Metagenome projects have generated
    enormous amounts of data that still cannot be
    assembled or annotated.

8
Bioinformatics
  • What bioinformatics tools are available now? The
    focus is on viral Metagenomics
  • Assembly
  • Annotation
  • Diversity and structure prediction

9
Focus on Annotation
  • Tools and methods available
  • BLAST (Basic Local Alignment Search Tool) The
    most popular tool for the annotation of the
    fragments
  • Problems
  • A large portion of the sequences do not have hits
    in the databases
  • Initial BLAST may offer clues not annotation

10
Our Specific Application
  • Progressive Comparisons
  • Based on criteria (E-Value) selected by the
    user, the input sequence can be compared to other
    databases and/or used Blasted with translation

11
Our Specific Application
DB8
12
Our Specific Application (contd)
DB8
  • The process tree can be as deep as the user
    desires
  • The nodes do not communicate directly with each
    other.
  • Later versions may have more intelligent clients

13
Executing the Application
  • Challenges
  • Many sequences, many large databases
  • And they are getting larger
  • Therefore we need to run this in parallel on
    multiple hosts (each with its local storage)
  • A bunch of workstations in a lab
  • Cluster
  • A bunch of PCs over the internet

14
Master-Worker Approach
  • Current version has bidirectional communication
    between an omniscient master and its faithful
    sheep (clients/nodes).
  • A French graduate student is developing
    algorithms for efficient (optimal?) database
    distribution among the workers (in Dr. Casanovas
    Lab)

15
Example Process Tree
16
The Nitty Gritty
  • The project was developed in Python
  • Why?
  • The configuration files are XML
  • Standardization
  • Parsing support

17
How Well does it Work?
  • It is Working as designed on localhost
  • Ready to be tested on a small cluster
  • Is the need met?
  • Dr. Poissons real life example, The 600
  • What about MHPCC resources?

18
My Supervisors for This Project
  • Allow me to introduce Guylaine Poisson Ph.D. and
    Henri Casanova Ph.D.
  • Both outstanding mentors, U.H. faculty, and
    foreign nationals

19
Future Application Development
  • ICS 675
  • GUI
  • Checkpoint
  • Research
  • Release to the Community

20
What Did I Learn?
  • Python
  • Sockets and Network Programming
  • Blast
  • Mac OSX
  • Time Management

21
Gratitude in No Particular Order
  • Dr. Poisson and Dr. Cassanova
  • Dr. Brown
  • MHPCC
  • H.R. People
  • All of You

22
Questions or Comments?
Write a Comment
User Comments (0)
About PowerShow.com