Title: Working with thousands of jigsaw puzzles"
1Software Infrastructure for High-Performance
Metagenomics Annotation
- Working with thousands of jigsaw puzzles"
by Brandon Sutherlin
2What is Metagenomics?
- Metagenomics ( Environmental Genomics or
Community Genomics) is the study of genomes
recovered from environmental samples without the
need for culturing them - Metagenomics processes data using bioinformatics
tools
3What is Bioinformatics?
- There is no universally accepted definition of
bioinformatics. - 2 definitions
- Classical Bioinformatics Any use of computers
to store, compare, retrieve, analyze, simulate or
predict the composition or the structure of
biologically based molecules (DNA, RNA, protein
etc.) - New Bioinformatics Bioinformatics derives
knowledge from computer analysis of biological
data. These can consist of the information stored
in the genetic code, but also experimental
results from various sources, patient statistics,
and scientific literature. - Other fields are closely related and incorporated
in the new bioinformatics (Eco-informatics,
medical informatics, quantitative biology, etc)
4Why is Metagenomics Important?
- All reasons lead to more knowledge.
- Organisms can be studied directly in their
environments bypassing the need to isolate each
species - There are significant advantages for viral
metagenomics, because of difficulties cultivating
the appropriate host - Genomic information has advanced research in a
diverse array of fields, including forensic
science and biomedical research
5Whole Gene Shotgun Sequencing for Metagenomics
One genome
Random genome fragmentation
Multiple genomes
Genome assembly using overlaps
Random genomes fragmentation
Genomes assembly using overlaps
6Many projects, many fragments
- Many different projects are now completed or in
progress - Examples
- Prokaryote
- Sargasso Sea (Venter et al 2004) 1.6 billion
base pairs generated estimated to come from 1800
genomic species - Viral
- Marine water (Breitbart et al 2002) Mission Bay
and Scripps Pier. 873 sequences for the Mission
Bay and 1061 for Scripps Pier with respectively
more than 65 and 73 of unknown
7Many projects, many fragments
- Three years after the Marine Water project, most
of sequences are still unique. Despite the fact
that GenBank has more than doubled in size. - All of the Metagenome projects have generated
enormous amounts of data that still cannot be
assembled or annotated.
8Bioinformatics
- What bioinformatics tools are available now? The
focus is on viral Metagenomics - Assembly
- Annotation
- Diversity and structure prediction
9 Focus on Annotation
- Tools and methods available
- BLAST (Basic Local Alignment Search Tool) The
most popular tool for the annotation of the
fragments - Problems
- A large portion of the sequences do not have hits
in the databases - Initial BLAST may offer clues not annotation
10Our Specific Application
- Progressive Comparisons
- Based on criteria (E-Value) selected by the
user, the input sequence can be compared to other
databases and/or used Blasted with translation
11Our Specific Application
DB8
12Our Specific Application (contd)
DB8
- The process tree can be as deep as the user
desires - The nodes do not communicate directly with each
other. - Later versions may have more intelligent clients
13Executing the Application
- Challenges
- Many sequences, many large databases
- And they are getting larger
- Therefore we need to run this in parallel on
multiple hosts (each with its local storage) - A bunch of workstations in a lab
- Cluster
- A bunch of PCs over the internet
14Master-Worker Approach
- Current version has bidirectional communication
between an omniscient master and its faithful
sheep (clients/nodes). - A French graduate student is developing
algorithms for efficient (optimal?) database
distribution among the workers (in Dr. Casanovas
Lab)
15Example Process Tree
16The Nitty Gritty
- The project was developed in Python
- Why?
- The configuration files are XML
- Standardization
- Parsing support
17How Well does it Work?
- It is Working as designed on localhost
- Ready to be tested on a small cluster
- Is the need met?
- Dr. Poissons real life example, The 600
- What about MHPCC resources?
-
18My Supervisors for This Project
- Allow me to introduce Guylaine Poisson Ph.D. and
Henri Casanova Ph.D. - Both outstanding mentors, U.H. faculty, and
foreign nationals
19Future Application Development
- ICS 675
- GUI
- Checkpoint
- Research
- Release to the Community
20What Did I Learn?
- Python
- Sockets and Network Programming
- Blast
- Mac OSX
- Time Management
21Gratitude in No Particular Order
- Dr. Poisson and Dr. Cassanova
- Dr. Brown
- MHPCC
- H.R. People
- All of You
22Questions or Comments?