Title: Agenda
1Integrated Microbial Genomes (IMG) System
A Case Study in Biological Data Management
- Different views on biological data management
(VLDB 2004 Panel on Biological Data
Management) - Computer Scientists
- Source of problems for database research
- Publication in database papers
- Prototypes
- Biologists
- Vehicle for rapid data analysis
- Publication in biology papers
- Immediate solutions
Victor M. Markowitz Frank Korzeniewski Krishna
Palaniappan Ernest Szeto Biological Data
Management Technology Center Lawrence Berkeley
National Lab Nikos C. Kyrpides Natalia N.
Ivanova Microbial Genome Analysis Program Joint
Genome Institute
2Biological Data Management Problem
- Effective data analysis
- involves combining data from multiple
sources - single data type data generation
collection - multiple data types data association
- in the context of inherently imprecise data
-
3Background Microbial Genomes
- Applications
- Healthcare, environmental cleanup, agriculture,
industrial processes, alternative energy
production
4Microbial Genome Data Analysis Context
5Data Analysis Example Occurrence Profiles
- Key Challenges
- Representing abstract concepts with experimental
data - Specifying individual and composite operations
- Data coherence, completeness, integration
6Microbial Genomes Data Generation Collection
- Process
- Raw data
- Small DNA sequence fragments
- Assembled sequence fragments (contigs)
- Complete (one contiguous) sequence
- Interpreted data
- Gene prediction (models)
- Functional prediction (annotations)
- Expert data validation (cleaning)
- Expert annotations
- Key Challenges
- Diversity of data sources
- Differences in models, depth/breadth of
annotations - Consistency of the data transformation process
7Data Transformation Process Example
Microbial Genome Annotation Pipeline (ORNL)
Preliminary Functional Annotation
Annotation Data Files
ORF Calling
Fetch
Post
Sequence Data Files
8Microbial Genomes Data Association
- Key Challenges
- Data quality/precision for different types of
data, sources - Transience of identifiers, relationships
9Biological Data Management Problem Revisited
- Effective data analysis involves
- combining data from multiple sources
- in the context of inherently imprecise data
- while addressing
- Data quality
- Data semantics, precision, integrity, provenance
- System quality
- Comprehensibility, performance, reliability,
scalability - Development strategy
- Choice of technologies
- Devising (cost, time) effective solutions
10Needed System Development Framework
Deploy System
11Requirement Analysis Example IMG Data Analysis
Find unique genes in a genome of interest ?0
wrt related genomes ?1 , , ?k
12Data Model Abstraction
- Motivation
- Adds precision
- Allows reasoning in an established framework
- Analogies to traditional data domain
- Biological data modeling
- Data warehouse concepts
- Proven technology for large scale biological data
management applications - Data Structure
- Multidimensional data space
- Gene, genome, function/ pathway
- Operations
- Multidimensional space selections, projections,
aggregations - Slice dice, roll up, drill down analogies
13Data Model Abstraction Example IMG Data Model
14Data Model Abstraction Example IMG Operations
Genes
Genomes
Functions/ Pathways
15Data Analysis Example Searching for Unique Genes
parasite in horses
Causes human disease in tropical areas
(melioidosis)
16Identifying Unique Genes of Interest
Genes involved in adherence and invasion
17Exploring Unique Gene Details
18Summary
- Needed
- Effective solutions for academic biological data
management - Employing appropriate technologies and methods
- Developed within (time, cost) constraints
- IMG Case Study
- System development process framework essential
for - Continuously evolving content
- aiming at coherence, completeness
- Developing meaningful data analysis tools
- Clarity of methods, parameters, results
- Metric for success
- Community adoption and support
- Increase in analysis productivity and value
19Summary
- Biological Data Management in Academic Settings
- Problems discussed in numerous forums since 1990
- Tools, techniques - poorly understood used
- Potential Causes
- biologists have been ineffective in the care
and feeding of databases that now extends to
poor maintenance of genomics databases
American Academy of Microbiology Report, 2002 - Computer scientists in pursuit of insignificant
or misunderstood problems Bio Data Management
Workshop, 2003 - Have little interest in tedious, repetitive, data
management tasks - diminished responsibility for biological
databases . Is correlated with lack of
enthusiasm for funding these efforts AAM
Report 2002 - Poor industry support