Title: What do we need the data for
1What do we need the data for
- Rigorous needs definition will drive the
infrastructure requirements - The types of data will impact the infrastructure
requirements, the types of data are driven by the
requirements - Draft sequence vs completely finished seq
- Requires different tools for support
- Our guess is we need completely finished sequence
to validate models - What do the biologists think?
- Will this change over time?
2Guiding Principles
- Need a new paradigm on data ownership
- Policies should be established up front
- Data owned by worldwide community
- Heirarchies of data
- All of the trace data is not required to be
released, only the summary data - Need to decide on archival policy
- Is it easier to regenerate data vs going back to
trace data? - Treat integration of data as a separate problem
- Conceptually centralized integration repository
3Guiding Principles
- Need to define data interfaces
- Up front (before the program is announces) we
need - The box
- XML? OIL? XML Schema?
- Pick one, box is less important
- Allows us to be internally consistent
- Does not constrain internal reps
4Guiding Principles
- Need to define data interfaces
- Up front (before the program is announces) we
need - The mechanism for filling the box
- Structures need to be able to evolve over time in
an organized, but fast way - Can leverage existing tools and infrastructure
and standards being developed by the individual
communities
5Guiding Principles
- Need to have translator capabilities intrinsic in
the infrastructure - Allows us to tie in to external data
- Increases value to community at large
6Guiding Principle
- Success of the project will be judged by how well
the project both is accepted by and serves the
community at large including those groups
beyond the walls of DOE
7Where do we need investments
- Integrated databases
- New and improved algorithms
- Need to leverage tools and intellectual output of
SciDAC and other efforts in - Collaborative computing environments
- Scientific visualization
8What are the showstoppers?
- Integration (first priority)
- Lack of algorithms
- Current algorithms arent necessarily applicable
- Integrated data offers lots of opportunities for
improved accuracy / new algorithms - Data analysis
- Specialized data mining algorithms
9Recommendations
- Address data integration problems now
- Make high performance computing resources
available for computational biology - Develop tools that allow biologists to perform
inference - Ability to frame questions in an intuitive way
- Comparison and analysis capabilities
- Example based queries
10Recommendations
- Realize that a lot of this is the application of
existing CS / Math /Stats techniques, and does
not necessarily require research in these
disciplines - There is interesting CS work here, just not all
of it is research (although some is) - Will not get funded under CS grants
- Impact on who should be working on the problems
- The National Labs are good at this
interdisciplinary type of work
11A New Synthesis between Computing and
BiologyLaying a Foundation for Understanding
Higher Levels of Biological Complexity
Ecological Processes and Populations
Tissue and Organismal Physiology
Cellular Developmental Processes
Functional and Structural BioComplexity
Biochemical Pathways Processes
Function-Structure Relationships
Gene Regulation Pathways
Comprehensive Genome-based Analysis
Gene Expression Networks
Comparitive Protein Analysis
Phylogeny Reconstruction
Comparitive Sequence Analysis
Genome Comparisons and Synteny
Protein Structure Modeling
Gene Structure Prediction
Protein Sequence Prediction
Gene and Feature Identification
Genome Assembly
Computing and Information Requirements
12Biological Data