What do we need the data for - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

What do we need the data for

Description:

The types of data will impact the infrastructure requirements, the types of data ... What are the showstoppers? Integration (first priority) Lack of algorithms ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 13
Provided by: teren90
Category:

less

Transcript and Presenter's Notes

Title: What do we need the data for


1
What do we need the data for
  • Rigorous needs definition will drive the
    infrastructure requirements
  • The types of data will impact the infrastructure
    requirements, the types of data are driven by the
    requirements
  • Draft sequence vs completely finished seq
  • Requires different tools for support
  • Our guess is we need completely finished sequence
    to validate models
  • What do the biologists think?
  • Will this change over time?

2
Guiding Principles
  • Need a new paradigm on data ownership
  • Policies should be established up front
  • Data owned by worldwide community
  • Heirarchies of data
  • All of the trace data is not required to be
    released, only the summary data
  • Need to decide on archival policy
  • Is it easier to regenerate data vs going back to
    trace data?
  • Treat integration of data as a separate problem
  • Conceptually centralized integration repository

3
Guiding Principles
  • Need to define data interfaces
  • Up front (before the program is announces) we
    need
  • The box
  • XML? OIL? XML Schema?
  • Pick one, box is less important
  • Allows us to be internally consistent
  • Does not constrain internal reps

4
Guiding Principles
  • Need to define data interfaces
  • Up front (before the program is announces) we
    need
  • The mechanism for filling the box
  • Structures need to be able to evolve over time in
    an organized, but fast way
  • Can leverage existing tools and infrastructure
    and standards being developed by the individual
    communities

5
Guiding Principles
  • Need to have translator capabilities intrinsic in
    the infrastructure
  • Allows us to tie in to external data
  • Increases value to community at large

6
Guiding Principle
  • Success of the project will be judged by how well
    the project both is accepted by and serves the
    community at large including those groups
    beyond the walls of DOE

7
Where do we need investments
  • Integrated databases
  • New and improved algorithms
  • Need to leverage tools and intellectual output of
    SciDAC and other efforts in
  • Collaborative computing environments
  • Scientific visualization

8
What are the showstoppers?
  • Integration (first priority)
  • Lack of algorithms
  • Current algorithms arent necessarily applicable
  • Integrated data offers lots of opportunities for
    improved accuracy / new algorithms
  • Data analysis
  • Specialized data mining algorithms

9
Recommendations
  • Address data integration problems now
  • Make high performance computing resources
    available for computational biology
  • Develop tools that allow biologists to perform
    inference
  • Ability to frame questions in an intuitive way
  • Comparison and analysis capabilities
  • Example based queries

10
Recommendations
  • Realize that a lot of this is the application of
    existing CS / Math /Stats techniques, and does
    not necessarily require research in these
    disciplines
  • There is interesting CS work here, just not all
    of it is research (although some is)
  • Will not get funded under CS grants
  • Impact on who should be working on the problems
  • The National Labs are good at this
    interdisciplinary type of work

11
A New Synthesis between Computing and
BiologyLaying a Foundation for Understanding
Higher Levels of Biological Complexity
Ecological Processes and Populations
Tissue and Organismal Physiology
Cellular Developmental Processes
Functional and Structural BioComplexity
Biochemical Pathways Processes
Function-Structure Relationships
Gene Regulation Pathways
Comprehensive Genome-based Analysis
Gene Expression Networks
Comparitive Protein Analysis
Phylogeny Reconstruction
Comparitive Sequence Analysis
Genome Comparisons and Synteny
Protein Structure Modeling
Gene Structure Prediction
Protein Sequence Prediction
Gene and Feature Identification
Genome Assembly
Computing and Information Requirements
12
Biological Data
Write a Comment
User Comments (0)
About PowerShow.com