Information Intensive Breakout - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

Information Intensive Breakout

Description:

Jim Myers, Debbie Gracio, Ron Taylor, Larry Rahn, Ray Bair, Tom Allison, Karlo ... of a portal to have a presence within their own community, for little investment. ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 10
Provided by: debbie162
Category:

less

Transcript and Presenter's Notes

Title: Information Intensive Breakout


1
Information Intensive Breakout 3
  • Jim Myers, Debbie Gracio, Ron Taylor, Larry Rahn,
    Ray Bair, Tom Allison, Karlo Berket, Lawrence
    Buja,

2
Summary
  • Social barriers to collaboratory efforts.
  • How can we help Virtual Organizations - in set
    up, and in maintaining their information store
    after a VO goes away.
  • Loosely-coupled federation of databases seen
    emerging, tied together through exposed web
    services and ontology use, for information
    sharing.
  • Importance of template and workflow storage.
  • Examples taken from biology and combustion
    communities.

3
  • Journal literature is a hyper-compression of
    data. We need to agree to move to a less
    compressed form for annotation by meta-data and
    explicit tagging of data items.
  • How does a community get to a level where
    knowledge, rather than data, is hosted?
  • Program managers strategically need to encourage
    standards where they make sense for data formats
    and data conventions. Pure chaos when everyone
    has their own data formats, and interoperability
    becomes the bottleneck.
  • Biology is a different funding model (vs, say,
    climate modeling). Only the individual lab is
    extracting the value - this is the model today,
    but it is evolving.

4
  • Some experimental raw data is valuable and can
    not necessarily be found again by experiment
    repetition.
  • Model data can be re-run over and over again
    (perhaps improving in the process).
  • Community needs to be involved in the electronic
    standards to support our programmatic efforts.
    Today we search against publications rather than
    against data sets publicly available within
    databases - lack of standards in experimental
    descriptions.
  • Combustion community is stalled because they
    dont know how to share data cultural issue
    around sharing data, and format issues. Changes
    will need to occur in the community to support
    better progress to meet the mission needs of DOE.
  • Basic Energy Sciences needs to become a stronger
    supporter of this we need to make the
    value-based argument.

5
  • Need to build trust within science community so
    that the CS community can respond to the needs
    and changes of the science community based
    around formats and next generation tools to work
    with the data.
  • In bio community very difficult to capture the
    context of the experiment success in
    microarray, working towards standards in
    proteomics.
  • Is there a general base of information that can
    be pulled out to facilitate capture across
    multiple data sources? Build standards for that
    base, then evolve the standards over time.
  • Experience with the data is very important before
    a standard can be defined. Workshops within
    community organizations in the biology area are
    starting to work towards standards.
  • Portal provides mechanism to share information
    whats the future of this infrastructure to allow
    us to continue to build communities?

6
  • Portal usage is based on community needs Group
    spaces vs. federated databases, e.g., molecular
    biology.
  • Strawman portal interface is stood up by a
    virtual organization with unique identifiers to
    databases. The virtual org wont own the data,
    which may be archived in a variety of locations
    that expose multiple web-based services that can
    work on the data.
  • Central web service registry (BioMoby based?).
    For biology, perhaps at NCBI.
  • Hypothesis stand-up of a VO is a key
    capability, with branding, control over policies,
    but with little investment in services, hardware,
    etc. The VO would create an instance of a portal
    to have a presence within their own community,
    for little investment.
  • Concept of mingling data across databases
    analogy of a Sourceforge for data, rather than
    program development. Groups can use common web
    services through local registries.
  • Data sources must have a value independent of the
    technology lasting value in the original data.
    There is a need to maintain the provenance of the
    metadata, data, and model results.

7
  • Strawman concept VO written to support multiple
    services, using similar look and feel. Component
    based, with a dynamic model for how it operates,
    based on policy. The component communicates its
    own semantics so that the developers/users dont
    worry about implementations.
  • Can assume that there is a translator between
    different data sources, between different
    ontologies (may be noop).
  • Assembly and propagation of ontologies are
    required to make electronic data items useable as
    knowledge. Need ontology translation process to
    tie ontologies together.
  • Biology community has multiple groups building
    ontologies. A community based ontology is the
    end product that all agree to work towards.
    Individual labs need to participate in ontology
    creation, and tie data items in local database to
    appropriate nodes in appropriate ontologies.

8
  • Want to be able to define an ontology that works
    for me, and then extend by adding other
    ontologies that add meaning to my data.
  • Want a suite of software tools that codes
    assumptions, hypothesis, and results into data
    and then connects with collaborators to share
    knowledge in a useable form.
  • Journals are moving towards standardized data
    formats within the bio community. Journals are
    requiring deposit into public databases eg. NCBI,
    EBI, NIST. How will people access the
    experimental data in such archives?
  • NIST model requires review/curation of data prior
    to archival which then also supports review for
    the journal. Data is then hosted in XML form
    within the NIST database. implemented for one
    specific community today, NIST wants to extend to
    more communities.
  • Projected bio future huge number of databases
    available on web will be data mined/knowledge
    mined via web services that are available to
    access each database, the services placed in a
    centralized registry. Published ontologies will
    be used within databases, interpretation of terms
    based on ontologies, loose federation of
    databases adopting unique identifiers (eg. Life
    Sciences identifier). Data items exchanged
    across web, tagged with metadata.

9
  • The importance of a globally unique identifier
    for data items.
  • Microarray public repositories will allow data
    mining to correlate your research data with that
    of other labs. Data mining will be extremely
    valuable, but each database must be designed to
    support such, and correlated ontologies are
    important to support mining across multiple
    databases.
  • Allowing the creation and storage of templates is
    important for workflow, as example.
  • Capturing workflow would also use of algorithm to
    provide quality control and validation.
  • A groups definition of the queries to be posed
    and the ontologies required to support knowledge
    discovery across communities, the need to share
    of workflow definitions and templates of
    procedures this then is a starting point for
    the development of data policies.
Write a Comment
User Comments (0)
About PowerShow.com