Preserving Scientific Data - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Preserving Scientific Data

Description:

Preserving Scientific Data Jamie Shiers, Information Technology Department, CERN, Geneva, Switzerland – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 16
Provided by: Jamie168
Category:

less

Transcript and Presenter's Notes

Title: Preserving Scientific Data


1
Preserving Scientific Data
  • Jamie Shiers, Information Technology Department,
    CERN, Geneva, Switzerland

2
Agenda
  • Motivation for preserving scientific data
    examples from a range of sciences
  • Volume of data involved and related issues
  • Some concrete archiving examples from Particle
    Physics
  • Remaining challenges
  • Conclusions

3
Motivation
  • Climate data in an era when climate change is
    hotly debated, the motivations appear clear
  • Medical data important for understanding issues
    such as historical pandemics, cross-species
    diseases etc. Avian flu, HIV,
  • Cosmological data plays a vital role in our
    evolving understanding of the Universe
    astrophysics community has an explicit policy
    (data is made public after 1 year data volume
    doubles each year)
  • Particle Physics data Similar arguments will
    we ever be able to build similar accelerators to
    those of today? If we lose this data, what of
    our scientific heritage? Need to look at old data
    for a signal that should have been seen (has
    happened several times)

4
Standard Cosmology Good model from 0.01
sec after Big Bang Supported by considerable
observational evidence
Elementary Particle Physics From the Standard
Model into the unknown towards energies of 1 TeV
and beyond the Terascale
Towards Quantum Gravity From the unknown into
the unknown...
http//www.damtp.cam.ac.uk/user/gr/public/bb_histo
ry.html
5
Issues
  • How much data is involved?
  • Preserving the bits
  • Understanding the bits

6
How much data is involved?
  • In 1998, the following estimates were made
    regarding the data from LEP (1989 2000) that
    should be kept

Experiment Analysis dataset Reconstructable dataset
ALEPH 250GB 1-2TB
DELPHI 2-6TB
L3 500GB 5TB
OPAL 300GB 1-2TB
  • By todays standards, these data volumes are
    trivial
  • Even though the total volume of data at the LHC
    is much much higher, the data that must be kept
    beyond the life of the machine (2007 to 2020)
    will be easily handled by then
  • The LHC will generate some 15PB of data per year!

7
The LHC machine - Overview
Introduction Status of LHCb ATLAS
ALICE CMS Conclusions
8
The size of HEP detectors
Introduction Status of LHCb ATLAS
ALICE CMS Conclusions
ATLAS
Bld. 40
CMS
9
Understanding the bits
  • In the mid-1990s, a successful re-analysis of
    10-year old data from the JADE collaboration at
    the PETRA accelerator at DESY was made
  • A sub-set of the data was found abandoned in an
    office corner. The programs to read the data were
    in an obsolete language and were unusable. The
    data format was proprietary (but de-codable).
  • This provided valuable input into the LEP data
    archive
  • Data format will this be readable in 5 / 10 /
    100 years? 1000?
  • Programs languages / operating systems /
    hardware platforms have very short life-spans wrt
    an archive
  • Metadata essential to understand what the data
    means
  • The best solution to date is a so-called Museum
    system, but this is still a very short term
    solution wrt even Einstein, let alone Tyco Brahe,
    Kepler and Newton

10
Preserving the bits
  • Lifetimes of Particle Physics experiments are
    extremely long! Currently measured in decades
  • Ironically, one of the solutions proposed for the
    LEP data archive (the then-current proposal for
    the LHC) was later abandoned (technical /
    commercial reasons)
  • This necessitated a triple migration
  • Of 300TB of data between storage media
  • Of the same data from one data format to another
  • Of the accompanying processing codes.
  • In the end, the exercise took around 2 months per
    100TB of data migrated, as well as a significant
    amount of effort (1 FTE / 100TB) and hardware
    resources

11
Outstanding Issues
  • There are no data formats, programming languages,
    computing hardware or operating systems with
    lifetimes that can be guaranteed beyond the short
    term
  • Virtual machine technology may extend an
    environments (see above) natural life perhaps
    doubling it
  • Reducing the data into a much simplified and
    widely-used format can have significant
    advantages, but only allows restricted analyses
    to be performed
  • Preserving the detailed knowledge of the
    experimental apparatus is beyond current
    technology it would require extreme discipline
    on behalf of the researchers as well as major
    advances in the understanding and description of
    metadata

12
Conclusions
  • As long as advances in storage capacity continue,
    there are no significant issues related to the
    volume of scientific data that must be kept
  • Periodic migration between different types of
    storage media must be foreseen
  • Specific storage formats must also be catered for
    this can require much more significant (time
    consuming and expensive) migrations
  • By far the biggest problem concerns understanding
    the data there is currently no clear solution
    in this domain

13
References
  • LEP Data archive
  • 1997 http//s.web.cern.ch/s/sticklan/www/archive/
  • 2002 http//mgt-focus.web.cern.ch/mgt-focus/Focus
    25/maggim.pdf
  • 2003 http//cern.ch/pfeiffer/LEP-Data-Archive/pro
    posal/ProposalForTheLEPDataArchive.html
  • http//tenchini.home.cern.ch/tenchini/Status_Archi
    ving_6_Mar_2003.pdf
  • Lisbon workshop
  • http//cern.ch/knobloch/talks/CernCodataLisbon.ppt
  • http//www.erpanet.org/events/2003/lisbon/LisbonRe
    portFinal.pdf
  • COMPASS / HARP data migrations
  • http//storageconference.org/2003/papers/06-Lubeck
    -Overview.pdf
  • http//www.slac.stanford.edu/econf/C0303241/proc/p
    apers/THKT001.PDF
  • http//indico.cern.ch/getFile.py/access?contribId
    448sessionId24resId1materialIdpaperconfId0

14
Acknowledgements
  • The following people provided material and / or
    pointers for this talk (knowingly or otherwise)
  • LEP Data Archive coordinators
  • David Stickland, David.Stickland_at_cern.ch (L3)
  • Andreas Pfeiffer, Andreas.Pfeiffer_at_cern.ch
  • Marcello Maggi, Marcello.Maggi_at_ba.infn.it (ALEPH)
  • COMPASS / HARP migrations
  • Andrea Valassi, Andrea.Valassi_at_cern.ch
  • ERPANET/CODATA Workshop
  • Jürgen Knobloch, Juergen.Knobloch_at_cern.ch

15
The End
Write a Comment
User Comments (0)
About PowerShow.com