Title: Data provenance in astronomy
1Data provenance in astronomy
- Bob Mann
- Wide-Field Astronomy UnitUniversity of
- Data and databases in astronomy
- Case Study UK Infrared Deep Sky Survey
- Conclusions
- Data and databases in astronomy
- Case Study UK Infrared Deep Sky Survey
- Conclusions
4Astronomers observe across the whole
electromagnetic spectrum
- Galaxy images look different across spectrum, due
to - Inherent angular resolution of the telescope
- Different emission processes
5Astronomical data original form
- Different detector technologies used across the
spectrum, yielding different types of data e.g. - Ultraviolet/optical/infrared
- Image array of pixel values
- X-ray
- Event list positions, arrival times, energies of
all detected photons - Radio
- Interferometric visibilities sparse Fourier
transform of a region of the sky
6Astronomical data final form
- Most research done using catalogue data
- i.e. tables of attributes of detected sources
mainly discrete sources (stars, galaxies, etc) - Data compression
- Catalogue few of image data volume
- Amenable to representation in relational DB
- Natural indexing by location in sky
- but original data products (images, spectra,
event lists) sometimes needed
7Astronomical databases
- Telescope archives
- Heterogeneous collections of raw data files from
all observations taken - Download data for reduction and analysis
- Sky survey archives
- Homogeneous data and pipeline reduction
- Science Archive do science on DB
- Bibliographic archives scans of journals
8Astronomical data processing
- Data reduction
- Remove instrumental signatures from raw data and
produce science-ready data - Software packages written for specific
instruments - Data analysis
- Derive scientific results from science-ready data
products e.g. statistical analyses - Some astro-specific packages/environments e.g.
IRAF - Some use of programming languages
- Fortran, C/C, Python, Java
- Some use of commercial packages
- e.g. Interactive Data Language (IDL)
- Data and databases in astronomy
- Case Study UKIDSS
- Introduction to UKIDSS
- Data life-cycle in UKIDSS
- Provenance in UKIDSS
- Conclusions
10UK Infrared Deep Sky Survey
- Set of five infrared sky surveys
- Covering 1/6 of the sky
- From large/shallow to very small/very deep
- See www.ukidss.org
- Observations 2005-2012 using Wide Field Camera
(WFCAM) on UK Infrared Telescope (UKIRT) in
11UKIDSS data life-cycle (1)
- Summit of Mauna Kea
- Data acquired from 4 WFCAM detectors
- Summit pipeline instrument health
- Data written to LTO tape in NDF format
- Tapes couriered to Cambridge weekly
- Cambridge
- Raw data converted from NDF to FITS
- Data reduction pipeline run on nightly basis
100Gb/night - Remove instrumental signatures, combine images,
detect and classify objects, calibrate positions
12UKIDSS data life-cycle (2)
- Edinburgh
- Ingest data from Cambridgecatalogues into
RDBMS image metadata into RDBMS images on disk - Combine data from multiple nights generate new
catalogues from stacked images - Prepare release databases for WFCAM Science
Archive (WSA) see http//surveys.roe.ac.uk/wsa - Users worldwide
- Extract raw images from Cambridge
- Extract image and catalogues in FITS files from
Edinburgh - Run queries on catalogues image metadata in WSA
13Provenance in UKIDSS
- Why is provenance important in UKIDSS?
- What provenance information is recorded?
- How will this be used?...and by whom?
- and is this adequate?
14Importance of provenance
- Much UKIDSS science is rare object search
Objects with these colours would be very unusual
and possibly very interesting. Are they
real? Need ability to trace back to reduced image
within which object was detected maybe back to
raw image.
Ratio of fluxes in H K bands
Ratio of fluxes in J H bands
15Structure of a FITS file
Header composedof 80-characterASCII records
Data units can be images or tables
16FITS header records
- Almost all records of the formKEYWORD value
/ COMMENT - Some standard keywords defined, butconsiderable
freedom to define new ones - Relevant metadata for particular instruments
- Amongst standard set is HISTORY
- Format HISTORY free text
- Provenance information can be stored in a series
of HISTORY records
17UKIDSS FITS files (1)
- Raw image files
- Primary header telescope/instrumentset-up,
observing conditions, target,observational
parameters - Primary data array empty
- Extensions (header,data) pairs for each of four
detectors header has detector-specific metadata
data is compressed image - Header keywords defined in Interface Control
Document between Hawaii Cambridge
18UKIDSS FITS files (2)
- Reduced image files
- Primary header data array metadatapropagated
from raw data file - Headers of extensions include HISTORY records
for data reduction steps run at Cambridge, e.g - HISTORY 20060615 173002
- HISTORY Id cir_stage1.c,v 1.11 2005/12/15
144404 jim Exp - HISTORY 20060615 173104
- HISTORY Id cir_qblkmed.c,v 1.9 2005/08/12
143519 jim Exp - HISTORY 20060615 173236
- HISTORY Id cir_xtalk.c,v 1.5 2005/10/17
145850 jim Exp - HISTORY 20060615 200158
- HISTORY Id cir_arith.c,v 1.8 2005/02/25
101455 jim Exp
19UKIDSS FITS files (3)
- Catalogue files
- Primary header metadata propagatedfrom raw
image - Primary data array empty
- Headers of extensions include metadata for
catalogue generation process invocations of
software modules in HISTORY records, with
parameter values in separate records - Header keywords for both reduced images and
catalogues are defined in an Interface Control
Document between Cambridge Edinburgh
20User access to provenance info
- All header records from all FITS files ingested
into WSA except HISTORY records - So, users can track provenance through queries
against WSA, and can get HISTORY records by
downloading files - Hopefully enough to determined whether unusual
object is real,but this is this good enough?
21RecapAstronomical data processing
- Data reduction
- Remove instrumental signatures from raw data and
produce science-ready data - Software packages written for specific
instruments - Data analysis
- Derive scientific results from science-ready data
products e.g. statistical analyses - Some astro-specific packages/environments e.g.
IRAF - Some use of programming languages
- Fortran, C/C, Python, Java
- Some use of commercial packages
- e.g. Interactive Data Language (IDL)
22Provenance in data analysisTwo main problems
- Less controlled software environment
- Little bits of code written for a specific
analysis, not tried and tested pipeline modules - Use of data from many sources
- UKIDSS/WSA is state-of-the-art for provenance
- Many (esp. older) data resources not so good
- Provenance of combined dataset only as good as
provenance of worst constituent dataset?
23Does this matter?
- Provenance information for data analysis is
recorded in the journal paper (sort of) - Improving links between online literature and
data sources - Increasing importance of large sky surveys with
well controlled environments - Moving more of the data analysis from the users
desktop to the data centre
- Modern sky survey systems record publish
extensive provenance for data reduction - Very little provenance recorded from data
analysis except description in journal paper - More could surely be done but would researchers
support overhead of doing so? - Improvements as more analysis in data centre
- Could/should we be doing more?