Title: SJC CGW2
1 Chemistry research data in the modern age A
clear need for curation expertise Simon
Coles School of Chemistry, University of
Southampton, U.K. s.j.coles_at_soton.ac.uk
2A Contentious and Changing World
3Data Generation
Data Collection
Synthesis
Publication
Data Workup
Data Processing
4Data Types
G bytes
RAW data
M bytes
DERIVED data
Lab / Institution
Subject Repository / Data Centre / Public Domain
k bytes
RESULTS data
5Incentives and Drivers
- Chemists dont think about their data!
- They need to understand that their data is
valuable and has a use beyond that of an
immediate gain, before they will consider
curation issues. - So what are the incentives and drivers?
- Data Management
- Data Deluge
- Publishing Data
- Validation, Assessment and Peer Review
- Re-analysing Data
- Data Reuse and Derivative Studies
- Publishing and Funding Mandates
6Curation Incentives - Data Management, Deluge
Publishing
Data from experiments conducted as recently as
six months ago might be suddenly deemed
important, but those researchers may never find
those numbers or if they did might not know
what those numbers meant Lost in some research
assistants computer, the data are often
irretrievable or an undecipherable string of
digits To vet experiments, correct errors, or
find new breakthroughs, scientists desperately
need better ways to store and retrieve research
data Data from Big Science is easier to
handle, understand and archive. Small Science is
horribly heterogeneous and far more vast. In time
Small Science will generate 2-3 times more data
than Big Science. Lost in a Sea of Science
Data S.Carlson, The Chronicle of Higher
Education (23/06/2006)
7Curation Incentives - Data Management, Deluge
Publishing
2,000,000
30,000,000
450,000
8Curation Incentives - Data Management, Deluge
Publishing
9Separating Data from Interpretations
Intellect Interpretation (Journal article,
report, etc)
Underlying data (Institutional data repository)
10The eCrystals Data Repository
An Institutional Repository http//ecrystals.chem.
soton.ac.uk
11The Repository for the Laboratory
Create new compound
Deposit
Add experiment data and metadata
12Curation Incentives - Validation Peer Review
13Curation Incentives - Raw Data Re-analysis
Good data
Difficult data
You never know when data might have to be
revisited or new innovations will allow
re-interpretation!
14Curation Incentives - Funding and/or publishing
mandates
- Mandates to store / make data available
- RCUK statement
15Curation Incentives - Derivative Science
- Starting points for new science
- Derivation of knowledgebases
16Curation Issues
- Need to engage stakeholders throughout the whole
research data lifecycle - Instrument manufacturers,
- scientists,
- archivists,
- librarians,
- subject repositories,
- data centres,
- publishers,
- funders,
- data miners information providers
17Curation Issues
- File formats, complexity and specialisation
- Data corruption and bit rot
- Quantity of data
18Curation Issues
- File formats, complexity and specialisation
- Data corruption and bit rot
- Quantity of data
- Future proofing
- Technology developments
- eScience
19Curation Issues
- File formats, complexity and specialisation
- Data corruption and bit rot
- Quantity of data
- Catering for a whole community
20Curation Issues
- File formats, complexity and specialisation
- Data corruption and bit rot
- Quantity of data
- Catering for a whole community
- What data is worth storing?
- Estimated that the real cost of a crystal
structure is 75 - 100 (200) - But what about the cost of producing the
crystal? - Priceless!
- The crystal was synthesised in a specialised
laboratory, by highly trained researchers under a
specific research program - A laboratory, researcher or scheme of work is a
transient or evolving entity - As much data as possible must be acquired and
future-proofed whilst the analyst has the
substance to hand
21Curation Issues
- File formats, complexity and specialisation
- Data corruption and bit rot
- Quantity of data
- Catering for a whole community
- What data is worth storing?
- Provenance, workflow and rights protection
22Curation Issues
- File formats, complexity and specialisation
- Data corruption and bit rot
- Quantity of data
- Catering for a whole community
- What data is worth storing?
- Provenance, workflow and protection of rights
- Available expertise, library/information services
structure - Cost and policy
- Business models
- Subject librarian model - working closely with
practitioners - New funding/structure models to support open data
as OA takes off - Working group to assess the volume and diversity
of research data - JISC funded survey - Cost of preserving research
data - Commercialisation of knowledge derived from
collections of data