Title: Adding Value to Data
1Data Repositories and JISC Repository
Landscape Mahendra Mahey Repositories Research
Officer, Repositories Research Team,
UKOLN GRADE Project Meeting (all
partners),Edinburgh, 30 October 2006.
UKOLN is supported by
This work is licensed under a Creative Commons
LicenceAttribution-ShareAlike 2.0
2 Data Repositories Landscape
Institutions
Data Centre
Data Centre
?
Data Centre
Institutions
Data Centre
?
3JISC Funds
- Data Centres
- MIMAS
- AHDS
- UK Data Archive
- EDINA
Also receive funding from Research Council UK
4JISC Information Environment Architecture
(Idealised) Technical Infrastructure for
ServicesAndy Powell, 2005
5Institutional Repositories Holding Research Data
- Very few around the world are doing this and are
they up to the job? - Versioning
- Authentication at individual asset level
- Other methods are being used, informal, ad-hoc,
lots of data slipping through the net - Repositories offer a better way to do this?
Different Data types lead to problems with
existing software - Data cluster projects
- E Bank
- Spectra
- GRADE
- CLADDIER
- ARROW DART
- The idea of linking papers to underlying data of
experiments and research is very appealing
stORe project and Open Access! - Can do some (orphaned) but not all, still role
for data centres
6Data Centres
- Have been storing data for years and predate
trendy r word, experts - They can teach institutions many lessons
- A lot of mystery, suspicion between Data Centres
and Institutions communication and dialogue
needed between the two and interdisciplinary - Time and money saving?
- Data centres argue that that subject specific is
a good thing, rationalising? - Storing and Curation has become science in its
own right, bioinformatics - Offer
- Databases
- Web access
- Tools to explore the information
- Systems to capture the information
- Service centres
- Custodianship, acquisition and ownership
- Depend of good will of community
- Add value, service and organisation, require lots
of money to continue
7Reactome
Data Centre Infrastructure Can be Complex!
EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
8Institutional and Data Centre practice exist
Data analysis, transformation, mining, modelling
Presentation services / portals
Data discovery, linking, citation
Publishers peer-review journals, conference
proceedings, etc
Aggregator services
Publication
Laboratory repository
Deposit
Validation
Institutional data repositories
Search, harvest
Validation
Deposit
9DRP Projects
- GRADE
- R4L
- SPECTRa
- CLADDIER
- stORe
- eBank
Data Cluster
Meetings
Road Map Required
Briefing Paper
Workshop
Interviews and Surveys
Road Map for Digital Repository / Preservation
Projects Focusing on Data
06/09 Call
10UKOLN - Data Repositories Research (Consultancy)
- To define how institutions (collectively and
individually) and scientific data centres can
together effectively achieve - Preservation
- Access Managed and Open
- Reuse Data Citation, Data Mining and
Reinterpretation - To identify the mechanisms, business processes
and good practice by which these functions can be
achieved - To facilitate dialogue between data centres,
institutions and other key players and to define
a collaborative way forward
Dr Liz Lyon
11Identifying and defining inter-relationships
- Socio-cultural, organisational, legal
- Technical interoperability
- Roles responsibilities
- Access
- Preservation
- Re-use
- See briefing paper produced for workshop
12Socio-cultural, organisational, political and
legal issues
- highly diverse in awareness
- practice and skills
- need to understand the full spectrum of research
practice - workflows and associated data flows
- both within and between disciplines/sub-discipline
s
13Hierarchy of Drivers
- Level 0 deliver project.
- Level 1 meet good scientific practice.
- Level 2 support own science.
- Level 3 employers requirements.
- Level 4 funders requirements.
- Level 5 public policy requirements.
Slide from Mark Thorley NERC
14RC UK - Funding Body
15Socio-legal conclusions
- Use a questionnaire and send to data centres,
disciplines will be different - Promote use interoperability through metadata
standards. Resource discovery standards should be
promoted developed by learned societies/
(membership arms) subject communities by
disciplines (not data curators). Bottom up rather
than top down. Education recognise very wide
range of understanding amongst disciplines re
value of data curation centres/IRs/archives
need go out and promote why they exist and why
they should be used. Focus at community. - Each research council should have a written
meaty data policy, disseminated and policed. - Legal issues value of JISC legal centre but
lack clarity and guidance of law where law exists
re use of digital objects, IP etc need clarity of
law and guidance on how best to interpret it,
straightforward answers to straightforward
questions. Model licences for use,
interpretation, confidentiality, disclosure. - Academics data centres need to be told
differences between data banks/data centres etc
and IRs. IRs have not had enough institutional
buy-in yet. - JISC could investigate why subject repositories
are more successful than IRs. JISC policy should
reflect what is happening on ground. - JISC should help sell IRs better
16Technical Interoperability
- Federation models
- interoperability and inter-relationships between
repositories
17Open Access
- Good thing but
- But are the tools up to the job
- OAI PMH
- Dublin Core
- Use METS as packaging standard, momentum
building? - Papers not data
- For data do these map to other Metadata Schema
developed, extensions to DC?
18Federation
- Monolithic solutions fail
- Aggregation of institutional repositories is
essential
Data Centres View
19Technical
- Need to define what is meant by semantics of
structured data and publish guidelines at levels
of metadata, classification/subject areas/factual
names/agreed conventions layered on top e.g
identifiers. - Application profiles who should be keeper of
those definitions eg registries who funds and
owns them ? - Scientists concentrate on narrow areas but
connections are to other wider areas - Time series data are different how discover and
use? More difficult to define discovery metadata
for time series. Data might not be logically the
same. - Data curation responsibility at institutional
level/data centre data curation requires
specialisms and data centres could feed this
expertise back to institutions need flow of
expertise from Data Centres to institutions - Invitations to work in a data centre for week
happening in Australia - Mixed economy re organisational responsibility is
inevitable some federation will be there - How to express quality role for provenance and
audit as a means to express quality also ranking
and annotation - Curation of data is of more interest to
scientists than interoperability as a means of
marketing/selling it.
20Roles, Rights Responsibilities
- Scientist Creation and use of data.
- Data centre Curation of and access to data.
- User Use of 3rd party data.
- Funder Set / react to public policy drivers.
- Publisher Maintain integrity of the scientific
record.
From Mark Thorley NERC
21Roles Responsibilities
- Individual scientists to deposit data using
domain standards of an acceptable quality - Re-user should acknowledge where data came from
and if it is appropriate to improve the quality
of the data. - Institution should have policies that mandate
data deposit in an appropriate place not
necessarily an IR. - Publishers/journals/editors should mandate open
deposit of data. - Curators who collect, describe and connect data,
idea of community proxy role - define standards
for domain working, in and with the scientists - Funders should enforce their data deposit
policies where possible - Funders should recognise the emerging need for
new infrastructure and provide appropriate
funding for this infrastructure and for the
resulting actions - Users and funders should feed back views on the
data stored to the data centre manager - Click use licence says if you enhance the data
you must give it back, but how to police that
policy by data centre? Versioning an issue here. - Value of good enough versus completely
comprehensive descriptions (Graham C) - Who is responsible for ownership of the data to
make changes? If multiple versions, not
necessarily the last one is best - Competitive views risk of sabotage of other
groups work is possible. - Who checks provenance of anything new? Curators?
22Small Science vs Big Science
Data from Big Science is easier to handle,
understand and archive. Small Science is horribly
heterogeneous and far more vast. In time Small
Science will generate 2-3 times more data than
Big Science. Lost in a Sea of Science Data
S.Carlson, The Chronicle of Higher Education
(23/06/2006)
23Dataset publishing
- Re examine concept of Dataset Publishing
(Callahan, Johnson, and Shelley 1996) - analogous to publishing papers
- rewards for publishing datasets (e.g. promotion,
RAE) - procedures (e.g. standards to use, peer review)
resources to manage procedures - Should minimise time and effort required
- need tools to assist in creation, maintenance and
dissemination of dataset descriptions - Means of putting into a public/community
- Deposit and Share are too cosy
- to publicate, to issue
- Terms of access and use
- Open?
- Privilege of membership
- Payment of money
Taken from Peter Burnhill
24Spatial is Special
- Why?
- GEO research data not deposited, Lots of data
slipping through nets, not falling under RC
remit, Data being lost, shared informally, may be
case for national repository? - Fears about legality of resources, e.g. OS data,
researchers really want to share in a big way - Should data be deposited in Data Centres?
- Academics not comfortable about sharing on larger
scale? - IRs not geared up to handle data?
- DSPace not allow edit of Metadata
- Problem with ISO Standard used for Geo data ISO
19115 and DC - Mapping done, further work needed, from wing
mirror to Smart Car?
25Responsibility of Data Providers
- Responsibility of publically funded research to
share data - Free our Data Guardian work
- INSPIRE work
26GRADEs input
- Important that GRADE inputs into this work as it
will set direction of research and focus on
GEOSPATIAL DATA Repository work - Interviews held with Rebecca and David
27DRP Projects
- GRADE
- R4L
- SPECTRa
- CLADDIER
- stORe
- eBank
Data Cluster
Meetings
Road Map Required
Briefing Paper
Workshop
Interviews and Surveys
Road Map for Digital Repository / Preservation
Projects Focusing on Data
06/09 Call
28We need your input!
l.lyon_at_ukoln.ac.uk m.mahey_at_ukoln.ac.uk