Title: The Role of Libraries in Data Curation
1The Role of Libraries in Data Curation
Or How do we even get started?
This link icon above automatically shows the
looping slides
John MacColl European Director, RLG
Partnership 9 June 2010
2What I want to talk about
- The importance of data
- Institutional vs domain solutions
- Skills needs
- Our project
- Reward structures
3The importance of data
4Its the data, stupid
- astronomers are just as likely to point a
software query tool at a digital sky survey as to
point a telescope at the stars (The Economist,
Feb 2010) - It's like the invention of the telescope,"
Franco Moretti, a Stanford professor of English
and comparative literature, says of Google Books.
"All of a sudden, an enormous amount of matter
becomes visible. (The Chronicle, The humanities
go Google, May 28 2010)
5DataVerse (Gary King, 2007)
- Data sometimes exist on individual researchers
Web sites, without professional backups, off-site
replication, plans for format conversion and
migration, or professional cataloging.
6Pious hopes (Carole Palmer)
- 60 archive generated or collected data (no
offsite backup) - 61 expect to keep more than 10 years
7Data lost, and data never born (U Wisconsin
Summary Report of the Research Data Management
Study Group (2009))
- In some cases, inadequate storage capacity is
leading to loss of data forcing some researchers
to discard data from past experiments in order to
make room for current ones or to avoid certain
types of experiments and research altogether
8Data and their uses
Freely available
Locked away
Embargoed
Shared with collaborators
Secondary artifacts statistical and pattern
analyses subset extractions visualisations
simulations discovery environments
transformations
Primary data sensory, numeric, digitised,
geospatial, etc
Ancillary data questionnaires, fieldnotes, lab
notebooks, data dictionaries, annotations,
lecture notes, etc
9Dont try this at home?
10Institutional vs domain solutions
11Blue Ribbon Task Force on Sustainable Digital
Preservation and Access on aggregation
- Creating economies of scale among archives when
possible is always desirable, and may be critical
when the materials under stewardship require
particular kinds of expertise that are scarce.
This is the case for much scientific data.
12Qualified gravitational pull (Green and Gutmann)
- Most institutional repositories do not and
cannot offer support for managing dataset formats
over time Policies for long-term stewardship
vary among institutions, but many have developed
a sliding scale of preservation promises
13Oxford University Research data management
services findings of the consultation with
service providers (September 2008)
14Cornell DataStaR a staging repository
15Datasets in Cornell IR
16Monash approach (institutional) (Treloar)
17U Wisconsin proposal
- Solutions comprised solely of expensive
technology will fail, because of the underlying
need to establish long-lasting cultural stability
within and between the research, library, and IT
communities on campus.
18Curation responsibilities (Carlson, The
Chronicle, 2006)
Data from Big Science is easier to handle,
understand and archive. Small Science is
horribly heterogeneous and far more vast. In time
Small Science will generate 2-3 times more data
than Big Science.
big science data
domain?
institution?
small science data
19Experiments failures
- NSF DataNet Data Conservancy project. 20m
awarded. Led by JHU. Includes social sciences. - U. Va. Mellon grant 870k. Programmers and
archivists. Includes Stanford, Yale and Hull. To
create a model for digital collection management
that can be easily shared among research
libraries. - UKRDS ?
20Meanwhile
21Specialist data archives
22Skills needs
23Is this possible (Gabridge)?
- libraries can develop existing liaisons with
interest, passion, and strong analytical skills
or they can recruit domain experts, and teach
them about excellent information science
practices.
24ARL study Scott Brandt
25Our project
26Joint OCLC Research-LIBER
- Binghamton
- Brigham Young
- Cambridge
- Leeds
- Melbourne
- Nijmegen
- Oxford
27Deliverables
- Desk research
- Case studies
- Interviews with researchers
- Report and recommendations
28Project Aim
- It has been frequently asserted in the
literature on data curation that there are new
service roles for research libraries emerging.
This project will seek to test this hypothesis by
considering the data curation requirements of a
number of recently completed research projects in
a sample group of North American and European
universities
29Method
- Each university partner will produce two or
three case studies of projects in which data has
been generated, and consider the data curation
implications of these The project will conclude
with an assessment of the potential role of the
research library in general in relation to such
datasets, based on the examples of good practice
discovered via the case studies.
30Project Approach
- The proposed project will adopt a bottom-up
approach and be grounded in the realities of data
storage and preservation behaviour as exemplified
in a number of real instances
31Scale again
- We consider that the question of how to arrive
at an articulation between the institutional
library and domain or funder data archives is one
of the most urgent requirements in this area, and
the project will explore it carefully.
32Environments data
33Timescapes (Leeds)
34Nyman/Jones Archive (Leeds)
35The Australian Womens Register (Melbourne)
36Life Patterns (Melbourne)
37Incremental Project (Cambridge)
38What do we expect?
- Not a great deal!
- Need to adjust our timescales?
- Signs of progress?
- Indications of favourable organisational
frameworks? - Indications of favourable policies?
- A taking of stock
39Reward structures
40Days understatement
41Being excited about being cited (DataVerse, King)
- Articles with accessible data are cited twice
as often as otherwise equivalent articles that do
not provide data access. - Articles in journals with replication policies
that make data available are cited thrice as
frequently as otherwise equivalent articles
without accessible data
42Library neutrality (Steinhart, 2007)?
- There is ample evidence that even when
appropriate data repositories exist for a
particular discipline, researchers often fail to
take full advantage of them This lack of
participation in data sharing and archival
activities suggests an opportunity for academic
libraries to provide a much-needed service
43Thinning the library
- No longer just about capture of outputs at the
endpoint - The library has to be involved in the whole
process of research and scholarship, throughout
its lifecycle - This involves thinning out the library
- Rethinking the point of engagement
- The library becomes engineering
- and people
44Ten Questions to Begin a Conversation With Your
Faculty About Data Curation (Witt Carlson)
- What is the story of your data?
- What form and format are the data in?
- What is the expected lifespan of your data?
- How could your data be used, reused, and
repurposed? - How large is your dataset, and what is its rate
of growth? - Who are potential audiences for your data?
- Who owns the data?
- Does the dataset include any sensitive
information? - What publications or discoveries have resulted
from the data? - How should the data be made accessible?
45Repositories at present are the wrong model
(Green and Guttman)
- repositories position themselves at or near the
end of the scientific research life cycle. Their
goal is less to partner with researchers or with
domain-specific repositories throughout the
research life cycle than to garner the value of
the institutions productivity
46Appraisal (Cornell)
- The archivist can no longer wait passively at
the end of the life cycle for records to arrive
at the archives when their creators no longer
wanted them or were dead (Cook 2000).
47Discussion!
48Next up
- Lunch and then
- 100
- Framing Libraries and the Environment
- Lorcan Dempsey, OCLC Research
- Buckingham