Curating Social Science Data The View from ICPSR - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Curating Social Science Data The View from ICPSR

Description:

Curation is important probably the most important thing we do ... Magnetic tape storage eased sharing and preservation. Dramatic Changes Since the 1990s ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 20
Provided by: gutm6
Category:

less

Transcript and Presenter's Notes

Title: Curating Social Science Data The View from ICPSR


1
Curating Social Science DataThe View from ICPSR
  • Myron Gutmann
  • Digital Curation Conference
  • December 12, 2007

2
To Start...
  • Curation is important probably the most
    important thing we do
  • Curation is the process that prepares data for
    use, for re-use, and for long-term preservation
  • Im going to concentrate on preparation, and how
    it supports use, re-use or preservation
    themselves
  • Standards are critical!
  • What are we doing with social science data at
    ICPSR?
  • At the end, Im going to talk a bit about the
    future, and what I think that means for data
    curation

3
ICPSR in a Minute
  • Membership organization at University of Michigan
  • Founded 1962
  • 6,000 Studies 500,000 data files
  • Social Science more, with increasing level of
    confidential data
  • Funds from 600 members, government agencies
    sponsors, research grants
  • Part of a Global network of social science data
    archives, with formal organizations in U.S.
    Europe

4
Data in Quantitative Social Science
  • Since the 1940s
  • Surveys Polls Large Small Private
    Public
  • Administrative Microdata and Tabulated data
    (Census other official statistical agencies)
  • Data Innovations from the 1960s 1970s
  • Longitudinal data from surveys
  • Longitudinal data from linked historical sources
  • Magnetic tape storage eased sharing and
    preservation

5
Dramatic Changes Since the 1990s
  • Most recently transactional data from cell
    phones, on-line games, blogs, EZ-Tag, etc
  • Retrospectively Harmonized data
  • IPUMS (Census data U.S. international)
  • Surveys, including NHIS, CPES, Fertility...
  • Expanding the science to include
  • Biomedical environmental data
  • Bibliography as data
  • Web-based enclave data access

Onnella in PNAS May 2007
6
Data Preservation SharingSocial Science
Practices
  • Longer and better sharing than in many sciences
  • Heterogeneous tradition
  • Strong in some disciplines, weak in others
  • Function of use of large datasets in research
    graduate instruction
  • Supported by NSF, NIH, NIJ Policies, Census
  • DDI Metadata Standard
  • Strong Curation Practices

7
Minding our Knitting (or what I tell young data
archivists)
  • Curation matters, to ensure reuse and document
    provenance - Quality Control is essential
  • Our scientific authority is critical, as
    established by a large number of reports that
    mention ICPSR (and other major data archives) as
    key infrastructure
  • We must protect the human subjects of research
    who are the basis of what we do
  • Social science and technology are constantly
    changing

8
Curation is Easy, Isnt it?
  • Identification, appraisal, acquisition are the
    first part of the curation process
  • Curation process first invented in the 1960s
    clearly in need of reinvention.
  • Is curation manual or not? ICPSR employs a mix of
    manual automated processes, with ever more
    automation adherence to standards

9
Knowing the Business Process Means Knowing how to
Get the Work Done
10
Metadata MattersCritical Curation Needs
  • Data Management Variable-level Metadata
  • Validity
  • Confidentiality
  • Usability
  • Data Discovery Authority Study-level Metadata
  • Discovery Planning
  • Provenance
  • Versioning
  • Preservation Planning Preservation Metadata

11
Validity Confidentiality
  • Critical task are the data valid and respondents
    anonymous?
  • Validity
  • Compare producers documentation with data
  • Identify questionable data
  • Discuss correction with producer
  • Document any changes
  • Confidentiality
  • Disclosure risk assessment reduction

12
Usability
  • Critical task How easily can the data be used?
  • Solution 1 Variable-level metadata
  • Create complete variable-level metadata
    (including question text or equivalent)
  • Create complete traditional codebooks on-line
    documentation
  • Solution 2 On-line Delivery Analysis
  • Effective on-line delivery systems since 2001
  • On-line analysis systems for many users
  • Quality Control with managerial sign-off

13
Discovery The Problem
14
Discovery Solutions
  • Critical task How easily can the data be found?
  • Process
  • Study-level metadata creation
  • High quality is critical
  • Producer-supplied metadata is not sufficient
  • Metadata must be effectively exposed-and
    therefore standards-based
  • Participation in multi-institution portals,
    including DataPASS (US) and CESSDA (Europe)

15
Provenance, Version, Preservation
  • Critical task Data users need to know what they
    are getting, to achieve replication, and the
    archive needs to know preservation history
    future needs
  • Process must provide
  • Provenance metadata (where did it come from?)
  • Version metadata (when did it change?)
  • Preservation metadata (when should we review and
    possibly change?) practice (another story)
  • The Challenge How much detail is required?

16
Research Curationfor the Future
17
Great Science in the Future will Require
Massively Complex Data
  • Many Science Domains
  • Social Science, biomedical, environmental,
    geographical, psychological, engineering,
    transactional
  • Examples
  • National Childrens Study
  • Chicago Neighborhoods
  • Distributed Sources Replication
  • Heterogeneous, not Homogeneous Metadata

18
Challenges to Curation, Preservation, Access in
the Future
  • Massively Complex data means
  • No single repository
  • No single harmonization scheme
  • Greater risk of disclosure confidentiality
    breach
  • Data Arent All Digital or Quantitative
  • Qualitative, including Video Audio
  • Biological Samples
  • Bibliography
  • Living Studies that change frequently

19
Curation for the Future
  • Technology that allows disparate data and
    researchers to talk, even when they dont know
    the same language
  • Ability to build new dynamic communities around
    new groupings of data questions
  • Engage technologists domain scientists to build
    new systems together

gutmann_at_umich.edu
Write a Comment
User Comments (0)
About PowerShow.com