Title: Curating Social Science Data The View from ICPSR
1Curating Social Science DataThe View from ICPSR
- Myron Gutmann
- Digital Curation Conference
- December 12, 2007
2To Start...
- Curation is important probably the most
important thing we do - Curation is the process that prepares data for
use, for re-use, and for long-term preservation - Im going to concentrate on preparation, and how
it supports use, re-use or preservation
themselves - Standards are critical!
- What are we doing with social science data at
ICPSR? - At the end, Im going to talk a bit about the
future, and what I think that means for data
curation
3ICPSR in a Minute
- Membership organization at University of Michigan
- Founded 1962
- 6,000 Studies 500,000 data files
- Social Science more, with increasing level of
confidential data - Funds from 600 members, government agencies
sponsors, research grants - Part of a Global network of social science data
archives, with formal organizations in U.S.
Europe
4Data in Quantitative Social Science
- Since the 1940s
- Surveys Polls Large Small Private
Public - Administrative Microdata and Tabulated data
(Census other official statistical agencies) - Data Innovations from the 1960s 1970s
- Longitudinal data from surveys
- Longitudinal data from linked historical sources
- Magnetic tape storage eased sharing and
preservation
5Dramatic Changes Since the 1990s
- Most recently transactional data from cell
phones, on-line games, blogs, EZ-Tag, etc - Retrospectively Harmonized data
- IPUMS (Census data U.S. international)
- Surveys, including NHIS, CPES, Fertility...
- Expanding the science to include
- Biomedical environmental data
- Bibliography as data
- Web-based enclave data access
Onnella in PNAS May 2007
6Data Preservation SharingSocial Science
Practices
- Longer and better sharing than in many sciences
- Heterogeneous tradition
- Strong in some disciplines, weak in others
- Function of use of large datasets in research
graduate instruction - Supported by NSF, NIH, NIJ Policies, Census
- DDI Metadata Standard
- Strong Curation Practices
7Minding our Knitting (or what I tell young data
archivists)
- Curation matters, to ensure reuse and document
provenance - Quality Control is essential - Our scientific authority is critical, as
established by a large number of reports that
mention ICPSR (and other major data archives) as
key infrastructure - We must protect the human subjects of research
who are the basis of what we do - Social science and technology are constantly
changing
8Curation is Easy, Isnt it?
- Identification, appraisal, acquisition are the
first part of the curation process - Curation process first invented in the 1960s
clearly in need of reinvention. - Is curation manual or not? ICPSR employs a mix of
manual automated processes, with ever more
automation adherence to standards
9Knowing the Business Process Means Knowing how to
Get the Work Done
10Metadata MattersCritical Curation Needs
- Data Management Variable-level Metadata
- Validity
- Confidentiality
- Usability
- Data Discovery Authority Study-level Metadata
- Discovery Planning
- Provenance
- Versioning
- Preservation Planning Preservation Metadata
11Validity Confidentiality
- Critical task are the data valid and respondents
anonymous? - Validity
- Compare producers documentation with data
- Identify questionable data
- Discuss correction with producer
- Document any changes
- Confidentiality
- Disclosure risk assessment reduction
12Usability
- Critical task How easily can the data be used?
- Solution 1 Variable-level metadata
- Create complete variable-level metadata
(including question text or equivalent) - Create complete traditional codebooks on-line
documentation - Solution 2 On-line Delivery Analysis
- Effective on-line delivery systems since 2001
- On-line analysis systems for many users
- Quality Control with managerial sign-off
13Discovery The Problem
14Discovery Solutions
- Critical task How easily can the data be found?
- Process
- Study-level metadata creation
- High quality is critical
- Producer-supplied metadata is not sufficient
- Metadata must be effectively exposed-and
therefore standards-based - Participation in multi-institution portals,
including DataPASS (US) and CESSDA (Europe)
15Provenance, Version, Preservation
- Critical task Data users need to know what they
are getting, to achieve replication, and the
archive needs to know preservation history
future needs - Process must provide
- Provenance metadata (where did it come from?)
- Version metadata (when did it change?)
- Preservation metadata (when should we review and
possibly change?) practice (another story) - The Challenge How much detail is required?
16Research Curationfor the Future
17Great Science in the Future will Require
Massively Complex Data
- Many Science Domains
- Social Science, biomedical, environmental,
geographical, psychological, engineering,
transactional - Examples
- National Childrens Study
- Chicago Neighborhoods
- Distributed Sources Replication
- Heterogeneous, not Homogeneous Metadata
18Challenges to Curation, Preservation, Access in
the Future
- Massively Complex data means
- No single repository
- No single harmonization scheme
- Greater risk of disclosure confidentiality
breach - Data Arent All Digital or Quantitative
- Qualitative, including Video Audio
- Biological Samples
- Bibliography
- Living Studies that change frequently
19Curation for the Future
- Technology that allows disparate data and
researchers to talk, even when they dont know
the same language - Ability to build new dynamic communities around
new groupings of data questions - Engage technologists domain scientists to build
new systems together
gutmann_at_umich.edu