Title: Geoffrey Greenwell, IHSN/PARIS21
1Development of Microdata Anonymization Tools by
the
IHSN
IHSN
- Geoffrey Greenwell, IHSN/PARIS21
- IASSIST Conference
- Tampere, Finland, May 2009
Olivier Dupriez, World Bank Francois Fonteneau,
IHSN/P21 Mark McConaghy, DFID
2About IHSN
- International Household Survey Network
- A network of international agencies
- Based in Paris at the OECD at PARIS21
- A coordinating mechanism to
- Improve quality and use of household survey data
in developing countries - Harmonize international recommendations for
survey design, data analysis, etc - Produce and disseminate international good
practices
3Accelerated Data Program
- Implementing the IHSN Tools in the countries
- Technical and financial support to establish
national data archives (in gt 50 countries) - Many datasets documented (DDI)
- Improved access to data by researchers, but not
yet satisfactory. We can measure demand through
the NADA - Need to anonymize data remains the most
frequently expressed concern and obstacle to data
access. - The ADP has provided some guidance but there is a
lack of simple and intuitive tools and guidelines
available ADP countries.
4ADP/IHSN in the world
5Setting up Catalogs
6Focus Nigeria
- Effects of data
- availability on MDG
- Halving the
- population without
- sustainable access to
- safe drinking water.
- Providing robust
- estimates to inform
- policy makers
- and
- sector
- monitoring.
- Water and Sanitation
- Sector. Workshop with
- WHO/UNICEF
7Effects of Data Availability
- Nigeria and the MDG Rural access to improved
water source
8Resistance in the countries
- Nigeria Statistics Law Statistical Act of 2007
obliges microdata release after due
anonymization. The legal framework exists. - Willing institution (the NBS in Nigeria)
- Current anonymization strategies undertaken are
limited to removal of direct identifiers however, - Other countries are unable to articulate a proper
policy for dissemination and tend to use
confidentiality as a barrier to mask political
resistance or inertia. - IHSN anonymization tools will be a way to deal
with both real ethical concerns but also
political resistance
9Better use of survey data
- Lots of survey data remain under-exploited
because not accessible by researchers/users - Obstacles
- Technical
- Psychological
- Financial ? Support by many sponsors
- Legal
- Ethical
- Political ? ?
IHSN data documentation and cataloguing tools
and guidelines
?
IHSN Dissemination Policy Guidelines Missing
piece SDC tools
?
10Anonymize Process
- Direct identifiers, which are variables such as
names, addresses, or identity card numbers. They
permit direct identification of a respondent but
are not needed for statistical or research
purposes, and should thus be removed from the
published dataset.
- Indirect identifiers, which are characteristics
that may be shared by several respondents, and
whose combination could lead to the
re-identification of one of them. For example,
the combination of variables such as district of
residence, age, sex, and profession would be
identifying if only one individual of that
particular sex, age and profession lived in that
particular district. Such variables are needed
for statistical purposes, and should thus not be
removed from the published data files.
11Defining the problem
Once all identifying variables have been removed
we can still have a disclosure problem, the
problem remains dealing with the indirect
identifiers. The IHSN Anonymization tools will
approach these problems by building on a great
deal of technical work undertaken by experts in
the field. The IHSN hosted an expert meeting in
October 2008 to present its tools and
acknowledges the work done by University of
Manchester ISTAT (Italian Statistics) Cornell
University ICPSR
12Developing SDC tools
- Building on existing work
- Not an integrated software
- A collection of specialized tools for
- Measuring the risk
- Reducing the risk
- Assessing the information loss
- 12 plug ins developed in C that interface with
SPSS, STATA or direct Server (Windows/Linux). - Need to be thoroughly tested.
1312 Plug-ins
- 12 plug-ins
- The µ-argus risk for weighted sample
- Re-identification rate to individual risk
threshold - Individual risk to household risk
- L-diversity for unweighted data
- SUDA2 DIS-sample data
- Kanon Micro-aggregation
- Local recoding
- Fixed length micro aggregation
- Noise Addition
- Pram Post Randomization
- Rank Swapping
- Sampling
Risk Measures Intruder Scenarios What does
the intruder know?
Risk Reduction What does the intruder want?
14Measuring Disclosure Risk
?
Based on CENEX Handbook on Statistical Disclosure
Control Version 1.01
15Reducing risk disclosure
Based on CENEX Handbook on Statistical Disclosure
Control Version 1.01
16Measuring Information Loss
Based on CENEX Handbook on Statistical Disclosure
Control Version 1.01
17Developing SDC toolsProposal
- In Stata (SPSS, SAS) using C plugins
- Stata version 9 or gt
- Log file for easy replication of procedure
- Informative output
- Or command-line (plugins with data server)
- Why Stata (SPSS/SAS)?
- Because most countries use/know these software
- Can use all tabulation and analysis functions
18Beta Interface
19Target use
- Large, imperfect datasets in under resourced
countries - For use by official data producers in developing
countries (IHSN objective) - Relevant for other users as well
- Free to all public source code
20Work Program for 2009
- Testing, calibrating and documenting
- Cornell IHSN selected countries
- Development/implementation of training and TA
program - Detailed documentation and guidelines
- Reference manual and training materials
- Possibly launched before end of the year (IHSN
website) - Participation of others welcome
21IHSN
- Adding to the Tools to facilitate data access in
developing countries - Tools
- Metadata Editor
- CDROM/HTML developer
- Web Based National Data Archives
- Question Bank
- Guidelines
- Data Dissemination
- Documentation Guide
- Survey Quality Assessment Framework
22The End
Thank you.