Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel

About This Presentation
Title:

Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel

Description:

Issues in Designing a Confidentiality Preserving Model Server. by Philip M ... Software development by Synectics. HTML, mySQL, php, to develop the query ... –

Number of Views:49
Avg rating:3.0/5.0
Slides: 31
Provided by: Statistica77
Learn more at: https://unece.org
Category:

less

Transcript and Presenter's Notes

Title: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel


1
Issues in Designing a Confidentiality Preserving
Model Serverby Philip M Steel Arnold Reznek
2
Talk Outline
  • Background
  • Basic design
  • Description of operation
  • Confidentiality outline
  • Constraints on universe formation
  • Other constraints
  • Summary

3
Background
  • PUBLIC remote access to confidential data
  • Restriction of queries and responses rather than
    the registering and monitoring the user
  • Current population survey (CPS), employment and
    economic well-being demographic supplement
  • Software development by Synectics
  • HTML, mySQL, php, to develop the query SAS as
    the statistical package run against the data

4
Risk Model for Microdata
  • Intruder has access to record linkage software
    and identified data sources
  • Disclosure occurs if the intruder is successful
    in linking his identified data with the published
    microdata

5
Risk Model for a Model Server
  • Intruder has access to record linkage software
    and identified data sources
  • Intruder uses model server to reconstruct
    microdata for both the variables overlapping his
    data sources and a sensitive variable
  • Disclosure occurs if the intruder is successful
    in linking his identified data with the
    reconstructed microdata and has valid estimate of
    a sensitive characteristic or value

6
Basic Design Choice
  • Enable Choose which functions will operate
  • Must construct a friendly interface
  • Limited to the procedures developed
  • Safe from unknown code
  • Disable Choose which functions will not operate
  • User free to program within disabling constraints
  • No limit on complexity
  • Must be monitored (human, program or mix)

7
Operation
  • User visits web site, chooses data set, explores
    data, chooses geography, analysis type
  • User chooses population, constructs model,
    selects output
  • Web site constructs code to send behind firewall
  • Code checked and run against data at Census
  • Results checked and returned to user

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Structure of Confidentiality Rules
  • Data preparation
  • Data exploration
  • Model universe formation
  • Model Statement
  • Model Output

17
Data exploration rules
  • Users may request tables for categorical
    variables and numeric recodes up to e1
    dimensions. (start e14 including geo)
  • User may transform numeric recodes using a
    limited set of functions log, root, square.

18
Universe formation Categorical Variables
  • Example Hispanic heads of household with a
    college degree.
  • Conditions X1H,X21,X35 (table cell)
  • Implication Data preparation must support safe
    lower dimensional tables

19
Universe formation rules Categorical Variables
  • Limit on the number of categorical variables
    (u13)
  • Minimum on the size of the universe selected
    (u275)

20
Universe Formation Numeric Variables
  • Example Families in poverty
  • Condition Family incomelt18,500. Or Family
    incomelt18,501?
  • Implication Rounding or pre-assigned cutpoints.

21
Universe formation rules Numeric variables
  • Users will select categorical variables first
  • Numeric variables can be used only at
    pre-assigned cutpoints.
  • The number of observations in the whole CPS
    universe between cutpoints shall be at least u3
    for every numeric variable. (start u380)

22
Universe formation rules (cont)
  • If a cutpoint is used in universe formation then
    the difference in the size of the model universe
    obtained by incrementing the cutpoint up or down
    cannot be less than u4. (start u44)
  • The universe for the model must have at least u2
    observations. (start u275)
  • There will be no cutpoints above the 97th
    percentile of nonzero points or the last half
    percentile of all points .

23
Model statements rules
  • At most m1 variables may be used in the model
    statement (start m120)  
  • Dummy variables must distinguish at least m2
    observations (start m220)
  • No interaction term may involve more than 4
    variables. (m34)
  • No model involving 3 or more variables can be
    fully interacted. (m43)

24
Model Output
  • Residuals will be based on synthetic data
  • Limit on the number of significant digits?
  • R2 cannot be 1?
  • Rules for other diagnostics

25
Synthetic Residuals
  • Users may see synthetic bar charts or
    distributions and synthetic 2-way plots.
  • Synthetic data must be generated from fixed
    random number starts and topcoded (and bottom
    coded where appropriate) at 4 standard deviations
    from the mean. 

26
Data preparation
  • The topcode for numeric data needs to be
    calculated
  • Cutpoints must be determined
  • Separate lists of variables for exploration,
    universe formation, dependent and independent
    variables, model estimation
  • Standard recodes added
  • Inference from the collection of all 4-way
    categorical tables checked

27
Major Hurdles
  • Implementing facility for dummy variables
  • Presentation of geographic options
  • Implementing synthetic residuals
  • Architecture for differing variable roles

28
Future development
  • Relaxation of top codes
  • Implementation of model variance estimation (NSO
    weighting)
  • Introduction of new dataset
  • Introduction of new statistical procedures
  • Facility to add contextual data or merge files
  • Use of non-sampled data

29
Overview
  • Avoids (as much as possible) tests which accept
    or reject a users choice.
  • Restricts the dimension of the data access.
  • Has some flexibility in setting system
    confidentiality parameters.
  • Changes the intruder model.
  • Introduces a modification of k-anonymity.

30
  • My thanks to Jerry Reiter, Laura Zayatz and
    Stephen Wenck
  • http//204.52.186.190/
  • Contact philip.m.steel_at_census.gov
Write a Comment
User Comments (0)
About PowerShow.com