Title: Creating Something from Nothing: Synthetic and Dummy files
1Creating Something from NothingSynthetic and
Dummy files
- Bo Wandschneider
- University of Guelph
- Chuck Humphrey
- University of Alberta
DLI Training Ottawa, May, 2003
2Outline
- Types of data Files
- Implications for analysis
- Where do we get access
- Which file is appropriate
- Providing service with synthetic files
- NPHS an exercise
- SLID an exercise
3Types of Data Files
- Microdata
- Confidential Microdata Products
- Master Files
- Share Files
- Public Access Microdata Products
- Public Use Anonym zed microdata (PUMFS)
- Synthetic Files
4 Microdata Products
- Microdata
- raw data organized in a file where the records or
lines in the file are observations of a specific
unit of analysis and the information on the lines
are the values of variables - requires some form of processing or analysis to
be used
5 Microdata Products
- Microdata - SCF Example
- 00001103100002560700000002560700033700000000
0000000000000000000000000000000000000000000
00000000000000000000002594400648101946323310
000000000909222012000000000002220232111000000000
0000003000000000000000002228233411412190638749500
575211004600132 000021031000000000000000000000
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000
000000000016630000000000608244322000000000006320
000000000000000000000000000000000000000311612111
1435481500777500570033004300110
00003103100000000000000000000000000000000000
0000000000000000000000000000000000000000000
00000000000000000000000000000000000000016630
000000000405211122000000000004320206261110000636
0000003000000000000000002228213411436491600778500
570033004200085 000041031000002080000000002080
0000000005750005220000000000000025740000000
0000000367100314900052200000000000000575100
000000575145510000000000608244322000000000005320
220101021000575000522300000000000000000224022341
1431251000774500571622361600065
00005103100001805000000001805000000000028800
0261000000000000000000000000000000000549000
28800026100000000117901977800246301731522210
000000000505222012000000000004320000001011000288
0002611000000000000000001246123411411440748739500
575011021600046 000061031000001500000000001500
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000150000
000000150025510000000001010245012000000000006310
000000000000000000000000000000000000000312326341
1431071300773500571612004300094
00007103100000000000000000000000000000000000
0000002540000000000000000000000000002540002
54000000000000000000000254000000000254041520
000000000103402012000000000002220121134000000000
0000003000000000000000002269233411436491600778500
570033004200041 000081031000008400000000008400
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000840000
085800754225510000000000808233012000000000003320
000000000000000000000000000000000000000311813341
1411210848739500575211004600055
00009103100002600000000002600000000000028700
0156000000000000000879000000000000001322001
16600015600000000000002732200433502298722310
000000000708234222000000000006420000001012000287
0001561000000000000000001248113411431400300774500
564512071600060 000101031000000000000000000000
0001570000000000000050430000000000000000000
0000000504300254100250200000000000000520000
000000520046520000000000406223122000000000006420
000000000000000000000200000000000000000437621341
1436491600778500570033004400076
00011103100000000000000000000000000000000000
0000000000000000000000000000000000000000000
00000000000000000000000000000000000000016630
000000000203412131000000000004620000000000000000
0000000000000000000000003119213411435481500777500
570033004500040 000121031000000991000000000991
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000099100
000000099125510000000000203433221000000000004330
000000000000000000000000000000000000000311712131
1432231400773500571222004300244
00013103100002771600000002771600000000028800
0000000000000000000000000000000000000288000
28800000000000000000002800400624302176122210
000000000707222012000000000003310034071100000288
0000001000000000000000001226163411411431138739500
575211004600156 000141031000010000000000010000
0000000006000000000000000000000000000000000
0000000060000060000000000000000000001060000
068600991423310000000000404222012000000000004330
077001011000600000522100000000000000000126012341
1411440636719500573012221600148
00015103100000075000000000075000000000000000
0370000000000000000000000000000000000370000
00000037000000000000000112000000000112025510
000000000808233132000000000006330323511032001126
0003703000000000000000002245223411411261318529500
575222004600132 000161031000007012000000007012
0001650000000000000000000000000030820000000
0000000308200308200000000000000000001025900
135600890325410000000000708244322000000000005310
000000000000000000000000000000000000000311812341
1421320320439500573522171600111
00017103100000202700000000202700000000000000
0000000000000000000000000000000000
6Confidential Microdata
- Master Files
- These files contain the fullness of detail
captured about the unit of observation. The
information in these files can identify the
individual who provided the original information
and, therefore, are considered confidential.
7Confidential Microdata
8Confidential Microdata
- Master File - Personal identifiers
9Confidential Microdata
- Master File Geography (SLID)
10Confidential Microdata
- Master File - Fullness of Data (NPHS)
11Confidential Microdata
- Master File - Fullness of Data
12Confidential Microdata
- Master File - Fullness of Data (SLID)
13Confidential Microdata
- Master File - Fullness of Data
14Confidential Microdata
- Share Files
- these are confidential files in which the
respondents have signed a consent form permitting
Statistics Canada to allow access to their
information for approved research. - Used with NPHS and NLSCY
15 Public Access Microdata
- Anonymized Microdata
- these microdata are specially prepared to
minimize the possibility of disclosing or
identifying any of the cases or observations - the original data from the master file are
edited to create a public use microdata file
16 Public Access Microdata
- Steps in Anonymizing Microdata
- removal of all personal identification
information (names, addresses, etc) - include only gross levels of geography
- collapse detailed information into a smaller
number of general categories - suppress the values of a variable
17 Public Access Microdata
- Statistics Canada PUMFs
- only available for select social surveys that
undergo a review of the Data Release Committee,
an internal Statistics Canada committee - no enterprise public use microdata
18 Public Access Microdata
- Statistics Canada PUMFs
- almost all are cross-sectional, that is,
represent data collected at one point in time - longitudinal data are difficult to anonymize
while maintaining any useful information
19 Public Access Microdata
- PUMFs personal identifiers
20 Public Access Microdata
21 Public Access Microdata
22 Public Access Microdata
23Public Access Microdata
- Synthetic Files
- These microdata do not contain actual real
cases but are pseudo-cases that provide aggregate
results close to the real cases
24Public Access Microdata
- Synthetic Files
- They have been prepared to create analysis runs
with the master file without possibly disclosing
or identifying any of the cases
25Public Access Microdata
- Synthetic Files
- The results are not to be reported strictly to
be used to prepare analyses of master files - Usually associated with longitudinal files
26Public Access Microdata
- Steps in creating Synthetic Files
- Observations are transformed
- No records actually exist
- Keep fullness of detail
27Public Access Microdata
- Synthetic Files NPHS example
28Public Access Microdata
- Synthetic Files NPHS 1999 general file
29Public Access Microdata
- Synthetic Files NPHS 1999
30Public Access Microdata
- Synthetic Files NPHS 1999
31Implications for Analysis
- What are the implications in doing analysis with
these different types of microdata files?
32Implications for Analysis
- Master File
- All observations
- Has the most variables with the most detail
- Lots of geography and personal characteristics
- Little grouping or capping of categories
33Implications for Analysis
- Master File
- Restricted access only available to authorized
Statistics Canada employees, which includes
deemed employees
34Implications for Analysis
- Master File
- Includes linkage variables across files within a
study, e.g., NLSCY linkage among the files for
different units of analysis (kids, parents,
teachers)
35Implications for Analysis
- Public Use Microdata (PUMF)
- Suppressed observations
- Suppressed variables removed from the file
- Suppressed content
- Gross geography
- Collapsed categories
- Capped values
36Implications for Analysis
- Public Use Microdata (PUMF)
- Licensed product agree to certain terms of use
- No linkage to multiple units of analysis, with a
few exceptions (GSS Time Use and Family)
37Implications for Analysis
- Synthetic Files
- Looks like a duck and quacks like a duck, but
it isnt a duck or any other type of fowl.
38Implications for Analysis
- Synthetic Files
- Looks like master files
- Lots of observations
- Lots of variables
- Little grouping or capping of categories
- Lots of geographic detail
39Synthetic Files
- Precautions
- Results not authentic but close in the
aggregate - Use for testing analysis setups only
- Still need the master files for publishable
results
40Where do we get Access?
- Master File
- Restricted access governed under the Statistics
Act - Remote Job Submission
- Research Data Centres
- Apply to SSHRC to obtain a peer-reviewed proposal
and STC for security clearance
41Where do we get Access?
- Public Use Microdata Files (PUMF)
- Get from DLI
- Analyze where ever is convenient
- Can use a variety of analysis software, including
SAS, SPSS, Stata, HLM, LISREL, etc. - Slidret sans data
42Where do we get Access?
- Synthetic Files
- Author Divisions may create it
- Most relevant when dealing with new Panel Data,
but not necessarily, e.g., the Census has
potential - NPHS synthetic files on DLI FTP site
43Where do we get Access?
- Synthetic files
- SLID, WES, YITS coming ????
- Do we need to encourage them?
- Work with locally
- Build SAS and SPSS setups
44Which File is Appropriate?
- 1st stop is still the PUMF
- This file has the easiest access for us
- Probably meets the needs of most clients
- Not as administratively burdensome as synthetic
or master file - Perfect for clients just looking for data
courses in quantitative analysis
45Which File is Appropriate?
- If more detail is needed, refer to the Master
File Documentation (similar to Synthetic File
Documentation) - Make them aware that the cost of use is higher,
both in terms of accessibility and analytical
requirements - Interest most likely to come from grad students
and experienced researchers
46Which File is Appropriate?
- Download the Synthetic files from DLI
- Make them aware of problems with synthetic files
RESULTS ARE NOT PUBLISHABLE - Encourage them to submit an application for RDC
access there is a time lag
47Which File is Appropriate?
48Which File is Appropriate?
- Some of you may work with client using synthetic
files before passing her/him off to RDC
49Services for Synthetic Files
- DLI Contacts can provide four basic services with
synthetic files. - Build SPSS and SAS system files from the raw
synthetic data files that are distributed through
DLI - Provide information about the use of Remote Job
Submission (a.k.a, Remote Access) and RDCs
50Services for Synthetic Files
- Assist with finding variables in the synthetic
files - Provide instruction about ways of capturing SPSS
or SAS code from dummy analysis runs with the
synthetic files. It is this code that is then
submitted to STC through remote job submission.
51Services for Synthetic Files
- 1. Building SPSS and SAS system files for
synthetic data - The NPHS synthetic data are distributed as a raw
ASCII file with accompanying command files for
SPSS and SAS - Separate synthetic data files exist for the
master file setup and for bootstrapping analysis
52Services for Synthetic Files
- 1. Building SPSS and SAS system files for
synthetic data - The synthetic data for the 2000-2001 NPHS has
4,138 variables and 17,276 fabricated cases.
Creating the SPSS and SAS system files from this
file is not difficult, but it does take time.
DLI Contacts may wish to create these products
for their patrons.
53Services for Synthetic Files
- 2. Information about Remote Job Submission (RJS)
- The author divisions supporting RJS have
established their own guidelines and have
different operating procedures. Not all
divisions supporting longitudinal surveys
currently support RJS. - Therefore, there is a need to track down this
information for our patrons.
54Services for Synthetic Files
- 2. Information about Remote Job Submission (RJS)
- For example, the sources for information about
RJS include the Centre for Education Statistics - http//www.statcan.ca/english/edu/rda/index.htm
55(No Transcript)
56Services for Synthetic Files
- 2. Information about Remote Job Submission (RJS)
- Where do you find this information?
- Ask the DLI Team via the DLI List
- The EAC has asked for a description of RJS on the
DLI website, which should be on the DLI Teams
to-do list
57Services for Synthetic Files
- 2. Information about Research Data Centres
- The collection of master files available through
RDCs is listed on the STC website for RDCs - Each RDC has its own website describing its
services - http//www.statcan.ca/english/rdc/index.htm
58(No Transcript)
59Services for Synthetic Files
- 3. Data Reference for the content of the
synthetic files - Helping researchers identify variables over
longitudinal files is an important service - Need to keep the unit of analysis straight
- Need to understand the mnemonic naming convention
for variables over cycles - Develop indexing aids for you and your patrons
60Services for Synthetic Files
- 4. Provide helpful tips for preserving the code
from dummy analysis runs in SPSS and SAS - Researchers will run analyses on the synthetic
file to generate the code that they will
subsequently email for Remote Job Submission - Providing information about how to do this easily
will be helpful to your patrons
61An Example Using the NPHS
- Lets look at an example of these four services
using the synthetic files from the NPHS,
2000-2001.
62An Example Using SLID
- Lets look at an example of a dummy file using
SLIDRET, a retrieval system developed to extract
data from the cycles of the SLID. A data-less
version of SLIDRET is available through DLI to
help identify variables for RJS.
63Location of Slides and Exercices
- http//drc.uoguelph.ca/DATA/WKSHPS/IASSIST2003