Creating Something from Nothing: Synthetic and Dummy files - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Creating Something from Nothing: Synthetic and Dummy files

Description:

raw data organized in a file where the records or lines in the file are ... 'Looks like a duck and quacks like a duck', but it isn't a duck or any other type of fowl. ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 64
Provided by: bowandsc
Category:

less

Transcript and Presenter's Notes

Title: Creating Something from Nothing: Synthetic and Dummy files


1
Creating Something from NothingSynthetic and
Dummy files
  • Bo Wandschneider
  • University of Guelph
  • Chuck Humphrey
  • University of Alberta

DLI Training Ottawa, May, 2003
2
Outline
  • Types of data Files
  • Implications for analysis
  • Where do we get access
  • Which file is appropriate
  • Providing service with synthetic files
  • NPHS an exercise
  • SLID an exercise

3
Types of Data Files
  • Microdata
  • Confidential Microdata Products
  • Master Files
  • Share Files
  • Public Access Microdata Products
  • Public Use Anonym zed microdata (PUMFS)
  • Synthetic Files

4
Microdata Products
  • Microdata
  • raw data organized in a file where the records or
    lines in the file are observations of a specific
    unit of analysis and the information on the lines
    are the values of variables
  • requires some form of processing or analysis to
    be used

5
Microdata Products
  • Microdata - SCF Example
  • 00001103100002560700000002560700033700000000
    0000000000000000000000000000000000000000000
    00000000000000000000002594400648101946323310
    000000000909222012000000000002220232111000000000
    0000003000000000000000002228233411412190638749500
    575211004600132 000021031000000000000000000000
    0000000000000000000000000000000000000000000
    0000000000000000000000000000000000000000000
    000000000016630000000000608244322000000000006320
    000000000000000000000000000000000000000311612111
    1435481500777500570033004300110
    00003103100000000000000000000000000000000000
    0000000000000000000000000000000000000000000
    00000000000000000000000000000000000000016630
    000000000405211122000000000004320206261110000636
    0000003000000000000000002228213411436491600778500
    570033004200085 000041031000002080000000002080
    0000000005750005220000000000000025740000000
    0000000367100314900052200000000000000575100
    000000575145510000000000608244322000000000005320
    220101021000575000522300000000000000000224022341
    1431251000774500571622361600065
    00005103100001805000000001805000000000028800
    0261000000000000000000000000000000000549000
    28800026100000000117901977800246301731522210
    000000000505222012000000000004320000001011000288
    0002611000000000000000001246123411411440748739500
    575011021600046 000061031000001500000000001500
    0000000000000000000000000000000000000000000
    0000000000000000000000000000000000000150000
    000000150025510000000001010245012000000000006310
    000000000000000000000000000000000000000312326341
    1431071300773500571612004300094
    00007103100000000000000000000000000000000000
    0000002540000000000000000000000000002540002
    54000000000000000000000254000000000254041520
    000000000103402012000000000002220121134000000000
    0000003000000000000000002269233411436491600778500
    570033004200041 000081031000008400000000008400
    0000000000000000000000000000000000000000000
    0000000000000000000000000000000000000840000
    085800754225510000000000808233012000000000003320
    000000000000000000000000000000000000000311813341
    1411210848739500575211004600055
    00009103100002600000000002600000000000028700
    0156000000000000000879000000000000001322001
    16600015600000000000002732200433502298722310
    000000000708234222000000000006420000001012000287
    0001561000000000000000001248113411431400300774500
    564512071600060 000101031000000000000000000000
    0001570000000000000050430000000000000000000
    0000000504300254100250200000000000000520000
    000000520046520000000000406223122000000000006420
    000000000000000000000200000000000000000437621341
    1436491600778500570033004400076
    00011103100000000000000000000000000000000000
    0000000000000000000000000000000000000000000
    00000000000000000000000000000000000000016630
    000000000203412131000000000004620000000000000000
    0000000000000000000000003119213411435481500777500
    570033004500040 000121031000000991000000000991
    0000000000000000000000000000000000000000000
    0000000000000000000000000000000000000099100
    000000099125510000000000203433221000000000004330
    000000000000000000000000000000000000000311712131
    1432231400773500571222004300244
    00013103100002771600000002771600000000028800
    0000000000000000000000000000000000000288000
    28800000000000000000002800400624302176122210
    000000000707222012000000000003310034071100000288
    0000001000000000000000001226163411411431138739500
    575211004600156 000141031000010000000000010000
    0000000006000000000000000000000000000000000
    0000000060000060000000000000000000001060000
    068600991423310000000000404222012000000000004330
    077001011000600000522100000000000000000126012341
    1411440636719500573012221600148
    00015103100000075000000000075000000000000000
    0370000000000000000000000000000000000370000
    00000037000000000000000112000000000112025510
    000000000808233132000000000006330323511032001126
    0003703000000000000000002245223411411261318529500
    575222004600132 000161031000007012000000007012
    0001650000000000000000000000000030820000000
    0000000308200308200000000000000000001025900
    135600890325410000000000708244322000000000005310
    000000000000000000000000000000000000000311812341
    1421320320439500573522171600111
    00017103100000202700000000202700000000000000
    0000000000000000000000000000000000

6
Confidential Microdata
  • Master Files
  • These files contain the fullness of detail
    captured about the unit of observation. The
    information in these files can identify the
    individual who provided the original information
    and, therefore, are considered confidential.

7
Confidential Microdata
  • Master File Example

8
Confidential Microdata
  • Master File - Personal identifiers

9
Confidential Microdata
  • Master File Geography (SLID)

10
Confidential Microdata
  • Master File - Fullness of Data (NPHS)

11
Confidential Microdata
  • Master File - Fullness of Data

12
Confidential Microdata
  • Master File - Fullness of Data (SLID)

13
Confidential Microdata
  • Master File - Fullness of Data

14
Confidential Microdata
  • Share Files
  • these are confidential files in which the
    respondents have signed a consent form permitting
    Statistics Canada to allow access to their
    information for approved research.
  • Used with NPHS and NLSCY

15
Public Access Microdata
  • Anonymized Microdata
  • these microdata are specially prepared to
    minimize the possibility of disclosing or
    identifying any of the cases or observations
  • the original data from the master file are
    edited to create a public use microdata file

16
Public Access Microdata
  • Steps in Anonymizing Microdata
  • removal of all personal identification
    information (names, addresses, etc)
  • include only gross levels of geography
  • collapse detailed information into a smaller
    number of general categories
  • suppress the values of a variable

17
Public Access Microdata
  • Statistics Canada PUMFs
  • only available for select social surveys that
    undergo a review of the Data Release Committee,
    an internal Statistics Canada committee
  • no enterprise public use microdata

18
Public Access Microdata
  • Statistics Canada PUMFs
  • almost all are cross-sectional, that is,
    represent data collected at one point in time
  • longitudinal data are difficult to anonymize
    while maintaining any useful information

19
Public Access Microdata
  • PUMFs personal identifiers

20
Public Access Microdata
  • PUMFs gross geography

21
Public Access Microdata
  • PUMFs collapsed data

22
Public Access Microdata
  • PUMFs suppressed data

23
Public Access Microdata
  • Synthetic Files
  • These microdata do not contain actual real
    cases but are pseudo-cases that provide aggregate
    results close to the real cases

24
Public Access Microdata
  • Synthetic Files
  • They have been prepared to create analysis runs
    with the master file without possibly disclosing
    or identifying any of the cases

25
Public Access Microdata
  • Synthetic Files
  • The results are not to be reported strictly to
    be used to prepare analyses of master files
  • Usually associated with longitudinal files

26
Public Access Microdata
  • Steps in creating Synthetic Files
  • Observations are transformed
  • No records actually exist
  • Keep fullness of detail

27
Public Access Microdata
  • Synthetic Files NPHS example

28
Public Access Microdata
  • Synthetic Files NPHS 1999 general file

29
Public Access Microdata
  • Synthetic Files NPHS 1999

30
Public Access Microdata
  • Synthetic Files NPHS 1999

31
Implications for Analysis
  • What are the implications in doing analysis with
    these different types of microdata files?

32
Implications for Analysis
  • Master File
  • All observations
  • Has the most variables with the most detail
  • Lots of geography and personal characteristics
  • Little grouping or capping of categories

33
Implications for Analysis
  • Master File
  • Restricted access only available to authorized
    Statistics Canada employees, which includes
    deemed employees

34
Implications for Analysis
  • Master File
  • Includes linkage variables across files within a
    study, e.g., NLSCY linkage among the files for
    different units of analysis (kids, parents,
    teachers)

35
Implications for Analysis
  • Public Use Microdata (PUMF)
  • Suppressed observations
  • Suppressed variables removed from the file
  • Suppressed content
  • Gross geography
  • Collapsed categories
  • Capped values

36
Implications for Analysis
  • Public Use Microdata (PUMF)
  • Licensed product agree to certain terms of use
  • No linkage to multiple units of analysis, with a
    few exceptions (GSS Time Use and Family)

37
Implications for Analysis
  • Synthetic Files
  • Looks like a duck and quacks like a duck, but
    it isnt a duck or any other type of fowl.

38
Implications for Analysis
  • Synthetic Files
  • Looks like master files
  • Lots of observations
  • Lots of variables
  • Little grouping or capping of categories
  • Lots of geographic detail

39
Synthetic Files
  • Precautions
  • Results not authentic but close in the
    aggregate
  • Use for testing analysis setups only
  • Still need the master files for publishable
    results

40
Where do we get Access?
  • Master File
  • Restricted access governed under the Statistics
    Act
  • Remote Job Submission
  • Research Data Centres
  • Apply to SSHRC to obtain a peer-reviewed proposal
    and STC for security clearance

41
Where do we get Access?
  • Public Use Microdata Files (PUMF)
  • Get from DLI
  • Analyze where ever is convenient
  • Can use a variety of analysis software, including
    SAS, SPSS, Stata, HLM, LISREL, etc.
  • Slidret sans data

42
Where do we get Access?
  • Synthetic Files
  • Author Divisions may create it
  • Most relevant when dealing with new Panel Data,
    but not necessarily, e.g., the Census has
    potential
  • NPHS synthetic files on DLI FTP site

43
Where do we get Access?
  • Synthetic files
  • SLID, WES, YITS coming ????
  • Do we need to encourage them?
  • Work with locally
  • Build SAS and SPSS setups

44
Which File is Appropriate?
  • 1st stop is still the PUMF
  • This file has the easiest access for us
  • Probably meets the needs of most clients
  • Not as administratively burdensome as synthetic
    or master file
  • Perfect for clients just looking for data
    courses in quantitative analysis

45
Which File is Appropriate?
  • If more detail is needed, refer to the Master
    File Documentation (similar to Synthetic File
    Documentation)
  • Make them aware that the cost of use is higher,
    both in terms of accessibility and analytical
    requirements
  • Interest most likely to come from grad students
    and experienced researchers

46
Which File is Appropriate?
  • Download the Synthetic files from DLI
  • Make them aware of problems with synthetic files
    RESULTS ARE NOT PUBLISHABLE
  • Encourage them to submit an application for RDC
    access there is a time lag

47
Which File is Appropriate?
  • RDC

48
Which File is Appropriate?
  • Some of you may work with client using synthetic
    files before passing her/him off to RDC

49
Services for Synthetic Files
  • DLI Contacts can provide four basic services with
    synthetic files.
  • Build SPSS and SAS system files from the raw
    synthetic data files that are distributed through
    DLI
  • Provide information about the use of Remote Job
    Submission (a.k.a, Remote Access) and RDCs

50
Services for Synthetic Files
  • Assist with finding variables in the synthetic
    files
  • Provide instruction about ways of capturing SPSS
    or SAS code from dummy analysis runs with the
    synthetic files. It is this code that is then
    submitted to STC through remote job submission.

51
Services for Synthetic Files
  • 1. Building SPSS and SAS system files for
    synthetic data
  • The NPHS synthetic data are distributed as a raw
    ASCII file with accompanying command files for
    SPSS and SAS
  • Separate synthetic data files exist for the
    master file setup and for bootstrapping analysis

52
Services for Synthetic Files
  • 1. Building SPSS and SAS system files for
    synthetic data
  • The synthetic data for the 2000-2001 NPHS has
    4,138 variables and 17,276 fabricated cases.
    Creating the SPSS and SAS system files from this
    file is not difficult, but it does take time.
    DLI Contacts may wish to create these products
    for their patrons.

53
Services for Synthetic Files
  • 2. Information about Remote Job Submission (RJS)
  • The author divisions supporting RJS have
    established their own guidelines and have
    different operating procedures. Not all
    divisions supporting longitudinal surveys
    currently support RJS.
  • Therefore, there is a need to track down this
    information for our patrons.

54
Services for Synthetic Files
  • 2. Information about Remote Job Submission (RJS)
  • For example, the sources for information about
    RJS include the Centre for Education Statistics
  • http//www.statcan.ca/english/edu/rda/index.htm

55
(No Transcript)
56
Services for Synthetic Files
  • 2. Information about Remote Job Submission (RJS)
  • Where do you find this information?
  • Ask the DLI Team via the DLI List
  • The EAC has asked for a description of RJS on the
    DLI website, which should be on the DLI Teams
    to-do list

57
Services for Synthetic Files
  • 2. Information about Research Data Centres
  • The collection of master files available through
    RDCs is listed on the STC website for RDCs
  • Each RDC has its own website describing its
    services
  • http//www.statcan.ca/english/rdc/index.htm

58
(No Transcript)
59
Services for Synthetic Files
  • 3. Data Reference for the content of the
    synthetic files
  • Helping researchers identify variables over
    longitudinal files is an important service
  • Need to keep the unit of analysis straight
  • Need to understand the mnemonic naming convention
    for variables over cycles
  • Develop indexing aids for you and your patrons

60
Services for Synthetic Files
  • 4. Provide helpful tips for preserving the code
    from dummy analysis runs in SPSS and SAS
  • Researchers will run analyses on the synthetic
    file to generate the code that they will
    subsequently email for Remote Job Submission
  • Providing information about how to do this easily
    will be helpful to your patrons

61
An Example Using the NPHS
  • Lets look at an example of these four services
    using the synthetic files from the NPHS,
    2000-2001.

62
An Example Using SLID
  • Lets look at an example of a dummy file using
    SLIDRET, a retrieval system developed to extract
    data from the cycles of the SLID. A data-less
    version of SLIDRET is available through DLI to
    help identify variables for RJS.

63
Location of Slides and Exercices
  • http//drc.uoguelph.ca/DATA/WKSHPS/IASSIST2003
Write a Comment
User Comments (0)
About PowerShow.com