NORC Data Enclave - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

NORC Data Enclave

Description:

Documenting after the facts leads to considerable loss of information ... Undocumented data is useless. What can Metadata do for you? What can metadata do for you? ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 25
Provided by: norc
Category:
Tags: norc | data | enclave | facts | useless

less

Transcript and Presenter's Notes

Title: NORC Data Enclave


1
NORC Data Enclave
  • Module 2
  • Metadata for researchers

2
Overview
  • What is metadata and why is it important
  • What can metadata do for you?
  • Guide to creating metadata
  • Creating metadata in the enclave

3
What is metadata?
4
Metadata and the survey life cycle
  • A survey is not a static process
  • It dynamically evolved across time and involves
    many players
  • It extends to aggregate data to reach decision
    makers
  • Metadata is crucial to capture knowledge

5
Importance of metadata
  • Imagine a world without metadata.
  • Users would say
  • I cant find the right data! How do I get access?
  • Where is the report / questionnaire /
    methodology?
  • I dont understand this survey / file / variable
  • I cant merge the files
  • How do I weight the data?
  • My results dont match the report, I cant
    reproduce the same results
  • Are these things comparable?
  • I didnt know someone did this research before?
  • Sounds familiar?
  • Metadata is an answer to a researchers
    frustrations
  • Producers and archivists are making efforts to
    improve metadata but similarly, metadata must
    also be captured by researchers (Life Cycle!)

6
When to capture metadata?
  • Metadata must be captured at the time the event
    occurs!
  • Documenting after the facts leads to considerable
    loss of information
  • This is true for producers and researchers

7
Metadata and the Replication standard
  • Replication standard
  • Gary King, Harvard, 1995
  • The only way to understand and evaluate an
    empirical analysis fully is to know the exact
    process by which the data were generate
  • Replication dataset include all information
    necessary to replicate empirical results
  • Metadata crucial to meet the standard
  • Composed of documentation and structured metadata
  • Undocumented data is useless

8
What can Metadata do for you?
9
What can metadata do for you?
  • Facilitate publication of results and increase
    visibility of your work
  • Integrating research results in the survey
    knowledge
  • Facilitate reporting, citations, etc.
  • Capture research process (replication standard!)
  • Facilitate reusability / extend the research
  • Compare results
  • Outcome
  • makes your life easier and your everyone happy

10
Guide to Creating Good Metadata
11
Capturing and using metadata
  • Starts with good practices
  • Need to be complemented with tools
  • In the enclave environment
  • Follow common guidelines and good practices
  • File and variable naming conventions
  • Code documentation
  • Good statistical methods
  • Take advantage of the collaboratory
  • Use the tools at your disposal
  • Exchange ideas with others
  • Express yourself using blog, wiki, shared
    document, etc.
  • Explore available resources

12
Coding and naming conventions (1)
  • Give meaningful names to files
  • Avoid spaces in names, dont use upper case
  • Version your files (capture progress)
  • Use middle extensions
  • Include metadata in the name
  • Not too good
  • report.doc, notes.txt
  • myfile.dta, table2.xls
  • reg.do, test.do,, results.
  • Better
  • byu_atp_final_report_200607.doc
  • byu_results_200706.dta , byu_enterprise_by_project
    _success.xls,
  • income_regression_mode.v200706.do

13
Coding and naming conventions (2)
  • Give meaningful names to variables
  • Not too good
  • tmp3, ag_exp2, v324
  • Better
  • valid_enterprise, agricultural_expenditure, s1q3
  • Comments, comments, comments!!
  • Make sure to include lots of comments in your
    source code
  • This is the best time to capture knowledge!
  • It also promotes replicability and will help you
    in a few months when to try to remember what you
    did
  • Share source code, use peer review

14
Not so good code example
  • local mypath c\data\anonymization\"
  • global data_in "mypath'" "\"
    "Demohh1000.dta"
  • global data_out "mypath'" "\"
    "Demohh1000.out.dta"
  • global threshold 0.8
  • cd mypath
  • set more off
  • use data_in, clear
  • tempfile temp
  • gen fk1
  • gen wiweight
  • collapse (sum) fk wi, by (town province marstat
    sex age)
  • gen pkfk/wi
  • gen qk1-pk
  • gen rk (pk/qk) log(1/pk) if fk1
  • replace rk (pk/(qk2)) ((pklog(pk))qk) if
    fk2
  • replace rk(pk/(2(qk3))) (qk(3qk-2) -
    (2pk2)log(pk)) if fk3
  • delimit
  • replace rk (pk/fk) (1 (qk/(fk1)) ((2qk2)
    / ((fk1)(fk2))) ((6qk3) /
    ((fk1)(fk2)(fk3))) ((24qk4) /
    ((fk1)(fk2)(fk3)(fk4))) ((120qk5)
    / ((fk1)(fk2)(fk3)(fk4)(fk5)))
    ((720qk6) / ((fk1)(fk2)(fk3)(fk4)(fk5)
    (fk6))) ((5040qk7) / ((fk1)(fk2)(fk3)
    (fk4)(fk5)(fk6)(fk7)))) if fkgt3

15
Better code example
  • /
  • Computes the disclosure risk at individual
    level
  • _at_author John Anonymous (janon_at_example.org)
  • _at_version 2007.06
  • References
  • - micro-Argus 4.1 manual, p27-25
  • /
  • // Configuration
  • local mypath C\data\anonymization\"
  • global data_in "mypath'" "\"
    "Demohh1000.dta"
  • global data_out "mypath'" "\"
    "Demohh1000.out.dta"
  • global threshold 0.8
  • // Initialize
  • cd my_path
  • set more off

16
Better code example (cont.)
  • // Compute frequencies
  • gen fk1
  • gen wiweight
  • // Group individual by re-indentifiction
    variables
  • collapse (sum) fk wi, by (town province marstat
    sex age)
  • gen pkfk/wi
  • gen qk1-pk
  • // Compute risk is cell frequency is 1
  • gen rk (pk/qk) log(1/pk) if fk1
  • // Compute risk is cell frequency is 2
  • replace rk (pk/(qk2)) ((pklog(pk))qk) if
    fk2
  • // Compute risk is cell frequency is 3
  • replace rk(pk/(2(qk3))) (qk(3qk-2) -
    (2pk2)log(pk)) if fk3
  • // Compute risk is cell frequency is greater than
    3 (series approximation)

17
Creating Metadata in the Enclave
18
Using the SharePoint based portal
Use the HELP!
  • But be aware that
  • not all documented functionalities are available
  • Some functions require administrative access

19
Editing content
20
Organizing work and exchanging ideas
Use the enclave announcement, tasks / todo, and
calendar to distribute and organize the research
work
Use the discussion groups to exchange ideas,
submit questions, etc
21
Using the blog to capture research events
  • Research is an iterative, evolving process
  • Capturing ideas and milestone is crucial
  • Personal logs have often been used in the past
  • Blogs is todays version of it

22
Using the wiki to capture research knowledge
  • Familiar with Wikipedia?
  • A wiki is a shared web site but does not require
    programming skills to maintain
  • Multiple authors can add, remove, and edit
    content (mass authoring).
  • Knowledge grows across time based in community
    contributions
  • Pages automatically link to each other page on
    topics

23
Sharing and tagging files
  • Take advantage of the shared documents facility
    to make information available to others
  • Documents, paper, etc
  • Scripts, programs
  • Tables
  • Organize documents by keyword/topics
  • Take advantage of the enclave search function

24
Report data quality issues!
  • A survey is not perfect, problems are always
    detected during research
  • Data issues
  • Invalid code, missing values, file that cannot be
    merged, missing files or variables, inconsistent
    results, bad distribution, etc
  • Metadata / Documentation issues
  • Undocumented variables or codes, discrepancies
    between docs and data, the post-processing/cleanin
    g/quality assurance black box, etc
  • Reporting this is crucial for other researchers
    and for the producer
Write a Comment
User Comments (0)
About PowerShow.com