Title: NORC Data Enclave
1NORC Data Enclave
- Module 2
- Metadata for researchers
2Overview
- What is metadata and why is it important
- What can metadata do for you?
- Guide to creating metadata
- Creating metadata in the enclave
3What is metadata?
4Metadata and the survey life cycle
- A survey is not a static process
- It dynamically evolved across time and involves
many players - It extends to aggregate data to reach decision
makers - Metadata is crucial to capture knowledge
5Importance of metadata
- Imagine a world without metadata.
- Users would say
- I cant find the right data! How do I get access?
- Where is the report / questionnaire /
methodology? - I dont understand this survey / file / variable
- I cant merge the files
- How do I weight the data?
- My results dont match the report, I cant
reproduce the same results - Are these things comparable?
- I didnt know someone did this research before?
- Sounds familiar?
- Metadata is an answer to a researchers
frustrations - Producers and archivists are making efforts to
improve metadata but similarly, metadata must
also be captured by researchers (Life Cycle!)
6When to capture metadata?
- Metadata must be captured at the time the event
occurs! - Documenting after the facts leads to considerable
loss of information - This is true for producers and researchers
7Metadata and the Replication standard
- Replication standard
- Gary King, Harvard, 1995
- The only way to understand and evaluate an
empirical analysis fully is to know the exact
process by which the data were generate - Replication dataset include all information
necessary to replicate empirical results - Metadata crucial to meet the standard
- Composed of documentation and structured metadata
- Undocumented data is useless
8What can Metadata do for you?
9What can metadata do for you?
- Facilitate publication of results and increase
visibility of your work - Integrating research results in the survey
knowledge - Facilitate reporting, citations, etc.
- Capture research process (replication standard!)
- Facilitate reusability / extend the research
- Compare results
- Outcome
- makes your life easier and your everyone happy
10Guide to Creating Good Metadata
11Capturing and using metadata
- Starts with good practices
- Need to be complemented with tools
- In the enclave environment
- Follow common guidelines and good practices
- File and variable naming conventions
- Code documentation
- Good statistical methods
- Take advantage of the collaboratory
- Use the tools at your disposal
- Exchange ideas with others
- Express yourself using blog, wiki, shared
document, etc. - Explore available resources
12Coding and naming conventions (1)
- Give meaningful names to files
- Avoid spaces in names, dont use upper case
- Version your files (capture progress)
- Use middle extensions
- Include metadata in the name
- Not too good
- report.doc, notes.txt
- myfile.dta, table2.xls
- reg.do, test.do,, results.
- Better
- byu_atp_final_report_200607.doc
- byu_results_200706.dta , byu_enterprise_by_project
_success.xls, - income_regression_mode.v200706.do
13Coding and naming conventions (2)
- Give meaningful names to variables
- Not too good
- tmp3, ag_exp2, v324
- Better
- valid_enterprise, agricultural_expenditure, s1q3
- Comments, comments, comments!!
- Make sure to include lots of comments in your
source code - This is the best time to capture knowledge!
- It also promotes replicability and will help you
in a few months when to try to remember what you
did - Share source code, use peer review
14Not so good code example
- local mypath c\data\anonymization\"
- global data_in "mypath'" "\"
"Demohh1000.dta" - global data_out "mypath'" "\"
"Demohh1000.out.dta" - global threshold 0.8
- cd mypath
- set more off
- use data_in, clear
- tempfile temp
- gen fk1
- gen wiweight
- collapse (sum) fk wi, by (town province marstat
sex age) - gen pkfk/wi
- gen qk1-pk
- gen rk (pk/qk) log(1/pk) if fk1
- replace rk (pk/(qk2)) ((pklog(pk))qk) if
fk2 - replace rk(pk/(2(qk3))) (qk(3qk-2) -
(2pk2)log(pk)) if fk3 - delimit
- replace rk (pk/fk) (1 (qk/(fk1)) ((2qk2)
/ ((fk1)(fk2))) ((6qk3) /
((fk1)(fk2)(fk3))) ((24qk4) /
((fk1)(fk2)(fk3)(fk4))) ((120qk5)
/ ((fk1)(fk2)(fk3)(fk4)(fk5)))
((720qk6) / ((fk1)(fk2)(fk3)(fk4)(fk5)
(fk6))) ((5040qk7) / ((fk1)(fk2)(fk3)
(fk4)(fk5)(fk6)(fk7)))) if fkgt3
15Better code example
- /
- Computes the disclosure risk at individual
level -
- _at_author John Anonymous (janon_at_example.org)
- _at_version 2007.06
- References
- - micro-Argus 4.1 manual, p27-25
- /
- // Configuration
- local mypath C\data\anonymization\"
- global data_in "mypath'" "\"
"Demohh1000.dta" - global data_out "mypath'" "\"
"Demohh1000.out.dta" - global threshold 0.8
- // Initialize
- cd my_path
- set more off
16Better code example (cont.)
- // Compute frequencies
- gen fk1
- gen wiweight
- // Group individual by re-indentifiction
variables - collapse (sum) fk wi, by (town province marstat
sex age) - gen pkfk/wi
- gen qk1-pk
- // Compute risk is cell frequency is 1
- gen rk (pk/qk) log(1/pk) if fk1
- // Compute risk is cell frequency is 2
- replace rk (pk/(qk2)) ((pklog(pk))qk) if
fk2 - // Compute risk is cell frequency is 3
- replace rk(pk/(2(qk3))) (qk(3qk-2) -
(2pk2)log(pk)) if fk3 - // Compute risk is cell frequency is greater than
3 (series approximation)
17Creating Metadata in the Enclave
18Using the SharePoint based portal
Use the HELP!
- But be aware that
- not all documented functionalities are available
- Some functions require administrative access
19Editing content
20Organizing work and exchanging ideas
Use the enclave announcement, tasks / todo, and
calendar to distribute and organize the research
work
Use the discussion groups to exchange ideas,
submit questions, etc
21Using the blog to capture research events
- Research is an iterative, evolving process
- Capturing ideas and milestone is crucial
- Personal logs have often been used in the past
- Blogs is todays version of it
22Using the wiki to capture research knowledge
- Familiar with Wikipedia?
- A wiki is a shared web site but does not require
programming skills to maintain - Multiple authors can add, remove, and edit
content (mass authoring). - Knowledge grows across time based in community
contributions - Pages automatically link to each other page on
topics
23Sharing and tagging files
- Take advantage of the shared documents facility
to make information available to others - Documents, paper, etc
- Scripts, programs
- Tables
- Organize documents by keyword/topics
- Take advantage of the enclave search function
24Report data quality issues!
- A survey is not perfect, problems are always
detected during research - Data issues
- Invalid code, missing values, file that cannot be
merged, missing files or variables, inconsistent
results, bad distribution, etc - Metadata / Documentation issues
- Undocumented variables or codes, discrepancies
between docs and data, the post-processing/cleanin
g/quality assurance black box, etc - Reporting this is crucial for other researchers
and for the producer