GEOGRAPHY and WEIGHTS IN THE NLS - PowerPoint PPT Presentation

About This Presentation
Title:

GEOGRAPHY and WEIGHTS IN THE NLS

Description:

Title: NLS Sample Design Adjustments Author: nramser Last modified by: olsen Created Date: 5/26/2004 6:11:31 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 54
Provided by: nra62
Learn more at: http://streaming.osu.edu
Category:
Tags: geography | nls | the | weights | steps

less

Transcript and Presenter's Notes

Title: GEOGRAPHY and WEIGHTS IN THE NLS


1
GEOGRAPHY and WEIGHTS IN THE NLS
  • By
  • Randall Olsen

2
The Plan for this Module
  • Basics of geographic data
  • Geography and sampling
  • Level of detail available in the NLS and GIS data
  • How you get geographic data
  • Weighting
  • Correct standard errors and geo-variables
  • Using GIS and family-level data to enhance your
    analysis plan

3
Basics of Geographic Data
  • Four Census Regions this is the finest level of
    geography available for original cohorts without
    going to a Census Research Data Center.
  • State and Counties
  • Census Tracts size varies, thousands of persons
  • Block Groups neighborhoods (almost)

4
Counties
  • Finest level routinely available in NLSY79 and
    NLSY97
  • 3100 in U.S. Texas (254), Delaware (3), Georgia
    (159) FIPS codes to designate
  • Extensive socio-economic demographic data
    available at county level (nee City-County Data
    Book) you merge them in using FIPS codes

5
Census Tracts
  • In 2000 all U.S. partitioned into tracts
  • Size and population varies, but can contain
    several thousand people in urban areas or a few
    hundred in rural areas
  • Using these data requires clearance from BLS

6
Block Group
  • In Urban areas, consist of groups of blocks (did
    you guess?)
  • This is the finest level of aggregation for
    Census geography (building blocks for
    reapportionment)
  • What non-rural folk think of as a neighborhood,
    except for those near boundary of the block group

7
Census Regions
8
Census Tracts NYC Metro Area
9
Census Tracts in Wyoming
10
Sampling - Original Cohorts
  • Original Cohorts drawn from experimental CPS
    sample frame in 1960s Title 13 confidentiality
    restrictions prohibit release of geographic data
    below Census Region
  • We recently geocoded these data (latitude and
    longitude) and one may use them at a Census
    Research Data Center
  • The exact sampling structure was kept secret a la
    Raiders of the Lost Ark details may exist in a
    musty file in Suitland, MD although the latitude
    and longitude data allows one to reverse engineer
    the sampling

11
Sampling for the NLSYsMultiple stages
  • U.S. divided into Primary Sampling Units (PSUs)
    Major Met areas, counties or groups of counties
    (rural areas)
  • Selection probability proportional to population
    of interest large cities always chosen
  • Dividing PSUs into groups can insure correct
    fraction of rural suburban areas chosen this
    can reduce the sampling variance relative to a
    simple random sample (SRS)

12
Next Stages
  • Select Tracts or Block groups list in order by
    income or ethnic composition to pick every nth
    one insures even distribution over the ordered
    characteristics. Again, this can reduce sampling
    variance relative to a SRS. Segments of streets
    selected within block groups.
  • List all addresses in selected segments randomly
    select units to do a screening interview to
    identify eligible persons
  • This process generates area clusters of nearly
    contiguous respondents

13
Examples of PSU Clustering
  • NLSY97 100 PSUs in cross sectional sample. 100
    PSUs in minority oversample.
  • NLSY79 102 PSUs in cross sectional sample. 100
    PSUs in oversample. 38 PSUs in Military
    oversample.
  • We average about 50 respondents per PSU the
    effect of clustering on statistical properties
    increases with the size of the cluster and degree
    to which variables are correlated within cluster.
  • Correlations within clusters increase the
    sampling variance and usually overcome the
    advantages of stratification.

14
  • PSUs in NLSY79 initial screening (done in 1978)
    over yield
  • PSUs in NLSY97 initial screening (done in 1997
    screen and go) under yield

15
Geographic Detail in NLS
  • States and counties available in Geocode release
  • Zipcode data kept at BLS and CHRR
  • Census tracts and block group identifiers at CHRR
    and BLS
  • Latitude and longitude (accurate to about 50
    feet) at CHRR

16
Geocode How you get it
  • You need to apply to BLS (see Web site)
  • Describe how you plan to use the data
  • If BLS approves you, CHRR sends you a CD
  • You need to return the CD when finished and you
    are subject to audit and legal liabilities if you
    violate terms of agreement with BLS. BLS
    performs many audits keep yourself in
    compliance.

17
Geocode How you use it
  • You use the state and county codes to merge in
    the data you need
  • Use standard FIPS codes
  • There is a variable indicating when R is in a
    central city (this was done using zipcodes -
    before 1998 missing values show zips that are not
    unambiguously central/non-central)
  • Data merge is a do-it-yourself project

18
Zipcode Data
  • CD is at BLS or CHRR and is not released
  • The CD has Zipcodes, but matching and merging in
    the data you need is a do-it-yourself project
  • You can have CHRR create a variable you need,
    with BLS approval
  • Zipcode centroid can be used as rough location of
    respondent for simple distance calculations

19
Fine Level Location
  • Modern Geographic Information Systems data use
    latitude and longitude as the basis for linking
    data
  • We geocode respondent addresses with latitude and
    longitude sometimes with GPS units (all years
    except 1980)
  • We place R within about 50 feet
  • Opportunities to extend analysis abound

20
Distance from R to
  • Fast food restaurants
  • Employers
  • Doctors offices
  • Hospitals
  • Freeways
  • Schools, public private
  • Post offices
  • Banks
  • Bus stops
  • Train stations
  • State licensed day care centers
  • Drug seizures prices
  • Air quality measures
  • Toxic waste sites

21
Data at Tract and Block Group Level
  • Based on Decennial Census Long Form or American
    Community Survey (recent years)
  • Ethnicity and Color of people in area
  • Average income, poverty rate, dispersion in
    income, housing attributes
  • Population density, education, employment rates

22
Other Sensitive Data for Analysis
  • CHRR maintains the names of employers for each
    respondent in each round
  • With BLS approval we can identify persons working
    for a particular sort of employer or match in
    employer characteristics
  • The guiding principal is that these specialized
    extracts must not give you the ability to
    re-identify the respondent

23
Ideas using detailed geography
  • Does proximity to fast-food restaurants now and
    in the past correlate with BMI?
  • Does current and past air quality have a
    relationship to the incidence of asthma?
  • Does proximity to health care correlate with
    health outcomes?
  • Is local income inequality related to health?

24
  • Respondent location is generally chosen by
    respondent this problem of endogenous location
    may be attenuated or solved using locational
    attributes at either screening or age 15
    locations reflecting primarily parental choice,
    not respondent choice.
  • These past locational attributes can be used
    as either regressors or instrumental variables
    (IV). IV creates a variable that stands in for
    a regressor that is correlated with the error
    term.

25
  • Some respondent choices may be endogenous to an
    outcome, such as smoking and birth weight of
    ones infant. One could use the incidence of
    smoking by ones peers in the original PSU (or by
    ones siblings) as an instrumental variable.
  • Peer smoking reflects shared socio-economic
    forces, but weight of Rs baby unlikely to have
    an effect on smoking behavior of Rs peers.
  • Need to avoid weak instruments, that is
    instruments that do not explain much of the
    variation in the variable they stand in for.

26
Using Fine-level Geography
  • Make application to BLS
  • CHRR can often create the variable for you if it
    does not threaten re-identification
  • Rounding data reduces precision and reduces
    threat of re-identification of tract, block group
    or zipcode
  • Do the analysis at BLS or CHRR

27
DIFFUSION OF THE SAMPLES PSU Clusters in
original NLSY97 Sample
But this clustering has broken down over time.
Here is where people live as of Round 6 in NLSY97
28
PSU Clusters in original NLSY79 Sample 12,000
By Round 20 in NLSY79 Sample there is even more
geographic dispersion. 9,000
29
Example of Segment Clustering
  • In the NLSY97 a cluster of respondents were
    picked from the Lower East side of Manhattan and
    a cluster from around Yankee Stadium.

30
Implications of Sample Design for Routine Data Use
  • All NLS samples contain oversamples of Blacks and
    NLSYs oversample Hispanics. Poor whites and
    military members have discontinued oversamples in
    NLSY79.
  • NLSY looks different from a Simple Random Sample
  • Clusters of Rs may share unobservable
    characteristics

31
Weighting
  • Weight summary statistics to describe population
  • For regressions Gauss Markov rules
  • OLS is BLUE under standard conditions, including
    correct specification of the model
  • Model heterogeneity does not call for weighted
    regression but rather weighting the various
    regression coefficients

32
Weighting Horn of the Dilemma
33
Using Weights
  • Weights for the NLSY97 Round 7 range from a high
    of 1,785,202 to a low of 90,060 two implied
    decimal places
  • One respondent represents from 900 to 17,852
    people, average is about 2,500
  • Zero weights indicate person not interviewed
  • NLSY97 and NLSY79 have single round weights
    representing population in 1997 and 1978 not
    immigrants since screening
  • NLSY97 has weights for cross section (no
    oversamples) as well as panel weights

34
A NLSY79 Example From 1994
  • Blacks and Hispanics on average have lower wages
    than whites (see WeightingWageData.Sas).
  • Unweighted
  • Mean Wage 12.50 per hour
  • Median Wage 10.15 per hour
  • Weighted with 1994 sample weight (R50804.00) to
    correct for oversampling
  • Mean Wage 13.60 per hour
  • Median Wage 11.10 per hour
  • Weighting increases average wage by roughly 1.00
    per hour

35
How Do I Weight Multiple Years?
  • NLS has a custom weighting program that provides
    users with the ability to go beyond weighting
    just a single round
  • Web Version http//www.nlsinfo.org/web-investiga
    tor. Allows you to weight a set of survey
    rounds.
  • PC-SAS Version Allows you to use the code that
    runs the web version on your own PC. Enables you
    to weight any set of respondent ids. This allows
    you to take into account event history data and
    item non-response. This is a powerful tool.

36
Web Version
37
PC-SAS Custom Weight Program
  • Contact NLS User Services. They will send you a
    pair of PC-SAS programs, a set of data files and
    an input file. Jay Zagorsky at CHRR will help
    you.
  • You must be comfortable making minor
    modifications to SAS programs and must have SAS
    installed on your computer.
  • Program takes as input a sorted list of ids, one
    id per line. Program produces same output as web
    version
  • This program allows you to weight data from an
    event history or other complex designs

38
Clustering Standard Errors
  • NLS has numerous clusters of respondents who are
    alike same person in different rounds, siblings,
    people in same neighborhood
  • Clustering means all observations are not
    independent (not i.i.d.) heterogeneity across
    persons and families plus spatial correlation
  • PSU clustering more a problem than family
    clustering for variances d.e. 1p(k-1)
    (adjust s.e. by sqrt). Large k produces problems
    clusters larger than families. But same person
    in different rounds means large p.

39
Clustering Standard Errors(cont.)
  • If intra-cluster correlations are high, number of
    effective observations number of clusters, not
    number of observations
  • OLS is still consistent and unbiased must use
    GLS for correct standard errors
  • Design effects in regressions are perhaps better
    described as misspecification effects as the
    intracluster correlation is due to unobserved
    variables affecting the cluster

40
Household Clustering
  • NLSY97 4,027 respondents came from homes that
    had multiple respondents.
  • There were six homes that each provided five
    respondents.
  • NLSY79 5,914 respondents came from homes that
    had multiple respondents.
  • There were four homes that each provided six
    respondents.
  • Data on siblings allows us to separate effects of
    household versus individual characteristics
  • For original cohorts, refer to multiple
    respondent file to detect parents and children
    across cohorts and siblings both within and
    across cohorts

41
Effect of Clustering on Std Errors
  • NLSY79 to explain log of male hourly wage
  • Regress hourly wages on race, age, education,
    AFQT score and marital status. Details are in
    WageData.sas
  • Y XB ui vij wijk zijkt
  • ui is error for PSU i, vij is component for PSU
    i and family j, wijk is component for PSU i,
    family j and person k, zijkt is idiosyncratic

42
OLS Results From SAS
  • Results using OLS with SAS. Note high T-values.

Variable Coefficient T Value
Constant 5.21 133
Black -0.16 18
Hispanic -0.04 4.4
Age 0.02 18
High Grade 0.06 38
AFQT 0.001 16
Married 0.19 26
43
How To Fix Problem
  • There are at least two statistical packages
    designed to fix the clustering problem.
  • Sudaan (www.rti.org/sudaan) is a special purpose
    package designed to fix clustering issues.
    Integrates with SAS.
  • Stata (www.stata.com) is a general purpose
    statistical program. To adjust for clustering
    for means use the Svyset command for
    regression use robust cluster (Huber-White).
  • No clustering data available for Original Cohorts

44
OLS Results From Sudaan
  • Here we correct for the surveys clustering on
    PSU (not on person or family)

Variable Coefficient SASs T Value Sudaans T Value
Constant 5.21 133 83
Black -0.16 18 8.8
Hispanic -0.04 4.4 1.5
Age 0.02 18 13.4
High Grade 0.06 38 18.7
AFQT 0.001 16 8.3
Married 0.19 26 15.8
45
What Happened?
  • Adjusting for clustering using Sudaan resulted in
    most of the T-values falling by half. Most are
    still highly significant.
  • The Hispanic variable, which was considered
    highly significant with the SAS results (Pr lt
    0.0001) is now no longer statistically
    significant (Pr lt 0.15) by most commonly used
    levels. (Problem more severe with clustered
    characteristics)

46
What Steps Are Needed To Adjust?
  • First, get geocode clearance. You need this
    clearance to access replicate and PSU data.
  • Second, extract all variables for your research
    plus the replicate and PSU values.
  • NLSY79 The PSU variable is R02191.45, titled
    Stratum Number For Primary Sampling Units and
    the replicate variable is R02191.46, titled
    Within Stratum Replicate Of Primary Sampling
    Unit. PSU10R02191.45R02191.46
  • NLSY97 The PSU variables is R13082.00, titled
    PRIMARY SAMPLING UNIT (CODED). The replicate
    variable is not released. Set replicate1 in
    your work.

47
What Steps Are Needed To Adjust?
  • Third, sort your data set by replicate and PSU.
  • Fourth run Sudaan. We used the following
    command.
  • Proc Regress
  • data"C\Documents and Settings\All
    Users\Desktop\ClusteringandWeighting\WageData.dbs"
  • filetypeascii designwr DEFT1 est_no24000
  • weight _ONE_
  • nest REPLICAT PSU / MISSUNIT
  • Model Ln_Pay Black Hispanic Age HGC AFQT Marry

48
Small Extension
  • The SAS file we used to create the previous
    example is called WageData.sas.
  • What happens when we add one more explanatory
    variable, height in inches?
  • Adding this variable investigates if taller
    people earn higher wages.
  • The created variable height is already part of
    the SAS data set.

49
Extension Results
  • Using SAS the OLS regression results show
    heights coefficient is 0.004 and the t-value is
    3.34.
  • In simple language this means each extra inch of
    height is associated with a 0.4 increase in
    hourly wages. The 3.34 T-value shows the
    coefficient is robust at the 99.9 level of
    significance, suggesting height and wages are
    definitely related.
  • Using Sudaan to take into account clustering
    lowers the T-value to 2.0. Sudaan computes the
    heights coefficient significance level at 95.
    Hence, adjusting for clustering means we no
    longer have almost complete statistical certainty
    in the relationship.

50
What If You Do Not Have Sudaan (or Stata)?
  • One method of getting roughly similar results is
    to add extra geographic variables which track
    each PSUs characteristics to the regression.
  • Using just SAS we reran the wage function and
    included for each respondents 1979 location
    percent black, percent Hispanic, median income,
    did the respondent reside in a SMSA of 2 million
    people and dummies for USA regions (see the file
    named WageDataPlusGeoVariables.sas).
  • Note we get results much like Sudaan just using
    location characteristics that are 20 years old.

51
Result of Adding Geographic Variables
  • The left 3 columns are the original wage
    equation. The right 2 columns are the results
    after adding the geographic variables. Like
    Sudaan regressions, adding the extra geographic
    indicators dramatically lowers the T-statistics.

Variable Original Coefficient Orig. T Value New Coefficient New T Value
Constant 5.21 133 4.42 39.7
Black -0.16 18 -0.22 10.4
Hispanic -0.04 4.4 -0.065 3.1
Age 0.02 18 0.02 9.5
High Grade 0.06 38 0.06 18.2
I.Q. 0.001 16 0.0015 9.2
Married 0.19 26 0.22 15.1
52
Children of NLSY YAG
  • Children had already diffused by their teen years
    relative to their mothers
  • Clustering is more subtle, includes kin networks
  • Appropriateness of Sudaan other routines more
    problematic
  • The two alternatives are a complex random effects
    model or using geographic descriptors to explain
    the error components responsible for the design
    effect problem

53
Bottom Line
  • GIS systems have become essential to social
    scientists. The NLSYs have a lot of data on
    location, but use is restricted.
  • The oversamples and clustering of the NLSYs
    require you to think carefully about the impact
    of heterogeneity, weighting and clustering on
    your analysis. Weighting is not usually correct
    except when estimating univariate population
    moments.
  • Using geographic descriptors as regressors
    attenuates design effects.
  • For original cohorts, geography is very limited.
Write a Comment
User Comments (0)
About PowerShow.com