Title: GEOGRAPHY and WEIGHTS IN THE NLS
1GEOGRAPHY and WEIGHTS IN THE NLS
2The Plan for this Module
- Basics of geographic data
- Geography and sampling
- Level of detail available in the NLS and GIS data
- How you get geographic data
- Weighting
- Correct standard errors and geo-variables
- Using GIS and family-level data to enhance your
analysis plan
3Basics of Geographic Data
- Four Census Regions this is the finest level of
geography available for original cohorts without
going to a Census Research Data Center. - State and Counties
- Census Tracts size varies, thousands of persons
- Block Groups neighborhoods (almost)
4Counties
- Finest level routinely available in NLSY79 and
NLSY97 - 3100 in U.S. Texas (254), Delaware (3), Georgia
(159) FIPS codes to designate - Extensive socio-economic demographic data
available at county level (nee City-County Data
Book) you merge them in using FIPS codes
5Census Tracts
- In 2000 all U.S. partitioned into tracts
- Size and population varies, but can contain
several thousand people in urban areas or a few
hundred in rural areas - Using these data requires clearance from BLS
6Block Group
- In Urban areas, consist of groups of blocks (did
you guess?) - This is the finest level of aggregation for
Census geography (building blocks for
reapportionment) - What non-rural folk think of as a neighborhood,
except for those near boundary of the block group
7Census Regions
8Census Tracts NYC Metro Area
9Census Tracts in Wyoming
10Sampling - Original Cohorts
- Original Cohorts drawn from experimental CPS
sample frame in 1960s Title 13 confidentiality
restrictions prohibit release of geographic data
below Census Region - We recently geocoded these data (latitude and
longitude) and one may use them at a Census
Research Data Center - The exact sampling structure was kept secret a la
Raiders of the Lost Ark details may exist in a
musty file in Suitland, MD although the latitude
and longitude data allows one to reverse engineer
the sampling
11Sampling for the NLSYsMultiple stages
- U.S. divided into Primary Sampling Units (PSUs)
Major Met areas, counties or groups of counties
(rural areas) - Selection probability proportional to population
of interest large cities always chosen - Dividing PSUs into groups can insure correct
fraction of rural suburban areas chosen this
can reduce the sampling variance relative to a
simple random sample (SRS)
12Next Stages
- Select Tracts or Block groups list in order by
income or ethnic composition to pick every nth
one insures even distribution over the ordered
characteristics. Again, this can reduce sampling
variance relative to a SRS. Segments of streets
selected within block groups. - List all addresses in selected segments randomly
select units to do a screening interview to
identify eligible persons - This process generates area clusters of nearly
contiguous respondents
13Examples of PSU Clustering
- NLSY97 100 PSUs in cross sectional sample. 100
PSUs in minority oversample. - NLSY79 102 PSUs in cross sectional sample. 100
PSUs in oversample. 38 PSUs in Military
oversample. - We average about 50 respondents per PSU the
effect of clustering on statistical properties
increases with the size of the cluster and degree
to which variables are correlated within cluster. - Correlations within clusters increase the
sampling variance and usually overcome the
advantages of stratification.
14- PSUs in NLSY79 initial screening (done in 1978)
over yield
- PSUs in NLSY97 initial screening (done in 1997
screen and go) under yield
15Geographic Detail in NLS
- States and counties available in Geocode release
- Zipcode data kept at BLS and CHRR
- Census tracts and block group identifiers at CHRR
and BLS - Latitude and longitude (accurate to about 50
feet) at CHRR
16Geocode How you get it
- You need to apply to BLS (see Web site)
- Describe how you plan to use the data
- If BLS approves you, CHRR sends you a CD
- You need to return the CD when finished and you
are subject to audit and legal liabilities if you
violate terms of agreement with BLS. BLS
performs many audits keep yourself in
compliance.
17Geocode How you use it
- You use the state and county codes to merge in
the data you need - Use standard FIPS codes
- There is a variable indicating when R is in a
central city (this was done using zipcodes -
before 1998 missing values show zips that are not
unambiguously central/non-central) - Data merge is a do-it-yourself project
18Zipcode Data
- CD is at BLS or CHRR and is not released
- The CD has Zipcodes, but matching and merging in
the data you need is a do-it-yourself project - You can have CHRR create a variable you need,
with BLS approval - Zipcode centroid can be used as rough location of
respondent for simple distance calculations
19Fine Level Location
- Modern Geographic Information Systems data use
latitude and longitude as the basis for linking
data - We geocode respondent addresses with latitude and
longitude sometimes with GPS units (all years
except 1980) - We place R within about 50 feet
- Opportunities to extend analysis abound
20Distance from R to
- Fast food restaurants
- Employers
- Doctors offices
- Hospitals
- Freeways
- Schools, public private
- Post offices
- Banks
- Bus stops
- Train stations
- State licensed day care centers
- Drug seizures prices
- Air quality measures
- Toxic waste sites
21Data at Tract and Block Group Level
- Based on Decennial Census Long Form or American
Community Survey (recent years) - Ethnicity and Color of people in area
- Average income, poverty rate, dispersion in
income, housing attributes - Population density, education, employment rates
22Other Sensitive Data for Analysis
- CHRR maintains the names of employers for each
respondent in each round - With BLS approval we can identify persons working
for a particular sort of employer or match in
employer characteristics - The guiding principal is that these specialized
extracts must not give you the ability to
re-identify the respondent
23Ideas using detailed geography
- Does proximity to fast-food restaurants now and
in the past correlate with BMI? - Does current and past air quality have a
relationship to the incidence of asthma? - Does proximity to health care correlate with
health outcomes? - Is local income inequality related to health?
24- Respondent location is generally chosen by
respondent this problem of endogenous location
may be attenuated or solved using locational
attributes at either screening or age 15
locations reflecting primarily parental choice,
not respondent choice. - These past locational attributes can be used
as either regressors or instrumental variables
(IV). IV creates a variable that stands in for
a regressor that is correlated with the error
term.
25- Some respondent choices may be endogenous to an
outcome, such as smoking and birth weight of
ones infant. One could use the incidence of
smoking by ones peers in the original PSU (or by
ones siblings) as an instrumental variable. - Peer smoking reflects shared socio-economic
forces, but weight of Rs baby unlikely to have
an effect on smoking behavior of Rs peers. - Need to avoid weak instruments, that is
instruments that do not explain much of the
variation in the variable they stand in for.
26Using Fine-level Geography
- Make application to BLS
- CHRR can often create the variable for you if it
does not threaten re-identification - Rounding data reduces precision and reduces
threat of re-identification of tract, block group
or zipcode - Do the analysis at BLS or CHRR
27DIFFUSION OF THE SAMPLES PSU Clusters in
original NLSY97 Sample
But this clustering has broken down over time.
Here is where people live as of Round 6 in NLSY97
28PSU Clusters in original NLSY79 Sample 12,000
By Round 20 in NLSY79 Sample there is even more
geographic dispersion. 9,000
29Example of Segment Clustering
- In the NLSY97 a cluster of respondents were
picked from the Lower East side of Manhattan and
a cluster from around Yankee Stadium.
30Implications of Sample Design for Routine Data Use
- All NLS samples contain oversamples of Blacks and
NLSYs oversample Hispanics. Poor whites and
military members have discontinued oversamples in
NLSY79. - NLSY looks different from a Simple Random Sample
- Clusters of Rs may share unobservable
characteristics
31Weighting
- Weight summary statistics to describe population
- For regressions Gauss Markov rules
- OLS is BLUE under standard conditions, including
correct specification of the model - Model heterogeneity does not call for weighted
regression but rather weighting the various
regression coefficients
32Weighting Horn of the Dilemma
33Using Weights
- Weights for the NLSY97 Round 7 range from a high
of 1,785,202 to a low of 90,060 two implied
decimal places - One respondent represents from 900 to 17,852
people, average is about 2,500 - Zero weights indicate person not interviewed
- NLSY97 and NLSY79 have single round weights
representing population in 1997 and 1978 not
immigrants since screening - NLSY97 has weights for cross section (no
oversamples) as well as panel weights
34A NLSY79 Example From 1994
- Blacks and Hispanics on average have lower wages
than whites (see WeightingWageData.Sas). - Unweighted
- Mean Wage 12.50 per hour
- Median Wage 10.15 per hour
- Weighted with 1994 sample weight (R50804.00) to
correct for oversampling - Mean Wage 13.60 per hour
- Median Wage 11.10 per hour
- Weighting increases average wage by roughly 1.00
per hour
35How Do I Weight Multiple Years?
- NLS has a custom weighting program that provides
users with the ability to go beyond weighting
just a single round - Web Version http//www.nlsinfo.org/web-investiga
tor. Allows you to weight a set of survey
rounds. - PC-SAS Version Allows you to use the code that
runs the web version on your own PC. Enables you
to weight any set of respondent ids. This allows
you to take into account event history data and
item non-response. This is a powerful tool.
36Web Version
37PC-SAS Custom Weight Program
- Contact NLS User Services. They will send you a
pair of PC-SAS programs, a set of data files and
an input file. Jay Zagorsky at CHRR will help
you. - You must be comfortable making minor
modifications to SAS programs and must have SAS
installed on your computer. - Program takes as input a sorted list of ids, one
id per line. Program produces same output as web
version - This program allows you to weight data from an
event history or other complex designs
38Clustering Standard Errors
- NLS has numerous clusters of respondents who are
alike same person in different rounds, siblings,
people in same neighborhood - Clustering means all observations are not
independent (not i.i.d.) heterogeneity across
persons and families plus spatial correlation - PSU clustering more a problem than family
clustering for variances d.e. 1p(k-1)
(adjust s.e. by sqrt). Large k produces problems
clusters larger than families. But same person
in different rounds means large p.
39Clustering Standard Errors(cont.)
- If intra-cluster correlations are high, number of
effective observations number of clusters, not
number of observations - OLS is still consistent and unbiased must use
GLS for correct standard errors - Design effects in regressions are perhaps better
described as misspecification effects as the
intracluster correlation is due to unobserved
variables affecting the cluster
40Household Clustering
- NLSY97 4,027 respondents came from homes that
had multiple respondents. - There were six homes that each provided five
respondents. - NLSY79 5,914 respondents came from homes that
had multiple respondents. - There were four homes that each provided six
respondents. - Data on siblings allows us to separate effects of
household versus individual characteristics - For original cohorts, refer to multiple
respondent file to detect parents and children
across cohorts and siblings both within and
across cohorts
41Effect of Clustering on Std Errors
- NLSY79 to explain log of male hourly wage
- Regress hourly wages on race, age, education,
AFQT score and marital status. Details are in
WageData.sas - Y XB ui vij wijk zijkt
-
- ui is error for PSU i, vij is component for PSU
i and family j, wijk is component for PSU i,
family j and person k, zijkt is idiosyncratic
42OLS Results From SAS
- Results using OLS with SAS. Note high T-values.
Variable Coefficient T Value
Constant 5.21 133
Black -0.16 18
Hispanic -0.04 4.4
Age 0.02 18
High Grade 0.06 38
AFQT 0.001 16
Married 0.19 26
43How To Fix Problem
- There are at least two statistical packages
designed to fix the clustering problem. - Sudaan (www.rti.org/sudaan) is a special purpose
package designed to fix clustering issues.
Integrates with SAS. - Stata (www.stata.com) is a general purpose
statistical program. To adjust for clustering
for means use the Svyset command for
regression use robust cluster (Huber-White). - No clustering data available for Original Cohorts
44OLS Results From Sudaan
- Here we correct for the surveys clustering on
PSU (not on person or family)
Variable Coefficient SASs T Value Sudaans T Value
Constant 5.21 133 83
Black -0.16 18 8.8
Hispanic -0.04 4.4 1.5
Age 0.02 18 13.4
High Grade 0.06 38 18.7
AFQT 0.001 16 8.3
Married 0.19 26 15.8
45What Happened?
- Adjusting for clustering using Sudaan resulted in
most of the T-values falling by half. Most are
still highly significant. - The Hispanic variable, which was considered
highly significant with the SAS results (Pr lt
0.0001) is now no longer statistically
significant (Pr lt 0.15) by most commonly used
levels. (Problem more severe with clustered
characteristics)
46What Steps Are Needed To Adjust?
- First, get geocode clearance. You need this
clearance to access replicate and PSU data. - Second, extract all variables for your research
plus the replicate and PSU values. - NLSY79 The PSU variable is R02191.45, titled
Stratum Number For Primary Sampling Units and
the replicate variable is R02191.46, titled
Within Stratum Replicate Of Primary Sampling
Unit. PSU10R02191.45R02191.46 - NLSY97 The PSU variables is R13082.00, titled
PRIMARY SAMPLING UNIT (CODED). The replicate
variable is not released. Set replicate1 in
your work.
47What Steps Are Needed To Adjust?
- Third, sort your data set by replicate and PSU.
- Fourth run Sudaan. We used the following
command. - Proc Regress
- data"C\Documents and Settings\All
Users\Desktop\ClusteringandWeighting\WageData.dbs"
- filetypeascii designwr DEFT1 est_no24000
- weight _ONE_
- nest REPLICAT PSU / MISSUNIT
- Model Ln_Pay Black Hispanic Age HGC AFQT Marry
48Small Extension
- The SAS file we used to create the previous
example is called WageData.sas. - What happens when we add one more explanatory
variable, height in inches? - Adding this variable investigates if taller
people earn higher wages. - The created variable height is already part of
the SAS data set.
49Extension Results
- Using SAS the OLS regression results show
heights coefficient is 0.004 and the t-value is
3.34. - In simple language this means each extra inch of
height is associated with a 0.4 increase in
hourly wages. The 3.34 T-value shows the
coefficient is robust at the 99.9 level of
significance, suggesting height and wages are
definitely related. - Using Sudaan to take into account clustering
lowers the T-value to 2.0. Sudaan computes the
heights coefficient significance level at 95.
Hence, adjusting for clustering means we no
longer have almost complete statistical certainty
in the relationship.
50What If You Do Not Have Sudaan (or Stata)?
- One method of getting roughly similar results is
to add extra geographic variables which track
each PSUs characteristics to the regression. - Using just SAS we reran the wage function and
included for each respondents 1979 location
percent black, percent Hispanic, median income,
did the respondent reside in a SMSA of 2 million
people and dummies for USA regions (see the file
named WageDataPlusGeoVariables.sas). - Note we get results much like Sudaan just using
location characteristics that are 20 years old.
51Result of Adding Geographic Variables
- The left 3 columns are the original wage
equation. The right 2 columns are the results
after adding the geographic variables. Like
Sudaan regressions, adding the extra geographic
indicators dramatically lowers the T-statistics.
Variable Original Coefficient Orig. T Value New Coefficient New T Value
Constant 5.21 133 4.42 39.7
Black -0.16 18 -0.22 10.4
Hispanic -0.04 4.4 -0.065 3.1
Age 0.02 18 0.02 9.5
High Grade 0.06 38 0.06 18.2
I.Q. 0.001 16 0.0015 9.2
Married 0.19 26 0.22 15.1
52Children of NLSY YAG
- Children had already diffused by their teen years
relative to their mothers - Clustering is more subtle, includes kin networks
- Appropriateness of Sudaan other routines more
problematic - The two alternatives are a complex random effects
model or using geographic descriptors to explain
the error components responsible for the design
effect problem
53Bottom Line
- GIS systems have become essential to social
scientists. The NLSYs have a lot of data on
location, but use is restricted. - The oversamples and clustering of the NLSYs
require you to think carefully about the impact
of heterogeneity, weighting and clustering on
your analysis. Weighting is not usually correct
except when estimating univariate population
moments. - Using geographic descriptors as regressors
attenuates design effects. - For original cohorts, geography is very limited.