Title: A Stata program for calibration weighting
1A Stata program for calibration weighting
- John DSouza
- National Centre for Social Research
2Outline
- Description of calibration
- Adjust selection weights so that a weighted
sample exactly matches the population - Generalizes post-stratification
- Several methods Linear, logistic
- SAS, GenStat
- A new Stata program
- Limitations and extensions
3Sampling
- Selection weights dk 1/P(Person k is chosen)
- Sample frame variables Xk1, , XkJ with known
population totals, P1, , PJ. - Horvitz-Thompson estimator of Pi
- ?dkXki Pi for i1,2, , J.
- Calibration Adjust dk to get calibration
weights, wk, giving exact equality - ?wkXki Pi for i1,2, , J.
4Example School Census
- Variables include
- Age, Gender, Ethnic Group, Exam results
- Type of School, Region
- Pupils Free School Meal eligibility
- We calibrate to J variables. Eg.
- Boy (binary)
- Girl (binary)
- Region (eg. four categories)
- FSM eligibility (binary)
- J 1 1 (4-1) 1 6
5Special case post-stratification
- Simplest case
- One categorical variable
- Easy to deal with (post-stratification)
- svyset , poststrata() postweight()
- More general case
- Several variables (categorical and numerical)
6Deville and Sarndal (1992).
- Minimize the distance between w and d subject
to the J calibration constraints. - Linear calibration Minimize
- ?S (wk- dk)2/dk
- Involves solving J simultaneous linear equations
- Logistic calibration Minimize
- ?S (wklog(wk/dk) wk dk)
- Involves solving J simultaneous non-linear
equations
7GenStat, SAS, Stata
- GenStat and SAS
- Methods linear, logistic and bounded.
- Estimation GenStat gives SEs.
- SAS handles categorical variables directly. Enter
as indicator variables in GenStat. - Stata
- Post-stratification (calibration to one
categorical variable). Gives SEs. - No routine for general calibration.
8A new Stata program
- Typical syntax.
- matrix M10000, 10000, 3000, 4000, 3000, 8000
- calibrate , entrywt(w1) exitwt(w2) poptot(M) ///
- marginals(boy girl FSM ireg1-ireg3) ///
- method(linear) print(final)
- 10,000 boys, 10,000 girls, 3,000 FSM
- Variables boys, girls, FSM are binary
- Categorical variable region (4 categories) turned
into 4 binary indicator variables). Only 3
entered in the syntax (colinearity)
9Output
Variable Pop total Weighted (entrywt) Weighted (exitwt) R
boy 10000 9619.7188 10000 .21373408
girl 10000 10380.281 10000 .13733883
FSM 3000 2915.4929 3000 .04710333
ireg1 4000 4056.3379 4000 -.19511394
ireg2 3000 3197.1831 3000 -.24808005
ireg3 8000 8507.042 8000 -.2391432
10Options
- Options available to
- Control amount of output/graphs
- Set max number of iterations/tolerance
- Methods
- linear, logistic, bounded linear and nonresp
- (blinear sets bounds for wk/dk. GenStat and SAS
have something very similar ) - (nonresp adjusts for non-response see below)
11Limitations (1)
- Solves the equations by finding a matrix inverse
- Wont work if J is large
- Can have problems with singular or nearly
singular matrices - Iterative methods (logistic, blinear) wont
always converge - No obvious solution to 1. Problem 2 and 3 are
usually down to problems with the data
12Limitations (2)
- We need to recode categorical variables (SAS
doesnt) - Stata tab region, gen(ireg)
- More complicated (eg two-phase) problems arent
handled directly - Need a bit of syntax to handle this
- Other packages can handle this directly
13Extensions Standard errors
- Calibration weights are often incorrectly treated
as selection weights. - calibrate , entrywt(w1) exitwt(w2) poptot(M) ///
- marginals(boy girl FSM ireg1-ireg3)
- calibmean , selwt(w1) calibwt(w2) yvar(y) ///
- marginals(boy girl FSM ireg1-ireg3) ///
- psu(school) designops (strata(region))
- This generalizes Statas poststrata command
14Extension Method nonresp (1)
- Example
- Select schools, then classes, then pupils
- Assume all schools respond, pupils might not
- Variables available on responders. (Pop totals
available) - Gender, Exam results, FSM, Region
- Variables on non-responders. (Pop totals not
available) - PTratio Pupil-teacher ratio
- topset Is pupil in the top set?
15Extension Method nonresp (2)
- serial region topset outc sex FSM
- ------------------------------------------
- 1. 1001 1 1 0 . .
- 2. 1002 1 0 1 1 0
- 3. 1003 2 0 0 . .
- 4. 1004 1 0 1 1 1
- 5. 1005 3 1 0 . .
- ------------------------------------------
- 6. 1006 1 0 1 0 1
- 7. 1007 3 1 1 1 0
- 8. 1008 2 1 0 . .
- 9. 1009 1 0 1 1 0
16Extension Method nonresp (3)
- Population totals unknown, but variables are
available on all the sample (including
non-responders) - calibrate , entrywt(w1) exitwt(w2) poptot(M) ///
- marginals(boy girl FSM ireg1-ireg3) ///
- method(nonresp) outc(outc) ///
- svars(PTratio topset)
- Responders weighted to pop totals on marginals
and to selected sample totals on svars
(Lundstrom Sarndal, 2005)
17Conclusions
- Weve found the program can handle many practical
problems - Easy to calculate SEs (but theory assumes no
non-response) - Method nonresp isnt available in many packages
- We dont have to calibrate to population totals
- Eg, calibrate Wave n1 of a survey to totals from
Wave n - Calibrate one sample to look like another
18Questions
19References
- Deville, J.-C. and Sarndal, C.-E. 1992.
Calibration estimators in survey sampling.
Journal of the American Statistical Association
87 376-382 - Background and theory behind calibration
- Lundstrom, S. and Sarndal, C.-E. 2005. Estimation
in Surveys with Nonresponse. Wiley - Deals with non-response
- Singh, A.C. and Mohl, C.A. 1996. Understanding
Calibration estimators in Survey Sampling. Survey
Methodology 22 107-115 - Discusses several methods of doing bounded
calibration