Title: SiStaN in Brief
1 A balanced Sampling approach for multiway
stratification design for small area
estimation Piero Demetrio Falorsi - Paolo
Righi ISTAT
2Index
- The issue of multivariate-multidomain sampling
strategy - The proposed sampling strategy
- Balanced sample for multiway stratification
- Modified GREG estimator
- The algorithm for the sample size definition
- Application fields and experiments
31. The issue of multivariate-multidomain sampling
strategy
When planning a sample strategy for a survey
aiming at producing estimates for several domains
(defined as non-nested partitions of the
population) an issue is to define the sample size
so that the sampling errors of domain estimates
of several parameters are lower than given
thresholds. A sampling strategy is proposed here
dealing with multivariate-multidomain surveys
when the overall sample size must satisfy budget
constraints. The standard solution of a
stratification given by cross-classification of
the domain variables is often not feasible
because the number of strata can be larger than
the overall sample size. Moreover, even if the
overall sample size allows covering all the
strata, the resulting allocation could lead to an
inefficient design.
41. The issue of multivariate-multidomain sampling
strategy
- Population
- Planned and actual sample with cross-classificatio
n stratification
51. The issue of multivariate-multidomain sampling
strategy
- Example Business Structural Statistics
- 36.000 cross-classification strata
6Standard strategy
1. The issue of multivariate-multidomain sampling
strategy
- Standard solution to obtain planned domains
adopts cross-stratified sampling design by
combining the domains - Consequences
- when the population size in many strata is small,
the stratification scheme could be inefficient - if different partitions in domains of interest
are not nested, the allocation of the sample in
the cross-classified strata may be substantially
different from the optimal allocation for the
domains of a given partition - the sample size to cover all strata could be too
large for the survey economical constrains - dealing with surveys repeated over time,
statistical burden may arise if there exist
strata containing only few units in the
population.
71. The issue of multivariate-multidomain sampling
strategy
- One possible solution is the multi-way
stratification - Several sophisticated solutions have been
proposed to keep under control the sample size in
all the categories of the stratifying variables
without using cross-classification design. These
methods are generally referred to as multi-way
stratification techniques, and have been
developed under two main approaches - Latin Squares or Latin Lattices schemes (Bryant
et al., 1960 Jessen, 1970) the indipendece
among rows and columns is supposed. these methods
work only if all the cross-strata exist in the
population. - Controlled rounding problems via linear
programming (Causey et al., 1985 Sitter and
Skinner, 1994). Very computationally complex
methods, not always get to a solution, inclusion
probability (both simple and joint) cannot be
computed immediately. - The main weaknesses of these approaches derives
from the computational complexity and moreover a
solution is not always reached.
82. The proposed sampling strategy
- Aim of this work is to define a sample strategy
that is optimal with regard to the sample scheme
and to the estimator utilized, by exploiting the
available auxiliary information in both phases - Define a probabilistic sample method
- Realize a multiway stratification based on
balanced sampling, controlling the sample size of
the margin domains - Use a modified GREG estimator
- Define the sample allocation, aiming at
controlling the sampling errors on margins, using
a variance estimator taking into account jointly
both the regression model under the GREG
estimator and the balanced sampling design - The strategy may take into account a simple (Fay
Herriot) Small Area Estimator - The proposed overall sampling strategy is easy to
implement and a software has been developed for
each phase - It is possible to extend it to different contexts
(considering the anticipated variance or the use
of indirect small area estimators) - It is possible to develop a sample strategy for
small area estimation considering the sample and
estimation phases jointly
9 2. The proposed sampling strategy
- Notation
- Denote with
- U the population of size N
- Ub the b-th partition in Mb domains Ubd , b1,,
B, d1,, Mb - the value of the (r 1,,R) variable of
interest in the k-th population unit -
- the domain membership indicator
- n the overall fixed sample size
- r-th parameter of interest
103. Balanced sampling and multi-way stratification
- Balanced sampling is a class of designs using
auxiliary information. - Properties have been studied in the
- model based approach (Royall and Herson, 1973
Valliant et al., 2000) - design based approach (Deville and Tillé, 2004,
2005). - In the following we consider the design based or
model assisted approach
113. Balanced sampling and multi-way stratification
- Let us define the sampling design p(.) with
inclusion probabilities
a design which assigns a
probability p(s) to each sample s such that -
-
- being a vector of sample indicators.
-
- Let be a
vector of Q auxiliary variables known for each
unit in the population. The sampling design p(s)
is said to be balanced with respect to the Q
auxiliary variables if and only if it satisfies
the balancing equations given by - being the sample weight
123. Balanced sampling and multi-way stratification
- Multi-way stratification design can represent a
special case of balanced design, when for unit k
the auxiliary variable vector is the indicator of
the belonging to the domains of the different
partitions multiplied by its inclusion
probability - The z vector, in this case, is defined as
- the balancing equations assure that for each
selected sample s, the size of the subsample
is a non-random quantity and is
133. Balanced sampling and multi-way stratification
- For multiway stratification the balancing
equations become - being the sample size for the d-th
domain of the b-th partition - and
-
143. Balanced sampling and multi-way stratification
- A relevant drawback of balanced sampling has
always been implementing a general procedure
giving a multivariate balanced random sample. - Deville and Tillé (2004) proposed a sample
selection method (cube method) drawing a balanced
samples for a large set of auxiliary variables
and with respect to different vectors of
inclusion probabilities. - A free macro for the selection of balanced
samples for large data sets may be downloaded
(SAS or R routine) - http//www.insee.fr/fr/nom_df_met/outils_stat/cube
/accueil_cube.htm - Deville and Tillé (2000) show that with our
specification of the auxiliary vectors, the
balancing equations can be exactly satisfied,
while in general the balancing equation are
approximately respected
154. Modified GREG estimator
- In the context of multi-variate estimation, the
r-th parameter of interest is - The modified GREG estimator is (through a
specific domain weight) - The superpopulation working model is
16Variance of the Horvitz-Thompson estimator with
the balanced sampling
4. Modified GREG estimator variance
- Deville and Tillé (2005) proposed an
approximation of the variance expression for HT
estimator and the overall domain -
-
- with
174. Modified GREG estimator variance
- Starting from the result by Deville (2005) it is
possible to derive the approximate expression of
the variance for the modified GREG estimator
under balanced sampling - being
- and
185. The algorithm for the sample size definition
- In order to calculate the inclusion probabilities
it is necessary to fix the sample size for each
domain so that the constraints on the sampling
errors were accomplished - When considering separately each marginal
partition we would have for each of them a
different set of inclusion probabilities - In our methodology we calculate a single
inclusion probability through a two step
procedure - Optimisation (calculating of optimal
probabilities) - Calibration (calculating of working
probabilities)
195. The algorithm for the sample size definition
- Optimisation the calculus of the inclusion
probabilities (sample size and domain allocation)
is carried out with the aim of minimizing the
expected sampling errors on several domains and
estimates - Multi domains
- Multi variable
- The problem is solved through the system
The solution can be obtained through the Chromy
algorithm (the one used in the software for
allocation MAUSS, which can be can be downloaded
from www.istat.it)
205. The algorithm for the sample size definition
- Calibration optimal inclusion probabilities lead
to non integer values for the domain sample size - Rounding of the expected domain sample size to
next integer - Calculating working probabilities nearest to
the optimal ones - The problem is defined through the system
Solution obtained by means of the Newton
algorithm (with some change), the same used in
calibration software Genesees which can be can
be downloaded from www.istat.it)
2121
6. Application fields and experiments Artificial
data
- Population Contingency table
- Variable for the allocation and estimation model
,
226. Application fields and experiments Artificial
data
22
- Compared sampling designs and expected CV()
236. Application fields and experiments
- Real data
- A simulation on real enterprises data (N10,392)
has been carried out to evaluate the effects of
planned sample size for small domain of estimate
(Falorsi et al., 2006) - U1 partition regions (20 domains)
- U2 partition economic activity by size class (24
domains) - Cross-classification strata with population
units 360. - Variables of interest value added and labour
cost - the sample sizes of U1 and U2 partitions have
been planned separately by means of a compromise
allocation - the 2 allocations guarantee a CV of 34.5 for U1
and 8.7 for U2 with regard to the variables
number of employers (supposed known at sampling
stage) - the overall sample size is n360
246. Application fields and experiments Real data
- The experiment examines a situation
characterizing many real survey contexts in which
the overall sample size n is fixed and the
marginal sample sizes are determined by a quite
simple rule being a compromise between the
Allocation Proportional to Population size (APP)
and the allocation uniform for each domain of a
given partition - The probabilities of both designs for U1 and U2
partitions have been obtained as solution of the
calibration problem below where the initial
probabilities are set uniformly equal to
256. Application fields and experiments Real data
267. Extension to the Fay Herriot Model
26
277. Extension to the Fay Herriot Model
27
287. Extension to the Fay Herriot Model
28