Title: Diapositiva 1
1Statistical Matching and Imputation of Survey
Data with the Package Statmatch for the
Environment
- Marcello DOrazio (madorazi_at_istat.it)
- UNECE - Work Session on Statistical Data Editing
- Ljubljana, Slovenia, 9-11 May 2011
2UNECE Work Session on Statistical Data Editing
What is Statistical Matching?
Statistical Matching (data fusion o synthetic
matching) consists in a series of methods to
integrate two or more data sources referred to
the same target population. Basic SM framework
- X variables are in common
- Y and Z are NOT jointly observed
- The chance of observing the same unit in A and B
is close to zero
Y X
source A
X Z
source B source B
Ljubljana, 9-11 May 2011
3UNECE Work Session on Statistical Data Editing
Objectives of Statistical Matching
- micro derive a synthetic data-set with X, Y
and Z
A filled-in with Z Y X Z
- macro estimation of parameters correlation
coef. ( ) - or frequencies
Approaches Approaches Approaches
Objectives SM Parametric Nonparametric Mixed
Macro P P
Micro P P P
Ljubljana, 9-11 May 2011
4UNECE Work Session on Statistical Data Editing
The package StatMatch for the R environment
StatMatch provides R functions to apply some
Statistical Matching methods Generalization and
optimization of the code provided with the
monograph about SM by DOrazio et al. (2006).
The first version of StatMatch (version 0.4)
released on CRAN (Comprehensive R Archive
Network) in 2008. In the beginning of 2011 the
version 1.0.1 has been released this version
present a significant improvement of the
functionalities of the previous version (0.8
released in 2009). http//cran.at.r-project.org/w
eb/packages/StatMatch/index.html Package
available for MS Windows (32 and 64 bit),
Linux, Mac
Ljubljana, 9-11 May 2011
5UNECE Work Session on Statistical Data Editing
Functions in StatMatch
- Five main groups of functions
- functions to perform nonparametric SM at micro
level by means of hot deck imputation
(NND.hotdeck, RANDwNND.hotdeck, rankNND.hotdeck) - a function to perform mixed SM at macro or micro
level for continuous variables (mixed.mtc) - functions to integrate data from complex sample
surveys through calibration of weights as
proposed by Renssen (1998) (harmonize.x and
comb.samples) - functions to explore uncertainty on the
contingency table YxZ (Frechet.bounds.cat and
Fbwidhts.by.x) - other functions to compute distances (gower.dist
and maximum.dist), to create the synthetic data
set (create.fused), etc.
Ljubljana, 9-11 May 2011
6UNECE Work Session on Statistical Data Editing
SM via hot deck imputation
NND.hotdeck() nearest neighbour distance hot
deck - many distance functions -
imputation classes - constrained or
unconstrained RANDwNND.hotdeck() random hot deck
and some variants - random hot deck in
classes - random hot deck in moving
classes - it is possible to use weights
rankNND.hotdeck() nearest neighbour with
distance computed on the percentage points
of the empirical cumulative distribution fun
ction of X
Ljubljana, 9-11 May 2011
7UNECE Work Session on Statistical Data Editing
Mixed SM methods
- mixed.mtc() mixed SM methods for continuous
- variables
- consist in two steps
- fits regression models (regression) Y vs. X and
Z vs. X - fills A with units chosen by means of
constrained distance hot deck computed on
intermediate and live values of Y and Z - - two methods to estimate regression
- parameters (ML and MoriarityScheuren,
- 2001)
- - possibility of introducing auxiliary
information about the correlation coef. - between Y and Z
Ljubljana, 9-11 May 2011
8UNECE Work Session on Statistical Data Editing
SM of data from complex sample surveys
Renssens (1998) approach based on a series of
calibration steps of the survey weights of A and
B, and if available C (C may contain Y and Z
or X, Y and Z) harmonize.x() harmonizes the
joint/marginal distribution of X variables in
A and B comb.samples() estimates the
contingency table Y vs. Z using available
auxiliary information in C (when
available) - Conditional Independence
Assum. - incomplete two way stratification -
synthetic two way stratification
Ljubljana, 9-11 May 2011
9UNECE Work Session on Statistical Data Editing
Exploring uncertainty due to SM basic framework
Frechet.bounds.cat() to derive the uncertainty
bounds for frequencies in the contingency
table Y vs. Z, starting from the marginal
tables X vs. Y and X vs. Z Fbwidths.by.x() expl
ores how the various possible subsets of the X
variables contribute in reducing the
uncertainty on the cells of Y vs. Z
Ljubljana, 9-11 May 2011
10UNECE Work Session on Statistical Data Editing
Computational Efficiency
All the functions in StatMatch are based on R
code and there are no calls to other external
code (compiled C or Fortran) Interpreted
languages (Matlab, R, Python, Lisp) are fun ...
but slow. Compiled languages (machine code,
assembly, FORTRAN, C, Java) are fast but are
work ( no fun) Mizera (2006)
Hot deck methods StatMatch Function Match vars Imp. class. Processtime (secs) Notes
UNconstrained NND NND.hotdeck 4 36 1282 dist.funGower
Constrained NND NND.hotdeck 4 36 1446 dist.funGower constr.algrelax
Random hot deck RANDwNND.hotdeck 4 36 1936 dist.funGowercut.don"exact k10
Artificial data A contains 14,000 obs. about
54,000 obs. in B. PC with CPU Pentium IV 3GHz,
3GB RAM, MS Windows XP Prof. (SP 3 32bit)
Ljubljana, 9-11 May 2011
11 Warning! Although abusing R was not proved to
be addictive, it should be noted that it often
leads to harder stuff Mizera
(2006) Thank You for Your attention!
Ljubljana, 9-11 May 2011
12UNECE Work Session on Statistical Data Editing
Some References
D'Orazio, M. (2009). StatMatch Statistical
Matching. R package version 1.0.1. http//CRAN.R-
project.org/packageStatMatch DOrazio, M., Di
Zio, M., and Scanu, M. (2006) Statistical
Matching Theory and Practice. Wiley and Sons,
Chichester. Mizera, I. (2006) Graphical
Exploratory Analysis Using Halfspace Depth.
Presentation at useR!, The R User Conference
2006, Wien, 15-17 June 2006. Moriarity C.,
Scheuren F. (2001) Statistical matching a
paradigm for assessing the uncertainty in the
procedure. Journal of Official Statistics, 17,
407422. Renssen, R.H. (1998) Use of
Statistical matching techniques in calibration
estimation Survey Methodology, 24, pp. 171-183.
Ljubljana, 9-11 May 2011