Diapositiva 1 - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Diapositiva 1

Description:

Statistical Matching and Imputation of Survey Data with the Package Statmatch for the Environment Marcello D Orazio (madorazi_at_istat.it) – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 13
Provided by: taba151
Learn more at: http://www.unece.org
Category:

less

Transcript and Presenter's Notes

Title: Diapositiva 1


1
Statistical Matching and Imputation of Survey
Data with the Package Statmatch for the
Environment
  • Marcello DOrazio (madorazi_at_istat.it)
  • UNECE - Work Session on Statistical Data Editing
  • Ljubljana, Slovenia, 9-11 May 2011

2
UNECE Work Session on Statistical Data Editing
What is Statistical Matching?
Statistical Matching (data fusion o synthetic
matching) consists in a series of methods to
integrate two or more data sources referred to
the same target population. Basic SM framework
  • X variables are in common
  • Y and Z are NOT jointly observed
  • The chance of observing the same unit in A and B
    is close to zero

Y X

source A


X Z

source B source B


Ljubljana, 9-11 May 2011
3
UNECE Work Session on Statistical Data Editing
Objectives of Statistical Matching
  • micro derive a synthetic data-set with X, Y
    and Z

A filled-in with Z Y X Z


  • macro estimation of parameters correlation
    coef. ( )
  • or frequencies

Approaches Approaches Approaches
Objectives SM Parametric Nonparametric Mixed
Macro P P
Micro P P P
Ljubljana, 9-11 May 2011
4
UNECE Work Session on Statistical Data Editing
The package StatMatch for the R environment
StatMatch provides R functions to apply some
Statistical Matching methods Generalization and
optimization of the code provided with the
monograph about SM by DOrazio et al. (2006).
The first version of StatMatch (version 0.4)
released on CRAN (Comprehensive R Archive
Network) in 2008. In the beginning of 2011 the
version 1.0.1 has been released this version
present a significant improvement of the
functionalities of the previous version (0.8
released in 2009). http//cran.at.r-project.org/w
eb/packages/StatMatch/index.html Package
available for MS Windows (32 and 64 bit),
Linux, Mac
Ljubljana, 9-11 May 2011
5
UNECE Work Session on Statistical Data Editing
Functions in StatMatch
  • Five main groups of functions
  • functions to perform nonparametric SM at micro
    level by means of hot deck imputation
    (NND.hotdeck, RANDwNND.hotdeck, rankNND.hotdeck)
  • a function to perform mixed SM at macro or micro
    level for continuous variables (mixed.mtc)
  • functions to integrate data from complex sample
    surveys through calibration of weights as
    proposed by Renssen (1998) (harmonize.x and
    comb.samples)
  • functions to explore uncertainty on the
    contingency table YxZ (Frechet.bounds.cat and
    Fbwidhts.by.x)
  • other functions to compute distances (gower.dist
    and maximum.dist), to create the synthetic data
    set (create.fused), etc.

Ljubljana, 9-11 May 2011
6
UNECE Work Session on Statistical Data Editing
SM via hot deck imputation
NND.hotdeck() nearest neighbour distance hot
deck - many distance functions -
imputation classes - constrained or
unconstrained RANDwNND.hotdeck() random hot deck
and some variants - random hot deck in
classes - random hot deck in moving
classes - it is possible to use weights
rankNND.hotdeck() nearest neighbour with
distance computed on the percentage points
of the empirical cumulative distribution fun
ction of X
Ljubljana, 9-11 May 2011
7
UNECE Work Session on Statistical Data Editing
Mixed SM methods
  • mixed.mtc() mixed SM methods for continuous
  • variables
  • consist in two steps
  • fits regression models (regression) Y vs. X and
    Z vs. X
  • fills A with units chosen by means of
    constrained distance hot deck computed on
    intermediate and live values of Y and Z
  • - two methods to estimate regression
  • parameters (ML and MoriarityScheuren,
  • 2001)
  • - possibility of introducing auxiliary
    information about the correlation coef.
  • between Y and Z

Ljubljana, 9-11 May 2011
8
UNECE Work Session on Statistical Data Editing
SM of data from complex sample surveys
Renssens (1998) approach based on a series of
calibration steps of the survey weights of A and
B, and if available C (C may contain Y and Z
or X, Y and Z) harmonize.x() harmonizes the
joint/marginal distribution of X variables in
A and B comb.samples() estimates the
contingency table Y vs. Z using available
auxiliary information in C (when
available) - Conditional Independence
Assum. - incomplete two way stratification -
synthetic two way stratification
Ljubljana, 9-11 May 2011
9
UNECE Work Session on Statistical Data Editing
Exploring uncertainty due to SM basic framework
Frechet.bounds.cat() to derive the uncertainty
bounds for frequencies in the contingency
table Y vs. Z, starting from the marginal
tables X vs. Y and X vs. Z Fbwidths.by.x() expl
ores how the various possible subsets of the X
variables contribute in reducing the
uncertainty on the cells of Y vs. Z
Ljubljana, 9-11 May 2011
10
UNECE Work Session on Statistical Data Editing
Computational Efficiency
All the functions in StatMatch are based on R
code and there are no calls to other external
code (compiled C or Fortran) Interpreted
languages (Matlab, R, Python, Lisp) are fun ...
but slow. Compiled languages (machine code,
assembly, FORTRAN, C, Java) are fast but are
work ( no fun) Mizera (2006)
Hot deck methods StatMatch Function Match vars Imp. class. Processtime (secs) Notes
UNconstrained NND NND.hotdeck 4 36 1282 dist.funGower
Constrained NND NND.hotdeck 4 36 1446 dist.funGower constr.algrelax
Random hot deck RANDwNND.hotdeck 4 36 1936 dist.funGowercut.don"exact k10
Artificial data A contains 14,000 obs. about
54,000 obs. in B. PC with CPU Pentium IV 3GHz,
3GB RAM, MS Windows XP Prof. (SP 3 32bit)
Ljubljana, 9-11 May 2011
11
Warning! Although abusing R was not proved to
be addictive, it should be noted that it often
leads to harder stuff Mizera
(2006) Thank You for Your attention!
Ljubljana, 9-11 May 2011
12
UNECE Work Session on Statistical Data Editing
Some References
D'Orazio, M. (2009). StatMatch Statistical
Matching. R package version 1.0.1. http//CRAN.R-
project.org/packageStatMatch DOrazio, M., Di
Zio, M., and Scanu, M. (2006) Statistical
Matching Theory and Practice. Wiley and Sons,
Chichester. Mizera, I. (2006) Graphical
Exploratory Analysis Using Halfspace Depth.
Presentation at useR!, The R User Conference
2006, Wien, 15-17 June 2006. Moriarity C.,
Scheuren F. (2001) Statistical matching a
paradigm for assessing the uncertainty in the
procedure. Journal of Official Statistics, 17,
407422. Renssen, R.H. (1998) Use of
Statistical matching techniques in calibration
estimation Survey Methodology, 24, pp. 171-183.

Ljubljana, 9-11 May 2011
Write a Comment
User Comments (0)
About PowerShow.com