SurveyDataWeighting: Generating Sample Weights for a dataset
This package generates weights for a sample dataset such that weighted sums of dataset columns match some set of targets. For example, you might want to weight a dataset so that it matches known amounts of benefit receipts or numbers of households in different regions of a country, or both.
A commercial product Calmar is available for this, and widely used, but there are many advantages in having a version that you can easily embed in a simulation program. It can be very useful for producing forecasts, for example; see the papers by Reed and Stark and Creedy below.
The routine calculates a set of weights that are closest in some sense to an initial set of weights such that, when summed, the weighted data hits the target_populations
. Output is a NamedTuple with a vector of weights and some information on how the routine converged. The paper by Merz has a good discussion of how to lay out the dataset.
function do_reweighting(
;
data, # either AbstractMatrix or e.g dataframe
initial_weights :: AbstractVector, # a column
target_populations :: AbstractVector, # a row
functiontype :: DistanceFunctionType,
upper_multiple = 0.0,
lower_multiple = 0.0,
tol = 10^(-10),
max_iterations = 100 )
See the tests for simple examples, based on examples from the Creedy paper.
The form of 'closeness' used is determined by the functiontype
parameter of enumerated type DistanceFunctionType
. See the Creedy and Deville and Sarndal papers on these. Notes on these:
chi_square
- minimising the squared difference between old and new weights can produce negative weights;constrained_chi_square
usually works best - this produces squared-difference weights that are at mostru
times the original weight and at leastrl
times the original.- the other measures are taken from the Deville and Sarndal paper and pass simple tests but sometimes fail to converge in real-world situations; whether this is because of something inherent or some mistake I've made I'm unsure;
- I believe Calmar implements different measures; see also D’Souza.
see: Merz (1994) for a good discussion on how to lay out the dataset.
Functions and Data Structures
SurveyDataWeighting.DistanceFunctionType
SurveyDataWeighting.compute_f_and_hessian
SurveyDataWeighting.do_chi_square_reweighting
SurveyDataWeighting.do_reweighting
SurveyDataWeighting.newton!
SurveyDataWeighting.DistanceFunctionType
— TypePossible distance types; see Creedy.
SurveyDataWeighting.compute_f_and_hessian
— Methodinternal use only - the function called by newton
.
SurveyDataWeighting.do_chi_square_reweighting
— MethodThis is a route-1 approach to Chi-square reweighting. The iterative main method should produce identical results when method=chi_square. This is kept here mainly for testing. Note the weights can be negative. See the Creedy Papers.
SurveyDataWeighting.do_reweighting
— Method" Make a weights vector which weights the matrix data
so when summed the col totals sum to target_populations
See the Creedy Paper for function_type
If using one of the constrained types, the output weights should be no more than ru*the initial weight, no less than rl Returns a Dict with :=>weights and some extra info on convergence. data : KxJ matrix where k is num observations and J is num constraints; see: Microdata Adjustment by the Minimum Information Loss Principle Joachim Merz; FFB Discussion Paper No. 10 July 1994 for a good discussion on how to lay out the dataset
data
: intial_weights
: K length vector target_populations
- J length vector;
upper_multiple
lower_multiple
max/min acceptable values of ratio of finalweight/initialweight (for constrained distance functions) tol
for the root finder max_iterations
: for controlling convergence
note: chi-square is just there for checking purposes; use do_chi_square_reweighting
if that's all you need.
uppermultiple/lowermultiple max/min acceptable values of ratio of finalweight/initialweight (for constrained distance functions)
SurveyDataWeighting.newton!
— Methoda very simple Implementation of Newton's method. func
here both evaluates the function and creates the hessian
TODO
- Chase up and add different closeness measures, e.g the Entropy measure I remember from an old Atkinson and Gomulka working paper, and whatever I can find elsewhere;
Bibliography
Creedy, John. “Survey Reweighting for Tax Microsimulation Modelling.” Treasury Working Paper Series. New Zealand Treasury, September 2003. http://ideas.repec.org/p/nzt/nztwps/03-17.html.
Creedy, John, and Ivan Tuckwell. “Reweighting the New Zealand Household Economic Survey for Tax Microsimulation Modelling.” Treasury Working Paper Series. New Zealand Treasury, December 2003. https://ideas.repec.org/p/nzt/nztwps/03-33.html.
Deville, Jean-Claude, and Carl-Erik Sarndal. “Calibration Estimators in Survey Sampling.” Journal of the American Statistical Association 87, no. 418 (1992): 376–82.
Merz, Joachim. ‘Microdata Adjustment by the Minimum Information Loss Principle’. SSRN Scholarly Paper. Rochester, NY: Social Science Research Network, 1 July 1994. https://papers.ssrn.com/abstract=1417310.
D’Souza, John. ‘A Stata Program for Calibration Weighting’. United Kingdom Stata Users’ Group Meetings 2010. Stata Users Group, 17 September 2010. https://ideas.repec.org/p/boc/usug10/02.html.
Reed, Howard, and Graham Stark. ‘Tackling Child Poverty Delivery Plan - Forecasting Child Poverty in Scotland’. Scottish Government, 9 March 2018. http://www.gov.scot/Publications/2018/03/2911/0.