Apologies in advance if I'm doing this wrong, but I didn't see anything in the FAQ about announcing things, but from searching the forum it appears sometimes people do and no one seems to mind, so...
I wrote a simple hotdeck program. Probably the only thing interesting about it is that it will select donor rows in proportion to their sample weights. I don't know how often this matters for most people, but I often work with datasets where weights can range from 1 to 2,000 and in those case it makes a big difference whether or not your hotdeck handles weights. As best I can tell, there are 3 existing community-contributed commands (hotdeck, whotdeck, and hotdeckvar) that perform hotdecks, but none of them handles sample weights as far as I can tell.
A poorly formatted copy of the help file is below, and I have uploaded the ado, help, and an example to github: https://github.com/johne13/wtd_hotdeck
Let me note that this is the first public version and you should of course be cautious in using this for any real production work. That said, I've used it in production work a couple of times, as have some colleagues and it seems to work like it's supposed to. That said, I'm sure there are plenty of bugs and edge cases yet to be discovered, so just be aware of that.
Comments, advice, and wisecracks are much appreciated!
Title
wtd_hotdeck -- Hotdeck (or statistical match) imputation that selects donor rows in proportion to their survey or sample weights
Syntax
wtd_hotdeck varlist(min=1) [, options]
options Description
--------------------------------------------------------------------------------------------------------------------------
Main
cells(varlist) (optional) Categorical-style variables that define the cells
weight(varname) (optional) Survey- or sample-type weights
seed(#) (optional, default=0) A positive integer will be used to set the seed, zero means no seed is set
verbose(#) (optional, default=0) A non-zero value will cause intermediate variables to be retained
--------------------------------------------------------------------------------------------------------------------------
Description
This is a fairly standard hotdeck program with the possibly interesting feature of allowing the use of frequency- or
survey-style weights. If provided, the donor rows are sampled in proportion to the weights, which may be either integers
or floats. If multiple variables are imputed to a row, then all values will be selected from the same donor row.
Note that donors and recipients are defined internally based on missing values in varlist. Rows with no missing values in
varlist are defined as donors, and rows with any missing values are defined as recipients. Also note that missing values
are replaced or over-written by the hotdeck, so it may be helpful to explicitly store the original values for later
comparisons.
This program is offered for free and "as is", with no guarantees except "your money back for any reason". It has mainly
been tested with Stata 12 (MacOS) and Stata 15 (Windows 10). Since it is a essentially just a specialized sorting
program, it will likely work with any semi-recent version of Stata (or your money back, of course).
Options
cells(varlist) Theses variables define the cells of the hotdeck. The user is responsible for checking that each cell
contains a sufficient number of donors and no checking is done by this program. The variables in "cells" are used
internally for sorting and will generally be of the categorical type, but any variable type is allowed (e.g. if you
have a float variable that only has five unique values, that should be fine).
weight(varname) These may be of frequency- or survey-type and can be integers or floats.
seed(#) Set to a postive integer in order to ensure reproducible results. The positive integer becomes the input for an
internal "set seed" command. If the seed is set to zero (the default value) or is not specified, then no seed is set
internally and Stata will use the system value of seed, whatever that happens to be.
verbose(#) If verbose is set to 1, a number of intermediate variables (beginning with "_") are retained at program
termination. This is mainly for debugging or curiosity.
Brief example
Start with the NMIHS data, then randomly set 20% of childsex & birthwgt to missing
. webuse nmihs
. keep finwgt marital age childsex birthwgt
. replace birthwgt = . if uniform() < 0.20
. replace childsex = . if birthwgt == .
. gen over25 = age > 25
. preserve
Impute childsex & birthwgt using cells based on age & marital status
. wtd_hotdeck childsex birthwgt, cells(marital over25) weight(finwgt)
Continuing the example...
Note that wtd_hotdeck does not check that all of your cells have enough donors observations, so you should always check
this manually. One simple way is to just tab the donor cells.
. table marital over25 if ~missing(childsex,birthwgt)
It can be interesting to check how much the weights matter. If you try the short example below, you are likely to find
that the weights matter substantially, although there will be some random variation with each run (if no seed is set).
. restore, preserve
. sum child birthwgt [w=finwgt] // before hotdeck
. qui: wtd_hotdeck childsex birthwgt, cells(marital over25)
. sum child birthwgt [w=finwgt] // after un-weighted hotdeck
. restore, preserve
. qui: wtd_hotdeck childsex birthwgt, cells(marital over25) weight(finwgt)
. sum child birthwgt [w=finwgt] // after weighted hotdeck
Author
John R Eiler
U.S. Dept of the Treasury
first.last at treasury.gov
Acknowledgements
Rachel Costello, Portia DeFillippes
Also see
hotdeck, whotdeck, hotdeckvar -- These are community-contributed commands that can be used for a hotdeck imputation. All
three can be installed with "ssc install" and include excellent help files. None of them allow sample weights as far as I
can tell.
Stata's mi -- Stata's mi command is very powerful and offers many alternative imputation approaches, but no option to do a
simple hotdeck, weighted or unweighted, to the best of my knowledge.
SAS's proc surveyimpute -- It appears that SAS offers a weighted hotdeck via the command "proc surveyimpute
method=hotdeck(selection=weighted);". I have not used this command and hence have not compared results to wtd_hotdeck.
I wrote a simple hotdeck program. Probably the only thing interesting about it is that it will select donor rows in proportion to their sample weights. I don't know how often this matters for most people, but I often work with datasets where weights can range from 1 to 2,000 and in those case it makes a big difference whether or not your hotdeck handles weights. As best I can tell, there are 3 existing community-contributed commands (hotdeck, whotdeck, and hotdeckvar) that perform hotdecks, but none of them handles sample weights as far as I can tell.
A poorly formatted copy of the help file is below, and I have uploaded the ado, help, and an example to github: https://github.com/johne13/wtd_hotdeck
Let me note that this is the first public version and you should of course be cautious in using this for any real production work. That said, I've used it in production work a couple of times, as have some colleagues and it seems to work like it's supposed to. That said, I'm sure there are plenty of bugs and edge cases yet to be discovered, so just be aware of that.
Comments, advice, and wisecracks are much appreciated!
Title
wtd_hotdeck -- Hotdeck (or statistical match) imputation that selects donor rows in proportion to their survey or sample weights
Syntax
wtd_hotdeck varlist(min=1) [, options]
options Description
--------------------------------------------------------------------------------------------------------------------------
Main
cells(varlist) (optional) Categorical-style variables that define the cells
weight(varname) (optional) Survey- or sample-type weights
seed(#) (optional, default=0) A positive integer will be used to set the seed, zero means no seed is set
verbose(#) (optional, default=0) A non-zero value will cause intermediate variables to be retained
--------------------------------------------------------------------------------------------------------------------------
Description
This is a fairly standard hotdeck program with the possibly interesting feature of allowing the use of frequency- or
survey-style weights. If provided, the donor rows are sampled in proportion to the weights, which may be either integers
or floats. If multiple variables are imputed to a row, then all values will be selected from the same donor row.
Note that donors and recipients are defined internally based on missing values in varlist. Rows with no missing values in
varlist are defined as donors, and rows with any missing values are defined as recipients. Also note that missing values
are replaced or over-written by the hotdeck, so it may be helpful to explicitly store the original values for later
comparisons.
This program is offered for free and "as is", with no guarantees except "your money back for any reason". It has mainly
been tested with Stata 12 (MacOS) and Stata 15 (Windows 10). Since it is a essentially just a specialized sorting
program, it will likely work with any semi-recent version of Stata (or your money back, of course).
Options
cells(varlist) Theses variables define the cells of the hotdeck. The user is responsible for checking that each cell
contains a sufficient number of donors and no checking is done by this program. The variables in "cells" are used
internally for sorting and will generally be of the categorical type, but any variable type is allowed (e.g. if you
have a float variable that only has five unique values, that should be fine).
weight(varname) These may be of frequency- or survey-type and can be integers or floats.
seed(#) Set to a postive integer in order to ensure reproducible results. The positive integer becomes the input for an
internal "set seed" command. If the seed is set to zero (the default value) or is not specified, then no seed is set
internally and Stata will use the system value of seed, whatever that happens to be.
verbose(#) If verbose is set to 1, a number of intermediate variables (beginning with "_") are retained at program
termination. This is mainly for debugging or curiosity.
Brief example
Start with the NMIHS data, then randomly set 20% of childsex & birthwgt to missing
. webuse nmihs
. keep finwgt marital age childsex birthwgt
. replace birthwgt = . if uniform() < 0.20
. replace childsex = . if birthwgt == .
. gen over25 = age > 25
. preserve
Impute childsex & birthwgt using cells based on age & marital status
. wtd_hotdeck childsex birthwgt, cells(marital over25) weight(finwgt)
Continuing the example...
Note that wtd_hotdeck does not check that all of your cells have enough donors observations, so you should always check
this manually. One simple way is to just tab the donor cells.
. table marital over25 if ~missing(childsex,birthwgt)
It can be interesting to check how much the weights matter. If you try the short example below, you are likely to find
that the weights matter substantially, although there will be some random variation with each run (if no seed is set).
. restore, preserve
. sum child birthwgt [w=finwgt] // before hotdeck
. qui: wtd_hotdeck childsex birthwgt, cells(marital over25)
. sum child birthwgt [w=finwgt] // after un-weighted hotdeck
. restore, preserve
. qui: wtd_hotdeck childsex birthwgt, cells(marital over25) weight(finwgt)
. sum child birthwgt [w=finwgt] // after weighted hotdeck
Author
John R Eiler
U.S. Dept of the Treasury
first.last at treasury.gov
Acknowledgements
Rachel Costello, Portia DeFillippes
Also see
hotdeck, whotdeck, hotdeckvar -- These are community-contributed commands that can be used for a hotdeck imputation. All
three can be installed with "ssc install" and include excellent help files. None of them allow sample weights as far as I
can tell.
Stata's mi -- Stata's mi command is very powerful and offers many alternative imputation approaches, but no option to do a
simple hotdeck, weighted or unweighted, to the best of my knowledge.
SAS's proc surveyimpute -- It appears that SAS offers a weighted hotdeck via the command "proc surveyimpute
method=hotdeck(selection=weighted);". I have not used this command and hence have not compared results to wtd_hotdeck.
Comment