Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Announcement: wtd_hotdeck -- a community-contributed program to hotdeck with sample weights

    Apologies in advance if I'm doing this wrong, but I didn't see anything in the FAQ about announcing things, but from searching the forum it appears sometimes people do and no one seems to mind, so...

    I wrote a simple hotdeck program. Probably the only thing interesting about it is that it will select donor rows in proportion to their sample weights. I don't know how often this matters for most people, but I often work with datasets where weights can range from 1 to 2,000 and in those case it makes a big difference whether or not your hotdeck handles weights. As best I can tell, there are 3 existing community-contributed commands (hotdeck, whotdeck, and hotdeckvar) that perform hotdecks, but none of them handles sample weights as far as I can tell.

    A poorly formatted copy of the help file is below, and I have uploaded the ado, help, and an example to github: https://github.com/johne13/wtd_hotdeck

    Let me note that this is the first public version and you should of course be cautious in using this for any real production work. That said, I've used it in production work a couple of times, as have some colleagues and it seems to work like it's supposed to. That said, I'm sure there are plenty of bugs and edge cases yet to be discovered, so just be aware of that.

    Comments, advice, and wisecracks are much appreciated!


    Title
    wtd_hotdeck -- Hotdeck (or statistical match) imputation that selects donor rows in proportion to their survey or sample weights

    Syntax
    wtd_hotdeck varlist(min=1) [, options]

    options Description
    --------------------------------------------------------------------------------------------------------------------------
    Main
    cells(varlist) (optional) Categorical-style variables that define the cells
    weight(varname) (optional) Survey- or sample-type weights
    seed(#) (optional, default=0) A positive integer will be used to set the seed, zero means no seed is set
    verbose(#) (optional, default=0) A non-zero value will cause intermediate variables to be retained
    --------------------------------------------------------------------------------------------------------------------------

    Description

    This is a fairly standard hotdeck program with the possibly interesting feature of allowing the use of frequency- or
    survey-style weights. If provided, the donor rows are sampled in proportion to the weights, which may be either integers
    or floats. If multiple variables are imputed to a row, then all values will be selected from the same donor row.

    Note that donors and recipients are defined internally based on missing values in varlist. Rows with no missing values in
    varlist are defined as donors, and rows with any missing values are defined as recipients. Also note that missing values
    are replaced or over-written by the hotdeck, so it may be helpful to explicitly store the original values for later
    comparisons.

    This program is offered for free and "as is", with no guarantees except "your money back for any reason". It has mainly
    been tested with Stata 12 (MacOS) and Stata 15 (Windows 10). Since it is a essentially just a specialized sorting
    program, it will likely work with any semi-recent version of Stata (or your money back, of course).

    Options

    cells(varlist) Theses variables define the cells of the hotdeck. The user is responsible for checking that each cell
    contains a sufficient number of donors and no checking is done by this program. The variables in "cells" are used
    internally for sorting and will generally be of the categorical type, but any variable type is allowed (e.g. if you
    have a float variable that only has five unique values, that should be fine).

    weight(varname) These may be of frequency- or survey-type and can be integers or floats.

    seed(#) Set to a postive integer in order to ensure reproducible results. The positive integer becomes the input for an
    internal "set seed" command. If the seed is set to zero (the default value) or is not specified, then no seed is set
    internally and Stata will use the system value of seed, whatever that happens to be.

    verbose(#) If verbose is set to 1, a number of intermediate variables (beginning with "_") are retained at program
    termination. This is mainly for debugging or curiosity.

    Brief example

    Start with the NMIHS data, then randomly set 20% of childsex & birthwgt to missing

    . webuse nmihs
    . keep finwgt marital age childsex birthwgt
    . replace birthwgt = . if uniform() < 0.20
    . replace childsex = . if birthwgt == .
    . gen over25 = age > 25
    . preserve

    Impute childsex & birthwgt using cells based on age & marital status

    . wtd_hotdeck childsex birthwgt, cells(marital over25) weight(finwgt)

    Continuing the example...

    Note that wtd_hotdeck does not check that all of your cells have enough donors observations, so you should always check
    this manually. One simple way is to just tab the donor cells.

    . table marital over25 if ~missing(childsex,birthwgt)

    It can be interesting to check how much the weights matter. If you try the short example below, you are likely to find
    that the weights matter substantially, although there will be some random variation with each run (if no seed is set).

    . restore, preserve
    . sum child birthwgt [w=finwgt] // before hotdeck

    . qui: wtd_hotdeck childsex birthwgt, cells(marital over25)
    . sum child birthwgt [w=finwgt] // after un-weighted hotdeck

    . restore, preserve
    . qui: wtd_hotdeck childsex birthwgt, cells(marital over25) weight(finwgt)
    . sum child birthwgt [w=finwgt] // after weighted hotdeck

    Author

    John R Eiler
    U.S. Dept of the Treasury
    first.last at treasury.gov

    Acknowledgements

    Rachel Costello, Portia DeFillippes

    Also see

    hotdeck, whotdeck, hotdeckvar -- These are community-contributed commands that can be used for a hotdeck imputation. All
    three can be installed with "ssc install" and include excellent help files. None of them allow sample weights as far as I
    can tell.

    Stata's mi -- Stata's mi command is very powerful and offers many alternative imputation approaches, but no option to do a
    simple hotdeck, weighted or unweighted, to the best of my knowledge.

    SAS's proc surveyimpute -- It appears that SAS offers a weighted hotdeck via the command "proc surveyimpute
    method=hotdeck(selection=weighted);". I have not used this command and hence have not compared results to wtd_hotdeck.



  • #2
    Let me add a cautionary note on weighted hotdecking based on an article I just found:

    Andridge, Rebecca R, and Roderick J Little. “The Use of Sample Weights in Hot Deck Imputation.” Journal of official statistics vol. 25,1 (2009): 21-36.

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3117228/

    I have only skimmed through it quickly but it seems like the gist of it is that weighted hotdecking could be biased if the weights are correlated with the probability of missing-ness. I guess this counts as common sense to an econometrician, but is worth spelling out in detail nonetheless. The article also provides a nice reading list for hotdecking and weighted hotdecking in general btw.

    My main use case it to completely impute a set of variables from one data set to another, which should satisfy MCAR (although this type of imputation is certainly not without other issues ;-), and my practical experience to date is that the weighted hotdeck performs much better than unweighted. Conversely, for the standard survey non-response situation, this paper makes it clear that weighted hotdecking might not be a good idea (or at least not the naive version of a weighted hotdeck that I have implemented here).

    I'll plan to put this info into the next version of the help file (if there is one), but this is your warning for now. ;-)

    Comment


    • #3
      I occasionally check the github stats and there seems to be a small but steady dribble of traffic there from statalist. Almost certainly not enough to bother with putting in SSC, but would love to hear any feedback (positive or negative) from folks who have tried it out. If you have tried it, feel free to respond in this thread or just email me at my work address in the help file, or at eiler13 at google's mail service. Thanks!

      Comment

      Working...
      X