Inverse Probability Weights

John Adler

Join Date: Apr 2017
Posts: 173

Inverse Probability Weights

17 Aug 2018, 04:35

I have a longitudinal dataset of individuals across three waves where I consider the effect of unemployment on health for mothers. There is some evidence to suggest that the results may biased due to attrition, and thus that the effects of unemployment on health may be underestimated. To address this issue I will utilize Inverse Probability Weighting (IPW).

To determine if the baseline characteristics of mothers are associated with the probability of leaving the sample, a binary variable is created which is equal to one if a respondent was in wave 1 and wave 2 or 3, or both, and zero if mothers were in wave 1 and no other waves, giving a complete attrition rate of 44%.

Code:


. capture drop insampm

. generate insampm = 0

. recode insampm 0 = 1 if has_y0_questionnaire==1 &  has_y5_questionnaire==1 | has_y0_questionnaire==1 & has_y10_questionnaire==1 | has_y0_questionnaire==1 & has_y5_questionnaire==1 & has_y10_questionnaire==1


. tab insampm

    insampm |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      1,464       44.28       44.28
          1 |      1,842       55.72      100.00
------------+-----------------------------------
      Total |      3,306      100.00

First, I estimate the probability of being a stayer using binary logit models including a range of control variables including education, marital status, recipient of social assistance, age and own employment, at baseline.Standard errors are clustered at the mother’s baseline location area. The below is exactly the same manner in which I model the core analysis in this paper.

Code:


. logit insampm i.cown_education_y0 i.cmaritalstatus_y0 i.cmedical_card_y0 i.cemployment_y0 i.cord_age_y0, cluster ( addres
> s_current_county_2002 )

note: 1.cown_education_y0 != 0 predicts failure perfectly
      1.cown_education_y0 dropped and 3 obs not used

note: 5.cemployment_y0 != 0 predicts failure perfectly
      5.cemployment_y0 dropped and 6 obs not used

note: 6.cown_education_y0 omitted because of collinearity
Iteration 0:   log pseudolikelihood = -1983.1518  
Iteration 1:   log pseudolikelihood = -1839.8105  
Iteration 2:   log pseudolikelihood = -1839.0974  
Iteration 3:   log pseudolikelihood = -1839.0964  
Iteration 4:   log pseudolikelihood = -1839.0964  

Logistic regression                             Number of obs     =      2,919
                                                Wald chi2(18)     =    1417.40
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -1839.0964               Pseudo R2         =     0.0726

                                                       (Std. Err. adjusted for 30 clusters in address_current_county_2002)
--------------------------------------------------------------------------------------------------------------------------
                                                         |               Robust
                                                 insampm |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------------------------------------------------+----------------------------------------------------------------
                                       cown_education_y0 |
                                           No schooling  |          0  (empty)
                               Primary school education  |  -1.980647   .8746838    -2.26   0.024    -3.694996   -.2662986
                                  Some secondary school  |  -.3748552   .2317244    -1.62   0.106    -.8290268    .0793163
                           Complete secondary education  |  -.3227909   .1922295    -1.68   0.093    -.6995538     .053972
 Some third level education at college, university, RTC  |  -.5120794   .2396614    -2.14   0.033    -.9818071   -.0423517
Complete third level education at college, university..  |          0  (omitted)
                                                         |
                                       cmaritalstatus_y0 |
                                             Cohabiting  |  -.3362515   .3143212    -1.07   0.285    -.9523098    .2798068
                                               Divorced  |  -1.057085   .6657199    -1.59   0.112    -2.361872    .2477024
                                                Widowed  |  -1.542918   1.233189    -1.25   0.211    -3.959924    .8740873
                                   Single/Never married  |  -.3289091    .257969    -1.27   0.202    -.8345191    .1767009
                                                         |
                                        cmedical_card_y0 |
                                                    Yes  |  -.1179747   .1656617    -0.71   0.476    -.4426656    .2067163
                                                         |
                                          cemployment_y0 |
                                             Unemployed  |    .075984   .3919981     0.19   0.846    -.6923183    .8442862
Unable to work owing to permanent sickness or disabil..  |  -.4583487   .5561027    -0.82   0.410     -1.54829    .6315926
                                      At school/student  |   .9783511   .3391637     2.88   0.004     .3136025      1.6431
                        Seeking work for the first time  |          0  (empty)
                                               Employed  |   .2686171   .1191097     2.26   0.024     .0351663    .5020679
                                          Self Employed  |   .4014955    .419458     0.96   0.338     -.420627    1.223618
                                                         |
                                             cord_age_y0 |
                                                  20-23  |   .2899182   .2319089     1.25   0.211     -.164615    .7444513
                                                  24-27  |   .5287781    .307094     1.72   0.085    -.0731151    1.130671
                                                  28-32  |   1.025553    .339614     3.02   0.003     .3599222    1.691185
                                                   33 +  |   1.257928   .3210913     3.92   0.000      .628601    1.887256
                                                         |
                                                   _cons |  -.3087511   .4024602    -0.77   0.443    -1.097559    .4800564
--------------------------------------------------------------------------------------------------------------------------

. predict p_insampm, pr
(387 missing values generated)

The inverse of this predicted probability is then to be used as a weight in the outcome analysis, such that mothers who have a lower probability of being a stayer are given a higher weight in the analysis, to compensate for similar mothers who are missing as informed by Wooldridge (2007), an archived Statalist post (https://www.stata.com/statalist/arch.../msg00999.html) and "12.2 Estimating IP weights via modeling" p. 12 of Causal Inference, Hernan and Robins https://cdn1.sph.harvard.edu/wp-cont...s_v2.17.18.pdf (worked examples in Stata can be found here: https://www.hsph.harvard.edu/miguel-...nference-book/).

Code:

. gen w=.
(3,306 missing values generated)

. 
. replace w=1/p_insampm if insampm==1
(1,701 real changes made)

. 
. replace w=1/(1-p_insampm) if insampm==0
(1,218 real changes made)

. 
. summarize w

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
           w |      2,919    1.998182    .7833788   1.096622   6.406883

My questions are, is the above approach to building this weight correct?

Also, in my analysis I make use of random effects estimators of unemployment on health (as informed by Hausman tests and the literature) for mothers who appear in the analysis in the first wave and at least one other wave. I intend to re-run this analysis and to apply the above weights to compare between the results when estimates are re-weighted to greater represent those individuals who were more likely to leave, however, it seems almost impossible to estimate a random effects model with weights.

xtregre2 estimates a random effects model with weights. It is an update to Kevin McKinney's rfregk (https://ideas.repec.org/c/boc/bocode/s456514.html)

However, xtregre2 only accepts aweights, factor variables not allowed and the alternative variance estimators are not supported. I'm not sure why but it also causes my number of observations to fall when I use it. Searching the archives someone also mentioned gllamm here https://www.stata.com/statalist/arch.../msg00716.html.

Can anyone please advise me as to whether my approach to building a weight above is correct, as well as my intention to apply it, and how exactly I can do this in a random effects regression?
Best,

John

Wooldridge, Jeffrey M. "Inverse probability weighted estimation for general missing data problems." Journal of Econometrics 141.2 (2007): 1281-1301.

Tags: panel data, random effects, regression, syntax, weighting

Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

17 Aug 2018, 08:32

I'm not expert in this area, but I think that you need a separate weight for each occasion. Imagine that you had only the first two waves. As there are missing values in the second, you'd need a weight based on the response probability for that wave. Add the third period and you need to estimate a different weight. This is standard practice, I believe; see, e.g. this link.

Possible complications (don't know how or whether they should affect the weight construction) :response at wave 2 could predict dropout at wave 3, whereas dropout at wave 2 could predict response at wave 3. A google search on longitudinal attrition weight turns up many references. This is about the limit of my knowledge so I'll stop here.

Good luck!

Last edited by Steve Samuels; 17 Aug 2018, 09:29.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#3

17 Aug 2018, 09:50

Sorry, by "response(s)" I meant "outcomes" or "measurements".

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#4

18 Aug 2018, 05:19

Dear Steve,

Thank you very much for the link you provided, which of course makes sense in the context of what you suggest, I will update my analysis accordingly.

I appreciate that you can't help me in applying the weights I develop in the context of my random effects regression but I was wondering if you had any thoughts on the syntax I am using in creating my weight as supplied above?

As my data is gathered over three waves my intention is to use the above Stata code to create a weight for appearing in wave 1 but not 2, appearing in wave 2 but not 3 and then creating a weight for appearing in wave 1 but not 3 and to use these weights to analyze attrition per wave combinations individually and then to combine these three to analyse attrition across all three waves, so it's important to ensure that the syntax is doing what I think it is doing,

All the best,

John
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#5

18 Aug 2018, 20:03

Unfortunately, I don't understand your weighting project as described in your last paragraph, for example what you mean by "a weight for appearing in wave 1 but not 3" and how you would use such a weight.
Random effects and weights: Mixed effects (me) models accept different kinds of weights and estimate random effects. Perhaps you can fit a panel model with melogit or meprobit.

Last edited by Steve Samuels; 18 Aug 2018, 20:06.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#6

20 Aug 2018, 09:52

Dear Steve,

Thank you for your feedback, I think the mixed effects approach is an excellent idea. Actually, I am now considering applying this approach as my core analysis instead of the random effects approach I was taking earlier.

Do you know if I need to do a Hausman test to compare mixed with fixed or anything else like that? Or is it enough to say that random effects models do not readily extend to the application of weights, and to support greater comparability between weighted and unweighted results linear mixed-effects models are utilized in the core analysis, and then re-weighted with inverse probability weighting to provide an analysis of the dataset which emphasises the experience of attriting mothers to a greater degree.

Similarly you mentioned that mixed effects models estimate random effects, but I had heard somewhere that they combine fixed and random effects, could you explain this?

Thank you again,

Jonathan
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#7

20 Aug 2018, 12:22

Sorry, I'm not expert enough in these models to comment knowledgeably; I couldn't even say what a Hausman Test is.

Last edited by Steve Samuels; 20 Aug 2018, 12:25.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

John Adler

Join Date: Apr 2017
Posts: 173

25 Aug 2018, 04:01

Thank you very much for your feedback here and elsewhere, we seem to be interacting a lot on the list at the moment so I hope I haven't become too annoying!

I just wanted to return to the intuition and syntax for my inverse probability weight and to see if this made sense to you.

I want to create a weight that is the inverse probability of being a stayer in the sample (having not attrited), so I do a logistical regression of being in a sample as based on some things that should influence that, create a predicted probability and then generate a new variable equal to nothing, I replace the variable as equal to one over the probability of being in the sample if the individual is in the sample and as equal to one over one minus the probability of being in the sample if the individual is not in the sample in order to create my inverse probability weight.

To put it bluntly, is what I am describing above, the same thing that I am doing below? I am new to weighting and not confident in my syntax,

Similarly thank you for the advice you provided and link (https://dsdr-kb.psc.isr.umich.edu/answer/1007) on creating waves for each wave, but I am finding it hard to determine the best way to apply this in Stata. Should I create a wave for attrition in each wave and then literally add these weights together? i.e. pweightwave1 + pweightwave2 + pweightwave3 or do I something of the order of:

Code:

mixed binary_health_y psum_unemployed_total_cont_y own_education_y maritalstatus_y medical_card_y employment_y ord_age_y [aw=pweightwave1] [aw=pweightwave2] [aw=pweightwave3]  if insampm == 1, cluster ( address_current_county_2002 )

Very best,

John

Code:

 
 capture drop insampm  
generate insampm = 0

recode insampm 0 = 1 if has_y0_questionnaire==1 &  has_y5_questionnaire==1 | has_y0_questionnaire==1 & has_y10_questionnaire==1 | has_y0_questionnaire==1 & has_y5_questionnaire==1 & has_y10_questionnaire==1


. tab insampm

    insampm |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      1,464       44.28       44.28
          1 |      1,842       55.72      100.00
------------+-----------------------------------
      Total |      3,306      100.00

.

Code:



. . logit insampm i.cown_education_y0 i.cmaritalstatus_y0 i.cmedical_card_y0 i.cemployment_y0 i.cord_age_y0, cluster ( address_current_county_2002 )

note: 1.cown_education_y0 != 0 predicts failure perfectly
      1.cown_education_y0 dropped and 3 obs not used

note: 5.cemployment_y0 != 0 predicts failure perfectly
      5.cemployment_y0 dropped and 6 obs not used

note: 6.cown_education_y0 omitted because of collinearity
Iteration 0:   log pseudolikelihood = -1983.1518  
Iteration 1:   log pseudolikelihood = -1839.8105  
Iteration 2:   log pseudolikelihood = -1839.0974  
Iteration 3:   log pseudolikelihood = -1839.0964  
Iteration 4:   log pseudolikelihood = -1839.0964  

Logistic regression                             Number of obs     =      2,919
                                                Wald chi2(18)     =    1417.40
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -1839.0964               Pseudo R2         =     0.0726

                                                          (Std. Err. adjusted for 30 clusters in address_current_county_2002)
-----------------------------------------------------------------------------------------------------------------------------
                                                            |               Robust
                                                    insampm |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------------------------------------------------+----------------------------------------------------------------
                                          cown_education_y0 |
                                              No schooling  |          0  (empty)
                                  Primary school education  |  -1.980647   .8746838    -2.26   0.024    -3.694996   -.2662986
                                     Some secondary school  |  -.3748552   .2317244    -1.62   0.106    -.8290268    .0793163
                              Complete secondary education  |  -.3227909   .1922295    -1.68   0.093    -.6995538     .053972
    Some third level education at college, university, RTC  |  -.5120794   .2396614    -2.14   0.033    -.9818071   -.0423517
Complete third level education at college, university, RTC  |          0  (omitted)
                                                            |
                                          cmaritalstatus_y0 |
                                                Cohabiting  |  -.3362515   .3143212    -1.07   0.285    -.9523098    .2798068
                                                  Divorced  |  -1.057085   .6657199    -1.59   0.112    -2.361872    .2477024
                                                   Widowed  |  -1.542918   1.233189    -1.25   0.211    -3.959924    .8740873
                                      Single/Never married  |  -.3289091    .257969    -1.27   0.202    -.8345191    .1767009
                                                            |
                                           cmedical_card_y0 |
                                                       Yes  |  -.1179747   .1656617    -0.71   0.476    -.4426656    .2067163
                                                            |
                                             cemployment_y0 |
                                                Unemployed  |    .075984   .3919981     0.19   0.846    -.6923183    .8442862
  Unable to work owing to permanent sickness or disability  |  -.4583487   .5561027    -0.82   0.410     -1.54829    .6315926
                                         At school/student  |   .9783511   .3391637     2.88   0.004     .3136025      1.6431
                           Seeking work for the first time  |          0  (empty)
                                                  Employed  |   .2686171   .1191097     2.26   0.024     .0351663    .5020679
                                             Self Employed  |   .4014955    .419458     0.96   0.338     -.420627    1.223618
                                                            |
                                                cord_age_y0 |
                                                     20-23  |   .2899182   .2319089     1.25   0.211     -.164615    .7444513
                                                     24-27  |   .5287781    .307094     1.72   0.085    -.0731151    1.130671
                                                     28-32  |   1.025553    .339614     3.02   0.003     .3599222    1.691185
                                                      33 +  |   1.257928   .3210913     3.92   0.000      .628601    1.887256
                                                            |
                                                      _cons |  -.3087511   .4024602    -0.77   0.443    -1.097559    .4800564
-----------------------------------------------------------------------------------------------------------------------------

. predict p_insampm, pr
(387 missing values generated)

Code:


. . gen w=.
(3,306 missing values generated)

.  
. . replace w=1/p_insampm if insampm==1
(1,701 real changes made)

. 
. . 

. . replace w=1/(1-p_insampm) if insampm==0
(1,218 real changes made)

. 
end of do-file

. summarize w

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
           w |      2,919    1.998182    .7833788   1.096622   6.406883

Comment

Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

25 Aug 2018, 20:58

I'm not going to comment on your code, because I've already said that what are trying is not the right way to do attrition weighting. Your classification of "in the sample" amounts to "in the sample at the second or in the third wave (or both)"'. In a panel study, one needs to analyze the measurements at each wave. What is needed is a weight to account for non-participation at each wave. Further when you attempt to assign the inverse of the 1-prob weight to people who missed the last two waves, you are mixing up attrition weighting with inverse probability of treatment weighting (IPTW). I suggest that a Google search of "attrition weighting" to get an idea of accepted practice in this area. I'm sorry that I can't help you further. Good luck!

Last edited by Steve Samuels; 25 Aug 2018, 21:40.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Sousan Hamwi

Join Date: Jun 2021

Posts: 1
#10

08 Sep 2021, 07:10

Dear John,
I know it's been a long time since you posted this, but I'm currently having a very similar problem and I was wondering if eventually managed to find the answer! Thanks!

I want to create a weight that is the inverse probability of being a stayer in the sample (having not attrited), so I do a logistical regression of being in a sample as based on some things that should influence that, create a predicted probability and then generate a new variable equal to nothing, I replace the variable as equal to one over the probability of being in the sample if the individual is in the sample and as equal to one over one minus the probability of being in the sample if the individual is not in the sample in order to create my inverse probability weight.

To put it bluntly, is what I am describing above, the same thing that I am doing below? I am new to weighting and not confident in my syntax,
Comment

Announcement

Inverse Probability Weights

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment