Is my approach to considering the probability of attrition in panel data reasonable?

John Adler

Join Date: Apr 2017
Posts: 173

Is my approach to considering the probability of attrition in panel data reasonable?

13 Mar 2018, 15:47

I have a panel of mothers surveyed at 3 time points.

I would like to determine if the mothers who left the panel after year 1 are different to the mothers who stayed in the survey for all three years, and in what way.

I create the below variable for mothers who did not have a questionnaire (and thus were not in the survey) in the last 2 time points. If a mother had a questionnaire in either of these time points they are recorded as not having attrited.

Code:

generate leftsamp=.

replace leftsamp = 1 if has_y5_questionnaire == 0 & has_y10_questionnaire == 0

replace leftsamp = 0 if has_y5_questionnaire == 1 | has_y10_questionnaire == 1


. tab leftsamp

   leftsamp |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        617       55.99       55.99
          1 |        485       44.01      100.00
------------+-----------------------------------
      Total |      1,102      100.00

Next I began creating the tables as follows to put together some descriptive statistics on leavers and stayers

Code:


. tab no_cigs_cons_more0_y0 if gender == 0 & leftsamp == 1

     Do you |
    consume |
more than 0 |
ciagarettes |
     a day? |      Freq.     Percent        Cum.
------------+-----------------------------------
         No |        329       68.68       68.68
        Yes |        150       31.32      100.00
------------+-----------------------------------
      Total |        479      100.00

.
.
. tab no_cigs_cons_more0_y0 if gender == 0 & leftsamp == 0

     Do you |
    consume |
more than 0 |
ciagarettes |
     a day? |      Freq.     Percent        Cum.
------------+-----------------------------------
         No |        502       82.03       82.03
        Yes |        110       17.97      100.00
------------+-----------------------------------
      Total |        612      100.00


* I check for significance in the table

. tab no_cigs_cons_more0_y0 leftsamp if gender == 0, column row nokey chi2 lrchi2 V exact gamma taub

    Do you |
   consume |
 more than |
         0 |
ciagarette |       leftsamp
  s a day? |         0          1 |     Total
-----------+----------------------+----------
        No |       502        329 |       831
           |     60.41      39.59 |    100.00
           |     82.03      68.68 |     76.17
-----------+----------------------+----------
       Yes |       110        150 |       260
           |     42.31      57.69 |    100.00
           |     17.97      31.32 |     23.83
-----------+----------------------+----------
     Total |       612        479 |     1,091
           |     56.10      43.90 |    100.00
           |    100.00     100.00 |    100.00

          Pearson chi2(1) =  26.3475   Pr = 0.000
 likelihood-ratio chi2(1) =  26.2048   Pr = 0.000
               Cramér's V =   0.1554
                    gamma =   0.3508  ASE = 0.063
          Kendall's tau-b =   0.1554  ASE = 0.030
           Fisher's exact =                 0.000
   1-sided Fisher's exact =                 0.000

I assume that this suggests some association between these two variables and so I test for the direction of this effect as below:

Code:


. logit leftsamp no_cigs_cons_more0_y0 if gender==0, cluster ( address_current_county_2002 )

Iteration 0:   log pseudolikelihood = -748.09659  
Iteration 1:   log pseudolikelihood = -734.99542  
Iteration 2:   log pseudolikelihood = -734.99419  
Iteration 3:   log pseudolikelihood = -734.99419  

Logistic regression                             Number of obs     =      1,091
                                                Wald chi2(1)      =       5.75
                                                Prob > chi2       =     0.0165
Log pseudolikelihood = -734.99419               Pseudo R2         =     0.0175

                    (Std. Err. adjusted for 30 clusters in address_current_county_2002)
---------------------------------------------------------------------------------------
                      |               Robust
             leftsamp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------------+----------------------------------------------------------------
no_cigs_cons_more0_y0 |   .7326973   .3055789     2.40   0.016     .1337737    1.331621
                _cons |  -.4225424   .1634234    -2.59   0.010    -.7428464   -.1022383
---------------------------------------------------------------------------------------

. margins if gender==0, dydx( no_cigs_cons_more0_y0 ) post

Average marginal effects                        Number of obs     =      1,091
Model VCE    : Robust

Expression   : Pr(leftsamp), predict()
dy/dx w.r.t. : no_cigs_cons_more0_y0

---------------------------------------------------------------------------------------
                      |            Delta-method
                      |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------------+----------------------------------------------------------------
no_cigs_cons_more0_y0 |   .1760942   .0695651     2.53   0.011     .0397491    .3124394
---------------------------------------------------------------------------------------

. estimates store logitmod

. estimates table logitmod, star stats(N r2 r2_a)

------------------------------
    Variable |   logitmod    
-------------+----------------
no_cig~e0_y0 |  .17609424*    
-------------+----------------
           N |       1091    
          r2 |                
        r2_a |                
------------------------------
legend: * p<0.05; ** p<0.01; *** p<0.001

.

Based on these results, I assume that mothers who smoke were 17% more likely to leave the sample after wave one, thus I report this in my paper.

But can anybody tell me if this is a reasonable approach? I cluster at the area that the mother lives in, and I suppose I could include some of her baseline characteristics as I did use these in earlier analysis of whether mothers smoked or not and employment change, I just wasn't sure how it fit here.

Happy to hear anyone's thoughts on this approach?

Kindest regards,

John

Last edited by John Adler; 13 Mar 2018, 16:19.

Tags: attrition, missing data, panel, panel data, syntax

Announcement

Is my approach to considering the probability of attrition in panel data reasonable?