Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • In panel data regression do respondents in only 1 wave still contribute to the analysis?

    I've been asked to conduct a linear probability fixed effects analysis of children in panel data. There are three waves of data and it was suggested that the sample of children to be analysed are those who appear in the first wave and at least one other wave (wave two or three or two and three).

    I understand technically how to do this, as below:

    Code:
    xtreg child_hours_slept_y control_var1 control_var2 if wave1==1 &  wave2==1 | wave1==1 & wave3==1 | wave1==1 & wave2==1 & wave3==1, cluster (child_location) fe robust
    So I've created a sample of children who appear in at least 2 waves and analyse this sample in several regressions; for waves 1, 2 and 3, waves 1 and 2, waves 2 and 3 and waves 1 and 3 by dropping the below as necessary.

    Code:
    drop if year == 3
    
    drop if year == 2
    
    drop if year == 1

    Here the sample is children in wave 1 and at least one other wave and no-one new joins the data after recruitment into wave 1. But, what I want to know is, in a linear probability fixed effects model, are children still contributing to the analysis even when they appear in only one wave?

    So if I have a child who appears in wave 1 and 3 and I analyse wave 1 and 2 (by dropping wave 3 as above before running my regression), according to the sample inclusion rules this child will still be included in the analysis:

    - as above

    Code:
    if wave1==1 &  wave2==1 | wave1==1 & wave3==1 | wave1==1 & wave2==1 & wave3==1
    but will they be dropped from the regression? Or will their wave 1 data still be included in the analysis and in what way?

    i.e. in a random effects model even if people drop out they provide a cohort effect, is the linear probability model doing likewise here?

    I am very confused,

    Thank you

    Jack

  • #2
    In a fixed effects regression children who only participated in one wave will not contribute to the regression coefficients. If you run the model both including them and after dropping them, you will see that all of the coefficients are identical. But you will see a difference in sigma_u and rho.

    Comment


    • #3
      Thank you for the explanation!

      I run a linear probability fixed effects regression on the sample who are in Wave 1 and at least one other wave (either 2 or 3 or both), but then I run a second regression where I drop wave 2 to consider my analysis across waves 1 and 3. My number of observations is within the 500 range in the first analysis and then in the 400 range when I run a regression of a complete case of children in Wave 1 AND Wave 3:

      Code:
       
       drop if year == 2

      So I was wondering if in panel data the fixed effect linear probability regression does not remove from the estimation sample individuals whose outcome does not vary. Indicating that the larger number of groups is because the linear probability model make use of within and between variation? Or are you telling me that although I have a higher number of groups in my first analysis, that they are not contributing to the regression coefficients? They get dropped? Although why then is the number of observations higher and avg. obs per group is less than 2 suggesting that some children are contributing only one wave of data?


      Here is the first analysis

      Code:
      . xtreg child_hours_slept_y control_var1 control_var2 if wave1 ==1 & wave2==1 | wave1==1 & wave3==1 | wave1==1 & wave2==1 & wave3==1, cluster (location) fe robust 
      
      Fixed-effects (within) regression               Number of obs      =      1005
      Group variable: id                              Number of groups   =       576
      
      R-sq:  within  = 0.0031                         Obs per group: min =         1
             between = 0.0029                                        avg =       1.7
             overall = 0.0024                                        max =         2
      
                                                      F(2,28)            =      2.04
      corr(u_i, Xb)  = -0.0206                        Prob > F           =    0.1490
      
                                           (Std. Err. adjusted for 29 clusters in location)
      ----------------------------------------------------------------------------------------------
                                   |               Robust
                   child_hours_slept_y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -----------------------------+----------------------------------------------------------------
      control_var1|  -.0068594   .0051416    -1.33   0.193    -.0173915    .0036727
                             control_var2|    .004084    .004422     0.92   0.364    -.0049741     .013142
                             _cons |   .6968944   .1008319     6.91   0.000     .4903497    .9034391
      -----------------------------+----------------------------------------------------------------
                           sigma_u |  .37366609
                           sigma_e |  .35506241
                               rho |  .52551234   (fraction of variance due to u_i)
      ----------------------------------------------------------------------------------------------
      
      .


      Here is the complete case:

      Code:
      
      . xtreg child_hours_slept_y control_var1 control_var2 if wave1==1 &  wave3==1, cluster (location) fe robust
      
      Fixed-effects (within) regression               Number of obs      =       869
      Group variable: id                              Number of groups   =       440
      
      R-sq:  within  = 0.0031                         Obs per group: min =         1
             between = 0.0029                                        avg =       2.0
             overall = 0.0024                                        max =         2
      
                                                      F(2,28)            =      2.04
      corr(u_i, Xb)  = -0.0233                        Prob > F           =    0.1491
      
                                           (Std. Err. adjusted for 29 clusters in location)
      ----------------------------------------------------------------------------------------------
                                   |               Robust
                   binary_health_y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -----------------------------+----------------------------------------------------------------
      control_var1 |  -.0068594   .0051424    -1.33   0.193    -.0173932    .0036744
                             control_var2 |    .004084   .0044227     0.92   0.364    -.0049755    .0131434
                             _cons |   .7100747   .1016855     6.98   0.000     .5017813    .9183681
      -----------------------------+----------------------------------------------------------------
                           sigma_u |  .33797811
                           sigma_e |  .35506241
                               rho |  .47536375   (fraction of variance due to u_i)
      ----------------------------------------------------------------------------------------------
      
      .









      Comment


      • #4
        The wave 1 only participants do contribute to the estimation of sigma_u, so they are counted when Stata reports the number of observations and number of groups at the top of the regression table. But they have no effect on the coefficients: if you were to re-run the same model excluding them, you would see that the coefficients do not change.

        Comment


        • #5
          Hi Clyde,

          So does that mean that if I were to run a regression on children who met my inclusion criterion (i.e. in at least 2 of 3 waves) and then a complete case analysis that I would be effectively running the exact same analysis twice? I had hoped to do a complete case analysis as a robustness check on my original analysis but now I'm concerned that this doesn't make any sense to do... I had heard linear probability models described as a recommended alternative to logistic models when outcomes are uncommon (RefA) and that logistic regression will fail to estimate an impact if treatment status perfectly predicts the outcome unlike linear probability models, the prevalence rate of the outcome is very high or low (RefB) so I had assumed that this estimator was closer to random effects in that within variation wasn't necessary for data to stay in the analysis, am I completely misunderstanding everything?


          Thank you,

          Jack


          References:

          RefA: Oddo, V. M., Nicholas, L. H., Bleich, S. N., & Jones-Smith, J. C. (2016). The impact of changing economic conditions on overweight risk among children in California from 2008 to 2012. J Epidemiol Community Health, 70(9), 874-880. https://www-ncbi-nlm-nih-gov.ucd.idm...ubmed/27251405 Page 5


          RefB: Deke, J. (2014). Using the linear probability model to estimate impacts on binary outcomes in randomized controlled trials (No. 62a1477e274d429faf7e0c71ba1204b2). Mathematica Policy Research. https://www.hhs.gov/ash/oah/sites/de...pm-tabrief.pdf Page 3

          Comment


          • #6
            So does that mean that if I were to run a regression on children who met my inclusion criterion (i.e. in at least 2 of 3 waves) and then a complete case analysis that I would be effectively running the exact same analysis twice?
            I don't understand how this question relates to what we have been discussing. A complete case analysis means an analysis using only participants who were present in all three waves.

            Let's review the analyses you describe in #1. You start with an analysis in which participants in at least two of the three waves are included. Some of these will be waves 1 and 2, some 1 and 3, and some 2 and 3, and some 1, 2, and 3. If you now drop the year 2 observations, you will be left with 2 observations for each participant who was in waves 1 and 3, or 1, 2, and 3, but only a single observation for those who were in waves 1 and 2 or 2 and 3 only. In this analysis the coefficients will all be the same as they would be in an analysis of only those who participated in waves 1 and 3 or in waves 1, 2, and 3. However, the reported numbers of observations and groups will differ, and sigma_u may differ as well. So it is not the exact same analysis. But if you are thinking of it as a robustness check on the original model, it does seem rather pointless as there your focus would primarily be on the coefficients, and there is no possibility of the coefficients coming out different.

            I had heard linear probability models described as a recommended alternative to logistic models when outcomes are uncommon (RefA)
            Well, they are an alternative. I'm not sure I would recommend it, but it's an alternative. It is known that logistic regression coefficients can be upwardly biased (in magnitude) when the outcome is either very rare or nearly universal. But in those same circumstances, linear probability models sometimes produce predicted probabilities outside the [0,1] interval. So I'm not sure that's better!

            and that logistic regression will fail to estimate an impact if treatment status perfectly predicts the outcome unlike linear probability models,
            That is true. But, again, in this situation you my find that your linear probability model predicts probabilities <0 or > 1. In this situation you might want to look at -firthlogit-, which fits a logistic model but uses penalized maximum likelihood estimation, which overcomes the limitations of the logistic model that you are worried about.

            so I had assumed that this estimator was closer to random effects in that within variation wasn't necessary for data to stay in the analysis
            I don't understand the connection you are drawing here with random effects.

            Comment


            • #7
              Thank you for explaining this, I guess I am confused because a fixed effect logit would cut the number of observations down to those who changed across regressions or had data in 2 waves and a linear probability model doesn't seem to do that, i.e. it always has a greater number of observations than a logit. Which is why I was drawing a connection with the random effects model, I wasn't sure why the lpm and logit differ and what bearing that would have on my analysis (so I guessed that the lpm had some characteristics similar to the random effects estimator). This was strengthened by my reading of those references which seemed to suggest (to me) that the lpm would allow for a greater number of observations than a logit model. I'm afraid I still don't understand why the number of observations between lpm and logit differs and the relevance of this for my analysis, would you be able to explain this to me?

              Comment


              • #8
                In -xtlogit, fe- any group that has no variation in outcome is dropped completely--that means any group with just one observation is dropped completely--hence a reduction in the number of observations reported. In -xtreg, fe- they are not completely dropped: they still influence the estimation of sigma_u, but not the coefficients. That is the difference.

                Comment


                • #9
                  Hi Clyde,

                  OK that makes sense, thank you. Could you help me to understand sigma_u? I've read the documentation on fixed and random effects here: https://www.stata.com/manuals13/xtxtreg.pdf but can't seem to get an intuitive understanding, it's the between subject standard deviation, so it's the average difference between individuals if I'm correct? How then do the groups with no variation influence sigma_u but not the coefficients?

                  Thank you for your time,

                  Jack

                  Comment


                  • #10
                    Well, although this is not how the calculations are actually done, and it is not a fully accurate picture, you can think of sigma_u as being something like this: in each group calculate the mean outcome. Now calculate the standard deviation of those means. Sigma_u is related to that standard deviation. So even a group with a single observation has a mean outcome, which can be included in the sigma_u calculation.

                    The reason groups with only one observation don't influence the coefficients is that the coefficient estimates in a fixed-effects model are purely estimates of within group effects. If you have only one observation in a group, then there is no variation in either the dependent or independent variables within that group. So it is not possible to estimate a within-group effect of the independent variables on the dependent variables. Hence these groups are uninformative about within-group effects, and they do not influence the coefficients.

                    Comment

                    Working...
                    X