Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Understanding the random effects regression conceptually.

    I’m confused by exactly how a random effects regression works in panel data using Stata. Take for example a logit regression in panel data, assuming that there are three waves of data, as follows:

    Code:
    xtlogit IV DV, re nolog
    Conceptually, I understand that in a fixed effects regression, Stata would focus on a singular person and determine if that individual person’s dependent variable had changed across waves of data in this panel, repeating this for all qualifying individuals in the sample. This often leads to small sample sizes.

    However, in the case of a random effects regression, the sample size that Stata uses in the analysis is generally significantly larger, which leads me to believe that individuals are treated differently in this kind of a regression.

    Are individual waves in the panel added together to make one larger wave where the effect of the independent variable on the dependent variable is considered? I would then assume that the wave in which the affect of the IV on the DV in wave 1 is no longer considered and that the focus switches just to the relationship between the IV and the DV, regardless of wave. Basically, that the data is treated as one large single wave sample in a random effects regression, much like in a simple linear ols regression of non-panel data.

    But is this the case? Can somebody explain to me what Stata is doing in a random effects regression, and how this translates into the sample that one ends up with at the end of this regression?

    Very best,

    J

  • #2
    Well, you're very close to right. You can think of a random-effects logistic model as being like a single logistic regression of the DV on the IV, except that each individual has a "customized" intercept (constant term). And, these customized intercepts are assumed to come from a normal distribution, whose variance is estimated from the data. It is not really analogous to OLS regression, but it is quite analogous to -xtreg, re-. So it does not have to remove from the estimation sample individuals whose outcome does not vary. This is why the -re- sample may be larger than the -fe- sample.

    (Tangent: it is interesting that your experience is that the -fe- samples are usually much smaller. In my work, most of the data sets to which I have occasion to apply -xtlogit, fe- are data sets that, by design, have variation in the outcome variable for every group. In fact, usually, by design, there is exactly one observation per group with outcome = 1 and all the rest have outcome = 0. In such data sets, no individuals end up being omitted from the estimation sample. So my experience has been the opposite of yours: the sample sizes are usually the same, and if they are not, it is a red flag that the data may contain errors. But you are probably working in a very different context from mine.)

    Comment


    • #3
      Thank you Clyde, for a clear explanation of the above.

      You are correct, my research is mostly focused on opportunistic cohorts of panel data where respondents can (and often do) leave the sample before we would like them to, questionnaires can also change across waves, etc.,

      Based on this, if I were to extend the sample size question a little further, does the above explanation suggest that, in a purely hypothetical panel dataset with three waves, where say 400 respondents were examined in wave 1, 200 of these respondents were re-examined in wave 2 and 100 of these respondents were re-examined in wave three that if I made use of an random effects logit regression across three waves, that the sample size reported would be 700, assuming no missing values?

      At the same time, would I be right in assuming that a fixed effects regression would report a sample size of 400 individuals, again assuming no missing values?

      And could this difference be attributed to the fact that a fixed effects regression focuses at the individual level, i.e. 400 individual respondents, examined a maximum of 400 times, whereas the random effects regression focuses on each IV's effects on a DV as a unique outcome or datapoint thus providing 700 examples of the effect of the IV on the DV?

      Many thanks,

      J

      Comment


      • #4
        Based on this, if I were to extend the sample size question a little further, does the above explanation suggest that, in a purely hypothetical panel dataset with three waves, where say 400 respondents were examined in wave 1, 200 of these respondents were re-examined in wave 2 and 100 of these respondents were re-examined in wave three that if I made use of an random effects logit regression across three waves, that the sample size reported would be 700, assuming no missing values?
        Correct.

        At the same time, would I be right in assuming that a fixed effects regression would report a sample size of 400 individuals, again assuming no missing values?
        Not necessarily. In fact, I think the largest possible sample size would be 200 because anybody with only one observation is necessarily dropped. So it seems to me that anybody who didn't make it into the second wave would be lost in that analysis. Once you have two observations, you might remain in the sample without being in the third--but it would depend on whether your outcome variable changes across those two observations or not.

        And could this difference be attributed to the fact that a fixed effects regression focuses at the individual level, i.e. 400 individual respondents, examined a maximum of 400 times, whereas the random effects regression focuses on each IV's effects on a DV as a unique outcome or datapoint thus providing 700 examples of the effect of the IV on the DV?
        I guess. I'm not entirely comfortable with the "focus" language, since the algorithm, and even the software, lack minds. But it is true that the fixed-effects analysis is a within-person change analysis, whereas the random effects model is a sample-wide view of the DV-IV relationship.

        Comment


        • #5
          Thank you Clyde,

          Apologies for my error, in the fixed effects model the largest sample would of course be 200. I think I understand the relationship much more clearly now, it is always helpful to have someone to discuss these things with. My final question has to do then with reporting the sample size once analysis is complete.


          In my own data, I I have a panel data-set across three waves (year 0, year 5 and year 10) which includes the health outcomes and behaviors of respondents. I do the following in order to prepare for my analysis:


          Code:
          
          reshape long health_y current_county_y binary_health_y bmi_y binbmi_overweight_y binbmi_underweight_y binbmi_obese_y ord_bmi_y own_education_y medical_card_y employment_y binary_employment_y maritalstatus_y binmartatus_y age_y ord_age_y psum_unemployed_total_cont_y, i(id) j(year)
          
          xtset id year
          
          quietly tab1 year, gen(yr)
          Then I run the below regression


          Code:
          xtlogit binary_health_y psum_unemployed_total_cont_y i.own_educatin_y /*i.ord_age_y*/ i.binmartatus_y i.medical_card_y if gender==0, re nolog
          estimates store random
          estimates table random, star stats(N r2 r2_a)
          
          margins if gender==0, dydx(psum_unemployed_total_cont_y) post
          outreg2 using test.doc, word replace ctitle(Marginal effects)

          Which provides the following output


          Code:
          
          . 
          . xtlogit binary_health_y psum_unemployed_total_cont_y i.own_educatin_y /*i.ord_age_y*/ i.binmartatus_y i.medical_card_y if gender==0, re nolog
          
          Random-effects logistic regression              Number of obs     =      1,989
          Group variable: id                              Number of groups  =      1,041
          
          Random effects u_i ~ Gaussian                   Obs per group:
                                                                        min =          1
                                                                        avg =        1.9
                                                                        max =          3
          
          Integration method: mvaghermite                 Integration pts.  =         12
          
                                                          Wald chi2(4)      =      62.44
          Log likelihood  = -1081.9807                    Prob > chi2       =     0.0000
          
          ----------------------------------------------------------------------------------------------
                       binary_health_y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -----------------------------+----------------------------------------------------------------
          psum_unemployed_total_cont_y |   .0194575   .0154244     1.26   0.207    -.0107738    .0496888
                                       |
                        own_educatin_y |
                    Secondary or less  |  -.6687584   .1741464    -3.84   0.000    -1.010079   -.3274378
                                       |
                         binmartatus_y |
                              Married  |   .5906219   .1771985     3.33   0.001     .2433193    .9379245
                                       |
                        medical_card_y |
                                  Yes  |  -.7591776     .18805    -4.04   0.000    -1.127749   -.3906063
                                 _cons |   1.246088   .2373629     5.25   0.000     .7808651    1.711311
          -----------------------------+----------------------------------------------------------------
                              /lnsig2u |   .9444658   .2049658                      .5427402    1.346191
          -----------------------------+----------------------------------------------------------------
                               sigma_u |   1.603571   .1643386                       1.31176    1.960296
                                   rho |   .4387143   .0504716                      .3434162    .5387581
          ----------------------------------------------------------------------------------------------
          LR test of rho=0: chibar2(01) = 81.27                  Prob >= chibar2 = 0.000

          Then I make use of the below to determine how many unique individuals were included in the regression


          Code:
          egen tag=tag(id)
          count if e(sample)==1 & tag==1
          count if tag==1
          
          Which provides the figure: 1,018

          My question is, which figure is normally reported as the "N" in the results section of the analysis. the "Number of obs", "Number of groups" or the number of unique individuals included in the regression?

          Apologies that this question has grown legs, it is something that I have been struggling to understand over the previous few days,

          Kindest regards,

          John

          Comment


          • #6
            Code:
            egen tag=tag(id)
            count if e(sample)==1 & tag==1
            count if tag==1
            
            Which provides the figure: 1,018
            That is wrong. The observation in the group belong to a given id may not be in the estimation sample due to, say, missing values on a model variable. Consequently this way of reckoning the number of people in the estimation sample potentially misses some people who are in there, but just not for the particular observation that was tagged. If you want to do something like this
            Code:
            egen tag = tag(id) if e(sample)
            count if tag == 1
            
            // OR,  IF YOU HAVE, OR INSTALL NICK COX'S -distinct.ado-
            // FROM STATA JOURNAL
            distinct id if e(sample)

            But you don't need to do any of that. You can get the number of people directly from the -xtlogit, re- output. There it tells you that the number of observations is 1,989 and the number of groups (and, in your data a group is a person) is 1,041. It is these two numbers that I would report. I would even say it as "The logistic regression analysis was carried out on a sample of 1,989 observations obtained from 1,041 distinct people."

            Comment


            • #7
              Clyde,

              Thank you for that, I have always been unsure as to whether it was acceptable to report the number of groups as the number of persons in an analysis in the context of a random effects logit model in panel data.

              This is due to our discussion above where a dataset may exist across three waves with 400 respondents examined in wave 1, 200 of these same respondents re-examined in wave 2 and 100 of these same respondents re-examined in wave three. In that case the random effects logit regression across three waves would report a number of groups of 700, providing there were no missing data, which is a sample size larger than the total 400 respondents that were recruited to the dataset.

              Would this present a problem or is it acceptable to report a sample size in this manner? I apologize if it is painfully clear to you what the accepted approach is in the literature, I'm just afraid of someone jumping up at a conference or something similar and asking why my sample size for my regression is much larger than the number of individuals recruited to the dataset!

              Thank you again for input,

              Kindest regards,

              J

              Comment


              • #8
                Well, in any longitudinal study, the number of observations will be larger than the number of individuals providing data. What is perhaps unusual about your particular study is that the attrition between waves 1 and 2 is very high, so that in a random effects model, a great deal of the analysis rests on people for whom we have only a single observation.

                Nevertheless, I have been reporting results in the way I suggested for a long time, and I don't think it causes any problems or misunderstanding. I suppose that if I were faced with the kind of severe attrition that you have here, and given that there are only three waves, I might go a bit farther and also report the number of people contributing to the analysis in each wave:

                Code:
                distinct id if e(sample) & wave == 1
                distinct id if e(sample) & wave == 2
                distinct id if e(sample) & wave == 3
                But if you are facing word limits, I wouldn't hesitate to omit this addition. Just looking at the 1,989 observations and 1,041 people it is really obvious that the average number of observations per person is less than 2. Providing counts at each wave makes the disappearance a bit more obvious and provides some additional information, but probably not anything essential in a summary report.

                Comment

                Working...
                X