Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pooled OLS with year interaction vs. OLS by year

    I'm trying to estimate the predicted test scores for students with different disability types, and the data is longitudinal. I have two options (below): option 1 is to run regression year by year and use margins to predict the scores for each year, and option 2 is to run one regression but interacted with year. The goal is to plot the predicted scores over time.

    Option 1
    Code:
    foreach i in 2009 2010 2011 2012 2015 2016 2017 2018 2019 {
    reg test_score i.DISAB i.race i.female i.FRPL if year==`i' 
    margins i.DISAB
    est store m`i'
    }


    Option 2
    Code:
    reg test_score (i.DISAB i.race i.female i.FRPL)##(year)
    margins i.DISAB#year
    marginsplot,  xdim(year)



    The end results from the two options are very similar (maybe with tiny differences in 2011 & 2012 for Disab=0).

    I'd like to know if the two specifications are equivalent in achieving the goal (i.e., plotting the predicted score over time).

    Given they produce very similar results (although not exactly the same), is one more correct than the other? Is there any violation of the assumptions required for OLS here?

    Thank you!



  • #2
    Assuming that 2009, 2010, 2011, 2012, 2015, 2016, 2017, 2018, 2019 is the complete list of years in the data (no 2013 or 2014?, nothing before 2009 or after 2019?), the two model formulations are exactly equivalent. The results should not just be "very similar," the predicted values should be identical, except possible for minor rounding errors. (The standard errors, however, will, in general, not agree.) Neither is more correct than the other--but, as I have just said, the question should not arise in the first place.

    Is there any violation of the assumptions required for OLS here?
    It depends on your research goal, and also on some aspects of the data. If your goal is to simply create a separate model that is valid for one, and only one year, then this might be a reasonable approach. But that is usually not what people try to do with longitudinal data. Rather, usually one wants to find a more general model that applies, perhaps with slight modification, across many years, or estimate time trends or things like that. These more common goals cannot be accomplished by creating separate, unrelated yearly models. They would require instead the use of some longitudinal regression model.

    As for other requirements for OLS, do not let yourself be distracted by non-issues like normality of residuals or heteroskedasticity--much ink is needlessly spilled and time wasted fretting over these things that either do not matter at all or are easily dealt with. The key requirement for OLS, and the one that I least often see people express reservations about, is that the linear relationship of the regression model is a correct specification of the real world data generating process. Whether that is true for these particular variables I wouldn't know. It might require some trial and error to explore different approaches (adding interactions or non-linear terms or using a wholly non-linear model).

    Comment


    • #3
      Thank you so much, Clyde! This is so helpful.

      The standard errors, however, will, in general, not agree.
      That's exactly the other question I had in mind - which would give you the "correct" standard errors? I think the pooled OLS interacted with time would give you a smaller SE (due to a larger sample size).

      Comment


      • #4
        which would give you the "correct" standard errors?
        Your intuition that the standard errors are smaller in the interaction model is right. Both are correct in their own way. A standard error is not just an attribute of the data, it is an attribute of the combination of the data and the analysis applied to the data. By definition, it is the standard deviation of the sampling distribution of the coefficient (or whatever statistic is being estimated) when calculated using that analysis. Since the results come from two different analyses (and, actually, in a sense, from two different data sets) you can't expect the standard errors to come out the same.

        So I think it comes back to what I said in the second paragraph of #2. If your goal is to construct separate, unrelated predictive models for each year, then the standard errors from the separate regressions would apply. If your goal is to construct a grand model that applies across years, then the interaction model applies. Also, if your goal is prediction, not modeling of associations or causal modeling, then the standard errors of the coefficients are not directly important. Rather focus on the standard errors of the predictions that you get with -predict, stdp-.

        Comment


        • #5
          Thank you for putting this in such an elegant way!!

          Comment

          Working...
          X