Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Doubt about absorb() in reghdfe

    Hi everyone!

    I am currently working with a rather big dataset (135mm observations) and I need to run a few regressions. I think it's for the best if I explain my doubt through a reproducible example. My main goal would be to attain the equivalent of the coefficients from the following regression:
    Code:
    clear all
    sysuse auto
    
    reghdfe price i.mpg, noabsorb vce(robust)
    However, if I try this approach on the real data I run out of memory (I would be using ~90GB of memory, which I do not have access to). Thinking about ways to circumvent this problem, I thought about doing as follows:
    Code:
    foreach l of local levels{
        gen byte mpg_`l' = 0
        replace mpg_`l' = 1 if mpg==`l'
    } 
    reghdfe price mpg_14, absorb(mpg_15-mpg_41) vce(robust)
    And then proceed to do the same for each level of mpg (not including mpg_12). As expected, this approach yields the same coefficients. That said, the computational burden remains significant and it would be for the best if I could optimize the code even further. In order to do so I tried the following:
    Code:
    gen int mpg_aux = .
    replace mpg_aux = mpg if mpg!=12 & mpg!=14
    
    reghdfe price mpg_14, absorb(mpg_aux) vce(robust)
    This should be significantly faster - and the syntax is also substantially better. However, although this last specification should be equivalent to the former two (as far as I know), it omits the coefficient of interest due to collinearity, thus being useless to me.

    Can any of you explain to me why collinearity is only a problem in this last specification? Is there any way to circumvent this while retaining efficiency?

    Best regards,
    Pedro

  • #2
    By your definition of mpg_aux, if mpg == 14, mpg_aux is missing, and therefore all observations with mpg == 14 are excluded from estimation of -reghdfe price mpg_14, absorb(mpg_aux) vce(robust)-. In the remaining observations that -reghdfe- includes, we have mpg_14, by its definition, is always zero. Since it is always zero, it is colinear with everything and is, therefore, omitted.

    As for the original problem of exceeding available memory, how many different values does your "mpg" variable have? If it's in the millions, you are simply not going to be able to do this regression in Stata. But let me ask you: why would you even want to do a regression with millions of explanatory variables? Even if it ran smoothly, how would you read the hundreds of thousands of pages (screens) of output it would generate? And even if you did that, no human brain would be able to make any sense out of that mass of results. So if that's what's going on, I would suggest you need to rethink your plan.

    Comment


    • #3
      Thank you very much for the fast response, Clyde!

      By your definition of mpg_aux, if mpg == 14, mpg_aux is missing, and therefore all observations with mpg == 14 are excluded from estimation of -reghdfe price mpg_14, absorb(mpg_aux) vce(robust)-. In the remaining observations that -reghdfe- includes, we have mpg_14, by its definition, is always zero. Since it is always zero, it is colinear with everything and is, therefore, omitted.
      I see! I thought it just didn't create the categorical variables instead of dropping the observations from the sample... thank you very much!

      As for the original problem of exceeding available memory, how many different values does your "mpg" variable have? If it's in the millions, you are simply not going to be able to do this regression in Stata. But let me ask you: why would you even want to do a regression with millions of explanatory variables? Even if it ran smoothly, how would you read the hundreds of thousands of pages (screens) of output it would generate? And even if you did that, no human brain would be able to make any sense out of that mass of results. So if that's what's going on, I would suggest you need to rethink your plan.
      In the original problem, my "mpg" refers to the observations' hour - hence I have 24 values for it.

      Thank you yet again!

      Comment

      Working...
      X