Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mixed results between reghdfe and pooled ols with fixed effects like industry, country and year

    Hello everyone

    I am new to this platform

    I am currently working with a panel dataset on Stata18 and have some questions

    General info: My model is firms (86) across 4 countries over 10 years with 860 observations. I have applied the reghdfe depvar indvar, absorb(firmid year industry country) cluster(firmid)

    In addition, I have also done OLS and run regressions with Industry, Country, and Year fixed.

    I want to compare OLS and reghdfe results, as I found out that OLS gave me significant results between ESG and Financial Performance, while the reghdfe command gave me insignificant results.


    My questions are:
    1) Is the regression done correctly by doing reghdfe depvar indvar, absorb(firmid year industry country) cluster(firmid) based on my panel data?
    2) Is it correct to call it "Pooled OLS" or is a Fixed Effect regression done manually?
    3) Should I include vce rob /vce cluster (firm id) in the OLS reg or should I only do the OLS without vce rob/ vce cluster firmID with reg Y X + Controlling variables and fixed variables to do a fair comparison between the models?



    Your advice and expertise would mean a lot as I navigate this step in my analysis! Thank you in advance for your help.

    Best regards,
    Martin

  • #2
    1. Assuming that each firmid operates only in one industry and one country at all times, it is unnecessary to absorb the industry and country variables, as the firmid variable already carries all of their information. On the other hand, it does no harm to have them there. As to the correctness of your -reghdfe- command, in this situation I believe you might need to cluster your vce at the industry level. In general*, you want the cluster variable to be the highest level such that observations are independent across that level but correlated within. That would, in your case, be the country level. But since you only have four countries, you cannot cluster at that level. You don't say how many industries you have, but if it is, say, 20 or more, I would cluster at the industry level, otherwise stick with firmid.

    2. This question is unclear. If you are referring to your -reghdfe- command, no it is not appropriate to call it pooled OLS. If you are referring to some other regression you did, you would have to show the actual command for anyone to answer you. That said, a pooled OLS with indicator variables for the fixed effects is equivalent to a fixed-effects regression--the method of calculating the coefficients differs, but the results are always the same either way.

    3. If your purpose in doing an OLS with indicators model in addition to your -reghdfe- model is to verify that they give the same results, then you have to cluster the vce the same way in both of them. In my response to your question 1) I challenged whether -cluster(firmid)- is appropriate here, but whatever you end up doing in -reghdfe- you should do the same clustering in your other model.

    *This is just a general-purpose rule of thumb that will be correct in the most frequently encountered situations. But, as already noted, the number of entities at each level also plays a role in deciding which level to cluster at. And there are other considerations as well, such as the level at which an intervention is applied. But for purely observational data such as I imagine you have, this rule will apply.

    Comment


    • #3
      Hi Clyde, and thank you for your response. I appreciate it.

      1). Your assumption is right: Each firmid operates only in one industry and one country at all times. If it is unnecessary to absorb the industry and country variables, as the firmid variable already carries all of their information, how should reghdfe command for my case look?

      1.1) I need to control for industry and country. I have 7 unique industries with 9 observations per firm. FirmId in my case is Company-Name, so you are saying that firm id "already carries all of their information?

      1.2 Referring to what you said, since I cannot cluster at the country level at a high level due to a limited /small number of countries, is the reghdfe command not appropriate for my data?
      If not, what type of regression method is more suited if I need to control for Industry-Fixed, Country Fixed, and Year Fixed in my case use and why?

      2) My apologies. No, I do not refer to reghdfe in that context, only in general. I want to use OLS method as "additional testing/robustness check", so in my thesis I am not sure if I should address this OLS method with fixed effect variables as Pooled OLS or "Fixed Effect done manually": Here is the command I did for OLS: reg Y X control variables, i.Industry i.Country .Year

      3) Ok, so just to confirm. I have to use the same fixed effect and use the same clustering strategy, which allows for fair comparison of coefficients and standard errors?

      3.1- Does that mean I should also cluster Industry( in addition to the firmd) as well in the OLS reg?

      Thank you again for the response

      Best, Martin
      Last edited by Martin Johanneson; 18 May 2025, 15:58.

      Comment


      • #4
        1.1 With each firm working in a single industry and a single country, inclusion of the firmid effects (whether as i.firmid or in -absorb()-) automatically adjusts ("controls") for the effects of industry and country and any other attribute of the firm that does not change over time (even if you have not measured it!). So you can just omit country and industry from the list of variables in the regressions and the -absorb()- options. Now, this approach does not give you estimates of the effects at the industry and country level: it is mathematically impossible to get those in any fixed-effects model while including firmid. If you need estimates of the effects at the industry and country level, you will have to use something else. Consider using -xtreg, cre- (if you have the most recent version of Stata) or -xthybrid- (if you don't) for that.

        1.2 The -reghdfe- command is relevant relevant regardless of what level you cluster at. The difference between -reghdfe- and Stata's built in -xtreg, fe- command, in earlier versions of Stata, is that -reghdfe- allows you to absorb multiple fixed effects, whereas -xtreg, fe- only allowed a single fixed effect to be specified. If you have the current version of Stata, however, even this difference has gone away because -xtreg, fe- now accepts an -absorb()- option as well. There are differences in the internal workings of those commands, and one might be more efficient than the other in very large data sets, but they should always give the same results when given the same model. And your data set is not large enough for any speed differences to be noticeable.

        2. This is a terminological issue and I don't feel qualified to comment on it. Different disciplines may have different ideas about what constitutes a robustness check. To my way of seeing things, as an epidemiologist, -reghdfe dv iv covariates, absorb(firmid year) vce(cluster ind)-, and -regress dv iv covariates, absorb(firmid year) vce(cluster ind)-, and -regress dv iv covariates i.firmid i.year, vce(cluster ind)- are all the exact same model. They differ in the internal workings of the calculations but will always produce the same results. So I wouldn't consider this a robustness check. To me a robustness check means using an altogether different approach, often one that will not produce the same results, and demonstrating that you nevertheless get results that are close to the original results. To me a robustness check is another approach that relies on different, but plausible, underlying assumptions from your original analysis. But that may be viewed differently in your discipline. Perhaps somebody from econometrics on this Forum will jump in and offer an opinion on this. If not, I suggest you consult with a colleague or supervisor in your own environment who has more experience than you and can probably provide a quick, simple answer.

        3. Yes, a "fair comparison" of two models would require using the same fixed effects and the same clustering strategy in both models.

        Comment


        • #5
          Aha, I see

          Thank you very much, Clyde, your input is very valuable . Yeah, do you perhaps know someone with an econometrics background who can jump into the conversation?

          Since it is imperative to control for Industry and Country in my case, does that mean that Pooled OLS with fixed variables and clustering with firmid should be my main model?

          Best regards,
          Martin



          Comment


          • #6
            Yeah, do you perhaps know someone with an econometrics background who can jump into the conversation?
            For the most part, Statalist works best when people don't address their questions to specific people. There are numerous economists/econometricians who follow the list regularly. The title of your post is sufficiently informative that they will likely take a look to see what this is all about. I'd say the chances are excellent that one of them will chose to respond. It is, of course, a weekend now, so this process may be delayed for a day, but it is likely to happen.

            Since it is imperative to control for Industry and Country in my case, does that mean that Pooled OLS with fixed variables and clustering with firmid should be my main model?
            If by Pooled OLS with fixed variables and clustering with firmid you mean -regress DV IV covariates i.year i.firmid i.industry i.country, vce(cluster firmid)- or -regress DV IV covariates, absorb(year firmid industry country) vce(cluster firmid) the answer is no. The industry and country variables will be dropped by these commands just as they would by -xtreg, fe- or -reghdfe-. There is no substantive difference between using -regress ... i.firmid ...- and using -reghdfe ..., absorb(firmid) ...-. Both of them are fixed effects models and both of them are estimating the same model. They differ in the particular calculations used to get the results, but they always produce the same results (with the possible exception of rounding errors in the very far decimal places).

            Moreover, these models that include firmid, do adjust for industry and country, even though industry and country are not explicitly included in the model. They do not give explicit estimates for effects at the industry and country level, but those are not needed in order to adjust for industry and country level effects. You need to clarify in your mind whether your concern is adjusting ("controlling", to use your word, though I think it inappropriate to use when speaking of observational data) for effects of industry and country, or whether you actually want to estimate industry and country level effects. If what you need is to adjust for them, then all you need in the model is the firm-level fixed effects--the adjustment for industry and country comes along automatically and "for free." If what you need are estimates of industry and country level effects, then you cannot get them from any fixed effects model (unless you omit the firm level effects--which strikes me as a very dubious thing to do). In #4, in response to question 1.1 of #3, I mentioned two Stata commands for hybrid fixed and random effects models that will enable you to get estimates of industry and country level effects while still adjusting for firm level effects.

            Finally, I think it would be helpful to also clarify some terminology. A model is an equation or set of equations describing relationships among variables. The kinds of models that we see here usually involve some parameters, which might, for example, appear as coefficients in equations, whose values we are interested in estimating. You have a model that looks like DV_it = constant + b1*IV_it + b_i*firm_i + b_t*year_t + error_it, where i represents firm and t represents time. Separate from the model is the question of the computational method used to calculate estimates of b1, the b_i's and the b_t's (and perhaps additional parameters that are sometimes of interest such as the error variance and intraclass-correlation). OLS is one computational method. The -xtreg, fe- command uses a different, although closely related, computational method. And the -reghdfe- method uses yet another computational method. But all three computational methods are applied to the same model. A truly different model would include different variables (adding some you don't have or omitting some you do, or both) or might impose different constraints on the error distribution, or perhaps transforming some of the variables (logarithms, or quadratic terms, or lots of other possibilties here. The commands -xtreg, cre- and -xthybrid- I have referred to, by contrast, would estimate different models, models in which the between-firm effects of industry and country would be mathematically possible to estimate. (The models estimated by -xtreg, cre- and -xthybrid- are closely related but slightly different. -xthybrid- also offers a -cre- option which will cause it to estimate the same model as -xtreg, cre-).

            Comment


            • #7
              Aha, okay. Thank you very much. This forum is new to me, so I was not sure how this works

              I ran the regress by doing reg Y X + controlling variables i.Year, i.Industry i.Country, vce (cluster Firmid), and it worked. i.Year, Industry and Country were not omitted, Each variable gave me coefficients for each year, each industry and each country.
              Maybe I am using the term "Pooled OLS" wrong, but maybe it is called OLS?
              (I did not use the "absorb" command this time as I want to control for Industry and Country.

              Hope this clarifies some things

              Best,
              Martin

              Comment


              • #8
                I ran the regress by doing reg Y X + controlling variables i.Year, i.Industry i.Country, vce (cluster Firmid), and it worked. i.Year, Industry and Country were not omitted, Each variable gave me coefficients for each year, each industry and each country.
                Yes, that would happen with that model. BUT you have not adequately accounted for firm level effects with this model. You have clustered the errors at the firmid level, which is better than nothing, but you have not introduced firm-level effects into the model. It is the absence of the firm-level effects that have enabled you to estimate industry and country effects--but those are, for most purposes, far less important than having firm-level fixed effects in the model.

                And I feel like I have not succeeded in getting across the message about adjusting ("controlling") for country and industry effects because on the one hand you say you "wanted to control for" them, but you instead created a strange model that enables you to estimate their effects at the expense of almost completely ignoring firm-level effects. I don't know how to say it more clearly, so I'll just say it again. -reghdfe Y X "controlling variables", absorb(Firmid Year)- does adjust ("control") for country and industry effects. And it also fully accounts for firm-level effects, which your new model does not do.

                This model that you have used,-reg Y X + controlling variables i.Year, i.Industry i.Country, vce (cluster Firmid)-, is, in fact a "pooled OLS" regression. It is pooled precisely because you do not include firm-level effects despite having firm-year panel data. It will produce different results from the other models we have been talking about in this thread (unless the firm-effects actually don't matter at all).

                Comment


                • #9
                  [QUOTE=Clyde Schechter;n1777609]

                  And I feel like I have not succeeded in getting across the message about adjusting ("controlling") for country and industry effects because on the one hand you say you "wanted to control for" them, but you instead created a strange model that enables you to estimate their effects at the expense of almost completely ignoring firm-level effects. I don't know how to say it more clearly, so I'll just say it again. -reghdfe Y X "controlling variables", absorb(Firmid Year)- does adjust ("control") for country and industry effects. And it also fully accounts for firm-level effects, which your new model does not do.

                  Wow,I am so sorry, I did not get that until just now. So let me break it down so just to make sure I got it right:

                  1) Since the fe -approach does account for Industry and Country as it turns out, does that mean I can choose whether I use the xtreg, fe or the reghdfe?
                  - 1.1) The methodology is different between xtreg fe and reghdfe, but the results from both approaches will still be the same regardless, correct?

                  I was just afraid that the fe command did not take Industry and Country into account, because I need to "adjust" for them according to the literature. I realize now that the estimates /coefficiens for Industry, and Country is not relevant to present in my study. I just wanted to prove for a reader that Industry and Country has been adjusted for.
                  Do you see my point?



                  2) Since I have a panel data with firm year panel data , it is important to adjust for firm-level to as well? Is that we cluster firmId?
                  If not, in what way do we actually do adjust for firm-level effect- by executing the fe command?



                  Comment


                  • #10
                    Hi again
                    In regard to what you said regarding firm level, in the Pooled OLS, I do control for firm level. I have control variables such as Firm Size defined as the natural logarithm of total assets, Leverage (defined as the ratio of total debt to total assets), ROA (Net Income/ lagged total assets), Market to Book ratio, Sales Growth, Cash flow from operations (CFO) etc.

                    When I ran the -reg Y X + controlling variables i.Year, i.Industry i.Country, vce (cluster Firmid) - the controlling variables include the variables I have mentioned above, in addition to Board Size, Board Diversity, and GDP growth

                    Does that mean the OLS method in that sense should be the main model, or should fe /reghdfe be the main model?

                    Looking forward to hearing from you

                    Best,
                    Martin

                    Comment


                    • #11
                      1) Since the fe -approach does account for Industry and Country as it turns out, does that mean I can choose whether I use the xtreg, fe or the reghdfe?
                      - 1.1) The methodology is different between xtreg fe and reghdfe, but the results from both approaches will still be the same regardless, correct?
                      Correct.

                      2) Since I have a panel data with firm year panel data , it is important to adjust for firm-level to as well? Is that we cluster firmId?
                      If not, in what way do we actually do adjust for firm-level effect- by executing the fe command?
                      Yes, you should adjust for firm-level effects as well. The -xtreg, fe- command does this. The easy way to see it is by remembering that -xtreg DV IV..., fe ...-
                      after -xtset firmid year- is exactly equivalent to -regress DV IV ... i.firmid, ...- where the inclusion of firmid is explicitly shown. The results are calculated using different formulas, but they produce the same results. In the case of -xtreg, fe-, the calculation involves first calculating the mean values of all the variables within groups of observations belonging to the same firm, and then subtracting the mean values from the actual variables and performing an ordinary regression on the differences. This de-meaned data, by having the firm-mean values removed, extracts the influence of firmid from the equations.

                      In regard to what you said regarding firm level, in the Pooled OLS, I do control for firm level. I have control variables such as Firm Size defined as the natural logarithm of total assets, Leverage (defined as the ratio of total debt to total assets), ROA (Net Income/ lagged total assets), Market to Book ratio, Sales Growth, Cash flow from operations (CFO) etc.
                      When I ran the -reg Y X + controlling variables i.Year, i.Industry i.Country, vce (cluster Firmid) - the controlling variables include the variables I have mentioned above, in addition to Board Size, Board Diversity, and GDP growth
                      That is adjusting for selected firm-level attributes, but it does not fully adjusted for firm. There almost certainly are other aspects of a firm that influence the IV, DV, and their relationship and those are not accounted for in the pooled OLS. Those variables may account for most of it, but it is still incomplete. It may be that in this particular situation, as a practical matter, those variables account for so large a part of the firm-level influences that the incompleteness is too small to worry about. But, in principle, this approach does not fully adjust for firm-level influences.

                      Does that mean the OLS method in that sense should be the main model, or should fe /reghdfe be the main model?
                      I would use fe or reghdfe as the main model. If you feel there is some additional value to the pooled OLS model that includes the numerous selected firm-level variables, then including it as a secondary analysis would make sense. That's not so much a statistical question as an economic/pedagogic one that I'm not qualified to pass judgment on.

                      Comment


                      • #12
                        1) Excellent

                        2) Okay, then it seems like the xtreg, fe is the way to go . Thank you once again for explaining in detail how time-invariant factors like industry and country work.

                        3) Perfect, then I will add the OLS as a secondary test. The goal is to compare different methods to see whether results are robust or not

                        Follow-up questions in regards to xtreg, fe

                        Descriptive statistics and correlation matrix
                        With descriptive statistics and a correlation matrix, do I have to:
                        1) run the fixed regress, xtreg, fe first?
                        2) And then execute the command to summarize descriptive statistics with vce (cluster firm Id) or vce robust
                        Or is it okay to add to just do the summarize command in general with X, Y + control variables without clustered firm id or robust standard errors?


                        3) With the correlation matrix, do I have to add a vce (cluster firm ID) or vce rob for the correlation matrix, or is that not necessary?


                        VCE clustered for firm or Vce robust
                        For the fixed effect model, which vce should I apply to adjust for heteroskedasticity and autocorrelation?
                        Vce (cluster firmid) or vce rob for robust standard errors?




                        Thank you in advance

                        Best,
                        Martin

                        Comment


                        • #13
                          The vce() options refer to the modifications that are made to the calculation of standard errors. Those are inferential, not descriptive, statistics. They are never to be used with descriptive statistics. (Moreover, the commands that calculate descriptive statistics in Stata don't even have any -vce()- option, because it would be inappropriate.)

                          For descriptive statistics, you should rely on commands like -summarize- or -tabstat- or -tab-. These are typically done on the full data sample. Now, if you have observations in the data set that include missing values, those observations do not participate in the regression calculations. So the sample that was used for the regression may not be the full sample. You can get descriptive statistics for the actual regression sample by running -estat summarize- immediately following the regression (xtreg, reghdfe) command. If they are appreciably different from the descriptive statistics for the entire sample, you should report that fact and have an additional table showing the descriptive statistics for the sample that actually participated in the regression calculations.

                          As for the correlation table, what do you plan to do with that? Are you writing this up for a thesis/dissertation? Or do you intend to publish it in a journal? Or are you just making slides for a presentation? In any case, vce(cluster ) and vce(robust) are not available for the correlation command and would not be appropriate.

                          For the fixed effect model, which vce should I apply to adjust for heteroskedasticity and autocorrelation?
                          Vce (cluster firmid) or vce rob for robust standard errors?
                          I would use vce(cluster firmid). It is likely that there is non-independence of error terms within firms, and simple robust standard errors do not correct for that.

                          Comment


                          • #14
                            There's a way to have the best of both worlds. It's call it the Mundlak approach, correlated random effects, or a hybrid approach. To account for firm fixed effects, it's enough to include the firm specific time averages for each time-varying variable. Then you can include industry and country fixed effects to see if the coefficients mean something to you. There's a Stata command, but I just do it by hand. Let xj be the time-varying explanatory variables and zj the time constant variables.

                            Code:
                            egen x1bar = mean(x1), by(firm)
                            egen x2bar = mean(x2), by firm
                            ...
                            egen xKbar = mean(xK), by firm
                            reg y x1 ... xK x1bar ... xKbar i.industry i.county i.year, vce(cluster firmid)
                            The coefficients on x1 ... xK will be the same as using firm fixed effects. You get coefficients on time-constant variables, including industry and country dummies.

                            The clustering issue is a tricky one. As Clyde says, at least at the firm level -- which is very different from the estimation problem.

                            In my recent paper in Empirical Economics with Leslie Papke (2023), we show that a joint test on x1bar, ..., xKbar is a test of whether firm fixed effects are needed. If they're significant, then they should be left in, and you are using firm FEs.

                            Comment


                            • #15
                              Okay, no vce rob nor vce cluster firm id as it is not allowed and should never be used with descriptive statistics. Perfect, thank you!

                              [QUOTE=Clyde Schechter;n1777659]: "As for the correlation table, what do you plan to do with that? Are you writing this up for a thesis/dissertation? Or do you intend to publish it in a journal? Or are you just making slides for a presentation? In any case, vce(cluster ) and vce(robust) are not available for the correlation command and would not be appropriate."

                              I am writing a master's thesis, and I have seen that the existing literature and previous master's thesis have added their correlation matrix, where they briefly describe the correlation matrix
                              Do you think it is not necessary to report it? I have also read and heard that we don't need to report VIF, but only mention that a VIF test was used to investigate whether your controlling variables suffer from multicollinearity. Is that true as well?













                              Comment

                              Working...
                              X