Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unevenly spaced panel and time fixed effects

    Hi Stata users.

    I have gathered data for 1960, 1970, 1980, 1990, 1995 and 2005 year. Regarding this I have two questions:

    1) Since time points are unevenly spaced would this be problem in panel data analysis?

    2) When I include i.year in my fe model time dummies statistical significance is great but my main predictors loose significance.
    But If I do log-log specification, time dummies are not significant anymore. In both cases testparm i.year shows that they are jointly significant.

    What should I do?

    Thanks

  • #2
    1) Since time points are unevenly spaced would this be problem in panel data analysis?
    That depends on what specific analysis you want to do. If you wanted to fit models with autoregressive correlation structure, you won't be able to do that with irregularly spaced data. And analyses involving lags and leads of variables will be problematic as well. But for most other kinds of panel data analysis there will be no problem.

    2) When I include i.year in my fe model time dummies statistical significance is great but my main predictors loose significance.
    But If I do log-log specification, time dummies are not significant anymore. In both cases testparm i.year shows that they are jointly significant.
    Unless you are specifically including the time dummies for the purposes of testing hypotheses about shocks occurring in specific years or some other strictly-time related hypothesis, you should ignore the statistical significance of the time variables--it is meaningless. If, as is usually the case, the time-variables are in there just for the purpose of adjusting for time-dependent shocks to the outcome, you shouldn't even waste time looking at their p-values, jointly or individually.

    As for what is happening with your main variables, the issue is which model is the better model for the data. Two things figure into that: one is whether the science and theory in your discipline have established that the relationship is directly linear, or, alternatively if the log-log relationship is directly linear. If there is a good theoretical reason to believe one or the other (without reference to your data and results) then you go with that. If there is nothing of that nature to go by, then you need to see which model is a better fit to your data. Statistical significance has nothing to do with that and, for purposes of model selection, you should not even be looking at p-values. The first approach would be graphical. It may be plainly obvious whether the relationship is linear, or log-log, or something altogether different. If it is not obvious, and if we are talking about OLS regression, you might want to pick the model with the better adjusted R2. For other types of regessions, you might look at other measures of fit. (e.g. for logistic models, which one has a better Hosmer-Lemeshow statistic?)

    Comment


    • #3
      Dear Clyde,

      thank you for your answer and advice, since I am pretty unexperienced with Stata I have red a lot of your posts here and have learn quite a lot.

      So, should I include time dummies in my model to adjust for time, regardless of their p-values and conclusion based on testparm i.year? I am not specifically testing hypotheses about shocks in specific years.

      There isn't a lot of econometric studies in my field of research, so I am looking to find a best model for my data.

      I have tested pooled model, fe and re model.
      F test suggest that fe model is better, and LM test that pooled model is superior against re model. Hausman test was invalid but xtoverid suggested use of fe model.
      So at the end I concluded that the fe model is best for my data.

      What graphical methods and measures of fit should I look in order to choose between two fe models (one with linear specification and other with log-log specification)?

      Thanks once again for your insightful remarks.

      Comment


      • #4
        So, should I include time dummies in my model to adjust for time, regardless of their p-values and conclusion based on testparm i.year? I am not specifically testing hypotheses about shocks in specific years.
        Probably so. Certainly, the p-values and the results of -testparm- have no bearing whatsoever on this question. If your outcome is one that is subject to substantial time-specific shocks, then you will want to include time indicators to adjust for that. Most economic variables are subject to such shocks. But if your variable is one that tends to be very stable over time, then including them would be unnecessary.

        What graphical methods and measures of fit should I look in order to choose between two fe models (one with linear specification and other with log-log specification)?
        The simplest and first approach would be to do a scatterplot of your outcome and predictor variables. Then re-do the scatterplot using log scales on each axis separately, and then on both axes. Compare the graphs to see which looks most like a linear fit. (Note: if the ranges of the x and y variables are narrow, you won't see much difference, and it also means that there isn't much difference between the models. But if either variable has a wide range the differences among these graphs should be quite evident.)

        The other issue that arises is that the relationship can be obscured by confounding variables. So if your scatterplots above don't show much or look uninterpretable, you can do both the y vs x and log y vs log x regressions and then look at scatterplots of residuals vs fitted (-help rvfplot-)

        Comment


        • #5
          Dear Clyde,

          I analyzed scatterplots and it turns out that log-log model is the right one.

          Below is my fe model with and without time dummies.


          Code:
          . xtreg loggedTotaltra loggedx1 loggedx4 loggedx5 loggedx14 loggedx19, fe vce(robust)
          
          Fixed-effects (within) regression               Number of obs      =       105
          Group variable: id                              Number of groups   =        26
          
          R-sq:  within  = 0.5479                         Obs per group: min =         1
                 between = 0.8620                                        avg =       4.0
                 overall = 0.7516                                        max =         6
          
                                                          F(5,25)            =     18.03
          corr(u_i, Xb)  = -0.8469                        Prob > F           =    0.0000
          
                                              (Std. Err. adjusted for 26 clusters in id)
          ------------------------------------------------------------------------------
                       |               Robust
          loggedTota~a |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
              loggedx1 |  -.4356533   .1488156    -2.93   0.007    -.7421447    -.129162
              loggedx4 |  -.0841455   .0387963    -2.17   0.040    -.1640479   -.0042431
              loggedx5 |   .1589569   .1405931     1.13   0.269    -.1306001    .4485138
             loggedx14 |  -.0431455   .0241568    -1.79   0.086    -.0928975    .0066064
             loggedx19 |   .7000899   .1811639     3.86   0.001     .3269759    1.073204
                 _cons |   6.158521   1.705606     3.61   0.001      2.64576    9.671283
          -------------+----------------------------------------------------------------
               sigma_u |  .48036987
               sigma_e |  .16317569
                   rho |  .89654927   (fraction of variance due to u_i)
          ------------------------------------------------------------------------------
          
          . xtreg loggedTotaltra i.Year loggedx1 loggedx4 loggedx5 loggedx14 loggedx19, fe vce(robust)
          
          Fixed-effects (within) regression               Number of obs      =       105
          Group variable: id                              Number of groups   =        26
          
          R-sq:  within  = 0.7645                         Obs per group: min =         1
                 between = 0.8768                                        avg =       4.0
                 overall = 0.8243                                        max =         6
          
                                                          F(10,25)           =    107.05
          corr(u_i, Xb)  = -0.0773                        Prob > F           =    0.0000
          
                                              (Std. Err. adjusted for 26 clusters in id)
          ------------------------------------------------------------------------------
                       |               Robust
          loggedTota~a |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                  Year |
                 1970  |   .0895917   .0656703     1.36   0.185    -.0456589    .2248423
                 1980  |   .0505849   .1247495     0.41   0.689    -.2063415    .3075114
                 1990  |   .1018121   .1545087     0.66   0.516    -.2164044    .4200287
                 1995  |  -.1656287   .1599647    -1.04   0.310    -.4950822    .1638248
                 2005  |  -.2077271   .1859161    -1.12   0.274    -.5906285    .1751743
                       |
              loggedx1 |  -.1757685   .1086577    -1.62   0.118    -.3995534    .0480163
              loggedx4 |   .0394131   .0615269     0.64   0.528     -.087304    .1661302
              loggedx5 |   .0576127    .119531     0.48   0.634     -.188566    .3037914
             loggedx14 |  -.0170065    .024242    -0.70   0.489    -.0669338    .0329207
             loggedx19 |   .5558698   .1676673     3.32   0.003     .2105526     .901187
                 _cons |   5.487333   1.540105     3.56   0.002     2.315427     8.65924
          -------------+----------------------------------------------------------------
               sigma_u |  .19444681
               sigma_e |  .12197195
                   rho |  .71762985   (fraction of variance due to u_i)
          ------------------------------------------------------------------------------
          
          .

          There is a couple of things I'm confused about, though.

          1) How should time dummies be interpreted? Would there be a different model for every year (with different intercepts)?

          2) If I'm not mistaken, many experienced users on this forum consider p-value not to be very important when deciding what variable to include or exclude from model. But reading some articles and books I was under impression that p-value should be above 1.96 in order for predictor to be significant. I am little bit confused now about what predictors should I keep in my model. As you can see couple of them now have p-value below threshold. On what criteria should including/excluding predictors be based on?


          Thank you once again.



          Comment


          • #6
            1) How should time dummies be interpreted? Would there be a different model for every year (with different intercepts)?
            The time indicators ("dummies") should be interpreted the same way they would in any other regression model. They represent shocks to the output in the corresponding years. If you wish to think of them as providing a model with a different intercept in each year, that would be a valid interpretation as well.

            2) If I'm not mistaken, many experienced users on this forum consider p-value not to be very important when deciding what variable to include or exclude from model. But reading some articles and books I was under impression that p-value should be above 1.96 in order for predictor to be significant. I am little bit confused now about what predictors should I keep in my model. As you can see couple of them now have p-value below threshold. On what criteria should including/excluding predictors be based on?
            p-values are always between 0 and 1, they can never be 1.96 or above. I think the 1.96 figure you are thinking about is a z-statistic, which would correspond to a p-value of 0.05, one that is commonly used as a threshold for statistical significance.

            The whole problem of variable selection in models is complicated and controversial, largely because there is no ideal solution to it. I have strongly held opinions that p-values should never be used for this purpose. I'm not going to explain that here because it is lengthy. But I think that even among people who believe in using p-values for variable selection, those who actually understand what they are doing would agree that you should never use the p-values for selecting individual variables if those variables are part of a natural group. So, for example, if you want to make a decision about including time indicators in your model and base it on p-values, you should do it based on the p-value of the joint test of significance of all of those variables. Picking out those particular time indicators that happen to have p < 0.05 is just plain wrong. Variables that come in groups can have synergistic or interfering effects that cause the p-values of the individual variables to over- or under-estimate the contribution of that variable to the model. So you need to treat them as a unit. See -help testparm-. Now, there are occasions where one might include some time indicators and not others. For example, in some contexts, you might want to include indicators for those years that had a recession, and only those years. That could be very sensible: but it is not p-value based.

            Comment


            • #7
              Dear Clyde,

              as you noticed I made a mistake, when I wrote value of 1.96 I meant z-statistic, not p-value.

              The reason I asked about variable selection and p-value is dilemma I have about my main predictors. How do I justify including variables x4, x5 and x14 in my model for example if their p-value is large when I include time dummies. Is there some other criteria I should be looking?

              Thank you very much for your clarifications.

              Comment


              • #8
                Dejan:
                hunting for "the best model" (if that means the one with the highest number of statistically significant predictors) is not the way to go: give a fair and true view of the data generating process instead (the literature in your research field can help you out in this respect) and, more important, stay away from any stepwise approach.
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  How do I justify including variables x4, x5 and x14 in my model for example if their p-value is large when I include time dummies. Is there some other criteria I should be looking?
                  How did you justify putting them in the model in the first place, before adding the time indicators? Whatever the justification was, it remains valid no matter what the p-values turn out to be.

                  Comment


                  • #10
                    Dear Carlo and Clyde,

                    I did not use stepwise approach (based on what I have red on this forum). Since I had almost 30 potential predictors I did EFA and based on that (and theory in my field) selected five variables that are included in my model. I hope that's better approach then stepwise.

                    Now, do I still report coefficients and p-values of my predictors in final model?

                    Sorry for many questions.

                    Comment


                    • #11
                      I hope that's better approach then stepwise.
                      Yes, that's much better.

                      Now, do I still report coefficients and p-values of my predictors in final model?
                      Yes.

                      Comment


                      • #12
                        Dejan:
                        happy with reading that you stayed away from -stepwise-.
                        As far as EFA is concerned, if it has been led by literature in your field it is justifiable (actually, I find difficult to be more specific on this point, as you don't tell us whether you performed, say, 30 univariate correlations between each predctor and the regressand or else).
                        I'm not sure I get you right about predictors and p-values:
                        - if you mean coefficients and p-values of the final model (that is, the regression outocome), the answer is a trivial yes (actually, I do not think it was the sense of your question);
                        - if you refer to EFA resulst, you may want to consider to provide it as a supplmental material (assuming that you're going to prepare a submisssion to a given technical journal in your reserach field).
                        I agree with the approach you followed to select the -fe- specification via -xtoverid-, even though I doubt that an F-test comparing -fe- specification vs pooled OLS has been given by Stata if you imposed non-default standard errors.
                        The critical issue with your data seems to be the sample size, pretty limited for a panel data regression.

                        PS: crossed in the cyberspace with Clyde's (more efficient) helpful reply.
                        Kind regards,
                        Carlo
                        (Stata 19.0)

                        Comment


                        • #13
                          Dear Clyde,

                          thank you for explanations on these issues. I am very grateful for your posts.


                          Dear Carlo,

                          my question was, indeed, trivial, since I am still under impression that large p-value will somehow be disqualifying for my predictors (it seems that my lectures on this subject were misguiding, unfortunately).

                          Yes, my sample size is small, but data in my field are very difficult to gather, and actually small part of data can be used in panel data regression. This sample size is the best I could manage to do. Do you think that would be problematic for panel analysis?

                          Comment


                          • #14
                            Dejan:
                            - the chance of getting a n-sized sample is simply a matter of fact; if, in your reserach field data collection is tricky, that's it. In all likelihood, other researchers complained the very same (or will do in the future). Things are obviously different if researchers "make-up" their sample using only those observations with complete values (complete case analysis), which is often misleading;
                            - p-value and the inference machinery are influenced by sample size. Besides blessed/blasted as it may be, p-value is informative, significant or not. I would say that focus on significant p-values only (and forget about all the rest) is a kind of quite widespread and unfortunate misconception of the inferential approach: hence don't worry about the p-values if you gave a fair and true view of the data gererating process; try to give an expalantion/educated guess of the reason why coefficient failed to reach a significan p-value (or, better a tighter confidence interval) instead.
                            Kind regards,
                            Carlo
                            (Stata 19.0)

                            Comment


                            • #15
                              Dear Carlo,

                              thank you for further clarification. I will do as you and Clyde suggested.
                              Can you recommend some book or article on this issue (model and variable selection and overemphasizing on p-value)? It would be really helpful.

                              Comment

                              Working...
                              X