Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Xtserial (N versus T) theoretical question

    Hi all,

    I had a question earlier that I posted about saving results from xtserial (from the Stata Journal), but I've now got a different and more theoretical question about xtserial so I thought I should make a second post (I hope this is not considered cross-posting, that was not my intention).

    I've got panel data that I'm running an OLS regression on and clustering by id. I run a few different version of my base regression (regress logthpp logprevcpn, cluster(id)) narrowing my panel down and adding in dummy variables. At its largest, my sample is N=77 (as in 77 different groups I am clustering by, I have over 2000 observations), and T= 70. Since my N & T are relatively similar in size, is this something that I should worry about? In particular, as I narrow my sample, my T= 40 at it's lowest but my N = 10. Here when N is small and T is larger, should I worry about clustering?

    Thanks for any thoughts.


  • #2
    Kate:
    the first comment to your query is: why going (pooled) OLS when Stata offers -xt- commands for both N>T and T>N panel datasets?
    That said, since you decided to go (pooled) OLS, you should -cluster- your standard errors; otherwise Stata will interpret your observations as independent (which is not the case, due to the panel structure of your dataset).
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      With N (id) = 10 it makes no sense to cluster especially with such a large number of periods. Also with at least 40 periods you should worry about the time series properties of your panels: it is difficult to assume that T is fixed.
      Are your panels stationary? You could introduce time dummies. You should have a look at the (user-written programme - xtscc-) to model the residuals. You can look at it first (ssc describe xtscc). It works for both balanced and unbalanced panels
      You say that you have whittled down you date. I assume that your times periods are without gaps.

      Comment


      • #4
        Originally posted by Carlo Lazzaro View Post
        Kate:
        the first comment to your query is: why going (pooled) OLS when Stata offers -xt- commands for both N>T and T>N panel datasets?
        That said, since you decided to go (pooled) OLS, you should -cluster- your standard errors; otherwise Stata will interpret your observations as independent (which is not the case, due to the panel structure of your dataset).
        Carlo (#2), I've never made proper use of the -xt- commands before. I just read about them in the manual, so, my understanding is that using xtreg would run a GLS random effects regression (unless I specified otherwise). Since a fixed effects model better suits my purposes (I'm looking at how the learning curve of companies over a five year period differed based on previous manufacturing experience of that particular product). So, in this case, I'd run a fixed effects xtreg model, specifying vce(cluster ID), and this, as far as I can tell, would run an OLS regression. I don't think that I'm fully understanding what the difference between this and regress y x, cluster(ID). Is there something to do with N>T and T>N at -xt- here that I'm not seeing?

        Comment


        • #5
          Originally posted by Eric de Souza View Post
          With N (id) = 10 it makes no sense to cluster especially with such a large number of periods. Also with at least 40 periods you should worry about the time series properties of your panels: it is difficult to assume that T is fixed.
          Eric (#3). The time period is without gaps for most of my regressions (I run the following three regressions for my whole dataset, where they are a few missing periods, and then again for just a subgroup, where there are no missing periods). I have not checked if the panels are stationary, do you think this would be a potential issue? I'm running multiple versions of these regression, ex.:

          1. (Basic) reg logthpp logcpn
          2. (Basic plus varying models of the product) reg logthpp logcpn B12 B24 B36
          3. (Basic w/models & time dummies) reg logthpp logcpn B12 B24 B36 y1941 y1942 y1943 y1944

          I hadn't been too worried about time fixed effects, but, as per Carlo's #2 I'm thinking of using xtreg for fe which would account for time fixed effects, no?

          As for xtscc, spatial correlation does seem like it might be an issue—one company's learning effecting another company's learning. Is there a way to check for spatial autocorrelation to justifying running a different regression that accounts for it?
          Last edited by Kate Pryce; 16 Nov 2020, 10:53.

          Comment


          • #6
            Kate (#5): Where did you find a reference to spatial autocorrelation. My concern is temporal autocorrelation. You have at least 40 time periods for each country. Therefore, you should worry about correlation over time. In the "standard" panel data model, one assumes that the number of time periods (T) are small and fixed whereas the number of "individuals" (N) can increase: the statistical properties are based on this. If N is large relative to T, clustering takes care of some of the temporal correlation effects. But with only 10 individuals and a large number of time periods, you cannot cluster. One way out is either to model the temporal correlation in your model (dynamic effects) or to model the temporal correlation in the residuals. The latter is what you can do with xtscc.

            Comment


            • #7
              Eric (#6), Ah, I see. Apologies for spatial autocorrelation, I'm not sure where I got it in my head you were commenting on that.

              So, what you're saying is that even using xtreg, fe the estimates will likely be off because of the relative size of N & T, and clustering should not be done when N is significantly smaller than T. Hence the need for xtscc to account for temporal correlation in the residuals?

              Comment


              • #8
                Kate (#7): The estimates will not be off but the standard errors will be: hence clustering or a substitute. Clustering corrects the standard errors but requires a "large" number of clusters, whereas you only have ten.

                Comment


                • #9
                  Eric (#8). Thank you! That makes absolute sense.

                  Comment


                  • #10
                    As a follow-up, using xtreg or xtscc, what do I do about dummy variables I had been interested in but now are zero-ed out because of the within-estimator structure.

                    Specifically, of the firms I am looking at, I have some that made product X previously and others that were novice producers. In my original OLS regression I had accounted for this with dummies, ex.

                    Code:
                    reg logthpp logcpn legacy legxcpn
                    How would I account for this using a within estimator model? Or am I missing something that makes this no longer relevant?

                    Thanks for all the help already.

                    Comment


                    • #11
                      Kate:
                      running -fe- panel data regression with -xtreg- means coding:
                      Code:
                      xtreg <depvar> <indepvars> <controls>, fe vce(cluster panelid) ///*provided that clusters are enough*///
                      You can also compare the results of a -fe- regression performed via OLS and -xtreg-:
                      Code:
                      use "https://www.stata-press.com/data/r16/nlswork.dta"
                      . regress ln_wage c.age##c.age i.idcode if idcode<=3, vce(cluster idcode)
                      
                      Linear regression                               Number of obs     =         39
                                                                      F(1, 2)           =          .
                                                                      Prob > F          =          .
                                                                      R-squared         =     0.7407
                                                                      Root MSE          =     .19867
                      
                                                       (Std. Err. adjusted for 3 clusters in idcode)
                      ------------------------------------------------------------------------------
                                   |               Robust
                           ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                      -------------+----------------------------------------------------------------
                               age |   .2512762    .103677     2.42   0.136    -.1948099    .6973623
                                   |
                       c.age#c.age |  -.0037603   .0015603    -2.41   0.138    -.0104736     .002953
                                   |
                            idcode |
                                2  |  -.4231615   .0288023   -14.69   0.005    -.5470877   -.2992353
                                3  |  -.6126416   .0625166    -9.80   0.010    -.8816288   -.3436544
                                   |
                             _cons |   -1.82398   1.588179    -1.15   0.370    -8.657361      5.0094
                      ------------------------------------------------------------------------------
                      
                      . xtreg ln_wage c.age##c.age if idcode<=3, fe vce(cluster idcode)
                      
                      Fixed-effects (within) regression               Number of obs     =         39
                      Group variable: idcode                          Number of groups  =          3
                      
                      R-sq:                                           Obs per group:
                           within  = 0.6382                                         min =         12
                           between = 0.8744                                         avg =       13.0
                           overall = 0.2765                                         max =         15
                      
                                                                      F(2,2)            =       3.83
                      corr(u_i, Xb)  = -0.2473                        Prob > F          =     0.2070
                      
                                                       (Std. Err. adjusted for 3 clusters in idcode)
                      ------------------------------------------------------------------------------
                                   |               Robust
                           ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                      -------------+----------------------------------------------------------------
                               age |   .2512762   .1007559     2.49   0.130    -.1822416     .684794
                                   |
                       c.age#c.age |  -.0037603   .0015163    -2.48   0.131    -.0102844    .0027638
                                   |
                             _cons |  -2.189815   1.575348    -1.39   0.299    -8.967992    4.588361
                      -------------+----------------------------------------------------------------
                           sigma_u |  .31366066
                           sigma_e |  .19867104
                               rho |  .71367959   (fraction of variance due to u_i)
                      ------------------------------------------------------------------------------
                      
                      .
                      As you can see, the shared coefficients are the same (whereas standard errors and related stuff differ).

                      This has nothing to do with N and T dimensions. However, when T=N or T>N, other -xt- commands are preferred (-xtgls-; -xtregar-).

                      As an aside, you will find https://www.stata.com/bookstore/micr...metrics-stata/ useful for self-learning purposes.
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment


                      • #12
                        Indeed the dummy variables are zero-ed out by FE. The only way around it is using Correlated Random Effects, also known as the Mundlak estimator.
                        I will not be returning to this page today because I have an (online on account of COVID) going on the whole day.

                        Comment


                        • #13
                          Originally posted by Kate Pryce View Post
                          As a follow-up, using xtreg or xtscc, what do I do about dummy variables I had been interested in but now are zero-ed out because of the within-estimator structure.

                          Specifically, of the firms I am looking at, I have some that made product X previously and others that were novice producers. In my original OLS regression I had accounted for this with dummies, ex.

                          Code:
                          reg logthpp logcpn legacy legxcpn
                          How would I account for this using a within estimator model? Or am I missing something that makes this no longer relevant?

                          Thanks for all the help already.
                          It depends on what you mean by "accounting for them". If it means controlling for them, the fixed effects already take care of that. But if you are interested in level differences of the dependent variable between two groups and want to have estimates for that, then fixed effects is not the way to go, as Eric de Souza already said. It is possible, however, to interact independent variables with indicator variables to see if the effect is different between groups.

                          Comment


                          • #14
                            Originally posted by Wouter Wakker View Post

                            It depends on what you mean by "accounting for them". If it means controlling for them, the fixed effects already take care of that. But if you are interested in level differences of the dependent variable between two groups and want to have estimates for that, then fixed effects is not the way to go, as Eric de Souza already said. It is possible, however, to interact independent variables with indicator variables to see if the effect is different between groups.
                            Wouter (#13), ah, yes great idea. I shall interact an independent variable with my legacy variable (my indicator variable) to see if there is an effect. I could multiply my legacy dummy against the cumulative production and then add i.legacyxcump into the model. Thank you! As you said, I am indeed interested in level differences between the two groups. So, in the end, I'm thinking I might have to take Eric's (#12) excellent suggestion to make use of correlated random effects.

                            Comment


                            • #15
                              Carlo (#11), thank you for the OLS versus xtreg example! That was incredibly helpful. Thanks particularly for the link to the book too; I'm going to check my library for it today.

                              Now that I've looked closely more closely at that example you provided, I'm thinking I could include this interaction that Wouter suggested as follows:

                              c.cump##i.legacy

                              From what I've been reading, this should create an indicator variable from the interaction of my continuous variable cump (Cumulative production) and my indicator variable legacy?

                              Comment

                              Working...
                              X