Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dates got omitted due to collinearity in fixed-effects regression model

    Hello,

    I run a regression to examine what has an impact on the decision to sell a stock on the stock market. In my regression I include time- and investor-fixed effects. But if I run my regression two dates got omitted because of collinearity (December 29, 1995 and November 22, 1996). I am not sure what to do and how to solve the issue. I am not even sure if this is even an issue.

    I shortly explain which variables I used in the regression:
    "sell": Dummy variable taking the value of 1 if a sale took place and zero otherwise
    "gain": Dummy variable taking the value of 1 if the stock was sold with a gain
    "ybeta": CAPM Beta of the stock
    "retp": positive part of the return: max(return;0)
    "retm": negative part of the return: min(return;0)
    "mini": dummy variable taking the value of 1 if the stock is at its minimum in the last 30 days
    "maxi": dummy variable taking the value of 1 if the stock is at its maximum in the last 30 days
    "sqrhp": square root of the time the stock is hold (measured in days)
    "logprc": logarithm of the purchase price
    "dec": dummy variable taking the value of 1 if the stock is sold/hold in December


    Maybe someone can help me understanding the note and how to deal with it.


    Code:
    .         xtreg sell i.gain##c.ybeta retp retm mini maxi sqrhp logprc i.dec##i.gain i.bdate, fe vce(cluster investor)
    note: 1304.bdate omitted because of collinearity.
    note: 1533.bdate omitted because of collinearity.
    
    Fixed-effects (within) regression               Number of obs     =    754,554
    Group variable: investor                        Number of groups  =     43,005
    
    R-squared:                                      Obs per group:
         Within  = 0.0312                                         min =          1
         Between = 0.0636                                         avg =       17.5
         Overall = 0.0473                                         max =     13,212
    
                                                    F(1498,43004)     =          .
    corr(u_i, Xb) = 0.1032                          Prob > F          =          .
    
                              (Std. err. adjusted for 43,005 clusters in investor)
    ------------------------------------------------------------------------------
                 |               Robust
            sell | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
          1.gain |   .0708707   .0043097    16.44   0.000     .0624236    .0793178
           ybeta |   .0106603   .0014711     7.25   0.000      .007777    .0135437
                 |
    gain#c.ybeta |
              1  |   .0201162   .0023042     8.73   0.000        .0156    .0246324
                 |
            retp |  -.0030423   .0031748    -0.96   0.338     -.009265    .0031804
            retm |  -.0557309   .0078385    -7.11   0.000    -.0710945   -.0403673
            mini |    .057538   .0030438    18.90   0.000      .051572     .063504
            maxi |   .1050872   .0033335    31.52   0.000     .0985535    .1116209
           sqrhp |  -.0004804   .0001556    -3.09   0.002    -.0007853   -.0001755
          logprc |   .0001387   .0007885     0.18   0.860    -.0014068    .0016842
           1.dec |   -.172695   .0119841   -14.41   0.000     -.196184    -.149206
                 |
        dec#gain |
            1 1  |  -.1198417   .0051324   -23.35   0.000    -.1299012   -.1097821
                 |
           bdate |
             44  |   .2201969   .0366147     6.01   0.000     .1484314    .2919624
             45  |   .0785358   .1512041     0.52   0.603    -.2178272    .3748987
    Remark: I did not post the whole regression output as I have around 1,500 dates. I just included the first ones (see last lines of the output)



  • #2
    No issue, these two variables do not pick up sufficient variation.

    A few remarks:

    - You may want to check out the community contributed command summclust. Cluster-robust inference should not be too much of a problem for you as you have over 43000 investors in your dataset, but still, you may want to compute CV3 standard errors (and not the default Stata CV1 standard errors) for a nice robustness check. You can read the following insightful paper: https://arxiv.org/pdf/2205.03288.pdf.

    - You may also want to take a look at community-contributed commands probitfe and logitfe, given that your dependent variable is binary. Compare the marginal effects of these two commands with the command you have run now, see if you get similar inference.

    - You have logged the purchase price, changing the interpretation of its coefficient. Is this common in finance literature?

    That's it for me

    Comment


    • #3
      By the way, regarding my second point, Jeff Wooldridge posted this, so actually I would much rather follow his advice than mine if I were you! Professor Wooldrdige posted here: https://www.statalist.org/forums/for...nel-regression.

      The only difference is that on the mentioned post, the poster had T=3, you have T=1500. The issue of standard errors would remain however.

      Comment


      • #4
        Jana:
        as an aside ti Maxence's helpful replies, I notice that your Within Rsq is very low (0.0312).
        This result might be caused by an overall limited variation in time-varying variables.
        I'd double-check whether your model is correctly specified.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Thanks for the fast replies! I have a few follow-up questions:

          1.) Maxence: I am not sure whether I got your first point. Do you mean that it is not good that I use standard errors that are clustered at investor level?

          2.) I know that my dependent variable is binary and therefore logit or probit models would be better but to make the interpretation of coefficients easier I apply the OLS model. But I also wanted to implement a logit model as a robustness test. But with xtlogit I could not include "vce(cluster investor)". I read the post from Professor Wooldridge you linked, but I dont really get how I could implement such a regression model. Maybe someone can explain it to me in other words? Would I use logitfe and then just include all my variables?

          Maxence: The definitions of the variables are oriented on other related papers and there they logged the purchase, so I did the same.

          Carlo: Other papers using the same dataset and a similar regression model also reported such low R2. Hence, I do not think that this should be a big problem.


          Thanks a lot!

          Comment


          • #6
            Jana:
            fine with low within Rsq if this result is frequently reported with the same dataset and similar regression speciifcation.
            However, as each researcher is accountable for her/his own regression model (regardless of others did), I woud double-check the finctional form of the regressanf of your linear probability model:
            Code:
            . use "https://www.stata-press.com/data/r17/nlswork.dta"
            (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
            
            . xtreg nev_mar c.age##c.age, fe vce(cluster idcode)
            
            Fixed-effects (within) regression               Number of obs     =     28,494
            Group variable: idcode                          Number of groups  =      4,710
            
            R-squared:                                      Obs per group:
                 Within  = 0.2466                                         min =          1
                 Between = 0.0912                                         avg =        6.0
                 Overall = 0.1158                                         max =         15
            
                                                            F(2,4709)         =     824.00
            corr(u_i, Xb) = -0.0305                         Prob > F          =     0.0000
            
                                         (Std. err. adjusted for 4,710 clusters in idcode)
            ------------------------------------------------------------------------------
                         |               Robust
                 nev_mar | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            -------------+----------------------------------------------------------------
                     age |  -.1337085   .0037181   -35.96   0.000    -.1409978   -.1264192
                         |
             c.age#c.age |    .001919   .0000573    33.49   0.000     .0018067    .0020314
                         |
                   _cons |    2.40826   .0577433    41.71   0.000     2.295056    2.521464
            -------------+----------------------------------------------------------------
                 sigma_u |  .35134613
                 sigma_e |  .23340767
                     rho |  .69380538   (fraction of variance due to u_i)
            ------------------------------------------------------------------------------
            
            . predict fitted, xb
            (24 missing values generated)
            
            . g sq_fitted=fitted^2
            (24 missing values generated)
            
            . xtreg nev_mar fitted sq_fitted , fe vce(cluster idcode)
            
            Fixed-effects (within) regression               Number of obs     =     28,494
            Group variable: idcode                          Number of groups  =      4,710
            
            R-squared:                                      Obs per group:
                 Within  = 0.2649                                         min =          1
                 Between = 0.0890                                         avg =        6.0
                 Overall = 0.1194                                         max =         15
            
                                                            F(2,4709)         =     829.60
            corr(u_i, Xb) = -0.0460                         Prob > F          =     0.0000
            
                                         (Std. err. adjusted for 4,710 clusters in idcode)
            ------------------------------------------------------------------------------
                         |               Robust
                 nev_mar | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            -------------+----------------------------------------------------------------
                  fitted |   .0183942   .0766724     0.24   0.810    -.1319196     .168708
               sq_fitted |   1.635666   .1274138    12.84   0.000     1.385876    1.885457
                   _cons |   .1007479   .0089692    11.23   0.000      .083164    .1183318
            -------------+----------------------------------------------------------------
                 sigma_u |  .35249962
                 sigma_e |  .23055292
                     rho |  .70038633   (fraction of variance due to u_i)
            ------------------------------------------------------------------------------
            
            . test sq_fitted
            
             ( 1)  sq_fitted = 0
            
                   F(  1,  4709) =  164.80
                        Prob > F =    0.0000
            
            .
            As the -test- outcome reaches statistical significance, the model is mispsecified.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              1. The community contributed summclust command computes an adjustment to cluster-robust standard errors, you can still specify that you want to cluster at an investor level.

              2. If you run logit or probit, you need to compute marginal effects. The marginal effects will then have an easy interpretation. For standard errors with these models, you could check out the boottest command with the score option.

              Comment


              • #8
                Returning to the question posed in #1, I'm not sure about the conclusion in #2 that the omission of extra bdate indicators is OK. Looking at your variables, it seems that the bdate indicators are by definition colinear with the variable dec. This kind of colinearity arises whenever one has both time fixed effects and another variable which indicates a subset of the dates.

                Whether this is a problem or not depends on whether you need to know the effect of variable dec, or whether it is just included to reduce omitted variable bias (a "control" variable whose effect is not directly of interest.) If you need to estimate the effect of dec to accomplish your research goals, you are in trouble. You are in trouble because where there are colinear variables, there is an unidentified model. You get results only because Stata identifies the model by constraining one or more coefficients to zero (i.e. omits one or more variables). And the problem is that you will get different results for the dec effect depending on which time variable(s) get omitted. In other words, in a model where dec is colinear with the time effects, it is mathematically impossible to estimate an effect of dec (nor of any of the time indicators, though this usually doesn't matter.) Whatever coefficient you get for dec is nothing other than an artifact of which time indicator(s) get dropped.

                If you need to estimate the effect of dec for your research goals, you must either omit the time fixed effects, or (and this would be peculiar, and probably especially frowned upon in finance/economics) use random effects for the bdate variable.

                If, however, you were just trying to adjust ("control") for the effects of dec, then there is no problem. While the coefficients you get for dec and the bdate indicators are arbitrary, the coefficients of the other variables, the ones in which you have a real interest, are not affected. Also, if you do -predict- or -margins- or other things, the results of those commands are also unaffected by the way in which the colinearity of dec and the bdate indicators are broken. And just as firm indicators will automatically adjust for industry level effects in a fixed effects model, the bdate indicators automatically adjust for dec effects in a model with time fixed effects. You can, in fact, omit dec from your -xtreg, fe- command and dec effects will be properly handled by the bdate variables instead.

                Comment


                • #9
                  Dec is only included as a control variable. However, if I omit dec from my regression, then instead of the two dates (1304.bdate and 1533.bdate) only one date (1533.bdate) is omitted because of collinearity. I do not get why this date is still omitted. This result confuses me, maybe someone has an idea why this date is still omitted?

                  Comment


                  • #10
                    Jana:
                    if you omit -dec-, Stata omits one date only to save your analysis from the so-called dummy trap (see https://en.wikipedia.org/wiki/Dummy_...le_(statistics).
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment


                    • #11
                      But normally if I include fixed effects, I do not get a note that one category is omitted. The first value often just gets omitted without that note. This is why I was confused

                      Comment


                      • #12
                        To post #7:
                        How exactly would I implement a logit regression with time- and investor-fixed effects and standard errors clustered at investor-level. xtlogit does not work, right? And using i.bdate and i.investor is not feasible as I have so many investors

                        Comment


                        • #13
                          Originally posted by Jana He View Post
                          To post #7:
                          How exactly would I implement a logit regression with time- and investor-fixed effects and standard errors clustered at investor-level. xtlogit does not work, right? And using i.bdate and i.investor is not feasible as I have so many investors
                          Give the community contributed commands probitfe and logitfe (NOT xtlogit) a go, but still beware, as Professor Wooldridge highlighted, that these methods are not foolproof. Your 1500 time periods do help though.

                          Comment


                          • #14
                            However, if I omit dec from my regression, then instead of the two dates (1304.bdate and 1533.bdate) only one date (1533.bdate) is omitted because of collinearity. I do not get why this date is still omitted.

                            ...

                            normally if I include fixed effects, I do not get a note that one category is omitted. The first value often just gets omitted without that note.
                            Your intuition is correct. In a proper model with no colinear variables, you should have only the base value of bdate omitted (and Stata does not give any warnings about that), and no others. So, somehow something else in your data is colinear with the date variables. The colinearity with dec was obvious from its definition. It is by no means obvious why any of the other variables would have such a colinearity. It might result from a "conspiracy" of missing values. That is, perhaps there is no colinearity in the entire data sample, but, after observations with a missing value on any variable mentioned in the -xtreg- command are removed, the remaining data exhibits a colinearity.

                            In any event, as your model does not have many variables, probably the simplest way to find what is causing it is to rerun the regression leaving out one variable at a time to see which restricted version of the model finally leads to no excess omissions of bdate values.
                            Last edited by Clyde Schechter; 02 Jan 2023, 08:53.

                            Comment


                            • #15
                              I followed your advice Clyde and left out one independent variable at a time but in every regression I get the note that one date is omitted. I am unsure how to deal with this. What can I do?

                              Comment

                              Working...
                              X