Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drop collinear variables with factor variables and interactions

    Dear all,

    I want to drop collinear variables from my list of independent variables in the case of factor variables and interactions.

    I am working with confidential data for which I can only access a small sample and can let my do-files run over a larger sample by the Statistical Office. Since I am using the margins command, which does not work with collinear variables, after my regressions, I cannot rely on Stata omitting the collinear variables. And as I am not running the do-files myself, I cannot check which variables are collinear and drop them manually.

    I know that there is the option forcedrop for the _rmcoll command. However, this option is not allowed for factor variables and interactions, which I do have in my regression:

    Code:
    reg y i.post_reform##ib999.Event i.DFAge* i.DYear*, r
    I regress the dependent variable y on the interaction of a post-reform dummy and an event variable (= 998 two years before event year, = 999 one year before event year, = 1000 in event year, =1001 one year after event year, = 1002 two years after event year). In addition, I include age and year dummies. Some of these age and year dummies are collinear.

    Ideally, I would like to use _rmcoll to identify collinear variables and omit these collinear variables from r(varlist) I obtain from the command. I would then run the regression only with non-collinear independent variables.

    Code:
    _rmcoll i.post_reform##ib999.Event i.DFAge* i.DYear*
    I suspect that I have these collinearity problems because of the small sample size and relatively many independent variables. This might not be the case for my larger sample. Still, I was wondering if any of you know a clever way of omitting the collinear variables in case of factor variables and interactions?

    Thank you very much!
    Leonie

  • #2
    I am not sure I get the reasoning behind what you want to do. But you can obtain the list using reghdfe from SSC, authored by Sergio Correia. Here is an example:

    Code:
    webuse grunfeld, clear
    *GENERATE COLLINEAR VARIABLES
    gen colvar1= 1942.year
    gen colvar2= 1954.year
    *ABSORB THE COLLINEAR VARIABLES LEAVING INDICATORS IN REGRESSION
    reghdfe invest mvalue i.year, absorb(colvar1 colvar2 company)
    *RETRIEVE VARIABLE NAMES FROM MATRIX e(b)
    local names: colnames e(b)
    *OMITTED VARIABLES HAVE THE PREFIX "o."
    local included = ustrregexra("`names'","[0-9]+o\.[a-z]+", "",.)
    local omitted: list names- included
    display "`included'"
    display "`omitted'"
    Res.:

    Code:
    . reghdfe invest mvalue i.year, absorb(colvar1 colvar2 company)
    note: 1942bn.year is probably collinear with the fixed effects (all partialled-out values are close to zero; tol = 1.0e-09)
    note: 1954bn.year is probably collinear with the fixed effects (all partialled-out values are close to zero; tol = 1.0e-09)
    (MWFE estimator converged in 3 iterations)
    note: 1942.year omitted because of collinearity
    note: 1954.year omitted because of collinearity
    
    HDFE Linear regression                            Number of obs   =        200
    Absorbing 3 HDFE groups                           F(  18,    170) =       8.07
                                                      Prob > F        =     0.0000
                                                      R-squared       =     0.8808
                                                      Adj R-squared   =     0.8604
                                                      Within R-sq.    =     0.4607
                                                      Root MSE        =    81.0287
    
    ------------------------------------------------------------------------------
          invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
          mvalue |   .1799679   .0206334     8.72   0.000     .1392372    .2206986
                 |
            year |
           1936  |   -38.1056   37.04158    -1.03   0.305    -111.2263    35.01509
           1937  |  -66.31173   38.60245    -1.72   0.088    -142.5136    9.890143
           1938  |   -20.3273   36.35156    -0.56   0.577    -92.08588    51.43128
           1939  |  -59.30898   37.04449    -1.60   0.111    -132.4354    13.81745
           1940  |  -36.24558   37.29061    -0.97   0.332    -109.8579    37.36671
           1941  |  -1.500652   37.07777    -0.04   0.968     -74.6928     71.6915
           1942  |          0  (omitted)
           1943  |  -6.731505   36.72012    -0.18   0.855    -79.21764    65.75463
           1944  |  -9.464895   36.83487    -0.26   0.798    -82.17755    63.24776
           1945  |  -26.44723   37.32046    -0.71   0.480    -100.1184    47.22398
           1946  |   -.739807   37.65736    -0.02   0.984    -75.07607    73.59646
           1947  |   34.82738   36.51991     0.95   0.342    -37.26352    106.9183
           1948  |   47.09646    36.4475     1.29   0.198    -24.85151    119.0444
           1949  |    28.6898   36.49548     0.79   0.433    -43.35289    100.7325
           1950  |   29.79853   36.66158     0.81   0.417    -42.57203    102.1691
           1951  |   36.74273   37.68058     0.98   0.331    -37.63936    111.1248
           1952  |   52.86058   37.95357     1.39   0.166    -22.06041    127.7816
           1953  |   64.20592   39.56961     1.62   0.107    -13.90515     142.317
           1954  |          0  (omitted)
                 |
           _cons |  -50.16154   27.86184    -1.80   0.074    -105.1613    4.838206
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
         colvar1 |         2           0           2     |
         colvar2 |         2           1           1     |
         company |        10           1           9    ?|
    -----------------------------------------------------+
    ? = number of redundant parameters may be higher
    
     
    . display "`included'"
    mvalue 1935b.year 1936.year 1937.year 1938.year 1939.year 1940.year 1941.year  1943.year 1944.year 1945.year 1946.year 1947.year 1948.year 1949.year 1950.y
    > ear 1951.year 1952.year 1953.year  _cons
    
    .
    . display "`omitted'"
    1942o.year 1954o.year

    Comment


    • #3
      Dear Andrew,

      Thank you very much for your quick response.
      I do not know which age and year dummies will be collinear beforehand. In addition, I run the same regression for several sample restrictions and different age and year dummies are collinear for different specifications.
      Do I understand reghdfe correctly that I would need to specifiy the collinear variables for absorb()? I would not be able to do so since I do not know the collinear variables before running reg or _rmcoll.

      Best,
      Leonie

      Comment


      • #4
        Depends on what you want to drop. Indicators or regressors collinear with the indicators. You absorb what you want to keep. I guess the bottom line is that without seeing the data, it is difficult to know what variables will be collinear with the indicators. However, if you know all the variable names, you can write code to extract the list of variables to omit prior to the main regression.

        Comment


        • #5
          Sorry Andrew, I do not understand your last sentence. Could you explain this in more detail?
          Thank you!

          Comment


          • #6
            It makes sense to absorb the indicators as you cannot identify the coefficients of the regressors that are collinear with the indicators. As long as you know what models that you will run and what variables you want to consider and provided that the third party has access to reghdfe, you can present to them a do file that first runs a regression using reghdfe to pick out the variables to use. With the example in #2, assume that we do not know colvar1 and colvar2 are collinear. Our do file will look like the following:

            Code:
            webuse grunfeld, clear
            *GENERATE COLLINEAR VARIABLES
            gen colvar1= 1942.year
            gen colvar2= 1954.year
            
            *START HERE
            local regressors mvalue kstock colvar1 colvar2
            qui reghdfe invest  `regressors', absorb(i.year) nocons
            local names: colnames e(b)
            *OMITTED VARIABLES HAVE THE PREFIX "o."
            local included = ustrregexra("`names'","o\.[A-Za-z0-9\_]+", "",.)
            regress invest `included' i.year
            
            *COMPARE WITH
            regress invest i.year `regressors'
            Res.:

            Code:
            . local included = ustrregexra("`names'","o\.[A-Za-z0-9\_]+", "",.)
            
            .
            . regress invest `included' i.year
            
                  Source |       SS           df       MS      Number of obs   =       200
            -------------+----------------------------------   F(21, 178)      =     37.84
                   Model |  7646972.23        21  364141.535   Prob > F        =    0.0000
                Residual |  1712971.69       178  9623.43646   R-squared       =    0.8170
            -------------+----------------------------------   Adj R-squared   =    0.7954
                   Total |  9359943.92       199  47034.8941   Root MSE        =    98.099
            
            ------------------------------------------------------------------------------
                  invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                  mvalue |   .1167978   .0063313    18.45   0.000     .1043037    .1292919
                  kstock |   .2197066   .0322961     6.80   0.000     .1559741    .2834391
                         |
                    year |
                   1936  |  -17.21234   43.92514    -0.39   0.696    -103.8934    69.46868
                   1937  |  -34.49127   44.01535    -0.78   0.434    -121.3503    52.36778
                   1938  |  -28.44276   43.92396    -0.65   0.518    -115.1215    58.23594
                   1939  |  -56.24304   43.95138    -1.28   0.202    -142.9759    30.48978
                   1940  |  -30.50473   43.96242    -0.69   0.489    -117.2593    56.24987
                   1941  |  -2.627113   43.98457    -0.06   0.952    -89.42542    84.17119
                   1942  |  -1.422156   44.06149    -0.03   0.974    -88.37227    85.52795
                   1943  |  -21.80127   44.07228    -0.49   0.621    -108.7727    65.17013
                   1944  |  -22.11735   44.06456    -0.50   0.616    -109.0735    64.83881
                   1945  |  -33.59647   44.07984    -0.76   0.447    -120.5828    53.38984
                   1946  |  -7.028064   44.12025    -0.16   0.874    -94.09412      80.038
                   1947  |  -5.246123   44.47173    -0.12   0.906    -93.00578    82.51353
                   1948  |  -3.919472   44.72107    -0.09   0.930    -92.17119    84.33224
                   1949  |  -28.79332   44.94258    -0.64   0.523    -117.4822    59.89552
                   1950  |  -28.35409    45.0507    -0.63   0.530    -117.2563     60.5481
                   1951  |  -11.67194   45.09191    -0.26   0.796    -100.6554    77.31157
                   1952  |  -5.613218   45.53045    -0.12   0.902    -95.46214     84.2357
                   1953  |   2.448996   46.12231     0.05   0.958     -88.5679    93.46589
                   1954  |  -12.31488   46.98088    -0.26   0.794     -105.026    80.39629
                         |
                   _cons |  -23.57497   31.25408    -0.75   0.452    -85.25117    38.10124
            ------------------------------------------------------------------------------
            
            .
            .
            .
            . *COMPARE WITH
            
            .
            
            . regress invest i.year `regressors'
            note: colvar1 omitted because of collinearity
            note: colvar2 omitted because of collinearity
            
                  Source |       SS           df       MS      Number of obs   =       200
            -------------+----------------------------------   F(21, 178)      =     37.84
                   Model |  7646972.23        21  364141.535   Prob > F        =    0.0000
                Residual |  1712971.69       178  9623.43646   R-squared       =    0.8170
            -------------+----------------------------------   Adj R-squared   =    0.7954
                   Total |  9359943.92       199  47034.8941   Root MSE        =    98.099
            
            ------------------------------------------------------------------------------
                  invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                    year |
                   1936  |  -17.21234   43.92514    -0.39   0.696    -103.8934    69.46868
                   1937  |  -34.49127   44.01535    -0.78   0.434    -121.3503    52.36778
                   1938  |  -28.44276   43.92396    -0.65   0.518    -115.1215    58.23594
                   1939  |  -56.24304   43.95138    -1.28   0.202    -142.9759    30.48978
                   1940  |  -30.50473   43.96242    -0.69   0.489    -117.2593    56.24987
                   1941  |  -2.627113   43.98457    -0.06   0.952    -89.42542    84.17119
                   1942  |  -1.422156   44.06149    -0.03   0.974    -88.37227    85.52795
                   1943  |  -21.80127   44.07228    -0.49   0.621    -108.7727    65.17013
                   1944  |  -22.11735   44.06456    -0.50   0.616    -109.0735    64.83881
                   1945  |  -33.59647   44.07984    -0.76   0.447    -120.5828    53.38984
                   1946  |  -7.028064   44.12025    -0.16   0.874    -94.09412      80.038
                   1947  |  -5.246123   44.47173    -0.12   0.906    -93.00578    82.51353
                   1948  |  -3.919472   44.72107    -0.09   0.930    -92.17119    84.33224
                   1949  |  -28.79332   44.94258    -0.64   0.523    -117.4822    59.89552
                   1950  |  -28.35409    45.0507    -0.63   0.530    -117.2563     60.5481
                   1951  |  -11.67194   45.09191    -0.26   0.796    -100.6554    77.31157
                   1952  |  -5.613218   45.53045    -0.12   0.902    -95.46214     84.2357
                   1953  |   2.448996   46.12231     0.05   0.958     -88.5679    93.46589
                   1954  |  -12.31488   46.98088    -0.26   0.794     -105.026    80.39629
                         |
                  mvalue |   .1167978   .0063313    18.45   0.000     .1043037    .1292919
                  kstock |   .2197066   .0322961     6.80   0.000     .1559741    .2834391
                 colvar1 |          0  (omitted)
                 colvar2 |          0  (omitted)
                   _cons |  -23.57497   31.25408    -0.75   0.452    -85.25117    38.10124
            ------------------------------------------------------------------------------
            
            .
            Last edited by Andrew Musau; 22 Oct 2020, 12:40.

            Comment


            • #7
              Thank you very much Andrew for the detailed explanation. I will try this out.

              Comment

              Working...
              X