Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to account for year fixed effect in LASSO

    Dear Statalister,

    I currently use lasso2 command in lassopack developed by Ahrens, Hansen and Schaffer (link to lassopack can be found here: https://statalasso.github.io/docs/lassopack/) for variable selection. To be more specific, I am applying lasso2 to choose what factors most relevant in explaining CEO compensation using US data from 2000 to 2014.
    I have declared panel data with firm id and year before calling lasso2 command in Stata. The dependent variable is CEO compensation (in log form) and I have 27 regressors on the right-hand side which are lagged one period. I understand that by adding "fe" option, lasso2 accounts for firm fixed effect in the model. However, I am puzzled with year fixed effect since I have obtained different results using either (1) i.year or (2) dummy for each year in the regression.

    I use "fe" and "lic(aic)" option to use AIC information criteria to choose penalty level in lasso2 and obtain the following results when trying to account for year fixed effect:
    1. adding "i.year" in the list of regressors:
      - The number of regressors chosen by lasso2 is 26 (out of 27). This result is not helpful at all since almost all the variables are selected
      - Penalty level is 1.323
    2. adding a list of dummy for the year in the list of regressors: dumyear1-dumyear15 (sample period from 2000-2014) :
      - Number of regressors chosen by lasso2 is 21/27
      - Penalty level is much higher: 16.31

      I am really confused about the results I have. I would expect using i.year is the same as using dummy for year in the regression.

    In another approach, I used the option notpen(i.year) or partial(i.year) to not penalise year when calling lasso2 but I got the error message as follows:

    3. using partial(i.year): syntax error - 0.year in partial(.) but not in list of regressors
    r(198);

    4. using notpen(i.year): internal _lassopath error - unpenalized 2002.year missing from selected vars
    set tolzero(.) or other tolerances smaller or use partial(.) option
    r(499);

    Clearly, I have added the dummy for all year or i.year in the list of regressors. I have no idea why I got this error message. I would highly appreciate if you could help me clear the question.
    Many thanks for your help.

    Below is the sample of my data. I have in total 27 regressors but can not include them all due to linesize limit. In the data sample, ceo_totalpay is the total compensation of CEO (dependent variable), other variables are independent variables, gvkey is firm id and year is the financial year.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(ceo_totalpay ceo_tenure ceo_ownership ceo_directorship ceo_age ceo_age_sq ceo_duality ceo_gender ln_size tobinq roa leverage free_cashflow rd capex sales_growth ln_firmage ln_segment merger) double gvkey float year
     7.743223  .6931472  .0000905406         0 56 3136 0 0   7.38426 1.5691023   .07805158   .326074  -.02208356  .02687224  .04476306     .2298486  2.772589 2.0794415 0 1034 2000
     7.218372 1.0986123 .00011408457         0 57 3249 0 0  7.779052 1.1174034  .035978958  .4437609  .029121887  .03625511 .035668083    .08236733  2.833213   1.94591 0 1034 2001
     7.556269 1.3862944  .0004853922         0 58 3364 0 0  7.739326  .8305851   .06212613  .3900251   .14331414  .02920776  .03238679    .26233295  2.890372   1.94591 0 1034 2002
     7.973105  1.609438   .001487662         0 59 3481 0 0  7.753309  .9630332   .04531338   .350821   .10936305 .027146727 .018297164    .05405026  2.944439   1.94591 0 1034 2003
     7.933617  1.792216  .0024018176         0 60 3600 1 0  7.602822 1.0058173   .03324264  .3501948   .15286067   .0406549  .02460573    .03252562  2.995732 2.0794415 0 1034 2004
     7.618988 1.9463015   .002395355         0 61 3721 1 0  7.392268  1.384189   .05913269 .25667757   .13513224  .01659251  .02398633            . 3.0445225 1.7917595 0 1034 2005
     8.047317  .9168385    .00322704 1.0986123 51 2601 0 1  7.160974 1.1174711 -.004286718 .24145354 -.015307399  .10887969  .04696526    .10491597  3.135494  1.609438 1 1034 2007
     7.939509 1.4859537            0  .6931472 51 2601 0 1  7.458848 1.6002488   .13146882 .08863158   .11346654          0  .04507452    .07843047  3.367296 2.0794415 0 1076 2011
      8.30934  .6497534   .000649168         . 70 4900 1 1  7.502699 1.5544815   .14043573 .07806594 -.002483934          0  .03589385    .09809002 3.4011974 2.0794415 0 1076 2012
     8.319598 1.0698934  .0009713119         . 71 5041 1 1  7.510527 1.5500143   .12147653 .07810085   .11668974          0  .03182233   .005418458  3.433987 2.1972246 0 1076 2013
     8.667651 1.2755923   .001827806 1.0986123 72 5184 0 1  7.806633  1.403948   .07163662  .2466913  -.04355994          0 .019360203    .21954766  3.465736 2.1972246 0 1076 2014
     8.983843 1.0986123  .0001799805         0 45 2025 1 1  9.634513  5.338754   .21344067 .10179913           .  .08839898          .    .04312545 4.1431346   1.94591 0 1078 2000
     9.694727 1.3862944 .00023363395         0 46 2116 1 1 10.056055 4.3312244    .1518659  .3128733           .  .12482397          .     .1847334  4.158883   1.94591 0 1078 2001
    10.109052  1.609438  .0003439838         0 47 2209 1 1 10.096547  3.137679   .16244793 .26475123   .04616797   .0688192  .05343961    .08593158 4.1743875   1.94591 1 1078 2002
     9.193542 1.7917595  .0003323995         0 48 2304 1 1 10.192993 3.2396975    .1558944 .22420397  .034286458  .06863891  .04666761     .1128604  4.189655   1.94591 1 1078 2003
    9.3351965 1.9463015  .0003605419         0 49 2401 1 1    10.267  3.031784   .15487362 .23570412   .05243976 .068680264  .04489905 -.0000276923  4.204693 1.7917595 1 1078 2004
     9.606682  2.079784  .0004261123         0 50 2500 1 1 10.279908 2.5880184   .16592997  .2276335   .06536151  .06308271  .04143593    .13250965 4.2195077 1.7917595 0 1078 2005
      9.96103  2.197529    .00274604 1.0986123 51 2601 1 1  10.49621   2.68126    .1343412  .3430501    .0533049  .11800682 .036978595   .008458167 4.2341065 1.7917595 0 1078 2006
     10.27352  2.302859   .003312108 1.0986123 52 2704 1 1 10.589458  2.743693   .13908334  .3075421   .04053475 .063092455  .04170343    .15295723  4.248495 1.7917595 0 1078 2007
    10.130786 2.3983934   .003275916  .6931472 53 2809 1 1 10.655356  2.541137    .1526925 .29140773   .07309864  .06567938   .0303571    .13943355   4.26268 1.7917595 1 1078 2008
    end

    Here is my Stata code:
    Code:
     
    local x ceo_tenure ceo_ownership ceo_directorship ceo_age ceo_age_sq ceo_duality ceo_gender ln_size tobinq roa leverage free_cashflow rd capex sales_growth ln_firmage ln_segment merger
    
    * calling lasso2 using i.year option to account for year fixed effect:
    qui eststo: lasso2 ceo_totalpay `x' i.year,  fe lic(aic) displayall postest
    
    * calling lasso2 using dummy for each year to account for year fixed effect:
    tab year, gen(dumyear)
    qui eststo: lasso2 ceo_totalpay `x' dumyear1-dumyear15,  fe lic(aic) displayall postest
    
    * in an attempt not to penalise year I apply the below syntax and got the message as in (3):
    qui eststo: lasso2 `ceo_totalpay `x' i.year, partial(i.year) fe lic(aic) displayall postest
    
    *or the below syntax and got the error message as in (4)
    qui eststo: lasso2 `ceo_totalpay `x' i.year,  notpen(i.year) fe lic(aic) displayall postest

  • #2
    On the difference between i.year and using manually generated dummies (dumyear1-dumyear15) when using the lasso:

    First note that if you use the "i." prefix, Stata will automatically drop one category (the base category) which Stata picks automatically.

    If you use OLS/regress, it doesn't make a difference whether you use "i.year" or "dumyear1-dumyear15" because Stata will drop one year dummy in either case. Including all dummies/categories would lead to perfect collinearity.

    In contrast to OLS, the lasso can deal with perfectly collinear variables. The lasso doesn't rely on the full rank condition like OLS does. If you use lasso with "i.year", Stata would still drop one base category whereas when using "dumyear1-dumyear15", the lasso will indeed include all dummies. Hence the difference in results.

    You have two options:

    1) In many panel data applications, you expect all year effects to matter and want to always include them. This can be done by not penalizing the year effect. To this end, use the options partial(i.year) or notpen(i.year) of lasso2/rlasso/cvlasso.

    2) If you want to let the lasso select which years matter, use "ibn.year" (which forces Stata to generate a dummy for each year) or include dumyear1-dumyear15. That is, include one dummy for every year. This way, the lasso will implicitly select the base category.

    On your error message: One issues is that you have an accent ( `) before the dependent variable in the last two estimations. However the main issue appears to be due to an internal bug that occurs when the model is estimated for the selected lambda value in the 2nd step. -- Thank you very much for alerting us to this problem and I will investigate.

    In the meantime there is a quick workaround. You can pass on the lambda value of your choice and re-estimate the model like so:

    Code:
    clear
    use https://statalasso.github.io/dta/AK91.dta
    xtset pob
    
    lasso2 lnwage c.educ#c.educ#i.qob i.yob, fe partial(i.yob)
    local lambda = e(laicc)
    lasso2 lnwage c.educ#c.educ#i.qob i.yob, fe partial(i.yob) lam(`lambda')
    --
    Tag me or email me for ddml/pdslasso/lassopack/pystacked related questions. I don't check Statalist.

    Comment


    • #3
      Thank you very much Achim Ahrens for your time on my question regarding the difference between i.year and dummy for each year. I have also tried your workaround suggestion and the results are fantastic. Please keep us posted when the bug issue is solved. Many thanks.

      Regards,

      Linh

      Comment

      Working...
      X