Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimates Restore with Lasso - "no observations" error

    Hi everyone, I am trying to to in-sample and out of sample prediction with lasso. The following steps worked perfectly find with linear regression but not with lasso, so I wonder why this is the case and how I can solve it.

    Step 1. Generate data
    matrix def c = J(8,8,0)
    matrix list c
    forvalues i = 1/8{
    forvalues j = 1/8{
    local toaddin = 0.5^(abs(`i' - `j'))
    matrix c[`i',`j'] = `toaddin'
    }
    }

    set seed 12345
    set obs 20
    drawnorm x1 x2 x3 x4 x5 x6 x7 x8, cov(c)
    gen e = rnormal(0,1)
    gen y = 3*x1 + 1.5*x2 + 2*x5 + e



    Step 2. Save the model
    lasso2 y x1 x2 x3 x4 x5 x6 x7 x8
    lasso2, lic(ebic)
    estimates store regression3


    Note: the insample prediction as the following works perfectly fine
    predict yhat_lasso, residual lic(ebic)
    gen yhat_lasso_2 = yhat_lasso^2
    summ yhat_lasso_2
    local yhat_lasso_2_mean = r(mean)
    di `yhat_lasso_2_mean'


    Step 3. Generate new data for out-of-sample prediction
    drop _all
    set seed 12345
    set obs 1000
    drawnorm x1 x2 x3 x4 x5 x6 x7 x8, cov(c)
    gen e = rnormal(0,1)
    gen y = 3*x1 + 1.5*x2 + 2*x5 + e


    Step 4: Try to do out-of-sample prediction
    set seed 12345
    estimates restore regression3
    ereturn list
    predict yhat_lasso, residual lic(ebic)


    Here is where the error message "No observation" comes out.
    The ereturn list commend works well, which means the regression result has been successfully restored. So I am unsure why can't I predict again here?



    Thanks everyone who tries to help!



  • #2
    It appears to me that lasso2 does not support out-of-sample prediction.

    The output of help lasso2 makes no mention of out-of-sample prediction techniques and suggests that predict works by re-estimating the model in the original sample, and this was confirmed using trace to look "behind the scenes". The failure message you received is because a careful look at the output of your second ereturn list shows that e(sample) was apparently not restored when estimates restore realized that the dataset had changed substantially,

    You might contact one of the authors at the email address given for support in
    Code:
    net describe lassopack, from(http://fmwww.bc.edu/RePEc/bocode/l)
    and asking if they can confirm this.

    I will note that with Stata 16 includes its own lasso command documented in the Stata Lasso Reference Manual PDF included in the Stata 16 installation and accessible through Stata's Help menu.
    Last edited by William Lisowski; 21 Oct 2019, 19:51.

    Comment


    • #3
      Hi Elisha, thanks for your report, which helps us to improve lassopack.

      First of all, let me clarify that lassopack supports out-of-sample prediction.

      The reason for the issue that you experience is that, as William rightly points out, e(sample) (i.e. the estimation sample which is stored after estimation) gets lost when you use drop _all. This issue affects any other Stata command in the same way:

      Code:
      *Step 1. Generate data
      clear all
      
      matrix def c = J(8,8,0)
      matrix list c
      forvalues i = 1/8{
      forvalues j = 1/8{
      local toaddin = 0.5^(abs(`i' - `j'))
      matrix c[`i',`j'] = `toaddin'
      }
      }
      
      set seed 12345
      set obs 200
      drawnorm x1 x2 x3 x4 x5 x6 x7 x8, cov(c)
      gen e = rnormal(0,1)
      gen y = 3*x1 + 1.5*x2 + 2*x5 + e
      
      *Step 2. Save the model
      reg y x1 x2 x3 x4 x5 x6 x7 x8
      estimates store regression1
      ereturn list // NOTE that e(sample) is there
      
      *Step 3. Generate new data for out-of-sample prediction
      drop _all
      set seed 12345
      set obs 1000
      drawnorm x1 x2 x3 x4 x5 x6 x7 x8, cov(c)
      gen e = rnormal(0,1)
      gen y = 3*x1 + 1.5*x2 + 2*x5 + e
      
      estimates restore regression1
      ereturn list // NOTE that e(sample) is missing
      lasso2's predict command requires e(sample) to exist, otherwise it will fail. This is in contrast to regress, which also works without e(sample). I simply haven't anticipated the case that e(sample) doesn't exist when you use predict. My bad... but I also think that you probably don't need to structure your code in this way.

      How about this approach:

      Code:
      *Step 1. Generate data
      clear all
      
      matrix def c = J(8,8,0)
      matrix list c
      forvalues i = 1/8{
      forvalues j = 1/8{
      local toaddin = 0.5^(abs(`i' - `j'))
      matrix c[`i',`j'] = `toaddin'
      }
      }
      
      set seed 12345
      set obs 1200
      drawnorm x1 x2 x3 x4 x5 x6 x7 x8, cov(c)
      gen e = rnormal(0,1)
      gen y = 3*x1 + 1.5*x2 + 2*x5 + e
      
      
      *Step 2. Estimate using first 1000 obs only
      reg y x1 x2 x3 x4 x5 x6 x7 x8 if _n <= 1000
      
      lasso2 y x1 x2 x3 x4 x5 x6 x7 x8 if _n <= 1000
      lasso2, lic(ebic) postres
      
      
      *Step 3: insample prediction
      predict yhat_lasso_in if _n <= 1000, residual
      
      *Step 4: out-of-sample prediction
      predict yhat_lasso_out if _n > 1000, residual
      In my view, this is a bit cleaner, but you might have reasons for structuring the code the way you did.

      Finally, just a quick side note: if you haven't done so already, make sure you install the latest version of lassopack. The latest version is currently only available via github (we are waiting for it to be uploaded to SSC.)

      I hope this helps.
      http://statalasso.github.io/

      Comment


      • #4
        Thank you William and Achim! Your help is much appreciated.

        Comment


        • #5
          A follow up question:
          Now I have modified my code and it works well. However, I've noticed that the location of ols in the in the predict line matters. As seen in my version 1, I added ols after a seperate line of prediction. For version 2, I added the predict at the same line of lasso2. These two approaches give different rmse. I was wondering what's the reason for this difference?

          Code:
          //* Version 1 *//
          *in-sample
          set seed 10101
          lasso2 y x1 x2 x3 x4 x5 x6 x7 x8 if in_sample==1
          lasso2, lic(aic) postres 
          predict yhat_postlasso, lic(ebic) ols
          gen residual_postlasso2 = (yhat_postlasso - y)^2
          summarize residual_postlasso2 if in_sample==1
          local mean_postlasso_in = r(mean)
          local rmse_postlasso_in = sqrt(`mean_postlasso_in'*20/13)
          display `rmse_postlasso_in' //.84318197
          *out-of-sample 
          summarize residual_postlasso2 if in_sample==0  
          local mean_postlasso_out = r(mean)
          local rmse_postlasso_out = sqrt(`mean_postlasso_out'*1000/993)
          display `rmse_postlasso_out' //1.5339668
          
          
          //* Version 2 *//
          set seed 10101
          lasso2 y x1 x2 x3 x4 x5 x6 x7 x8 if in_sample==1, lic(ebic) ols
          local mylambda2 = e(lebic) 
          predict yhat_postlasso, lambda(`mylambda2')
          gen residual_postlasso2 = (yhat_postlasso - y)^2
          summarize residual_postlasso2 if in_sample==1
          local mean_postlasso_in = r(mean)
          local rmse_postlasso_in = sqrt(`mean_postlasso_in')
          display `rmse_postlasso_in' //.93564476
          *out-of-sample 
          summarize residual_postlasso2 if in_sample==0  
          local mean_postlasso_out = r(mean)
          local rmse_postlasso_out = sqrt(`mean_postlasso_out')
          display `rmse_postlasso_out' //1.1081077

          Comment


          • #6
            Thanks for the excellent question again.

            Version 1 obtains the coefficient path using the lasso, picks the optimal lambda based on AIC/EBIC, and then uses post-lasso.

            Version 2 obtains the coefficient path for the post-lasso and picks the optimal lambda based on AIC/EBIC.

            I would probably go with Version 1, which seems more natural to me, but I have no theoretical justification to prefer one over the other.
            http://statalasso.github.io/

            Comment

            Working...
            X