Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Training sample for lasso and them making out of sample predictions

    Dear All

    I have created a loop to generate forecasts using a lasso regressions that (I hope) are considered out of sample.

    Can anyone please confirm if these forecasts are out of sample in the sense that these simple lines of the code will only use observations up to the year 1996 to train the model and subsequently generate forecasts for the period after 1996 that are out of sample? Do I have to condition "predict" on sample 2 ?!
    The model is estimated using lasso regression from Stata 16

    My code is:

    Code:
    gen sample=.
    replace sample=1 if year<1996  // training sample
    replace sample=2  if sample==.   // evaluation sample
    
    quietly lasso linear BUS s1 s2 s3 s4 s5 if sample == 1
    estimates store lasso
    
    estimate restore lasso
    predict BUShat, postselection. // out of sample forecasts
    Last edited by Mike Kraft; 12 Sep 2020, 02:44. Reason: a typo in the title then* but could not edit

  • #2
    Out-of-sample simply means that the predictions correspond to observations outside the estimation sample.

    gen sample=.
    replace sample=1 if year<1996 // training sample
    replace sample=2 if sample==. // evaluation sample
    quietly lasso linear BUS s1 s2 s3 s4 s5 if sample == 1
    estimates store lasso
    estimate restore lasso
    predict BUShat, postselection. // out of sample forecasts
    Here, predict will create predictions for all observations in the dataset. Part of the observations (year<1996) will be in-sample and the remainder (year>=1996) will be out-of-sample.

    Comment


    • #3
      Thank you so much Andrew.

      What about the following to have only out-of-sample predictions?!

      Code:
      gen sample=.
      replace sample=1 if year<1996  // training sample
      replace sample=2  if sample==.   // evaluation sample
      
      quietly lasso linear BUS s1 s2 s3 s4 s5 if sample == 1
      estimates store lasso
      
      estimate restore lasso
      predict BUShat, postselection // producing forecasts (BUShat before 1996 are in sample and BUShat from 1996 onwards are out of sample)
      replace prediction==. if sample==1. // to keep only the out-of-sample forecasts

      Comment


      • #4
        Yes, that's fine. But you could also do it directly using predict.

        Code:
        predict BUShat if sample==2, postselection

        Tip: It is better and more efficient to code a binary variable 0/1. Your code above can be reduced to the following 3 lines:

        Code:
        gen sample = year<1996
        quietly lasso linear BUS s1 s2 s3 s4 s5 if sample
        predict BUShat if !sample, postselection
        Last edited by Andrew Musau; 12 Sep 2020, 05:00.

        Comment


        • #5
          Thanks a lot.
          I have the following comments:

          1- Using if sample==2 works and so I do not need to replace prediction=. if sample==1.

          However, I do not know why when I used your summarized code, the results did not look the same. i.e. the BUShat are different.


          2- Most importantly for me, now I understand that sample 1 is used to train the model, and then predictions are made for sample 2. Does this exactly mean that all coefficients are obtained from the estimation of the model using sample 1 and then new forecasts are produced using new data for s1-s5 as they become available during sample 2 ? or that the model coefficients are also updated over time? I am sorry but I am still unsure.

          3- As for the sample window, does this also mean that estimation is done recursively in the sense that more data are used each time the forecast is made beyond 2016? and if this is the case, how can I have a fixed window where I add one new observation and drop an older one each time a new forecast is made?

          Thanks

          Comment


          • #6
            Dear fellow STATA users.

            Quick check.. i have some unclean (AMR) data in this format
            LabNo Test results
            201514 SENSITIVITY clindamycin 3+,cephalexim 2+,Cyprofloxaxin3+.
            201514 RESISTANCE Tetracyclin,erythromycin,septrin,cloxacillin,Amoxy clav
            203658 SENSITIVITY NO PATHOGEN ISOLATED AT 37 0c for 48hrs
            203819 Salmonella Typhi H Positive
            203918 Salmonella Typhi H negative
            204089 SENSITIVITY levofloxacin 3+,ciprofloxacin 3+,genta 3+,Amoxicillin 3+,Norbactin 3+
            204197 SENSITIVITY LEVOFLOXACIN 3+ NORBACTIN 3+,GENTAMYCIN 3+,CIPROFLOXACIN 3+
            204197 RESISTANCE AMPICILLIN,AMOXYCILLIN
            Now , i want to split the Results column onto each value within it and have the results of the new columns cells as the last integer of the current cell value plus the +sign
            where : 3+ = "S" ,+++="S", 2+="I",++="I", +="I" where Test column ="SENTIVITY" and have "R" as the results where the Test column ="RESISTANCE"
            Example : expecting to get the following resultant output:

            expected resultant output table
            LabNo Test Value clindamycin cephalexim Cyprofloxaxin levofloxacin NORBACTIN GENTAMYCIN AMPICILLIN AMOXYCILLIN
            201514 SENSITIVITY clindamycin 3+,cephalexim 2+,Cyprofloxaxin3+. S I S
            201514 RESISTANCE Tetracyclin,erythromycin,septrin,cloxacillin,Amoxy clav R R R
            203658 SENSITIVITY NO PATHOGEN ISOLATED AT 37 0c for 48hrs
            203819 Salmonella Typhi H Positive
            203918 Salmonella Typhi H negative
            204089 SENSITIVITY levofloxacin 3+,ciprofloxacin 3+,genta 3+,Amoxicillin 3+,Norbactin 3+ S S S S S
            204197 SENSITIVITY LEVOFLOXACIN 3+ NORBACTIN 3+,GENTAMYCIN 3+,CIPROFLOXACIN 3+ S S S S
            204197 RESISTANCE AMPICILLIN,AMOXYCILLIN R R
            Any idea from somebody on how to go about this? please help a brother here.
            Thanks.

            Comment


            • #7
              Shadrack Muema, I think you probably need to create a post for that as it is not related to the question here. I can see that it is your first post to Stata, so welcome to the forum.

              Comment


              • #8

                1- Using if sample==2 works and so I do not need to replace prediction=. if sample==1.

                However, I do not know why when I used your summarized code, the results did not look the same. i.e. the BUShat are different.
                I think your saved estimates are messing up the procedure. Add -estimates clear- before comparing the methods

                Code:
                estimates clear
                webuse grunfeld, clear
                gen sample=.
                replace sample=1 if time<6  // training sample
                replace sample=2  if sample==.   // evaluation sample
                
                quietly lasso linear invest mvalue kstock if sample == 1
                estimates store lasso
                
                estimate restore lasso
                predict investhat, postselection // out of sample forecasts
                replace investhat=. if sample==1
                
                gen sample2= time<6
                quietly lasso linear invest mvalue kstock if sample2
                predict investhat2 if !sample2, postselection 
                assert investhat==investhat2

                2- Most importantly for me, now I understand that sample 1 is used to train the model, and then predictions are made for sample 2. Does this exactly mean that all coefficients are obtained from the estimation of the model using sample 1 and then new forecasts are produced using new data for s1-s5 as they become available during sample 2 ?
                Yes.


                3- As for the sample window, does this also mean that estimation is done recursively in the sense that more data are used each time the forecast is made beyond 2016? and if this is the case, how can I have a fixed window where I add one new observation and drop an older one each time a new forecast is made?
                So far, your code is static. What you want can be programmed given that you define your windows.

                Comment


                • #9
                  You are super. Thanks Andrew.
                  As for 3, any ideas on how to adjust the code so that

                  a) Out of sample forecasts are produced using an increasing window with data in sample 1 being used as an initial window (i.e. rolling but increasing window)
                  or
                  b) Out of sample forecasts are produced using a fixed window in a rolling fashion (i.e. rolling but fixed window)

                  The sample starts from 1987.

                  Hope I get help with that

                  Comment


                  • #10
                    Assume that my windows are each of length 1 year. So my initial sample at time 1 predicts the outcome at time 2 and the sample at time 2 predicts the outcome at time 3. All predictions in this case are therefore out-of-sample.


                    Code:
                    webuse grunfeld, clear
                    gen prediction=.
                    levelsof year, local(years)
                    forval i= 1/ `=wordcount("`years'")'{
                        qui capture{
                            quietly lasso linear invest mvalue kstock if year== `=word("`years'", `i')'  
                            predict investhat`i' if year== `=word("`years'", `=`i'+1')', postselection 
                            replace prediction= investhat`i' if !missing(investhat`i')
                        }
                    }

                    Comment


                    • #11
                      Thanks a lot. Yes, the fact that the sample at time 1 forecast the outcome at time 2 and that at time 2 forecasts the outcome at time 3 is what I was asking for. But I do have the following issue with the window in your code:

                      Does this mean that you use only a one-year window to make predictions?

                      I think I meant in #9 that (a) the window should be increasing for example it starts with 100 years and then increases by one year each time the forecast is made while in (b) it is a fixed window say of 100 years that is rolling, so we add one new year and drop the oldest year and so on.

                      I am not sure which one of these is consistent with your code in #10. If none, then I do appreciate if the code can be also amended to do both (a) and (b) (surely, two different codes)

                      Thanks.

                      Comment


                      • #12
                        Dear All
                        I appreciate it so much if someone can help also with my questions in post #11 regarding the windows. I thought to bring this up as it did not get a response over the last 24 hours.
                        Thanks

                        Comment

                        Working...
                        X