Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I calculate out-of-sample SE for a linear regression?

    I feel like I'm missing something really obvious. I ran a linear regression on a subset of my data:

    regress logprice lone_ticket i.sectiontype_encoded i.gamedate_enc if insample==1

    It's easy to pull lots of summary stats describing how well this regression fits within the sample, but I can't seem to ask it questions about the remainder of the data (i.e. where insample==0). All I want is to figure out how much worse the fit is out-of-sample. Standard error or r^2 would be adequate.

  • #2
    I don't know what you mean by standard error--I can't see how that statistic fits in here. Perhaps you mean root mean squared error?

    Anyway, the -predict- command will get you linear predictions and residuals both in and out of sample. You can get your statistics from there. Here's an example:

    Code:
    sysuse auto, clear
    
    regress price mpg if foreign == 0
    
    predict xb, xb
    predict e, resid
    gen e2 = e*e
    
    corr xb price if foreign != 0
    display "Out of sample R^2 = " %05.3f =`r(rho)'^2
    
    summ e2 if foreign != 0, meanonly
    display "Out of sample RMSE = " =sqrt(`r(mean)')

    Comment


    • #3
      Thanks, this solves my immediate problem. And you're right, I was looking for RMSE -- sorry for the confusion.

      Out of curiosity, is there a standard/preferred way to do cross-validation and out-of-sample tests of predictors in Stata? Or is the method to calculate your own statistics based on the predict command?

      Comment


      • #4
        I'm not aware of any, but I can't say I've ever looked. Whenever I've needed to do this, I've always just calculated my own statistics, in ways similar to what I showed you. There may well be user-written Stata commands that do this, but I don't know about them. Other Forum members may know more about this and might respond. Also you might try using Stata's -search- command to see if it turns up anything relevant.

        Comment


        • #5
          Hi Clyde,

          With regards to your calculation for the 'Out of Sample RMSE' I was a bit confused and would just like to clarify. Firstly, when caclulating the RMSE why did you not divide it by the errors degrees of freedom?

          And if you did want to divide it by the errors degrees of freedom would it be more accurate to get the square root of the sum of the sqaured errors and divide this by the errors degrees of freedom. Becuase if you use the 'r(mean)' command and then were to divide it by the errors degress of freedom, would you be caluclating the mean twice?

          Lastly, when performing the out of sample RMSE how do you know the errors degrees of freedom are across the entire sample or just limited to the subset implied by 'foreign !=0'

          Code:
          ysuse auto, clear
          
          regress price mpg if foreign == 0  
          predict xb, xb predict e,
          resid gen e2 = e*e  
          
          corr xb price if foreign != 0 display
          "Out of sample R^2 = " %05.3f =`r(rho)'^2  
          
          summ e2 if foreign != 0, meanonly
          display "Out of sample RMSE = " =sqrt(r(sum)/e(df_r))
          
          // how do you know if the error degrees of freedom are across the entire sample or just the subset?
          Also why did you not divide the sum of the squared errors by the degrees of freedom. If you were to do this would it be more accurate to use r(sum) as oppose to r(mean)?
          Last edited by Donough Lawlor; 18 Dec 2018, 04:24.

          Comment


          • #6
            Donough, if you start asking the questions you asked, however in a reverse order, you would be giving the answers as well.

            How do we know how many degrees of freedom does the out of sample error have? Certainly e(df_r) cannot be the correct degrees of freedom because those are the In Sample Number of Observations - The In Sample Estimated Parameters. But now we have a different number of observations, the Out of Sample Number of Observations, and we are not losing any degrees of freedom because of estimation, because we did the estimation on another In Sample fraction of the data.

            In short Clyde divides by the number of Out of Sample observations, because the concept of "degrees of freedom" is not well defined for out of sample calculations.

            Comment


            • #7
              Thanks for the clarification Joro.

              Apologies for framing the questions in reverse order and thereby giving the answers as well. I did it that way to attempt to get clarification was that the correct approach and if it wasn't hopefully someone could point out the correct way forward perhaps while outlining where I went wrong with my approach to give me a deeper understanding.

              You're answer was great and clarified everything. I had seen a colleague in order to get Out of Sample RMSE divide with the command e(df_r). However as you rightly pointed out this would be the In Sample Number of Observations which would lead to an incorrectly calculated Out of Sample RMSE.

              Thanks for clarifiying that Clyde divides by the nuber of Out of Sample Observations. That's predominantnly what I was looking for clarification on.

              Comment

              Working...
              X