Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • RMSE: why does stata adjust for degrees of freedom?

    I need to calculate RMSE a number of times, where I have saved data and various predictions. In checking the stats, I noticed my own calculations and stata's results (via cnsreg) were off. It seems Stata adjusts for the degrees of freedom in all RMSE calculations. I find this odd, as textbooks I've looked at do not adjust for degrees of freedom. (E.g. Wooldridge writes "This is essentially the sample standard deviation of the forecast errors (without any degrees of freedom adjustment)." Greene also implies no adjustment. Is this a well known issue? Is it normally such as small effect people don't mind? Are the textbooks out of date? I can find stata forum posts confidently asserting stata's approach is correct/normal, but nothing acknowledging the discrepancy between econometrics textbooks and stata's implementation.

    My own code was running cnsreg with a constraint of 1 on the only RHS variable (the model prediction), and without a constant, as a quick way of calculating the RMSE. That approach means cnsreg actually adds one to the sample size, as there is one constraint.

    Is there a way, other than coding the whole calculation, to opt for an unadjusted RMSE? And is this adjustment standard outside of my favourite textbooks?

    Many thanks.

  • #2
    I don’t know off the top of my head, but have you tried running the same problem in two different programs, e.g. Stata and SPSS? Or taken a textbook problem that includes the data needed to replicate the example and see what Stata says? I’d want to make sure there is a discrepancy first, and that you use aren’t misunderstanding things.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    StataNow Version: 19.5 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://academicweb.nd.edu/~rwilliam/

    Comment


    • #3
      Originally posted by Paul Clist View Post
      It seems Stata adjusts for the degrees of freedom in all RMSE calculations.
      This is not a Stata thing! There are different ways to estimate the error variance, but let's first consider the linear regression model before presenting these approaches. We have:

      \[
      y = X\beta + u,
      \]
      with \(n\) observations and \(k\) parameters (including the intercept). We can define the vector of OLS residuals as:

      \[
      e = y - X\hat{\beta},
      \]

      which is an \(n \times 1\) vector. The quadratic form:

      \[
      e'e = \sum_{i=1}^n e_i^2
      \]

      is the sum of squared residuals (SSR). Now, because the population error variance \(\sigma^2 = \mathbb{E}[u_i^2]\) is unknown, we need to estimate it. There are two common approaches:

      (1) Maximum likelihood estimate (MLE):

      \[
      \hat{\sigma}^2_{\text{MLE}} = \frac{e'e}{n},
      \]
      which divides by \(n\), but is biased downward.

      (2) Unbiased estimate:

      \[
      s^2 = \frac{e'e}{n-k}.
      \]
      This divides by \(n-k\) (degrees of freedom), correcting the bias. Under the classical OLS assumptions,

      \[
      \frac{e'e}{\sigma^2} \sim \chi^2_{n-k},
      \]

      so that

      \[
      \mathbb{E}\!\left[\frac{e'e}{\sigma^2}\right] = n-k.
      \]

      Thus dividing \(e'e\) by \(n-k\) gives an unbiased estimator of \(\sigma^2\). The Root Mean Squared Error (RMSE) is then defined as

      \[
      \text{RMSE} = \sqrt{\frac{e'e}{n-k}} = \sqrt{s^2}.
      \]

      For large \(n\), the difference between dividing by \(n\) and \(n-k\) is negligible (unimportant).

      Is there a way, other than coding the whole calculation, to opt for an unadjusted RMSE?
      Given that the RMSE is an estimate of the standard deviation of the regression residuals, just predict the regression residuals and compute their standard deviation.


      Code:
      sysuse auto, clear
      regress mpg weight displacement
      predict res, res
      sum res
      di `r(sd)'
      Res.:

      Code:
      . regress mpg weight displacement
      
            Source |       SS           df       MS      Number of obs   =        74
      -------------+----------------------------------   F(2, 71)        =     66.79
             Model |  1595.40969         2  797.704846   Prob > F        =    0.0000
          Residual |  848.049768        71  11.9443629   R-squared       =    0.6529
      -------------+----------------------------------   Adj R-squared   =    0.6432
             Total |  2443.45946        73  33.4720474   Root MSE        =    3.4561
      
      ------------------------------------------------------------------------------
               mpg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
            weight |  -.0065671   .0011662    -5.63   0.000    -.0088925   -.0042417
      displacement |   .0052808   .0098696     0.54   0.594    -.0143986    .0249602
             _cons |   40.08452    2.02011    19.84   0.000     36.05654    44.11251
      ------------------------------------------------------------------------------
      
      .
      . predict res, res
      
      .
      . sum res
      
          Variable |        Obs        Mean    Std. dev.       Min        Max
      -------------+---------------------------------------------------------
               res |         74    4.51e-09     3.40839  -6.965423    13.8371
      
      .
      . di `r(sd)'
      3.4083896

      Comment


      • #4
        Thank you both. For future readers...

        1. Yes, it is possible I've misunderstood.

        2. I'm interested in the RMSE error because of a paper called 'Disguising Lies—Image Concerns and Partial Lying in Cheating Games' By Kiryl Khalmetski and Dirk Sliwka, published in American Economic Journal: Microeconomics 2019. They calculate the RMSE manually in Mathematica without adjusting for the degrees of freedom (code below). Below, where k=5 (not shown) they merely average the squared error in predictions over six points.The model has 2 parameters, but they don't adjust for these.

        Click image for larger version

Name:	ks code.png
Views:	1
Size:	77.4 KB
ID:	1781668


        To me, this appears to be in line with Greene and Wooldridge, but not in line with Stata or Andrew.

        3. I worked out a simple workaround to give the unadjusted RMSE:
        Code:
        local rmse =  e(rmse) * sqrt(e(df_r) / e(N))
        . This readjusts, using the model degrees of freedom that Stata thinks you have, and the sample size.
        (Andrew provides another approach above, but that doesn't seem to lead to the same result as I wish to get - see this code to compare.)

        Code:
        sysuse auto, clear
        regress mpg weight displacement
        predict res, res
        sum res
        di `r(sd)'
        
        gen se = res^2
        sum se
        di sqrt(`r(mean)')
         di e(rmse) * sqrt(e(df_r) / e(N))
        4. I imagine this is yet another example of the same statistical term being used to mean different things in stats/econometrics, and people (or me at least) being unaware of the variety of definitions.

        Many thanks for the replies, and hope this helps someone.




        Comment


        • #5
          Richard - I forgot to post that the unadjusted RMSE appears to be used in at least some R packages, e.g. metrics just uses y and yhat: https://www.r-bloggers.com/2021/07/h...ror-rmse-in-r/
          But you can't tempt me to venture into SPSS to find out their approach!

          Comment


          • #6
            A detail that reconciles any apparent contradiction here is that summarize also uses an unbiased estimator for variance, namely it uses in the divisor the sample size MINUS 1

            This is documented, but here is a simple demonstration. Consider the values 1 2 3 4 5 where the mean is 3 and the deviations from the mean are -2 -1 0 1 2 and their squares are 4 1 0 1 4 and so the sum of squared deviations is 10. If you use maximum likelihood you then divide by 5 and get 2 for the variance but if you use the unbiased estimator you divide by 4 and get 2.5. To estimate the SD you take the square root in either case.

            Here is the whole kit and caboodle, especially because I can't do most square roots in my head.

            Code:
            . clear 
            
            . set obs 5 
            number of observations (_N) was 0, now 5
            
            . gen x = _n
            
            . list 
            
                 +---+
                 | x |
                 |---|
              1. | 1 |
              2. | 2 |
              3. | 3 |
              4. | 4 |
              5. | 5 |
                 +---+
            
            . su x, d 
            
                                          x
            -------------------------------------------------------------
                  Percentiles      Smallest
             1%            1              1
             5%            1              2
            10%            1              3       Obs                   5
            25%            2              4       Sum of Wgt.           5
            
            50%            3                      Mean                  3
                                    Largest       Std. Dev.      1.581139
            75%            4              2
            90%            5              3       Variance            2.5
            95%            5              4       Skewness              0
            99%            5              5       Kurtosis            1.7
            
            . 
            . mata
            ------------------------------------------------- mata (type end to exit) --------------------
            : x = (1::5)
            
            : sum((x :- 3):^2) / 5
              2
            
            : sum((x :- 3):^2) / 4
              2.5
            
            : sqrt(2)
              1.414213562
            
            : sqrt(2.5)
              1.58113883
            
            : end
            Tactical tip: To try out simple examples, create simple variables in Stata -- or simple vectors in Mata.

            Technical detail: The square root of the variance, even calculated this way, is not an unbiased estimator of the SD, although sloppy texts will claim or imply that to be true, yet the bias is usually slight for reasonable sample sizes (and that circularly defines "reasonable" and "slight").

            Comment


            • #7
              Code:
              sysuse auto, clear
              regress mpg weight displacement
              predict res, res
              predict yhat, xb
              sum res
              di `r(sd)'
              
              gen se = res^2
              sum se
              di sqrt(`r(mean)')
              di e(rmse) * sqrt(e(df_r) / e(N))
              rmse mpg yhat , raw
              
              capture drop rmse
              g rmse = (mpg - yhat)^2 
              summ rmse
              di sqrt(r(sum) / 74)

              Comment


              • #8
                Originally posted by Nick Cox View Post
                A detail that reconciles any apparent contradiction here is that summarize also uses an unbiased estimator for variance, namely it uses in the divisor the sample size MINUS 1
                Good catch, Nick! That should have been apparent to me, since it follows the same convention as the calculation of the RMSE.

                Comment

                Working...
                X