Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata and Mata give different results for OLS

    I am confused why Stata OLS and Mata would give me different results, rather one of them is incorrect for the same data.
    See the following dataset and the output from Stata OLS and Mata code provided in Stata Blog by David Drukker https://blog.stata.com/2016/01/12/pr...nd-using-mata/.
    Code:
    clear
    input double(y x1 x2 x3 x4 x5 x6)
    -.053244516 .74 -1.56 .95 -.18 -1.46 -.109749
      .06378974 .74 -1.56 .95 -.18 -1.46 -.109749
     .021148825 .74 -1.56 .95 -.18 -1.46 -.109749
    -.055137638 .74 -1.56 .95 -.18 -1.46 -.109749
     -.04604394 .74 -1.56 .95 -.18 -1.46 -.109749
     .005379249 .74 -1.56 .95 -.18 -1.46 -.109749
    -.003599284 .74 -1.56 .95 -.18 -1.46 -.109749
    -.041651525 .74 -1.56 .95 -.18 -1.46 -.109749
    -.029081209 .74 -1.56 .95 -.18 -1.46 -.109749
      .11075136 .74 -1.56 .95 -.18 -1.46 -.109749
    -.004830927 .74 -1.56 .95 -.18 -1.46 -.109749
    -.036290001 .74 -1.56 .95 -.18 -1.46 -.109749
     .052939385 .74 -1.56 .95 -.18 -1.46 -.109749
     .010752792 .74 -1.56 .95 -.18 -1.46 -.109749
    -.017558312 .74 -1.56 .95 -.18 -1.46 -.109749
    -.092018902 .74 -1.56 .95 -.18 -1.46 -.109749
    -.007434979 .74 -1.56 .95 -.18 -1.46 -.109749
    -.005221944 .74 -1.56 .95 -.18 -1.46 -.109749
    -.005797118 .74 -1.56 .95 -.18 -1.46 -.109749
     .013793322 .74 -1.56 .95 -.18 -1.46 -.109749
     .063281447 .74 -1.56 .95 -.18 -1.46 -.109749
      .00873368 .74 -1.56 .95 -.18 -1.46 -.109749
    -.026145279 .74 -1.56 .95 -.18 -1.46 -.109749
    -.001409444 .74 -1.56 .95 -.18 -1.46 -.109749
     .017094433 .74 -1.56 .95 -.18 -1.46 -.109749
    -.021337137 .74 -1.56 .95 -.18 -1.46 -.109749
    -.081678033 .74 -1.56 .95 -.18 -1.46 -.109749
              0 .74 -1.56 .95 -.18 -1.46 -.109749
    -.023657016 .74 -1.56 .95 -.18 -1.46 -.109749
    -.025943395 .74 -1.56 .95 -.18 -1.46 -.109749
    -.021053409 .74 -1.56 .95 -.18 -1.46 -.109749
    -.018648559 .74 -1.56 .95 -.18 -1.46 -.109749
     .033170536 .74 -1.56 .95 -.18 -1.46 -.109749
    -.058127742 .74 -1.56 .95 -.18 -1.46 -.109749
     .010582109 .74 -1.56 .95 -.18 -1.46 -.109749
     .020438667 .74 -1.56 .95 -.18 -1.46 -.109749
     .006299234 .74 -1.56 .95 -.18 -1.46 -.109749
    -.007017573 .74 -1.56 .95 -.18 -1.46 -.109749
     .027920596 .74 -1.56 .95 -.18 -1.46 -.109749
     .038655818 .74 -1.56 .95 -.18 -1.46 -.109749
    -.034916267 .74 -1.56 .95 -.18 -1.46 -.109749
     .037139546 .74 -1.56 .95 -.18 -1.46 -.109749
     .064247444 .74 -1.56 .95 -.18 -1.46 -.109749
     .009569451 .74 -1.56 .95 -.18 -1.46 -.109749
     .032088313 .74 -1.56 .95 -.18 -1.46 -.109749
      .00925543 .74 -1.56 .95 -.18 -1.46 -.109749
              0 .74 -1.56 .95 -.18 -1.46 -.109749
     .014764607 .74 -1.56 .95 -.18 -1.46 -.109749
    -.050643735 .74 -1.56 .95 -.18 -1.46 -.109749
    -.050325084 .74 -1.56 .95 -.18 -1.46 -.109749
    -.021739986 .74 -1.56 .95 -.18 -1.46 -.109749
     -.04284016 .74 -1.56 .95 -.18 -1.46 -.109749
    -.033662669 .74 -1.56 .95 -.18 -1.46 -.109749
    
    end
    OLS results

    Code:
    reg y x1-x6
          Source |       SS           df       MS      Number of obs   =        53
    -------------+----------------------------------   F(0, 52)        =      0.00
           Model |           0         0           .   Prob > F        =         .
        Residual |  .077875744        52   .00149761   R-squared       =    0.0000
    -------------+----------------------------------   Adj R-squared   =    0.0000
           Total |  .077875744        52   .00149761   Root MSE        =     .0387
    
    ------------------------------------------------------------------------------
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              x1 |          0  (omitted)
              x2 |          0  (omitted)
              x3 |          0  (omitted)
              x4 |          0  (omitted)
              x5 |          0  (omitted)
              x6 |          0  (omitted)
           _cons |  -.0046275   .0053157    -0.87   0.388    -.0152943    .0060392
    ------------------------------------------------------------------------------
    Mata code

    Code:
    . myregress11 y x1-x6
    ------------------------------------------------------------------------------
                 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              x1 |          0  (omitted)
              x2 |   .0029664   .0034075     0.87   0.388    -.0038713     .009804
              x3 |          0  (omitted)
              x4 |          0  (omitted)
              x5 |          0  (omitted)
              x6 |          0  (omitted)
           _cons |          0  (omitted)
    ------------------------------------------------------------------------------

  • #2
    Your regressors are perfectly collinear, hence the X'X matrix is singular and the parameters computed are useless. Since the regressors are all a multiple of the constant regressor, all are omitted except one. Which is kept is pretty random. Note by the way that .0029664*(-1.56) = -.00462758.

    There is nothing incorrect with the code, your dataset is.

    I can't give much more details as to why regress does not keep the same regressor, since the code is in the builtin (hence hidden) command _regress, but in Mata, invsym returns a generalized inverse and the second regressor is kept (apparently the constant regressor with the largest value is kept, which is probably related to implementation details of invsym). The _regress command probably does some more work.

    The manual (section "Methods and formulas" of regress) seems to explain what happens: "Let X denote the matrix of observations on the right-hand-side variables, y the vector of observationson the left-hand-side variable. Define A as X'X and a as X'y. The coefficient vector b is defined as A^-1 a. Although not shown in the notation, unless hascons is specified, A and a are accumulated in deviation form and the constant is calculated separately. This comment applies to all statistics listed below.".

    I am not sure what the sentence in bold really means, but if you run "regress y x*, hascons", you will get the same result as Drukker's Mata code (except the is no _cons at all in the output, but it would be omitted anyway).

    Jean-Claude Arbaut
    Last edited by Jean-Claude Arbaut; 03 May 2019, 01:46.

    Comment


    • #3
      Just to elaborate on

      Originally posted by Jean-Claude Arbaut View Post
      Note by the way that .0029664*(-1.56) = -.00462758.
      Jean-Claude has computed the predicted values, y-hat, here. These are the same in both cases. Also, since all regressors are constants, i.e., have no variance, the predicted values are just the mean of y.

      Code:
      . mean y
      
      Mean estimation                     Number of obs    =      53
      
      --------------------------------------------------------------
                   |       Mean   Std. Err.     [95% Conf. Interval]
      -------------+------------------------------------------------
                 y |  -.0046275   .0053157     -.0152943    .0060392
      --------------------------------------------------------------
      Best
      Daniel

      Comment


      • #4
        Melissa Martin --

        It is, however, the case that one can get different results from OLS in Mata and Stata, depending on floating point arithmetic. I know this because a long time ago I asked a question about this and got a very enlightening response from Bill Gould (StataCorp) about it. It is in the archives here.

        Matthew J. Baker

        Comment


        • #5
          Thanks for your answers. My main problem is not to prove the accuracy of either code. I am rather interested in making the Mata code behave like the regress command. Therefore, if someone can help in modifying the Dukker's code so that both the regress and his code behave exactly the same. The settings that I am using is really prone to generating incorrect results due to the mentioned problem with the Dukker's Mata code.

          Comment


          • #6
            Originally posted by Meliesa Martin View Post
            The settings that I am using is really prone to generating incorrect results due to the mentioned problem with the Dukker's Mata code.
            No, it is not the case that the results from myregressallare in any way incorrect. They simply differ from the results from regress.

            Originally posted by Jean-Claude Arbaut View Post
            There is nothing incorrect with the code, your dataset is.
            The collinearity in your dataset means that there is not a single "correct" set of parameter estimates for your model given your data. The results from regress and the results from myregressall are both correct and in fact produce identical fitted values, which shows that they produce the same minimum of the sum of squared errors. If the myregressall output were to include the same model summary that appears above the coefficient estimates in the regress output you would find them to be identical, with a zero value for r-squared telling you that your independent variables add nothing to simply taking the average of the dependent variable.

            Comment


            • #7
              Originally posted by Meliesa Martin View Post
              I am rather interested in making the Mata code behave like the regress command.
              Since regress calls _regerss, and since the latter is built-in, i.e., compiled C code, there is no way to really be sure that any Mata code will behave exactly like regress; sorry. If you want regress, why do you not use regress?

              Bill Gould has a Mata class implementation of linear regression in his excellent Mata Book; you can download the code here. For your example, it gives the exact same results as regress.

              Best
              Daniel

              Comment


              • #8
                daniel klein It might sound that I am ingrateful to what you have contributed, believe me, I am really thankful. But I need more help here. Can you please replicate the code from Bill Gould book here to show how his Mata code produces the same results on the data I have posted.

                Comment


                • #9
                  Here is what is going on: in order to estimate regression coefficients, when there is an additional constant term, Stata first centers variables x1-xn by subtracting the mean. This does not change the coefficients, except for the constant coefficient, which must be corrected. The tricky part is getting the covariance matrix right.

                  In the code from Bill Gould's book, everything is in linreg2.mata, in a class that handles all details. Here I reproduce just the formulas for the model with a constant regressor.

                  The function below returns what is computed by mywork in myregress11.ado: coefficients (b), covariance (V) and degrees of freedeom (df).

                  Code:
                  mata
                  function reg(X,y,b,V,df) {
                      N = rows(X)
                      k = cols(X)
                      K = k+1
                  
                      mx = mean(X)
                      my = mean(y)
                  
                      SS = quadcrossdev(X,mx,X,mx)
                      St = quadcrossdev(X,mx,y,my)\0
                  
                      if (K==1) {
                          SSinv = 1/N
                          b = my
                          yhat = J(N,1,my)
                      } else {
                          m = invsym(SS)
                          v = -mx*m
                          SSinv = J(K,K,.)
                          SSinv[|1,1\k,k|] = m
                          SSinv[|K,1\K,k|] = v
                          SSinv[|1,K\k,K|] = v'
                          SSinv[K,K] = 1/N-mx*v'
                          b = SSinv*St
                          b[K] = my-mx*b[1::k]
                          yhat = X*b[1::k]:+b[K]
                      }
                      e = y-yhat
                      df = N-K+diag0cnt(SSinv)
                      s2 = quadcross(e,e)/df
                      V = s2*SSinv
                  }
                  end
                  Here is how to apply this function to your dataset:

                  Code:
                  mata
                  y = st_data(.,"y")
                  X = st_data(.,"x*")
                  reg(X,y,b=.,V=.,df=.)
                  b
                  V
                  df
                  end
                  The integration of this function with myregress11.ado is left as an exercise. Since you didn't explain what you really want and why regress or linreg2.mata won't do, I don't want to guess what options would be useful to you (if you want "nocons", you will have to replicate all of linreg2.mata, which does not make any sense: just use it). Besides, with the help of David Drukker's excellent blog entries on writing estimation commands, you should be able to do it yourself.

                  Hope this helps

                  Jean-Claude Arbaut
                  Last edited by Jean-Claude Arbaut; 03 May 2019, 15:44.

                  Comment

                  Working...
                  X