Stata and Mata give different results for OLS

Meliesa Martin

Join Date: Apr 2018
Posts: 29

Stata and Mata give different results for OLS

03 May 2019, 01:06

I am confused why Stata OLS and Mata would give me different results, rather one of them is incorrect for the same data.
See the following dataset and the output from Stata OLS and Mata code provided in Stata Blog by David Drukker https://blog.stata.com/2016/01/12/pr...nd-using-mata/.

Code:

clear
input double(y x1 x2 x3 x4 x5 x6)
-.053244516 .74 -1.56 .95 -.18 -1.46 -.109749
  .06378974 .74 -1.56 .95 -.18 -1.46 -.109749
 .021148825 .74 -1.56 .95 -.18 -1.46 -.109749
-.055137638 .74 -1.56 .95 -.18 -1.46 -.109749
 -.04604394 .74 -1.56 .95 -.18 -1.46 -.109749
 .005379249 .74 -1.56 .95 -.18 -1.46 -.109749
-.003599284 .74 -1.56 .95 -.18 -1.46 -.109749
-.041651525 .74 -1.56 .95 -.18 -1.46 -.109749
-.029081209 .74 -1.56 .95 -.18 -1.46 -.109749
  .11075136 .74 -1.56 .95 -.18 -1.46 -.109749
-.004830927 .74 -1.56 .95 -.18 -1.46 -.109749
-.036290001 .74 -1.56 .95 -.18 -1.46 -.109749
 .052939385 .74 -1.56 .95 -.18 -1.46 -.109749
 .010752792 .74 -1.56 .95 -.18 -1.46 -.109749
-.017558312 .74 -1.56 .95 -.18 -1.46 -.109749
-.092018902 .74 -1.56 .95 -.18 -1.46 -.109749
-.007434979 .74 -1.56 .95 -.18 -1.46 -.109749
-.005221944 .74 -1.56 .95 -.18 -1.46 -.109749
-.005797118 .74 -1.56 .95 -.18 -1.46 -.109749
 .013793322 .74 -1.56 .95 -.18 -1.46 -.109749
 .063281447 .74 -1.56 .95 -.18 -1.46 -.109749
  .00873368 .74 -1.56 .95 -.18 -1.46 -.109749
-.026145279 .74 -1.56 .95 -.18 -1.46 -.109749
-.001409444 .74 -1.56 .95 -.18 -1.46 -.109749
 .017094433 .74 -1.56 .95 -.18 -1.46 -.109749
-.021337137 .74 -1.56 .95 -.18 -1.46 -.109749
-.081678033 .74 -1.56 .95 -.18 -1.46 -.109749
          0 .74 -1.56 .95 -.18 -1.46 -.109749
-.023657016 .74 -1.56 .95 -.18 -1.46 -.109749
-.025943395 .74 -1.56 .95 -.18 -1.46 -.109749
-.021053409 .74 -1.56 .95 -.18 -1.46 -.109749
-.018648559 .74 -1.56 .95 -.18 -1.46 -.109749
 .033170536 .74 -1.56 .95 -.18 -1.46 -.109749
-.058127742 .74 -1.56 .95 -.18 -1.46 -.109749
 .010582109 .74 -1.56 .95 -.18 -1.46 -.109749
 .020438667 .74 -1.56 .95 -.18 -1.46 -.109749
 .006299234 .74 -1.56 .95 -.18 -1.46 -.109749
-.007017573 .74 -1.56 .95 -.18 -1.46 -.109749
 .027920596 .74 -1.56 .95 -.18 -1.46 -.109749
 .038655818 .74 -1.56 .95 -.18 -1.46 -.109749
-.034916267 .74 -1.56 .95 -.18 -1.46 -.109749
 .037139546 .74 -1.56 .95 -.18 -1.46 -.109749
 .064247444 .74 -1.56 .95 -.18 -1.46 -.109749
 .009569451 .74 -1.56 .95 -.18 -1.46 -.109749
 .032088313 .74 -1.56 .95 -.18 -1.46 -.109749
  .00925543 .74 -1.56 .95 -.18 -1.46 -.109749
          0 .74 -1.56 .95 -.18 -1.46 -.109749
 .014764607 .74 -1.56 .95 -.18 -1.46 -.109749
-.050643735 .74 -1.56 .95 -.18 -1.46 -.109749
-.050325084 .74 -1.56 .95 -.18 -1.46 -.109749
-.021739986 .74 -1.56 .95 -.18 -1.46 -.109749
 -.04284016 .74 -1.56 .95 -.18 -1.46 -.109749
-.033662669 .74 -1.56 .95 -.18 -1.46 -.109749

end

OLS results

Code:

reg y x1-x6
      Source |       SS           df       MS      Number of obs   =        53
-------------+----------------------------------   F(0, 52)        =      0.00
       Model |           0         0           .   Prob > F        =         .
    Residual |  .077875744        52   .00149761   R-squared       =    0.0000
-------------+----------------------------------   Adj R-squared   =    0.0000
       Total |  .077875744        52   .00149761   Root MSE        =     .0387

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |          0  (omitted)
          x2 |          0  (omitted)
          x3 |          0  (omitted)
          x4 |          0  (omitted)
          x5 |          0  (omitted)
          x6 |          0  (omitted)
       _cons |  -.0046275   .0053157    -0.87   0.388    -.0152943    .0060392
------------------------------------------------------------------------------

Mata code

Code:

. myregress11 y x1-x6
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |          0  (omitted)
          x2 |   .0029664   .0034075     0.87   0.388    -.0038713     .009804
          x3 |          0  (omitted)
          x4 |          0  (omitted)
          x5 |          0  (omitted)
          x6 |          0  (omitted)
       _cons |          0  (omitted)
------------------------------------------------------------------------------

Tags: None

Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#2

03 May 2019, 01:13

Your regressors are perfectly collinear, hence the X'X matrix is singular and the parameters computed are useless. Since the regressors are all a multiple of the constant regressor, all are omitted except one. Which is kept is pretty random. Note by the way that .0029664*(-1.56) = -.00462758.

There is nothing incorrect with the code, your dataset is.

I can't give much more details as to why regress does not keep the same regressor, since the code is in the builtin (hence hidden) command _regress, but in Mata, invsym returns a generalized inverse and the second regressor is kept (apparently the constant regressor with the largest value is kept, which is probably related to implementation details of invsym). The _regress command probably does some more work.

The manual (section "Methods and formulas" of regress) seems to explain what happens: "Let X denote the matrix of observations on the right-hand-side variables, y the vector of observationson the left-hand-side variable. Define A as X'X and a as X'y. The coefficient vector b is defined as A^-1 a. Although not shown in the notation, unless hascons is specified, A and a are accumulated in deviation form and the constant is calculated separately. This comment applies to all statistics listed below.".

I am not sure what the sentence in bold really means, but if you run "regress y x*, hascons", you will get the same result as Drukker's Mata code (except the is no _cons at all in the output, but it would be omitted anyway).

Jean-Claude Arbaut

Last edited by Jean-Claude Arbaut; 03 May 2019, 01:46.
4 likes
Comment
daniel klein

Join Date: Mar 2014

Posts: 3886
#3

03 May 2019, 01:36

Just to elaborate on

Originally posted by Jean-Claude Arbaut View Post

Note by the way that .0029664*(-1.56) = -.00462758.

Jean-Claude has computed the predicted values, y-hat, here. These are the same in both cases. Also, since all regressors are constants, i.e., have no variance, the predicted values are just the mean of y.

Code:

. mean y Mean estimation Number of obs = 53 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ y | -.0046275 .0053157 -.0152943 .0060392 --------------------------------------------------------------

Best
Daniel
1 like
Comment
Matthew J. Baker

Join Date: Mar 2014

Posts: 126
#4

03 May 2019, 06:37

Melissa Martin --

It is, however, the case that one can get different results from OLS in Mata and Stata, depending on floating point arithmetic. I know this because a long time ago I asked a question about this and got a very enlightening response from Bill Gould (StataCorp) about it. It is in the archives here.

Matthew J. Baker
2 likes
Comment
Meliesa Martin

Join Date: Apr 2018

Posts: 29
#5

03 May 2019, 06:48

Thanks for your answers. My main problem is not to prove the accuracy of either code. I am rather interested in making the Mata code behave like the regress command. Therefore, if someone can help in modifying the Dukker's code so that both the regress and his code behave exactly the same. The settings that I am using is really prone to generating incorrect results due to the mentioned problem with the Dukker's Mata code.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

03 May 2019, 09:38

Originally posted by Meliesa Martin View Post

The settings that I am using is really prone to generating incorrect results due to the mentioned problem with the Dukker's Mata code.

No, it is not the case that the results from myregressallare in any way incorrect. They simply differ from the results from regress.

Originally posted by Jean-Claude Arbaut View Post

There is nothing incorrect with the code, your dataset is.

The collinearity in your dataset means that there is not a single "correct" set of parameter estimates for your model given your data. The results from regress and the results from myregressall are both correct and in fact produce identical fitted values, which shows that they produce the same minimum of the sum of squared errors. If the myregressall output were to include the same model summary that appears above the coefficient estimates in the regress output you would find them to be identical, with a zero value for r-squared telling you that your independent variables add nothing to simply taking the average of the dependent variable.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3886
#7

03 May 2019, 10:13

Originally posted by Meliesa Martin View Post

I am rather interested in making the Mata code behave like the regress command.

Since regress calls _regerss, and since the latter is built-in, i.e., compiled C code, there is no way to really be sure that any Mata code will behave exactly like regress; sorry. If you want regress, why do you not use regress?

Bill Gould has a Mata class implementation of linear regression in his excellent Mata Book; you can download the code here. For your example, it gives the exact same results as regress.

Best
Daniel
3 likes
Comment
Meliesa Martin

Join Date: Apr 2018

Posts: 29
#8

03 May 2019, 13:22

daniel klein It might sound that I am ingrateful to what you have contributed, believe me, I am really thankful. But I need more help here. Can you please replicate the code from Bill Gould book here to show how his Mata code produces the same results on the data I have posted.
Comment
Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#9

03 May 2019, 15:36

Here is what is going on: in order to estimate regression coefficients, when there is an additional constant term, Stata first centers variables x1-xn by subtracting the mean. This does not change the coefficients, except for the constant coefficient, which must be corrected. The tricky part is getting the covariance matrix right.

In the code from Bill Gould's book, everything is in linreg2.mata, in a class that handles all details. Here I reproduce just the formulas for the model with a constant regressor.

The function below returns what is computed by mywork in myregress11.ado: coefficients (b), covariance (V) and degrees of freedeom (df).

Code:

mata function reg(X,y,b,V,df) { N = rows(X) k = cols(X) K = k+1 mx = mean(X) my = mean(y) SS = quadcrossdev(X,mx,X,mx) St = quadcrossdev(X,mx,y,my)\0 if (K==1) { SSinv = 1/N b = my yhat = J(N,1,my) } else { m = invsym(SS) v = -mx*m SSinv = J(K,K,.) SSinv[|1,1\k,k|] = m SSinv[|K,1\K,k|] = v SSinv[|1,K\k,K|] = v' SSinv[K,K] = 1/N-mx*v' b = SSinv*St b[K] = my-mx*b[1::k] yhat = X*b[1::k]:+b[K] } e = y-yhat df = N-K+diag0cnt(SSinv) s2 = quadcross(e,e)/df V = s2*SSinv } end

Here is how to apply this function to your dataset:

Code:

mata y = st_data(.,"y") X = st_data(.,"x*") reg(X,y,b=.,V=.,df=.) b V df end

The integration of this function with myregress11.ado is left as an exercise. Since you didn't explain what you really want and why regress or linreg2.mata won't do, I don't want to guess what options would be useful to you (if you want "nocons", you will have to replicate all of linreg2.mata, which does not make any sense: just use it). Besides, with the help of David Drukker's excellent blog entries on writing estimation commands, you should be able to do it yourself.

Hope this helps

Jean-Claude Arbaut

Last edited by Jean-Claude Arbaut; 03 May 2019, 15:44.
1 like
Comment

Announcement

Stata and Mata give different results for OLS

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment