Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • under the -reg- hood


    Apologies if this is a basic question, but can someone explain what -reg- is doing? I had been assuming that the estimates that -reg- reports are simply the result of matrix multiplication, but that seems not to be the case.

    Under what conditions do the estimates reported by -reg- match matrix multiplication? I realize this is an odd hand-picked example, but I'm just trying to understand what -reg- does.

    Code:
    * make a data setclear
    set seed 999
    set obs 400
    gen y     = rnormal()
    gen x1     = ln(_n + 1000)
    gen x2     = x1^2
    gen x3     = x1^3
    noi corr *    /* yes, all RHS vars are very highly correlated as we'd expect*/
    * matrix mult method of getting betas
    gen con = 1
    mkmat con x1 x2 x3 , matrix(x)
    mkmat y
    matrix B = inv(x'*x)*(x'*y)
    matrix list B                                /* compare this ...             */
    * ols
    reg y x1 x2 x3                                 /*                 ... with this     */



  • #2
    gen x1 = ln(_n + 1000)
    gen x2 = x1^2
    gen x3 = x1^3
    OLS inverts a matrix, so having highly collinear variables is problematic when using a homemade device. regress handles the collinearity in some way to allow you to have quadratic, cubic and higher order terms. Some other estimators, e.g., glm will drop a variable to break the collinearity.

    Under what conditions do the estimates reported by -reg- match matrix multiplication?
    If the matrix has full column rank (or the columns are linearly independent).
    Last edited by Andrew Musau; 21 Feb 2021, 15:50.

    Comment


    • #3
      It is the very high correlations among the x variables that is causing the discrepancy. Consequently, the X'X matrix is almost singular. There are different ways to invert a matrix, and when the matrix is nearly singular, these different ways can produce different results due to both computational rounding/truncation issues, and the actual numerical instability of some matrix inversion algorithms in this situation. Internal to the -reg- command, Stata sends these matrices to Mata and uses Mata's matrix inversion method, which is of higher precision than the Stata inv() matrix inversion method.

      So yes, under the hood, Stata is using matrix inversion, but not the same matrix inversion algorithm you used.

      Comment


      • #4
        Andrew and Clyde have pointed out that there are different algorithms to invert a matrix. There are also different ways to get the textbook X'X. Mata has a quadcross() function that does this in quad (i.e., double-double) precision. Also, regress (probably) applies these calculations to the de-meaned data. You can read more about that (and plenty of other interesting stuff) in Bill Gould's The Mata Book. The supplementary material has a class-based Mata implementation of linear regression (linreg2.mata) that is probably close to how regress works internally. I do not think that regress technically uses Mata. The underlying _regress is listed as built-in; the code is probably implemented in C.
        Last edited by daniel klein; 22 Feb 2021, 01:40.

        Comment


        • #5
          Ah, that makes sense: -reg- uses Mata which is of higher precision than the Stata inv() matrix inversion method.

          Comment


          • #6
            I'm a little confused by this question and this discussion; notwithstanding the "methods and formulas" section of the Stata manual, I don't believe that Stata actually inverts the matrix; some version of the "sweep" algorithm is much more likely; if you want to read about regression algorthms, here are some citations (the third is the least theoretical and most practice-oriented):

            Kennedy, WJ, Jr. and Gentle, JE (1980), Statistical Computing, Marcel Dekker
            Thisted, RA (1988), Elements of Statistical Computing, Chapman and Hall
            Press, WH, et al. (2007), Numerical Recipes, third edition, Cambridge U press

            Comment


            • #7
              I do not know what regress does as StataCorp has not made the source code public, so that is why I chose to be vague in #2.

              regress handles the collinearity in some way to allow you to have quadratic, cubic and higher order terms.
              However, I will tend to agree with you Rich that the divergence in estimates is likely not due to the precision of the inversion method. Using David Drukker's implementation of the OLS algorithm from this Stata blog and the example in #1, you see that the algorithm will drop one of the variables to break the collinearity. I would be interested if anyone has a Mata algorithm that uses inversion and produces the exact same results as regress.

              Code:
              * make a data setclear
              set seed 999
              set obs 400
              gen y     = rnormal()
              gen x1     = ln(_n + 1000)
              gen x2     = x1^2
              gen x3     = x1^3
              
              mata
              y    = st_data(., "y")
              X    = st_data(., "x1 x2 x3")
              n    = rows(X)
              X    = X,J(n,1,1)
              XpX  = quadcross(X, X)
              XpXi = invsym(XpX)
              b    = XpXi*quadcross(X, y)
              b
              end
              
              regress y x1 x2 x3
              Res.:

              Code:
              : b    = XpXi*quadcross(X, y)
              
              : b
                                1
                  +----------------+
                1 |   25.37120317  |
                2 |  -7.012076157  |
                3 |   .4843253213  |
                4 |             0  |
                  +----------------+
              
              : end
              --------------------------------------------------------------------------------------
              
              .
              . regress y x1 x2 x3
              
                    Source |       SS           df       MS      Number of obs   =       400
              -------------+----------------------------------   F(3, 396)       =      1.44
                     Model |  4.67519272         3  1.55839757   Prob > F        =    0.2315
                  Residual |  429.449514       396  1.08446847   R-squared       =    0.0108
              -------------+----------------------------------   Adj R-squared   =    0.0033
                     Total |  434.124707       399  1.08803185   Root MSE        =    1.0414
              
              ------------------------------------------------------------------------------
                         y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                        x1 |  -1298.321   10929.67    -0.12   0.906    -22785.75     20189.1
                        x2 |    179.992   1544.121     0.12   0.907    -2855.707    3215.691
                        x3 |  -8.320985   72.70972    -0.11   0.909    -151.2663    134.6243
                     _cons |   3122.863   25785.13     0.12   0.904    -47569.99    53815.72
              ------------------------------------------------------------------------------
              Last edited by Andrew Musau; 22 Feb 2021, 13:31.

              Comment

              Working...
              X