Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Warning message of Variance matrix is nonsymmetric or highly singular

    In my following regression I'm getting this result. Though, I can sperately run the first stage regression , and check if this is a weak instrument or not. But, running after this regression I can't see any p-value or t statisitics in my first stage regression. Is this normal looking setup or I'm doing something wrong ?

    This is the command I'm using for the following regression result:

    Code:
    ivregress 2sls DEPVAR (endogenous=IV) male ismarried wasmarried age age2 age3 age4 black asian hispanic lths hsdegree somecollege , cluster(county) first
    Code:
    First-stage regressions
    -----------------------
    Warning: Variance matrix is nonsymmetric or highly singular.
    
                                                       Number of obs   = 1,221,477
                                                       No. of clusters =       413
                                                       F(0, 1221462)   =         .
                                                       Prob > F        =         .
                                                       R-squared       =    0.0287
                                                       Adj R-squared   =    0.0287
                                                       Root MSE        =    0.1198
    
    -----------------------------------------------------------------------------------
                      |               Robust
            endogenous | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    ------------------+----------------------------------------------------------------
                 male |  -.0021216          .        .       .            .           .
            ismarried |   -.006381          .        .       .            .           .
           wasmarried |  -.0050979          .        .       .            .           .
                  age |   .0078539          .        .       .            .           .
                 age2 |  -.0002958          .        .       .            .           .
                 age3 |   4.79e-06          .        .       .            .           .
                 age4 |  -2.75e-08          .        .       .            .           .
                black |   .0016079          .        .       .            .           .
                asian |   .0061996          .        .       .            .           .
             hispanic |   .0180494          .        .       .            .           .
                 lths |  -.0098576          .        .       .            .           .
             hsdegree |  -.0048408          .        .       .            .           .
          somecollege |  -.0021989          .        .       .            .           .
                      |
    INSTRUMENT VARIABLE |  -3.45e-06          .        .       .            .           .
                      |
                _cons |  -.0298293          .        .       .            .           .
    -----------------------------------------------------------------------------------
    
    
    Instrumental variables 2SLS regression            Number of obs   =  1,221,477
                                                      Wald chi2(14)   =    2078.25
                                                      Prob > chi2     =     0.0000
                                                      R-squared       =     0.0011
                                                      Root MSE        =     .10483
    
                                   (Std. err. adjusted for 413 clusters in the county)
    ------------------------------------------------------------------------------
                 |               Robust
    DEPVAR | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
       endogenous |   .0287666   .0115837     2.48   0.013     .0060629    .0514703
            male |  -.0053925   .0002373   -22.73   0.000    -.0058575   -.0049274
       ismarried |   .0011665   .0002201     5.30   0.000     .0007351    .0015979
      wasmarried |   .0023514   .0003723     6.32   0.000     .0016216    .0030811
             age |  -.0012788   .0006861    -1.86   0.062    -.0026234    .0000658
            age2 |    .000064   .0000281     2.27   0.023     8.84e-06    .0001192
            age3 |  -1.12e-06   4.92e-07    -2.28   0.022    -2.09e-06   -1.59e-07
            age4 |   7.03e-09   3.12e-09     2.26   0.024     9.21e-10    1.31e-08
           black |  -.0001923   .0002559    -0.75   0.452    -.0006938    .0003092
           asian |  -.0035667   .0010158    -3.51   0.000    -.0055577   -.0015758
        hispanic |  -.0048981   .0004398   -11.14   0.000    -.0057601   -.0040362
            lths |   .0028831   .0008705     3.31   0.001      .001177    .0045893
        hsdegree |    .002897   .0003899     7.43   0.000     .0021328    .0036612
     somecollege |    .003127   .0003549     8.81   0.000     .0024314    .0038226
           _cons |   .0151984   .0060315     2.52   0.012     .0033768      .02702

  • #2
    I am going to make the guess that the variables age2, age3, and age4 are the square, cube, and fourth power of age. If so, they are the most likely source of your problem. The names of the variables suggest that you are dealing with a sample of adult humans. So their ages will probably range between, say, 18 and 90 or something similar. When you then take the squares, cubes, and fourth powers of those, the resulting variables are extremely highly correlated, as the following toy data illustrate:

    Code:
    . clear
    
    . set obs 50
    Number of observations (_N) was 0, now 50.
    
    .
    . gen age = rnormal(50, 15)
    
    . summ age
    
        Variable |        Obs        Mean    Std. dev.       Min        Max
    -------------+---------------------------------------------------------
             age |         50    46.89612    13.26637   11.50563   76.89632
    
    .
    . forvalues i = 2/4 {
      2.     gen age`i' = age^`i'
      3. }
    
    .
    . corr age*
    (obs=50)
    
                 |      age     age2     age3     age4
    -------------+------------------------------------
             age |   1.0000
            age2 |   0.9797   1.0000
            age3 |   0.9405   0.9889   1.0000
            age4 |   0.8971   0.9647   0.9928   1.0000
    
    
    .
    end of do-file
    With variables this highly correlated, the covariance matrix that Stata needs to invert to estimate the regressions is going to be very close to singular. This means that the results of your analysis are not reliable.

    Is there a compelling reason for using so many powers of age in your model? It is not uncommon to use age and age squared to capture U-shaped relationships. But why do you need more powers of age than that? Does a model using only age, or only age and age squared produce a bad fit to your model of endogenous?

    Anyway, if this is, in fact, the source of the problem, there are two potential solutions:

    1. Get rid of age3 and age4.
    2. Or, if you really need them, first mean-center age, and then use that and its powers instead of the raw age variables. Using the toy data from above, you can see that the correlation matrix you get is much better behaved when you mean-center:
    Code:
    . summ age, meanonly
    
    . gen mc_age = age - r(mean)
    
    . forvalues i = 2/4 {
      2.     gen mc_age`i' = mc_age^`i'
      3. }
    
    .
    . corr mc_age*
    (obs=50)
    
                 |   mc_age  mc_age2  mc_age3  mc_age4
    -------------+------------------------------------
          mc_age |   1.0000
         mc_age2 |   0.0905   1.0000
         mc_age3 |   0.8225  -0.0786   1.0000
         mc_age4 |  -0.0371   0.9288  -0.2646   1.0000

    Comment


    • #3
      Very very grateful for such on point direction and much needed guidance. After removing the age3 and age4 variable I got the results which seems to be the normal setup and got all those t-statisitcs and p-value back. And, hoenstly I don't need this age3 and age4 variables. Highly obliged!

      Comment

      Working...
      X