Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Test statistics and p-values different in SEM linear regression vs. OLS

    Greetings,

    I'm running Stata 15.1 on a Mac OS. I'm currently working with aggregate time series data. The dependent variables are indexes of political attitudes for different political subgroups (e.g. white democrat, white republican). I'm interested in testing whether a specific exogenous variable has a stronger effect on one group's attitudes vs. the other. To this end, I specified two linear models with the SEM command--one for each of the subgroups of interest. I then used the 'test' command to see whether the standardized beta coefficient in model 1 (white democrats) is stronger than the coefficient in model 2. However, while doing this, I noticed that the test statistics in the SEM models were different than what can be observed in the conventional OLS (i.e. using the 'reg' command). The upshot is that variables that are marginally significant or insignificant (at the 95% level) in the OLS models achieve statistical significance in the SEM models. To illustrate this, here are the results from the SEM:

    Code:
    . sem (whdem5_policydiscrim1<-whdem5_policydiscrim1L1 media blkracial_pct anes_whdem_boomerX_epol  policy_spending3    consume
    > r_sentiment2 whdem2_policymood1) if  year < 1996, stand
    (9 observations with missing values excluded)
    
    Endogenous variables
    
    Observed:  whdem5_policydiscrim1
    
    Exogenous variables
    
    Observed:  whdem5_policydiscrim1L1 media blkracial_pct anes_whdem_boomerX_epol policy_spending3 consumer_sentiment2
    whdem2_policymood1
    
    Fitting target model:
    
    Iteration 0:   log likelihood = -516.36592  
    Iteration 1:   log likelihood = -516.36592  
    
    Structural equation model                       Number of obs     =         40
    Estimation method  = ml
    Log likelihood     = -516.36592
    
    
    OIM
    Standardized       Coef.   Std. Err.      z    P>z     [95% Conf. Interval]
    
    Structural                  
    whdem5_policydiscrim1     
    whdem5_policydiscrim1L1    .6394075   .0840223     7.61   0.000     .4747269    .8040882
    media    .3522813   .1699684     2.07   0.038     .0191493    .6854133
    blkracial_pct    .1872766    .160491     1.17   0.243      -.12728    .5018332
    anes_whdem_boomerX_epol   -.1627897   .1832061    -0.89   0.374     -.521867    .1962876
    policy_spending3    .3268914    .173965     1.88   0.060    -.0140736    .6678565
    consumer_sentiment2    .0982929   .0846609     1.16   0.246    -.0676393    .2642252
    whdem2_policymood1    .4730785   .1544848     3.06   0.002     .1702939    .7758631
    _cons   -4.837008   2.321672    -2.08   0.037    -9.387403   -.2866138
    
    var(e.whdem5_policydiscrim1)   .1698421   .0374261                      .1102748    .2615861
    
    LR test of model vs. saturated: chi2(0)   =      0.00, Prob > chi2 =      .
    As you can see, the p-value for the media variable in the above model (for group 1) is 0.038.

    Now here are the results from using the 'reg' command:

    Code:
    . regress whdem5_policydiscrim1 L.whdem5_policydiscrim1 media blkracial_pct   anes_whdem_boomerX_epol    policy_spending3    cons
    > umer_sentiment2 whdem2_policymood1  if whdem2_policymood1!=. & year < 1996  , beta
    
    Source        SS           df       MS      Number of obs   =        40
    F(7, 32)        =     22.34
    Model   1036.04676         7   148.00668   Prob > F        =    0.0000
    Residual   211.965001        32  6.62390627   R-squared       =    0.8302
    Adj R-squared   =    0.7930
    Total   1248.01176        39  32.0003016   Root MSE        =    2.5737
    
    
    whdem5_policydiscrim1       Coef.   Std. Err.      t    P>t                     Beta
    
    whdem5_policydiscrim1 
    L1.    .6418976   .1061421     6.05   0.000                 .6394075
    
    media     4.62368   2.518701     1.84   0.076                 .3522813
    blkracial_pct     .519159   .4989768     1.04   0.306                 .1872766
    anes_whdem_boomerX_epol   -4.352578   5.486595    -0.79   0.433                -.1627897
    policy_spending3    2.483215   1.489468     1.67   0.105                 .3268914
    consumer_sentiment2    .0526498   .0508577     1.04   0.308                 .0982929
    whdem2_policymood1    .4584689   .1709626     2.68   0.011                 .4730785
    _cons   -27.01819   15.10765    -1.79   0.083                        .
    As you can see, the p-value for the media variable is 0.076. I recognize that SEM is using z-statistics while OLS is using t-statistics, but I'm not sure why this would result in different p-values. Either way, which test results are more reliable here? Thank you for your help!

  • #2
    My understanding is that, to get the same results between "sem" and "regress", you need to have the same adjustment (while typically sem uses the large sample and regress the small sample adjustment) and the same type of variance-covariance estimator (while, by default, sem uses Observed Information Matrix and regress uses OLS). You can take a look here: https://www.stata.com/meeting/german...y19_Langer.pdf

    Comment


    • #3
      There is a much bigger problem here than a slight change in the p-values! Look at the coefficient of media. It's not even in the same ballpark in the two models. The same is true of several of the other coefficients. This has nothing to do with t vs z or df adjustments. These cannot possibly be the same model.

      The most obvious place to look for trouble is that the OLS model contains a lagged version of the outcome variable as a predictor using the built-in L. time-series operator. The SEM model instead incorporates a variable whose name suggests it is also the lag of the outcome variable: but it is not calculated "on the fly" with the L. operator: it is a homebrew lag variable. My first hunch is that the homebrew lag variable is incorrectly calculated--this is a common error when working with longitudinal data. As the code by which it was created is not shown, I can't say anything more specific than that. But I would suggest that O.P. start by looking at the two variables to see if they are different.

      If that does not turn out to be the source of the problem, I recommend that O.P. post back and show example data which reproduces this problem (be sure to include all the variable necessary for the regression in the example).

      Comment

      Working...
      X