Interpretation of coefficients for grouped regression

Yuhan HU

Join Date: Apr 2024

Posts: 27
#1

Interpretation of coefficients for grouped regression

07 Aug 2025, 20:45

I used the "mixed" command to test the relationship between x and y. I found that in the total sample, the coefficient of x is 0.03, and it is significant. However, in Subsample 1, the coefficient of x is -0.003, and it is non-significant. In Subsample 2, the coefficient of x is 0.025, and it is also non-significant. How to explain it?
Tags: data, interaction, panel data, regression
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#2

07 Aug 2025, 21:33

Originally posted by Yuhan HU View Post

How to explain it?

What is the question you're trying to answer overall? Depending upon what that question is, you might not need to explain this.

In the meantime, try including an interaction term for x and the subsample categories and fit an omnibus model using the total sample that way.

If you're fitting models to separate subsamples because of a concern over heteroskedasticity, then you can specify separate residual variance estimates for each subsample using the residuals() option to mixed when fitting the omnibus model.
1 like
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17739

08 Aug 2025, 00:24

Yuhan:
see the following toy-example:

Code:

. use "C:\Program Files\Stata19\ado\base\a\auto.dta"
(1978 automobile data)

. regress price c.trunk if foreign==0

      Source |       SS           df       MS      Number of obs   =        52
-------------+----------------------------------   F(1, 50)        =      7.62
       Model |  64723188.5         1  64723188.5   Prob > F        =    0.0080
    Residual |   424471612        50  8489432.24   R-squared       =    0.1323
-------------+----------------------------------   Adj R-squared   =    0.1150
       Total |   489194801        51  9592054.92   Root MSE        =    2913.7

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       trunk |   261.6024   94.74388     2.76   0.008     71.30376    451.9011
       _cons |   2213.787   1454.712     1.52   0.134    -708.0877    5135.662
------------------------------------------------------------------------------

. regress price c.trunk if foreign==1

      Source |       SS           df       MS      Number of obs   =        22
-------------+----------------------------------   F(1, 20)        =      2.42
       Model |  15592366.1         1  15592366.1   Prob > F        =    0.1353
    Residual |   128770847        20  6438542.33   R-squared       =    0.1080
-------------+----------------------------------   Adj R-squared   =    0.0634
       Total |   144363213        21   6874438.7   Root MSE        =    2537.4

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       trunk |   267.8601   172.1257     1.56   0.135    -91.18787     626.908
       _cons |   3328.642   2036.949     1.63   0.118    -920.3602    7577.644
------------------------------------------------------------------------------

. regress price c.trunk

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =      7.89
       Model |  62747229.9         1  62747229.9   Prob > F        =    0.0064
    Residual |   572318166        72  7948863.42   R-squared       =    0.0988
-------------+----------------------------------   Adj R-squared   =    0.0863
       Total |   635065396        73  8699525.97   Root MSE        =    2819.4

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       trunk |   216.7482   77.14554     2.81   0.006     62.96142     370.535
       _cons |   3183.504   1110.728     2.87   0.005     969.3088    5397.699
------------------------------------------------------------------------------

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Yuhan HU

Join Date: Apr 2024
Posts: 27

08 Aug 2025, 02:29

Originally posted by Carlo Lazzaro View Post

Yuhan:
see the following toy-example:

Code:

. use "C:\Program Files\Stata19\ado\base\a\auto.dta"
(1978 automobile data)

. regress price c.trunk if foreign==0

Source | SS df MS Number of obs = 52
-------------+---------------------------------- F(1, 50) = 7.62
Model | 64723188.5 1 64723188.5 Prob > F = 0.0080
Residual | 424471612 50 8489432.24 R-squared = 0.1323
-------------+---------------------------------- Adj R-squared = 0.1150
Total | 489194801 51 9592054.92 Root MSE = 2913.7

------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
trunk | 261.6024 94.74388 2.76 0.008 71.30376 451.9011
_cons | 2213.787 1454.712 1.52 0.134 -708.0877 5135.662
------------------------------------------------------------------------------

. regress price c.trunk if foreign==1

Source | SS df MS Number of obs = 22
-------------+---------------------------------- F(1, 20) = 2.42
Model | 15592366.1 1 15592366.1 Prob > F = 0.1353
Residual | 128770847 20 6438542.33 R-squared = 0.1080
-------------+---------------------------------- Adj R-squared = 0.0634
Total | 144363213 21 6874438.7 Root MSE = 2537.4

------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
trunk | 267.8601 172.1257 1.56  0.135 -91.18787 626.908
_cons | 3328.642 2036.949 1.63 0.118 -920.3602 7577.644
------------------------------------------------------------------------------

. regress price c.trunk

Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(1, 72) = 7.89
Model | 62747229.9 1 62747229.9 Prob > F = 0.0064
Residual | 572318166 72 7948863.42 R-squared = 0.0988
-------------+---------------------------------- Adj R-squared = 0.0863
Total | 635065396 73 8699525.97 Root MSE = 2819.4

------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
trunk | 216.7482 77.14554 2.81 0.006  62.96142 370.535
_cons | 3183.504 1110.728 2.87 0.005 969.3088 5397.699
------------------------------------------------------------------------------

.

Thanks a lot, Carlo! I see you did not control the foreign variable in the total sample. After you controlled that, we may get the different coefficient of trunk. I guess it would be nearly equal to the average value of coefficients of the two sub samples?

Comment

Yuhan HU

Join Date: Apr 2024

Posts: 27
#5

08 Aug 2025, 02:37

Originally posted by Joseph Coveney View Post

What is the question you're trying to answer overall? Depending upon what that question is, you might not need to explain this.

In the meantime, try including an interaction term for x and the subsample categories and fit an omnibus model using the total sample that way.

If you're fitting models to separate subsamples because of a concern over heteroskedasticity, then you can specify separate residual variance estimates for each subsample using the residuals() option to mixed when fitting the omnibus model.

Thanks a lot, Joseph! It may be a solution. However, I am still wondering whether it is normal as the coefficient t in the total sample is significant, but in the subsamples it is not significant and the coefficient directions differ, and the total-sample coefficient does not fall between the two subsample coefficients?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#6

08 Aug 2025, 02:55

Originally posted by Yuhan HU View Post

. . . the total-sample coefficient does not fall between the two subsample coefficients . . .

It's hard to say without seeing your data and knowing what else is in your model, for example, does your model have covariates and do their distributions differ between subsamples? (The model obviously has the random effect's variable and possibly a time variable or other repeated-measurement index.)
Comment
Yuhan HU

Join Date: Apr 2024

Posts: 27
#7

08 Aug 2025, 18:00

Originally posted by Joseph Coveney View Post

It's hard to say without seeing your data and knowing what else is in your model, for example, does your model have covariates and do their distributions differ between subsamples? (The model obviously has the random effect's variable and possibly a time variable or other repeated-measurement index.)

Thanks for your reply, Joseph! This is my code for the total sample:

mixed y c.cage c.cage#c.cage cohort c.cohort#c.cage c.cohort#c.cohort c.cohort#c.cohort#c.cage i.education i.income i.education#c.cage i.education#c.cohort i.income#c.cohort i.health i.rural2011 i.ragender i.missing i.work|| ID2: c.cage, covariance(unstructured) nolog

And this is the code for the subsamples (by gender):
mixed y c.cage c.cage#c.cage cohort c.cohort#c.cage c.cohort#c.cohort c.cohort#c.cohort#c.cage i.education i.income i.education#c.cage i.education#c.cohort i.income#c.cohort i.health i.rural2011 i.missing i.work| if gender=1| ID2: c.cage, covariance(unstructured) nolog
mixed y c.cage c.cage#c.cage cohort c.cohort#c.cage c.cohort#c.cohort c.cohort#c.cohort#c.cage i.education i.income i.education#c.cage i.education#c.cohort i.income#c.cohort i.health i.rural2011 i.missing i.work| if gender=0| ID2: c.cage, covariance(unstructured) nolog

I focus on the coefficient of the interaction term of education and cohort.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#8

08 Aug 2025, 20:16

Originally posted by Yuhan HU View Post

I focus on the coefficient of the interaction term of education and cohort.

Both of those predictors are involved in interaction terms with other predictors. One, cohort, is in an interaction term with itself as well as in a three-way interaction term of itself and centered age. Other predictors that are involved in interaction terms with these two are themselves in interaction terms with each other. The model is fitted to data of an observational study and at least a few predictors look like they are liable to selection and other forms of bias. The two coefficients from models fitted to subsample data are not statistically significantly different from zero, which indicates that the estimates lack the precision needed to conclude much of anything about their location.

I think that I would not have been surprised by the finding that the coefficients behave as you report.
1 like
Comment
Yuhan HU

Join Date: Apr 2024

Posts: 27
#9

09 Aug 2025, 02:55

Originally posted by Joseph Coveney View Post

Both of those predictors are involved in interaction terms with other predictors. One, cohort, is in an interaction term with itself as well as in a three-way interaction term of itself and centered age. Other predictors that are involved in interaction terms with these two are themselves in interaction terms with each other. The model is fitted to data of an observational study and at least a few predictors look like they are liable to selection and other forms of bias. The two coefficients from models fitted to subsample data are not statistically significantly different from zero, which indicates that the estimates lack the precision needed to conclude much of anything about their location.

I think that I would not have been surprised by the finding that the coefficients behave as you report.

Thanks for your reply, Joseph! I also had a model that did not include the interaction term of education and age. The code for the total sample is:

mixed y c.cage c.cage#c.cage cohort c.cohort#c.cage c.cohort#c.cohort c.cohort#c.cohort#c.cage i.education i.income i.health i.rural2011 i.ragender i.missing i.work|| ID2: c.cage, covariance(unstructured) nolog

The subsample is:
mixed y c.cage c.cage#c.cage cohort c.cohort#c.cage c.cohort#c.cohort c.cohort#c.cohort#c.cage i.education i.income i.health i.rural2011 i.missing i.work if gender==1| ID2: c.cage, covariance(unstructured) nolog
mixed y c.cage c.cage#c.cage cohort c.cohort#c.cage c.cohort#c.cohort c.cohort#c.cohort#c.cage i.education i.income i.health i.rural2011 i.missing i.work if gender==0| ID2: c.cage, covariance(unstructured) nolog

I found that the coefficients of education are all significant in three models, but the total-sample coefficient does not fall between the two subsample coefficients. Does it make sense?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#10

09 Aug 2025, 18:02

Originally posted by Yuhan HU View Post

. . . the total-sample coefficient does not fall between the two subsample coefficients. Does it make sense?

Yes, it's not uncommon under your circumstances.

Here:
Taking Stata's auto dataset as an example for convenience, and just arbitrarily picking the first (numeric) variable, price, and regressing it on the remaining variables, first with the i.foreign variable included and then with the subsamples of foreign == 1 and foreign == 0, you'll find that nearly half (four of nine) of the regression coefficients behave in this way.
See the code below (do-file and log file attached for your convenience).

Code:

version 19 clear * quietly sysuse auto quietly replace rep78 = 1 if mi(rep78) frame create Coefficients str20 foreign double(mpg rep78 headroom trunk /// weight length turn displacement gear_ratio) foreach var of varlist mpg-gear_ratio { local coe `coe' (_b[`var']) } regress price c.(mpg-gear_ratio) i.foreign frame post Coefficients ("i.foreign") `coe' regress price c.(mpg-gear_ratio) if !foreign frame post Coefficients ("domestic") `coe' regress price c.(mpg-gear_ratio) if foreign frame post Coefficients ("foreign") `coe' cwf Coefficients list, noobs abbreviate(20) foreach var of varlist mpg-gear_ratio { list foreign `var' if !inrange(`var'[1], min(`var'[2], `var'[3]), /// max(`var'[2], `var'[3])), noobs abbreviate(20) } exit

And that's just picking any old handy observational dataset.

I'm not sure what all has been said on the list before, but you might try Googling ("simpson's" OR "lord's") AND site:statalist.org and see whether it would help.
Attached Files

Paradox.smcl (11.8 KB, 1 view)

Paradox.do (848 Bytes, 1 view)

Last edited by Joseph Coveney; 09 Aug 2025, 18:19. Reason: First numeric
2 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17739
#11

10 Aug 2025, 02:29

Yuhan:

Thanks a lot, Carlo! I see you did not control the foreign variable in the total sample. After you controlled that, we may get the different coefficient of trunk. I guess it would be nearly equal to the average value of coefficients of the two sub samples?

My previous toy-example shows that statistical significance is also a matter of sample size.

Kind regards,
Carlo
(Stata 19.0)
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#12

10 Aug 2025, 04:07

Originally posted by Yuhan HU View Post

mixed y . . . i.missing . . .

Oh, by the way, I noticed but forgot to ask about that predictor.

If it's what I suspect, then I hope that you've seen at least this and this thread, for example.
Comment

Announcement

Interpretation of coefficients for grouped regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment