Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpretation of coefficients for grouped regression

    I used the "mixed" command to test the relationship between x and y. I found that in the total sample, the coefficient of x is 0.03, and it is significant. However, in Subsample 1, the coefficient of x is -0.003, and it is non-significant. In Subsample 2, the coefficient of x is 0.025, and it is also non-significant. How to explain it?

  • #2
    Originally posted by Yuhan HU View Post
    How to explain it?
    What is the question you're trying to answer overall? Depending upon what that question is, you might not need to explain this.

    In the meantime, try including an interaction term for x and the subsample categories and fit an omnibus model using the total sample that way.

    If you're fitting models to separate subsamples because of a concern over heteroskedasticity, then you can specify separate residual variance estimates for each subsample using the residuals() option to mixed when fitting the omnibus model.

    Comment


    • #3
      Yuhan:
      see the following toy-example:
      Code:
      . use "C:\Program Files\Stata19\ado\base\a\auto.dta"
      (1978 automobile data)
      
      . regress price c.trunk if foreign==0
      
            Source |       SS           df       MS      Number of obs   =        52
      -------------+----------------------------------   F(1, 50)        =      7.62
             Model |  64723188.5         1  64723188.5   Prob > F        =    0.0080
          Residual |   424471612        50  8489432.24   R-squared       =    0.1323
      -------------+----------------------------------   Adj R-squared   =    0.1150
             Total |   489194801        51  9592054.92   Root MSE        =    2913.7
      
      ------------------------------------------------------------------------------
             price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
             trunk |   261.6024   94.74388     2.76   0.008     71.30376    451.9011
             _cons |   2213.787   1454.712     1.52   0.134    -708.0877    5135.662
      ------------------------------------------------------------------------------
      
      . regress price c.trunk if foreign==1
      
            Source |       SS           df       MS      Number of obs   =        22
      -------------+----------------------------------   F(1, 20)        =      2.42
             Model |  15592366.1         1  15592366.1   Prob > F        =    0.1353
          Residual |   128770847        20  6438542.33   R-squared       =    0.1080
      -------------+----------------------------------   Adj R-squared   =    0.0634
             Total |   144363213        21   6874438.7   Root MSE        =    2537.4
      
      ------------------------------------------------------------------------------
             price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
             trunk |   267.8601   172.1257     1.56   0.135    -91.18787     626.908
             _cons |   3328.642   2036.949     1.63   0.118    -920.3602    7577.644
      ------------------------------------------------------------------------------
      
      . regress price c.trunk
      
            Source |       SS           df       MS      Number of obs   =        74
      -------------+----------------------------------   F(1, 72)        =      7.89
             Model |  62747229.9         1  62747229.9   Prob > F        =    0.0064
          Residual |   572318166        72  7948863.42   R-squared       =    0.0988
      -------------+----------------------------------   Adj R-squared   =    0.0863
             Total |   635065396        73  8699525.97   Root MSE        =    2819.4
      
      ------------------------------------------------------------------------------
             price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
             trunk |   216.7482   77.14554     2.81   0.006     62.96142     370.535
             _cons |   3183.504   1110.728     2.87   0.005     969.3088    5397.699
      ------------------------------------------------------------------------------
      
      .
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Originally posted by Carlo Lazzaro View Post
        Yuhan:
        see the following toy-example:
        Code:
        . use "C:\Program Files\Stata19\ado\base\a\auto.dta"
        (1978 automobile data)
        
        . regress price c.trunk if foreign==0
        
        Source | SS df MS Number of obs = 52
        -------------+---------------------------------- F(1, 50) = 7.62
        Model | 64723188.5 1 64723188.5 Prob > F = 0.0080
        Residual | 424471612 50 8489432.24 R-squared = 0.1323
        -------------+---------------------------------- Adj R-squared = 0.1150
        Total | 489194801 51 9592054.92 Root MSE = 2913.7
        
        ------------------------------------------------------------------------------
        price | Coefficient Std. err. t P>|t| [95% conf. interval]
        -------------+----------------------------------------------------------------
        trunk | 261.6024 94.74388 2.76 0.008 71.30376 451.9011
        _cons | 2213.787 1454.712 1.52 0.134 -708.0877 5135.662
        ------------------------------------------------------------------------------
        
        . regress price c.trunk if foreign==1
        
        Source | SS df MS Number of obs = 22
        -------------+---------------------------------- F(1, 20) = 2.42
        Model | 15592366.1 1 15592366.1 Prob > F = 0.1353
        Residual | 128770847 20 6438542.33 R-squared = 0.1080
        -------------+---------------------------------- Adj R-squared = 0.0634
        Total | 144363213 21 6874438.7 Root MSE = 2537.4
        
        ------------------------------------------------------------------------------
        price | Coefficient Std. err. t P>|t| [95% conf. interval]
        -------------+----------------------------------------------------------------
        trunk | 267.8601 172.1257 1.56  0.135 -91.18787 626.908
        _cons | 3328.642 2036.949 1.63 0.118 -920.3602 7577.644
        ------------------------------------------------------------------------------
        
        . regress price c.trunk
        
        Source | SS df MS Number of obs = 74
        -------------+---------------------------------- F(1, 72) = 7.89
        Model | 62747229.9 1 62747229.9 Prob > F = 0.0064
        Residual | 572318166 72 7948863.42 R-squared = 0.0988
        -------------+---------------------------------- Adj R-squared = 0.0863
        Total | 635065396 73 8699525.97 Root MSE = 2819.4
        
        ------------------------------------------------------------------------------
        price | Coefficient Std. err. t P>|t| [95% conf. interval]
        -------------+----------------------------------------------------------------
        trunk | 216.7482 77.14554 2.81 0.006  62.96142 370.535
        _cons | 3183.504 1110.728 2.87 0.005 969.3088 5397.699
        ------------------------------------------------------------------------------
        
        .
        Thanks a lot, Carlo! I see you did not control the foreign variable in the total sample. After you controlled that, we may get the different coefficient of trunk. I guess it would be nearly equal to the average value of coefficients of the two sub samples?

        Comment


        • #5
          Originally posted by Joseph Coveney View Post
          What is the question you're trying to answer overall? Depending upon what that question is, you might not need to explain this.

          In the meantime, try including an interaction term for x and the subsample categories and fit an omnibus model using the total sample that way.

          If you're fitting models to separate subsamples because of a concern over heteroskedasticity, then you can specify separate residual variance estimates for each subsample using the residuals() option to mixed when fitting the omnibus model.
          Thanks a lot, Joseph! It may be a solution. However, I am still wondering whether it is normal as the coefficient t in the total sample is significant, but in the subsamples it is not significant and the coefficient directions differ, and the total-sample coefficient does not fall between the two subsample coefficients?

          Comment


          • #6
            Originally posted by Yuhan HU View Post
            . . . the total-sample coefficient does not fall between the two subsample coefficients . . .
            It's hard to say without seeing your data and knowing what else is in your model, for example, does your model have covariates and do their distributions differ between subsamples? (The model obviously has the random effect's variable and possibly a time variable or other repeated-measurement index.)

            Comment


            • #7
              Originally posted by Joseph Coveney View Post
              It's hard to say without seeing your data and knowing what else is in your model, for example, does your model have covariates and do their distributions differ between subsamples? (The model obviously has the random effect's variable and possibly a time variable or other repeated-measurement index.)
              Thanks for your reply, Joseph! This is my code for the total sample:

              mixed y c.cage c.cage#c.cage cohort c.cohort#c.cage c.cohort#c.cohort c.cohort#c.cohort#c.cage i.education i.income i.education#c.cage i.education#c.cohort i.income#c.cohort i.health i.rural2011 i.ragender i.missing i.work|| ID2: c.cage, covariance(unstructured) nolog

              And this is the code for the subsamples (by gender):
              mixed y c.cage c.cage#c.cage cohort c.cohort#c.cage c.cohort#c.cohort c.cohort#c.cohort#c.cage i.education i.income i.education#c.cage i.education#c.cohort i.income#c.cohort i.health i.rural2011 i.missing i.work| if gender=1| ID2: c.cage, covariance(unstructured) nolog
              mixed y c.cage c.cage#c.cage cohort c.cohort#c.cage c.cohort#c.cohort c.cohort#c.cohort#c.cage i.education i.income i.education#c.cage i.education#c.cohort i.income#c.cohort i.health i.rural2011 i.missing i.work| if gender=0| ID2: c.cage, covariance(unstructured) nolog

              I focus on the coefficient of the interaction term of education and cohort.

              Comment


              • #8
                Originally posted by Yuhan HU View Post
                I focus on the coefficient of the interaction term of education and cohort.
                Both of those predictors are involved in interaction terms with other predictors. One, cohort, is in an interaction term with itself as well as in a three-way interaction term of itself and centered age. Other predictors that are involved in interaction terms with these two are themselves in interaction terms with each other. The model is fitted to data of an observational study and at least a few predictors look like they are liable to selection and other forms of bias. The two coefficients from models fitted to subsample data are not statistically significantly different from zero, which indicates that the estimates lack the precision needed to conclude much of anything about their location.

                I think that I would not have been surprised by the finding that the coefficients behave as you report.

                Comment


                • #9
                  Originally posted by Joseph Coveney View Post
                  Both of those predictors are involved in interaction terms with other predictors. One, cohort, is in an interaction term with itself as well as in a three-way interaction term of itself and centered age. Other predictors that are involved in interaction terms with these two are themselves in interaction terms with each other. The model is fitted to data of an observational study and at least a few predictors look like they are liable to selection and other forms of bias. The two coefficients from models fitted to subsample data are not statistically significantly different from zero, which indicates that the estimates lack the precision needed to conclude much of anything about their location.

                  I think that I would not have been surprised by the finding that the coefficients behave as you report.
                  Thanks for your reply, Joseph! I also had a model that did not include the interaction term of education and age. The code for the total sample is:

                  mixed y c.cage c.cage#c.cage cohort c.cohort#c.cage c.cohort#c.cohort c.cohort#c.cohort#c.cage i.education i.income i.health i.rural2011 i.ragender i.missing i.work|| ID2: c.cage, covariance(unstructured) nolog

                  The subsample is:
                  mixed y c.cage c.cage#c.cage cohort c.cohort#c.cage c.cohort#c.cohort c.cohort#c.cohort#c.cage i.education i.income i.health i.rural2011 i.missing i.work if gender==1| ID2: c.cage, covariance(unstructured) nolog
                  mixed y c.cage c.cage#c.cage cohort c.cohort#c.cage c.cohort#c.cohort c.cohort#c.cohort#c.cage i.education i.income i.health i.rural2011 i.missing i.work if gender==0| ID2: c.cage, covariance(unstructured) nolog

                  I found that the coefficients of education are all significant in three models, but the total-sample coefficient does not fall between the two subsample coefficients. Does it make sense?

                  Comment


                  • #10
                    Originally posted by Yuhan HU View Post
                    . . . the total-sample coefficient does not fall between the two subsample coefficients. Does it make sense?
                    Yes, it's not uncommon under your circumstances.

                    Here:
                    Taking Stata's auto dataset as an example for convenience, and just arbitrarily picking the first (numeric) variable, price, and regressing it on the remaining variables, first with the i.foreign variable included and then with the subsamples of foreign == 1 and foreign == 0, you'll find that nearly half (four of nine) of the regression coefficients behave in this way.
                    See the code below (do-file and log file attached for your convenience).
                    Code:
                    version 19
                    
                    clear *
                    
                    quietly sysuse auto
                    
                    quietly replace rep78 = 1 if mi(rep78)
                    
                    frame create Coefficients str20 foreign double(mpg rep78 headroom trunk ///
                        weight length turn displacement gear_ratio)
                    
                    foreach var of varlist mpg-gear_ratio {
                        local coe `coe' (_b[`var'])
                    }
                    
                    regress price c.(mpg-gear_ratio) i.foreign
                    frame post Coefficients ("i.foreign") `coe'
                    
                    regress price c.(mpg-gear_ratio) if !foreign
                    frame post Coefficients ("domestic") `coe'
                    
                    regress price c.(mpg-gear_ratio) if foreign
                    frame post Coefficients ("foreign") `coe'
                    
                    cwf Coefficients
                    list, noobs abbreviate(20)
                    foreach var of varlist mpg-gear_ratio {
                        list foreign `var' if !inrange(`var'[1], min(`var'[2], `var'[3]), ///
                            max(`var'[2], `var'[3])), noobs abbreviate(20)
                    }
                    
                    exit
                    And that's just picking any old handy observational dataset.

                    I'm not sure what all has been said on the list before, but you might try Googling ("simpson's" OR "lord's") AND site:statalist.org and see whether it would help.
                    Attached Files
                    Last edited by Joseph Coveney; 09 Aug 2025, 18:19. Reason: First numeric

                    Comment


                    • #11
                      Yuhan:
                      Thanks a lot, Carlo! I see you did not control the foreign variable in the total sample. After you controlled that, we may get the different coefficient of trunk. I guess it would be nearly equal to the average value of coefficients of the two sub samples?
                      My previous toy-example shows that statistical significance is also a matter of sample size.
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment


                      • #12
                        Originally posted by Yuhan HU View Post
                        mixed y . . . i.missing . . .
                        Oh, by the way, I noticed but forgot to ask about that predictor.

                        If it's what I suspect, then I hope that you've seen at least this and this thread, for example.

                        Comment

                        Working...
                        X