Compare sum of coefficients across different regression subsamples

Joro Kolev

Join Date: Aug 2018
Posts: 3050

#16

24 Oct 2020, 22:56

Dana, another slightly different spin of what Clyde is saying, is to think of the hypothesis as a logical construct.

When you write b1=b2=b3, you very clearly view this as the logical construct (b1=b2) AND (b2=b3) AND (b1=b3). However one of these three elements in the logical construct is automatically implied by the other two. So if you prefer, you can view the issue not as a hypothesis testing or linear algebra issue, but as a logical issue. For example
IF (b1=b2) AND (b1=b3) THEN (b2=b3). The third one is unavoidably logically implied by the combination of the first two.

Therefore if I try to test such implied constraint, Stata automatically drops it, and correctly states in the numerator degrees of freedom of the F test that in fact I am testing only two constraints:

Code:

. sysuse auto, clear
(1978 Automobile Data)

. reg price mpg headroom weight, noheader
------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -56.19416   85.07654    -0.66   0.511     -225.874    113.4856
    headroom |  -675.5962   392.3504    -1.72   0.090    -1458.115     106.922
      weight |   2.061945   .6586383     3.13   0.003      .748332    3.375557
       _cons |   3158.306   3617.449     0.87   0.386    -4056.468    10373.08
------------------------------------------------------------------------------

. test (mpg=headroom) (headroom=weight) (mpg=weight)

 ( 1)  mpg - headroom = 0
 ( 2)  headroom - weight = 0
 ( 3)  mpg - weight = 0
       Constraint 2 dropped

       F(  2,    70) =    1.68
            Prob > F =    0.1946

Comment

Duong Le

Join Date: Apr 2020
Posts: 66

#17

31 Oct 2020, 21:40

Originally posted by Clyde Schechter View Post

Well, having read previous posts on -suest-, you are presumably aware that you can't do this because -suest- does not support -xtreg-.

However, you can get the comparison you are looking for as follows. I will assume that IV1, IV2 and CV1 are continuous variables, and that CV2 and CV3 are discrete, to illustrate the approach. If that is not the case, you will need to modify the code accordingly.

Code:

xtreg DV i.S##(c.(IV1 IV2 CV1) i.(CV2 CV3 )), fe vce(cluster panelvar)
lincom 1.S#IV1 + 1S.#IV2

The -xtreg- command above uses an interaction between S and all of the predictors of the model, thus completely emulating two separate subset regressions. Since the 1.S#whatever terms represent differences between the S = 0 and S = 1 coefficients, the -lincom- command calculates the difference between the S = 0 and S = 1 values of _b[IV1] +_b[IV2].

If you are not familiar with the i. and c. prefixes and the ## operator, read -help fvvarlist- so you will learn about one of Stata's very best features!

Dear Clyde Schechter,

Can you explain this result for me? Actually, I want to examine the effect of caregiving on caregivers' health. Here I have two groups of caregivers: those with higher HH income and those with lower income so I want to test whether the effects of caregiving on those with higher HH income and those with lower HH income are the same. In my setup, I treat caregiving as an endogenous variable and Z1 and Z2 as potential instrumental variables, X1, X2, and X3 are exogenous variables ( X1 is continuous)

health: 0 if good health, 1 otherwise
care: 1 if providing care, 0 otherwise
income: 1 if higher HH income, 0 otherwise

I first run separate models for the two groups:

Code:

ivregress 2sls health c.X1 i.X2 i.X3 (care=Z1 Z2) if income==1, robust
ivregress 2sls health c.X1 i.X2 i.X3 (care=Z1 Z2) if income==0, robust

Results for separate models

Code:

// For high income group
Instrumental variables (2SLS) regression          Number of obs   =      2,099
                                                  Wald chi2(14)   =      27.73
                                                  Prob > chi2     =     0.0154
                                                  R-squared       =     0.0038
                                                  Root MSE        =     .35127

----------------------------------------------------------------------------------
                 |               Robust
          health |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
             care |   .1859373   .0875488     2.12   0.034     .0143448    .3575298

// For low income group
Instrumental variables (2SLS) regression          Number of obs   =      1,956
                                                  Wald chi2(14)   =      43.80
                                                  Prob > chi2     =     0.0001
                                                  R-squared       =     0.0029
                                                  Root MSE        =     .37884

----------------------------------------------------------------------------------
                 |               Robust
          health |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
             care |   .3692195   .1358493     2.72   0.007     .1029598    .6354793

The results show negative and significant effect of caregiving on health for both groups and the effect of caregiving is much greater for those with lower HH income so I would expect significant differences between the low and high income groups in the impact of caregiving on health in equality coefficient test.

My specification is as follows:

Code:

ivregress 2sls health i.income##(c.X1 i.X2 i.X3) (care care#i.income=Z1 Z1#i.income Z2 Z2#i.income), robust

Results

Code:

Instrumental variables (2SLS) regression          Number of obs   =      4,055
                                                  Wald chi2(29)   =      76.54
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.0050
                                                  Root MSE        =     .36483

----------------------------------------------------------------------------------------
                       |               Robust
              health |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------------+----------------------------------------------------------------
                 care |   .1859373   .0875488     2.12   0.034     .0143448    .3575298
                         |
   care#income |
                  1 0  |   .1832823   .1616163     1.13   0.257    -.1334799    .5000444
                  1 1  |          0  (omitted)

In contrast to my expectation, the results show no significant difference between the two groups in the effect of caregiving on health. Do you think my model specification is correct?

Thanks

Last edited by Duong Le; 31 Oct 2020, 21:48.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#18

01 Nov 2020, 11:01

Caveat: I do not use instrumental variables models in my work and have only a superficial understanding of them. So there may be more to this than I can comment.

But I do notice that in the interaction model, the coefficient of care (which means the effect of care in the income = 0 group) is .1859373, which when added to the interaction coefficient 0.1832823 gives .3692196, which is, except for a minor rounding error in the last decimal place, exactly what you found as the coefficient of care in the low-income group when you modeled it separately. So everything seems to be quite in order.

As for why these results do not accord with your expectations, I cannot say. First, it depends on what the basis for your expectations is. Second, you seem to be very concerned with statistical significance here. As you may know, I am among those who most strongly support the American Statistical Association's recommendation that the concept no longer be used. See https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr. The estimated difference between the two effects is given by your care#income interaction coefficient, and it ha a confidence interval that ranges from about -.13 to about +.50. So that covers a wide range from slightly favorable effects on health to pretty strongly unfavorable. To me the only conclusion is that your data simply do not permit a precise enough estimate to draw any useful conclusion. The usual reasons for that need to be considered: how good are your measurements? Perhaps they are far too noisy. Indeed, the fact that you used a 0/1 indicator of good vs poor health as your outcome variable suggests to me that you entered the boxing ring with one hand tied behind your back. For that matter, how valid is your measure of caregiver status? It may be that be eliding distinctions of part-time vs full-time, or care-giving as an occupation vs care-giving to a family-member, or other things like that may have degraded the information your data could convey. Finally, I do not really know how the use of instrumental variables impacts this kind of thing, but I have the impression that it adds additional uncertainty.
1 like
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#19

01 Nov 2020, 19:25

Originally posted by Clyde Schechter View Post

Caveat: I do not use instrumental variables models in my work and have only a superficial understanding of them. So there may be more to this than I can comment.

But I do notice that in the interaction model, the coefficient of care (which means the effect of care in the income = 0 group) is .1859373, which when added to the interaction coefficient 0.1832823 gives .3692196, which is, except for a minor rounding error in the last decimal place, exactly what you found as the coefficient of care in the low-income group when you modeled it separately. So everything seems to be quite in order.

As for why these results do not accord with your expectations, I cannot say. First, it depends on what the basis for your expectations is. Second, you seem to be very concerned with statistical significance here. As you may know, I am among those who most strongly support the American Statistical Association's recommendation that the concept no longer be used. See https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr. The estimated difference between the two effects is given by your care#income interaction coefficient, and it ha a confidence interval that ranges from about -.13 to about +.50. So that covers a wide range from slightly favorable effects on health to pretty strongly unfavorable. To me the only conclusion is that your data simply do not permit a precise enough estimate to draw any useful conclusion. The usual reasons for that need to be considered: how good are your measurements? Perhaps they are far too noisy. Indeed, the fact that you used a 0/1 indicator of good vs poor health as your outcome variable suggests to me that you entered the boxing ring with one hand tied behind your back. For that matter, how valid is your measure of caregiver status? It may be that be eliding distinctions of part-time vs full-time, or care-giving as an occupation vs care-giving to a family-member, or other things like that may have degraded the information your data could convey. Finally, I do not really know how the use of instrumental variables impacts this kind of thing, but I have the impression that it adds additional uncertainty.

Dear Professor Clyde Schechter,

Thank you for your insightful comments. It's always great to read your comments and explanation. I will keep in mind what you suggested to my models. In addition, the way you interpreted the confidence interval is really interesting, say "confidence interval that ranges from about -.13 to about +.50. So that covers a wide range from slightly favorable effects on health to pretty strongly unfavorable", could you please explain a bit to me the intuition behind your explanation?

Thank you.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#20

01 Nov 2020, 20:05

My intuition behind the explanation comes directly from what you wrote about your model. The outcome variable is a dichotomous variable: 0 = good health, 1 = bad health. Predicted value of your model can be interpreted as estimates of the probability of good health. So if the interaction effect were really at the lower confidence limit of -0.13 it would correspond to roughly a 13 percentage point higher probability of poor health compared to what one would see in the high income/providing care group if there were no interaction at all, so I would call that a slightly favorable effect on health. At the other end, +0.50, we are talking about roughly a 50 percentage point higher probability of poor health, which I think most people would agree is pretty strongly unfavorable.
1 like
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#21

01 Nov 2020, 20:57

Dear Professor Clyde Schechter,

Thanks for your detailed explanation. So is it true that the sign (negative or positive) of CI does not matter in terms of interpretation direction of the health effect and that the sign of CI only tells us whether 0 falls into the value range of CI?

Best,
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#22

01 Nov 2020, 22:01

So is it true that the sign (negative or positive) of CI does not matter in terms of interpretation direction of the health effect and that the sign of CI only tells us whether 0 falls into the value range of CI?

No, that's a terrible way to think about confidence intervals. That's treating confidence intervals as a verbose way to express statistical significance (or lack thereof). The American Statistical Association has recommended that the concept of statistical significance be abandoned. See https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr.

But even if you still want to use the concept of statistical significance, a confidence interval is so much richer than that. In fact, probably the least important thing about a confidence interval is whether it contains zero. A confidence interval is a range of values that are consistent with the data. Statistical analyses are done to answer questions about the real world. So the true value of a confidence interval comes from seeing what its endpoints mean in real-world practical terms. That gives you a much richer understanding of your results than a verdict on statistical significance.
1 like
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#23

01 Nov 2020, 22:27

Thank you, Professor Clyde Schechter. I have learned a lots about CI and other things today.

I wish you all the best!
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#24

05 Nov 2020, 06:29

Dear Professor Clyde Schechter,

I am sorry for bothering you again but I would like to seek your advice once more. As you may recall my post in #17 that caregivers with lower HH income have worse health than their counterparts with higher HH income. In order to explore the difference between those two groups in health, I am thinking of two hypotheses: caregivers with lower HH income experience worse health because 1) they devote more care hours for their parents than do caregivers with higher HH income; and 2) their care recipients have worse health (thus, need intensive care) than care recipients of caregivers with higher HH income.

In order to test the two hypotheses, I did as follows: 1) I choose two alternative outcomes to replace the health outcome, that are log(1+care hour) - defined as log(1+ the number of hours that caregivers provides for their parents) and long-term care certificate (1 if care recipients have been certified as long-term care need and otherwise); 2) I regressed the two new outcomes (separate models for each outcome) on caregiving indicator and other covariates (please note that I used OLS instead of 2SLS because the new two variables and IVs in #17 (Z1 and Z2) could be determine simultaneously); and 3) I conduct Chow test to examine coefficient equality of caregiving indicator in the two models. My intuition is that if coefficients of caregiving status in two models for caregivers with lower HH income are significant and greater than those of caregivers with higher HH income then I may conclude that care hour and the health of care recipients could play a role in my study. However, I am not sure whether my approach can help me to respond to the two hypotheses. Your advice is highly appreciated as always. My Stata commands are as follows:

Code:

// For ln(1+care hour) outcome qui reg ln(1+carehour) care c.X1 i.X2 i.X3 if income==1 est store m1 qui reg ln(1+carehour) care c.X1 i.X2 i.X3 if income==0 est store m2 suest m1 m2, vce(cluster id) test [m1_mean]care = [m2_mean]care // For long-term care certificate qui reg long_term_care care c.X1 i.X2 i.X3 if income==1 est store m3 qui reg long_term_care care c.X1 i.X2 i.X3 if income==0 est store m4 suest m3 m4, vce(cluster id) test [m3_mean]care = [m4_mean]care

Thank you.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#25

05 Nov 2020, 13:15

These models don't seem quite right. If I understand your new variables correctly, both of them will always be zero in any observation where the person is not a caregiver. In light of that, it does not make sense to me to include the variable care (which, according to your earlier posts is an indicator for whether or not the person is a caregiver) in the model. The outcome variables are only meaningful when care = 1. And your new hypotheses are not about the effect of being a caregiver: they are about why income modifies the effect of being a caregiver. So I think the variable care does not belong in these models. Rather, I think you want to see whether these new outcome variables differ across the income groups. So that looks something like this:

Code:

reg ln_1_plus_care_hours i.income c.X1 i.X2 i.X3 if care == 1 reg long_term_care i.income c.X1 i.X2 i.X3 if care == 1

By the way, it is not immediately clear whether you should include X1, X2, and X3 in these models. Even if they are variables that are needed for the earlier analyses of the health outcome variable, they may or may not be appropriate for these different outcomes. You will have to give that some thought and make a decision about that.

I'm wondering why you chose to specify the care hours as ln(1+x). I'm guessing you did that because you wanted to log-transform the variable (OK, but why? What did you see in the data that suggested you would need to do that?) and then realized that you had a lot of zeroes. Well when you have zeroes, ln(1+x) is just a kludge and really should be avoided unless you can explicitly justify ln(1+x) as opposed to, say ln(347.382+x) or ln(0.000001+x) etc., all of which would likely produce very different results for your analyses. When you have zeroes your first instinct should be to simply avoid logarithms altogether. Without knowing why you were inclined to do a log-transform in the first place, I can't make an emphatic recommendation for what would be better, but often a square-root transform is better (if there are no negative values, as I assume is true here) or a cube-root. Or, you may also note that restricting your analysis to care == 1 observations, I suspect the zeroes will go away which may make it feasible to just do good old log(x) (assuming there is still a reason to transform it in the first place.)
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#26

05 Nov 2020, 20:34

Dear Professor Clyde Schechter,

Thank you so much for your insightful comments and advice. Please see below for my responses

If I understand your new variables correctly, both of them will always be zero in any observation where the person is not a caregiver

Technically, you are right. However, it may not be necessary true for the case of long-term care certificate because in my data there could be cases that ones still can provide care for their parents even their parents are not certified as long-term care need. In order to have a long-term care certificate, individuals are asked to take a test held by local authorities (I am sorry for not providing this information in #24) and they may get a certificate if their health problems reach a certain level that requires long-term care.

It is true for the case of care hour that non-caregivers will have zero care hour so your suggested regression model in #25 is very helpful. As for long-term care need, do you think that my models in #24 make sense? or do you have any other advice?

By the way, it is not immediately clear whether you should include X1, X2, and X3 in these models

Thanks for that. I will think about this issue.

I'm wondering why you chose to specify the care hours as ln(1+x). I'm guessing you did that because you wanted to log-transform the variable (OK, but why? What did you see in the data that suggested you would need to do that?) and then realized that you had a lot of zeroes. Well when you have zeroes, ln(1+x) is just a kludge and really should be avoided unless you can explicitly justify ln(1+x) as opposed to, say ln(347.382+x) or ln(0.000001+x) etc., all of which would likely produce very different results for your analyses. When you have zeroes your first instinct should be to simply avoid logarithms altogether. Without knowing why you were inclined to do a log-transform in the first place, I can't make an emphatic recommendation for what would be better, but often a square-root transform is better (if there are no negative values, as I assume is true here) or a cube-root. Or, you may also note that restricting your analysis to care == 1 observations, I suspect the zeroes will go away which may make it feasible to just do good old log(x) (assuming there is still a reason to transform it in the first place.)

I have read several topics that you explained about the use of logarithms and I agree with you on this matter and will take your suggestions into account. In addition to your useful advice, one thing comes into my mind is that the care hour is measured as count data with lots of zeroes (for those who do not provide care) so I am thinking of using Poisson regression with raw measure of care hour as the dependent variable. what would you think about that?

Thank you so much.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#27

07 Nov 2020, 11:59

As for long-term care need, do you think that my models in #24 make sense? or do you have any other advice?

Given your explanation about the certificates, the models for long-term care need in #24 look sensible.

I am thinking of using Poisson regression with raw measure of care hour as the dependent variable. what would you think about that?

That's a good idea. In light of what sounds like an excess of zeroes, consider using the -vce(robust)- option.
1 like
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#28

07 Nov 2020, 19:58

Dear Professor Clyde Schechter,

Thanks a lot for your comments and suggestions. I will keep them in mind.
Comment
Naeem BhaBha

Join Date: Dec 2020

Posts: 4
#29

17 Dec 2020, 07:02

Hello Dear Statalisters, I tried my best to get my answer from the previous contents which is posted here, but I could not, maybe I missed it or I overlooked. My specific question is I want to know whether CSR coefficient is significant different or not with different CEO age. Please tell me the test which fulfill the requirement.

reg wroa1 CSR12_net_adj1 LOG_AT_US wsale_gr wche_at wbook_td i.fyear i.ff48i if CEO age>50.5, cluster(GVKEY1)
reg wroa1 CSR12_net_adj1 LOG_AT_US wsale_gr wche_at wbook_td i.fyear i.ff48i if CEO age<=50.5, cluster(GVKEY1)

your response is highly appreciated
Comment
Wouter Wakker

Join Date: Nov 2018

Posts: 621
#30

17 Dec 2020, 08:09

CEO age is not a valid variable name as it contains a space, so your code will throw an error.

Otherwise, posts #2 and #3 of this topic explain two techniques on how to do this, with the difference being that you are not including fixed effects, so you can just use regress instead of xtreg or areg.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment