estimate the "impact" of the time-invariant variable on the change of DV in a random effect model

Vincent Li

Join Date: Dec 2016

Posts: 57
#1

estimate the "impact" of the time-invariant variable on the change of DV in a random effect model

23 May 2025, 04:19

Hi!

I'm working with a dataset that includes 82 children, each with data collected at three time points: pre, po1, and po2. My goal is to determine whether the number of prompts given at the pre period can predict the change in emotion scores from pre to po1 and from pre to po2. Additionally, I am considering wsl and age as covariates, which are constant over time.

The regression models have been constructed:
Score_po1=a0+b0*score_pre+b1*prompts+b2*age+b3*wsl +e
Score_po2=a0+b0*score_pre+b1*prompts+b2*age+b3*wsl +e

However, I'm considering whether a random effects model would be more suitable for the data. Initially, I developed a model that looks like this

Code:

xtreg CEtotal i.period prompt_t1 wsl age, i(indi_num) re vce(robust)

the output is:

Random-effects GLS regression Number of obs = 240
Group variable: indi_num Number of groups = 80

R-squared: Obs per group:
Within = 0.1044 min = 3
Between = 0.4183 avg = 3.0
Overall = 0.3437 max = 3

Wald chi2(5) = 83.13
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

(Std. err. adjusted for 80 clusters in indi_num)
------------------------------------------------------------------------------
| Robust
CEtotal | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
period |
1 | 1.1375 .3634737 3.13 0.002 .4251046 1.849895
2 | 1.4625 .421989 3.47 0.001 .6354168 2.289583
|
prompt_t1 | -.0938059 .1043262 -0.90 0.369 -.2982815 .1106697
wsl | .1166733 .0174754 6.68 0.000 .0824222 .1509245
age | 1.304953 .2735377 4.77 0.000 .7688289 1.841077
_cons | -10.36433 2.828021 -3.66 0.000 -15.90714 -4.821508
-------------+----------------------------------------------------------------
sigma_u | 2.3870557
sigma_e | 2.2636948
rho | .52650628 (fraction of variance due to u_i)
------------------------------------------------------------------------------

However, the prompts is a time-invariant variable and I don't think this model can estimate what I want.

Then I tried:

Code:

xtreg CEtotal i.period##c.prompt_t1 wsl age, i(indi_num) re vce(robust)

The outcome is:

Random-effects GLS regression Number of obs = 240
Group variable: indi_num Number of groups = 80

R-squared: Obs per group:
Within = 0.1377 min = 3
Between = 0.4183 avg = 3.0
Overall = 0.3516 max = 3

Wald chi2(7) = 86.91
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

(Std. err. adjusted for 80 clusters in indi_num)
------------------------------------------------------------------------------------
| Robust
CEtotal | Coefficient std. err. z P>|z| [95% conf. interval]
-------------------+----------------------------------------------------------------
period |
1 | 1.592238 .4144295 3.84 0.000 .779971 2.404505
2 | 1.85533 .4734211 3.92 0.000 .9274419 2.783219
|
prompt_t1 | .0539184 .1135493 0.47 0.635 -.1686342 .276471
|
period#c.prompt_t1 |
1 | -.2377714 .1043024 -2.28 0.023 -.4422003 -.0333425
2 | -.2054014 .0979767 -2.10 0.036 -.3974323 -.0133706
|
wsl | .1166733 .0175506 6.65 0.000 .0822748 .1510718
age | 1.304953 .2747142 4.75 0.000 .766523 1.843383
_cons | -10.64685 2.855855 -3.73 0.000 -16.24422 -5.049477
-------------------+----------------------------------------------------------------
sigma_u | 2.3959142
sigma_e | 2.2354428
rho | .53460734 (fraction of variance due to u_i)
------------------------------------------------------------------------------------

Then I want to compare whether the "coefficient difference" of prompt is significant correlated to the score change from pre to po1. but -margins- can not be used for the interaction including a continuous variable. I also tried

Code:

test prompt_t1+1.period#prompt_t1 = prompt_t1

. but I don't think this is a correct approach. The output is:

( 1) 1.period#c.prompt_t1 = 0

chi2( 1) = 5.20
Prob > chi2 = 0.0226

To sum up, there are two questions:
1. Is a random effect model necessary?
2. If it is, how can I obtain the coefficient for the prompt (a continuous, time-invariant variable) and its significance regarding the score changes from pre to po1 and from pre to po2?

Thank you!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#2

23 May 2025, 09:13

Reading #1 as a whole, it appears that you are most interested in whether the prompt variable modifies the evolution CE_total across time periods. Your random effects regression with interaction -xtreg CEtotal i.period##c.prompt_t1 wsl age, i(indi_num) re vce(robust)- is a proper specification for that question. You do not need the -margins- command to identify the extent (or the "significance") of that interaction: that comes directly from the interaction coefficients in the -xtreg- output itself. If you want a joint test of the effect modification at both time periods, you can do that with -test 1.period#c.prompt_t1 2.period#c.prompt_t1-. The test statistics in the -xtreg- output for those two coefficients give you separate tests for each period.

The main objection that can be raised to this approach is that random effects models can provide inconsistent estimates due to correlation of regressors with the error term (i.e. incomplete adjustment for confounding variables). If the value of the prompt variable was randomly assigned, then this is not a problem. But if this is observational data, then you would probably be better off using -xtreg, cre- (if you are using the current version of Stata, or -xthybrid-, available from SSC if you are using an older Stata version).
Comment
Vincent Li

Join Date: Dec 2016

Posts: 57
#3

25 May 2025, 23:06

Originally posted by Clyde Schechter View Post

Reading #1 as a whole, it appears that you are most interested in whether the prompt variable modifies the evolution CE_total across time periods. Your random effects regression with interaction -xtreg CEtotal i.period##c.prompt_t1 wsl age, i(indi_num) re vce(robust)- is a proper specification for that question. You do not need the -margins- command to identify the extent (or the "significance") of that interaction: that comes directly from the interaction coefficients in the -xtreg- output itself. If you want a joint test of the effect modification at both time periods, you can do that with -test 1.period#c.prompt_t1 2.period#c.prompt_t1-. The test statistics in the -xtreg- output for those two coefficients give you separate tests for each period.

The main objection that can be raised to this approach is that random effects models can provide inconsistent estimates due to correlation of regressors with the error term (i.e. incomplete adjustment for confounding variables). If the value of the prompt variable was randomly assigned, then this is not a problem. But if this is observational data, then you would probably be better off using -xtreg, cre- (if you are using the current version of Stata, or -xthybrid-, available from SSC if you are using an older Stata version).

Thank you very much, Clyde. I mistakenly confused the interpretations of interaction between categorical#categorical and categorical#continuous. I've now clarified that.

I appreciate your suggestions. Currently, my Stata 17 MP version does not support -xtreg, cre-, so I opted to use -xthybrid-. Since interaction terms are not allowed in -xthybrid-, I executed the following commands:

Code:

xi:gen per_pro=period*prompt_t1 xi: xthybrid CEtotal i.period prompt_t1 per_pro wsl age, cre c(indi_num) vce(robust) se t p

The outcome is :

The variable 'prompt_t1' does not vary sufficiently within clusters
and will not be used to create additional regressors.
[0% of the total variance in 'prompt_t1' is within clusters]
The variable 'wsl' does not vary sufficiently within clusters
and will not be used to create additional regressors.
[0% of the total variance in 'wsl' is within clusters]
The variable 'age' does not vary sufficiently within clusters
and will not be used to create additional regressors.
[0% of the total variance in 'age' is within clusters]

Correlated random effects model. Family: gaussian. Link: identity.

+-----------------------------------+
| Variable | model |
|----------------------+------------|
| CEtotal | |
| R__wsl | 0.1167 |
| | 0.0173 |
| | 6.75 |
| | 0.0000 |
| R__age | 1.3050 |
| | 0.2707 |
| | 4.82 |
| | 0.0000 |
| R__prompt_t1 | 0.0089 |
| | 0.1128 |
| | 0.08 |
| | 0.9371 |
| W__per_pro | -0.1027 |
| | 0.0483 |
| | -2.13 |
| | 0.0334 |
| W___Iperiod_1 | 1.3339 |
| | 0.3685 |
| | 3.62 |
| | 0.0003 |
| W___Iperiod_2 | 1.8553 |
| | 0.4664 |
| | 3.98 |
| | 0.0001 |
| B___Iperiod_2 | (omitted) |
| | |
| | |
| | |
| D___Iperiod_1 | (omitted) |
| | |
| | |
| | |
| D__per_pro | (omitted) |
| | |
| | |
| | |
| _cons | -10.5607 |
| | 2.8162 |
| | -3.75 |
| | 0.0002 |
|----------------------+------------|
| var(_cons[indi_num])| |
| _cons | 5.3888 |
| | 0.9335 |
| | 5.77 |
| | 0.0000 |
|----------------------+------------|
| var(e.CEtotal)| |
| _cons | 4.9410 |
| | 0.6130 |
| | 8.06 |
| | 0.0000 |
|----------------------+------------|
| Statistics | |
| ll | -590.3362 |
| chi2 | 84.0782 |
| p | 0.0000 |
| aic | 1198.6724 |
| bic | 1229.9981 |
+-----------------------------------+
Legend: b/se/t/p
Level 1: 240 units. Level 2: 80 units.

One issue is that the interaction term per_pro has only one coefficient, while it is expected to have two—one for period 1 and another for period 2.

I also tried:

Code:

xi:gen per_pro=i.period*prompt_t1

But Stata said

i.period _Iperiod_0-2 (naturally coded; _Iperiod_0 omitted)
i.period*pro~t1 _IperXpromp_# (coded as above)
variable per_pro already defined

Are there alternative ways to incorporate i.period#c.prompt_t1 into the -xthybrid- model?

Thanks again!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#4

26 May 2025, 10:00

Please post example data, using the -dataex- command to do that. Run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2148
#5

26 May 2025, 12:08

A few comments.
1. The regressions that you show, which includes the initial period score, score_pre, are not consistent with the RE estimation, which excludes this variable. This is potential very important, depending on how prompts was generated. If it was randomized, there is no need to control for score_pre, but doing so can help a lot with precision because it seems likely that score_pre is a good predicting of the score in later periods. And if prompts is not randomly assigned, but correlated with score_pre, then score_pre should be included.
2. Assuming you are estimating the same model, you're in a case where pooled OLS and random effects are the same. Here, "pooled" means across the two years. You've shown two separate regressions but you can also obtain all estimates using one regression by interacting all variables with the period dummy variables.
3. Here are two commands that will produce identical estimates:

Code:

reg score i.period#(c.prompts c.score_pre c.age c.wsl) i.period, vce(cluster id) xtreg score i.period#(c.prompts c.score_pre c.age c.wsl) i.period, re vce(cluster id)

These allow all coefficients to be different across the two periods, which is what your separate regressions are doing. But you get all the estimates at once. So it doesn't matter whether you use pooled OLS or RE: they are numerically the same because none of the x variables, except the period dummies, change over time. See my working paper "Two-Way Fixed Effects, the Two-Way Mundlak Estimator, and Difference-in-Differences Estimators."

4. If you want to impose constant coefficients on the controls -- which is what your original RE command does -- use

Code:

reg score i.period#c.prompts c.score_pre c.age c.wsl i.period, vce(cluster id) xtreg score i.period#c.prompts c.score_pre c.age c.wsl i.period, re vce(cluster id)

Again, these estimate will be the same. You're still allowing the coefficient on prompts to vary. It is easy to use the test or lincom commands to test if they are equal.

Edit: There is no reason to use xthybrid or the cre option with xtreg, as these will just reproduce the estimates above. Only if an x is truly time varying will the estimates differ.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#6

26 May 2025, 13:06

Sorry, my recommendation to use -xtreg, cre- or -xthybrid- was based on thinking that there were some time-varying regressors in the model. On re-reading #1, I see that it was clear there that all of the regressors are time-invariant, but somehow I missed that. I'm sorry for leading O.P. astray here. Sorry for the confusion.
Comment

Vincent Li

Join Date: Dec 2016
Posts: 57

Yesterday, 07:04

Thank you very much, Jeff and Clyde! I will read Jeff's working paper this weekend.

Taking Clyde's suggestion, I used -dataex- to create an example dataset:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int indi_num float CEtotal byte angrycetotal float CEtotal_01 byte(angrycetotal_01 period prompt_t1) int wsl byte age
 1 16 4 16 4 0  2 113 8
 2 14 4 14 4 0  0 110 7
 3 13 3 13 3 0  1 120 8
 4  3 0  3 0 0  4  60 7
 5 11 4 11 4 0 11  55 8
 6 13 4 13 4 0  0 116 6
 7 10 2 10 2 0  0 116 6
 8 12 2 12 2 0  0 113 6
 9  5 0  5 0 0 14  75 7
10  3 0  3 0 0  0  92 8
11  6 2  6 2 0  7  50 9
12 12 3 12 3 0  0  92 8
13  8 3  8 3 0  1  89 8
14 15 3 15 3 0  1  92 8
15  9 3  9 3 0  0 100 5
16 14 2 14 2 0  1  89 8
17 16 4 16 4 0  2 113 7
18 11 2 11 2 0  1  98 6
19 13 2 13 2 0 14  98 7
20  8 2  8 2 0  3 130 6
21 12 2 12 2 0  1  98 7
22 15 4 15 4 0  0 126 7
23 14 4 14 4 0  0 107 6
24  0 0  0 0 0  8  64 7
25  6 2  6 2 0  7  79 7
26 13 2 13 2 0  0  89 7
27 11 1 11 1 0  1 104 6
28 16 4 16 4 0  0  95 8
29 13 2 13 2 0  . 110 6
30 12 1 12 1 0  0 100 6
31  4 0  4 0 0  1  95 6
32 14 2 14 2 0  0 107 6
33  9 2  9 2 0  1  73 7
34 12 4 12 4 0  2 116 7
35 14 2 14 2 0  0 100 7
36 15 3 15 3 0  0  92 6
37  6 0  6 0 0  0 102 9
38  6 2  6 2 0  3  55 8
39 13 2 13 2 0  2  98 6
40  1 0  1 0 0  .  84 5
41  9 1  9 1 0  1 123 6
42  2 0  2 0 0  1  92 5
43  3 2  3 2 0  0  55 5
44 11 1 11 1 0  3 102 5
45 13 2 13 2 0 14  86 8
46 12 2 12 2 0  1 110 7
47 12 2 12 2 0  0  89 7
48 11 3 11 3 0  2  98 7
49  6 2  6 2 0  2  95 7
50 14 3 14 3 0  0 107 8
51  6 2  6 2 0 10  81 7
52  2 0  2 0 0  0  60 7
53  9 1  9 1 0  0  95 6
54 14 2 14 2 0  0  98 8
55 14 2 14 2 0  7 102 8
56  3 1  3 1 0  0 102 6
57  6 0  6 0 0  0  92 8
58 13 2 13 2 0  0  84 9
59  7 2  7 2 0  4 116 6
60  8 1  8 1 0  0 110 6
61  5 0  5 0 0  1  89 8
62 11 2 11 2 0  2  81 6
63 13 2 13 2 0  0 107 7
64 15 4 15 4 0  8 107 6
65 15 4 15 4 0  0 110 6
66 12 2 12 2 0  1 133 6
67  6 1  6 1 0  0  92 7
68  7 2  7 2 0  0  73 9
69 11 2 11 2 0  0 123 5
70  2 0  2 0 0  0 113 5
71  6 1  6 1 0  1 136 5
72  7 2  7 2 0  0 104 6
73  9 2  9 2 0  0 107 7
74 11 4 11 4 0  1 110 6
75  6 0  6 0 0  0  92 7
76 12 4 12 4 0  0 107 7
77 11 4 11 4 0  0 133 8
78  6 2  6 2 0  2 102 4
79 13 4 13 4 0  1  95 8
80 13 3 13 3 0  0 104 8
81  6 1  6 1 0  1 117 4
82  4 0  4 0 0  2 111 4
 1 14 2 16 4 1  2 113 8
 2 14 3 14 4 1  0 110 7
 3 14 3 13 3 1  1 120 8
 4  3 1  3 0 1  4  60 7
 5  8 1 11 4 1 11  55 8
 6 12 3 13 4 1  0 116 6
 7 14 3 10 2 1  0 116 6
 8 16 4 12 2 1  0 113 6
 9  1 0  5 0 1 14  75 7
10 11 2  3 0 1  0  92 8
11  3 0  6 2 1  7  50 9
12 15 3 12 3 1  0  92 8
13 12 2  8 3 1  1  89 8
14 13 2 15 3 1  1  92 8
15  7 1  9 3 1  0 100 5
16 13 3 14 2 1  1  89 8
17 16 4 16 4 1  2 113 7
18 11 3 11 2 1  1  98 6
end

Within the dataset, CEtotal and angrycetotal can serve as dependent variables, while CEtotal_01 and angrycetotal_01 represent the pretest scores. The period variable includes values 0, 1, and 2, corresponding to pre, po1, and po2; however, the example dataset only contains values 0 and 1. Additionally, prompt_t1, wsl, and age are continuous between-subject variables.

One question is, my code:

Code:

 
 xtreg CEtotal i.period##c.prompt_t1 CEtotal_01 wsl age, i(indi_num) re vce(robust)

is a bit different from Jeff's code:

Code:

xtreg score i.period#c.prompts c.score_pre c.age c.wsl i.period, re vce(cluster id)

Will including c.prompt_t1 in the model affect the estimation or mislead me in answering my question?

Another question is about the differences between using OLS vs. pooled regression or time random effect with longitudinal data.
Specifically, if I treat the data as cross-sectional, the example dataset is as follows:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(CEtotal CEtotal_01) int wsl byte(age prompt_t1)
16 16 113 8  2
14 14 110 7  0
13 13 120 8  1
 3  3  60 7  4
11 11  55 8 11
13 13 116 6  0
10 10 116 6  0
12 12 113 6  0
 5  5  75 7 14
 3  3  92 8  0
 6  6  50 9  7
12 12  92 8  0
 8  8  89 8  1
15 15  92 8  1
 9  9 100 5  0
14 14  89 8  1
16 16 113 7  2
11 11  98 6  1
13 13  98 7 14
 8  8 130 6  3
12 12  98 7  1
15 15 126 7  0
14 14 107 6  0
 0  0  64 7  8
 6  6  79 7  7
13 13  89 7  0
11 11 104 6  1
16 16  95 8  0
13 13 110 6  .
12 12 100 6  0
 4  4  95 6  1
14 14 107 6  0
 9  9  73 7  1
12 12 116 7  2
14 14 100 7  0
15 15  92 6  0
 6  6 102 9  0
 6  6  55 8  3
13 13  98 6  2
 1  1  84 5  .
 9  9 123 6  1
 2  2  92 5  1
 3  3  55 5  0
11 11 102 5  3
13 13  86 8 14
12 12 110 7  1
12 12  89 7  0
11 11  98 7  2
 6  6  95 7  2
14 14 107 8  0
 6  6  81 7 10
 2  2  60 7  0
 9  9  95 6  0
14 14  98 8  0
14 14 102 8  7
 3  3 102 6  0
 6  6  92 8  0
13 13  84 9  0
 7  7 116 6  4
 8  8 110 6  0
 5  5  89 8  1
11 11  81 6  2
13 13 107 7  0
15 15 107 6  8
15 15 110 6  0
12 12 133 6  1
 6  6  92 7  0
 7  7  73 9  0
11 11 123 5  0
 2  2 113 5  0
 6  6 136 5  1
 7  7 104 6  0
 9  9 107 7  0
11 11 110 6  1
 6  6  92 7  0
12 12 107 7  0
11 11 133 8  0
 6  6 102 4  2
13 13  95 8  1
13 13 104 8  0
 6  6 117 4  1
 4  4 111 4  2
14 16 113 8  2
14 14 110 7  0
14 13 120 8  1
 3  3  60 7  4
 8 11  55 8 11
12 13 116 6  0
14 10 116 6  0
16 12 113 6  0
 1  5  75 7 14
11  3  92 8  0
 3  6  50 9  7
15 12  92 8  0
12  8  89 8  1
13 15  92 8  1
 7  9 100 5  0
13 14  89 8  1
16 16 113 7  2
11 11  98 6  1
end

and constructed a model that excludes the period variable, using the scores from po1 as the dependent variable and the scores from pre as the covariate:
Score_po1=a0+b0*score_pre+b1*prompt_t1+b2*age+b3*w sl +e

In this model, the "impact" of prompt_t1 on po1 scores, while controlling for pre scores, is estimated. This impact represents the change in po1 scores associated with a one-unit change in prompt_t1, assuming that pre scores remain constant.

Is this the same as, or does it have similar implications to, the "impact" of prompt_t1 in the pooled regression or random effects model? Specifically, does it indicate the correlation between changes in prompt_t1 and the score difference between pre and po1?

Code:

 
 xtreg score i.period#c.prompts c.score_pre c.age c.wsl i.period, re vce(cluster id)

What information might be overlooked when analyzing longitudinal data using a cross-sectional model? in this situation, only period is time-variant variable, is it acceptable to use a regression without time variable?

I discussed this with others but can not clarify it clearly, which is awful.

Thank you!

Comment

Vincent Li

Join Date: Dec 2016

Posts: 57
#8

Yesterday, 07:33

oh, a stupid question.The effects of some time-invariant variables will be addressed when using a time fixed effects model.
Comment

Announcement