Dear all,
I have an unbalanced panel data set, where N (110 companies) > T (5 Years). I first conducted a pooled OLS regression (-regress-). Later, I conducted panel regressions (-xtreg-), comparing the results as robustness checks. My model is as follows:
ROA = c.Var1_c##i.Industry Var2 Var3 Var4 Var5 i.Year, with Var1 being compensation to the CEO, Var2-5 control variables and Industry being a dummy variable (1 to 10 for different industries).
First, I winsorized my data at (5 95) to account for any outliers. I controlled for the OLS assumptions and in consequence transformed some variables (linearity), and mean-centered my key independent variable (multicollinearity for the interaction term). As one would expect, I do have heteroscedasticity (-estat hettest-) and autocorrelation (with -gen time = _n-; -tsset time-; and -dwstat-) in my data.
Question 1: How do I account for autocorrelation AND heteroscedasticity in pooled OLS? I understand that for the first I can use -prais ..., corc-, and for the latter -regress ...,vce(robust) -, but I have failed to find a combined method.
See the result of my pooled OLS regression below:
Question 2: Would you consider this an appropriate model? Am I missing something?
For my panel regressions I used the same winsorized/transformed data.
Question 3: Is this considered normal, or would one take the original (for some part) non-linear, non-normally-distributed data?
I then followed to do panel regressions (-xtreg, fe/re-) and testing for autocorrelation (-xtserial ...,output-, without categorical variables/interaction term) and heteroscedasticity (-xttest3-). After having confirmed, that both exist in my panel data, I accounted for it by going -xtreg ..., re vce(cluster Company_ID)- after the Hausman Test.
Question 4: Is using - ,re vce(cluster Company_ID)- correct in order to account for both, or should I conduct FGLS (-xtgls ..., p(h) c(ar1)-) or PCSE analyses (-xtpcse ..., het c(ar1)-)?
Question 5: Would you consider this an appropriate approach? Am I missing something?
Following that Var1_c in pooled OLS and random effects is similar in significance and having the same sign, I would conclude that the results obtained from pooled OLS seem reasonable and accept/reject my hypothesis from there.
Question 6: Would this be a correct way to do this?
Thank you very much for bearing with me so long. I am looking forward to your answers.
Best regards,
Pietro
I have an unbalanced panel data set, where N (110 companies) > T (5 Years). I first conducted a pooled OLS regression (-regress-). Later, I conducted panel regressions (-xtreg-), comparing the results as robustness checks. My model is as follows:
ROA = c.Var1_c##i.Industry Var2 Var3 Var4 Var5 i.Year, with Var1 being compensation to the CEO, Var2-5 control variables and Industry being a dummy variable (1 to 10 for different industries).
First, I winsorized my data at (5 95) to account for any outliers. I controlled for the OLS assumptions and in consequence transformed some variables (linearity), and mean-centered my key independent variable (multicollinearity for the interaction term). As one would expect, I do have heteroscedasticity (-estat hettest-) and autocorrelation (with -gen time = _n-; -tsset time-; and -dwstat-) in my data.
Question 1: How do I account for autocorrelation AND heteroscedasticity in pooled OLS? I understand that for the first I can use -prais ..., corc-, and for the latter -regress ...,vce(robust) -, but I have failed to find a combined method.
See the result of my pooled OLS regression below:
HTML Code:
. regress ROA_new c.Var1_c##ib6.IndustryRank Var2 Var3 Var4 Var5 i.Year, vce(robust)
Linear regression Number of obs = 472
F(28, 443) = 34.84
Prob > F = 0.0000
R-squared = 0.4664
Root MSE = .03384
-----------------------------------------------------------------------------------------
| Robust
ROA_new | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------------------+----------------------------------------------------------------
Var1_c | .0004192 .0001773 2.37 0.018 .0000709 .0007676
|
IndustryRank |
Communication Services | .0129088 .0072345 1.78 0.075 -.0013093 .027127
Consumer Discretionary | .0188844 .0071505 2.64 0.009 .0048314 .0329375
Consumer Staples | .0078738 .0148976 0.53 0.597 -.0214049 .0371525
Financials | -.0030436 .0073608 -0.41 0.679 -.0175101 .0114228
Health Care | .0181798 .0069513 2.62 0.009 .0045182 .0318415
Information Technology | .0339861 .0076653 4.43 0.000 .0189212 .049051
Materials | .0007409 .0056497 0.13 0.896 -.0103627 .0118444
Real Estate | -.0062964 .0064746 -0.97 0.331 -.0190212 .0064284
Utilities | -.0269126 .0061301 -4.39 0.000 -.0389604 -.0148648
|
IndustryRank#c.Var1_c |
Communication Services | -.000053 .0002856 -0.19 0.853 -.0006143 .0005083
Consumer Discretionary | .0001062 .0002586 0.41 0.682 -.0004021 .0006145
Consumer Staples | .0004539 .0005145 0.88 0.378 -.0005572 .0014651
Financials | -.0002145 .0001982 -1.08 0.280 -.0006041 .0001751
Health Care | .0003999 .000239 1.67 0.095 -.0000699 .0008697
Information Technology | .0004263 .000286 1.49 0.137 -.0001358 .0009884
Materials | .0004491 .0002949 1.52 0.129 -.0001305 .0010288
Real Estate | .000397 .000238 1.67 0.096 -.0000708 .0008648
Utilities | -.0001756 .000237 -0.74 0.459 -.0006415 .0002902
|
Var2 | -.0339362 .0150897 -2.25 0.025 -.0635926 -.0042799
Var3 | .0479924 .0242645 1.98 0.049 .0003044 .0956803
Var4 | -.0126286 .0016375 -7.71 0.000 -.0158468 -.0094103
Var5 | .0003169 .0011124 0.28 0.776 -.0018693 .0025032
|
Year |
2015 | .001048 .005721 0.18 0.855 -.0101957 .0122917
2016 | .001855 .0052554 0.35 0.724 -.0084736 .0121835
2017 | .0068407 .0051132 1.34 0.182 -.0032084 .0168898
2018 | .0058702 .0051116 1.15 0.251 -.0041758 .0159161
2019 | .0056374 .0054482 1.03 0.301 -.0050702 .016345
|
_cons | .2496132 .0273422 9.13 0.000 .1958766 .3033498
-----------------------------------------------------------------------------------------
For my panel regressions I used the same winsorized/transformed data.
Question 3: Is this considered normal, or would one take the original (for some part) non-linear, non-normally-distributed data?
I then followed to do panel regressions (-xtreg, fe/re-) and testing for autocorrelation (-xtserial ...,output-, without categorical variables/interaction term) and heteroscedasticity (-xttest3-). After having confirmed, that both exist in my panel data, I accounted for it by going -xtreg ..., re vce(cluster Company_ID)- after the Hausman Test.
Question 4: Is using - ,re vce(cluster Company_ID)- correct in order to account for both, or should I conduct FGLS (-xtgls ..., p(h) c(ar1)-) or PCSE analyses (-xtpcse ..., het c(ar1)-)?
HTML Code:
. xtset Company_ID Year
panel variable: Company_ID (unbalanced)
time variable: Year, 2014 to 2019, but with gaps
delta: 1 unit
. xtreg ROA_new c.Var1_c##ib6.IndustryRank Var2 Var3 Var4 Var5 i.Year, re
Random-effects GLS regression Number of obs = 472
Group variable: Company_ID Number of groups = 106
R-sq: Obs per group:
within = 0.2685 min = 1
between = 0.3871 avg = 4.5
overall = 0.4329 max = 6
Wald chi2(28) = 187.42
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
-----------------------------------------------------------------------------------------
ROA_new | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------------------+----------------------------------------------------------------
Var1_c | .00071 .0001607 4.42 0.000 .000395 .001025
|
IndustryRank |
Communication Services | .0097942 .0136338 0.72 0.473 -.0169275 .036516
Consumer Discretionary | .0130758 .0124651 1.05 0.294 -.0113553 .0375069
Consumer Staples | .0087058 .0202608 0.43 0.667 -.0310046 .0484163
Financials | -.0081966 .0166936 -0.49 0.623 -.0409154 .0245223
Health Care | .0134478 .0135422 0.99 0.321 -.0130943 .0399899
Information Technology | .0324283 .0149377 2.17 0.030 .0031509 .0617057
Materials | -.0025767 .0128767 -0.20 0.841 -.0278146 .0226613
Real Estate | -.0099463 .0232338 -0.43 0.669 -.0554838 .0355912
Utilities | -.0306672 .0206443 -1.49 0.137 -.0711293 .009795
|
IndustryRank#c.Var1_c |
Communication Services | -.0000123 .0003093 -0.04 0.968 -.0006185 .0005938
Consumer Discretionary | -.0001672 .0002001 -0.84 0.403 -.0005595 .000225
Consumer Staples | -.0006347 .0002995 -2.12 0.034 -.0012218 -.0000476
Financials | -.000613 .0003204 -1.91 0.056 -.0012409 .000015
Health Care | -.0004573 .0002939 -1.56 0.120 -.0010333 .0001188
Information Technology | -.0002168 .0003547 -0.61 0.541 -.000912 .0004783
Materials | .0002684 .0002352 1.14 0.254 -.0001924 .0007293
Real Estate | .0000822 .0006748 0.12 0.903 -.0012404 .0014048
Utilities | -.000399 .0003905 -1.02 0.307 -.0011645 .0003664
|
Var2 | -.0373825 .0139206 -2.69 0.007 -.0646665 -.0100985
Var3 | .0421203 .0217148 1.94 0.052 -.00044 .0846806
Var4 | -.0106307 .0025466 -4.17 0.000 -.0156219 -.0056396
Var5 | -.0007089 .0008282 -0.86 0.392 -.0023321 .0009143
|
Year |
2015 | .0009376 .0024598 0.38 0.703 -.0038835 .0057588
2016 | .0003204 .0024665 0.13 0.897 -.0045139 .0051546
2017 | .0038309 .0024417 1.57 0.117 -.0009546 .0086165
2018 | .0013017 .0025037 0.52 0.603 -.0036054 .0062088
2019 | -.0014532 .0027073 -0.54 0.591 -.0067594 .0038531
|
_cons | .2299991 .0409079 5.62 0.000 .149821 .3101771
------------------------+----------------------------------------------------------------
sigma_u | .0357301
sigma_e | .01435012
rho | .86110172 (fraction of variance due to u_i)
-----------------------------------------------------------------------------------------
. est store re1
. xtreg ROA_new c.Var1_c##ib6.IndustryRank Var2 Var3 Var4 Var5 i.Year, fe
note: 1.IndustryRank omitted because of collinearity
note: 2.IndustryRank omitted because of collinearity
note: 3.IndustryRank omitted because of collinearity
note: 4.IndustryRank omitted because of collinearity
note: 5.IndustryRank omitted because of collinearity
note: 7.IndustryRank omitted because of collinearity
note: 8.IndustryRank omitted because of collinearity
note: 9.IndustryRank omitted because of collinearity
note: 10.IndustryRank omitted because of collinearity
Fixed-effects (within) regression Number of obs = 472
Group variable: Company_ID Number of groups = 106
R-sq: Obs per group:
within = 0.2722 min = 1
between = 0.2660 avg = 4.5
overall = 0.3108 max = 6
F(19,347) = 6.83
corr(u_i, Xb) = 0.0466 Prob > F = 0.0000
-----------------------------------------------------------------------------------------
ROA_new | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------------------+----------------------------------------------------------------
Var1_c | .0007539 .00017 4.43 0.000 .0004195 .0010883
|
IndustryRank |
Communication Services | 0 (omitted)
Consumer Discretionary | 0 (omitted)
Consumer Staples | 0 (omitted)
Financials | 0 (omitted)
Health Care | 0 (omitted)
Information Technology | 0 (omitted)
Materials | 0 (omitted)
Real Estate | 0 (omitted)
Utilities | 0 (omitted)
|
IndustryRank#c.Var1_c |
Communication Services | .0002298 .0003752 0.61 0.541 -.0005083 .0009678
Consumer Discretionary | -.0002387 .0002111 -1.13 0.259 -.0006538 .0001765
Consumer Staples | -.0007517 .0003146 -2.39 0.017 -.0013706 -.0001329
Financials | -.0006464 .0003789 -1.71 0.089 -.0013916 .0000988
Health Care | -.0006811 .0003347 -2.03 0.043 -.0013393 -.0000228
Information Technology | -.0003157 .0004111 -0.77 0.443 -.0011243 .0004928
Materials | .0002191 .0002482 0.88 0.378 -.0002691 .0007072
Real Estate | .0000297 .0007433 0.04 0.968 -.0014321 .0014916
Utilities | -.0004243 .0004024 -1.05 0.292 -.0012158 .0003672
|
Var2 | -.048388 .0168456 -2.87 0.004 -.0815203 -.0152556
Var3 | .0406569 .0238885 1.70 0.090 -.0063276 .0876414
Var4 | -.0106254 .0056596 -1.88 0.061 -.0217569 .000506
Var5 | -.000814 .0008762 -0.93 0.354 -.0025374 .0009094
|
Year |
2015 | .0008862 .0024982 0.35 0.723 -.0040273 .0057997
2016 | .0002122 .0025832 0.08 0.935 -.0048685 .005293
2017 | .0035253 .0025814 1.37 0.173 -.0015518 .0086025
2018 | .0008915 .0028079 0.32 0.751 -.0046312 .0064143
2019 | -.0017244 .0031846 -0.54 0.589 -.0079881 .0045392
|
_cons | .2380737 .0926853 2.57 0.011 .055778 .4203694
------------------------+----------------------------------------------------------------
sigma_u | .03773385
sigma_e | .01435012
rho | .87364723 (fraction of variance due to u_i)
-----------------------------------------------------------------------------------------
F test that all u_i=0: F(105, 347) = 22.42 Prob > F = 0.0000
. est store fe1
. xttest3
Modified Wald test for groupwise heteroskedasticity
in fixed effect regression model
H0: sigma(i)^2 = sigma^2 for all i
chi2 (106) = 5.2e+31
Prob>chi2 = 0.0000
. xtserial ROA_new c.Var1_c##ib6.IndustryRank Var2 Var3 Var4 Var5 i.Year, output
factor-variable and time-series operators not allowed
r(101);
. xtserial ROA_new Var1_c Var2 Var3 Var4 Var5, output
Linear regression Number of obs = 364
F(5, 95) = 12.17
Prob > F = 0.0000
R-squared = 0.2013
Root MSE = .01611
(Std. Err. adjusted for 96 clusters in Company_ID)
------------------------------------------------------------------------------
| Robust
D.ROA_new | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Var1_c |
D1. | .0004306 .000123 3.50 0.001 .0001865 .0006747
|
Var2 |
D1. | -.0765561 .0178377 -4.29 0.000 -.1119683 -.0411439
|
Var3 |
D1. | .0629007 .0284957 2.21 0.030 .0063295 .1194718
|
Var4 |
D1. | -.0152056 .0053296 -2.85 0.005 -.0257861 -.0046251
|
Var5 |
D1. | -.0006843 .0006922 -0.99 0.325 -.0020586 .00069
------------------------------------------------------------------------------
Wooldridge test for autocorrelation in panel data
H0: no first-order autocorrelation
F( 1, 90) = 11.909
Prob > F = 0.0009
. hausman fe1 re1
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| fe1 re1 Difference S.E.
-------------+----------------------------------------------------------------
Var1_c | .0007539 .00071 .0000439 .0000555
IndustryRank#|
c.Var1_c |
1 | .0002298 -.0000123 .0002421 .0002125
2 | -.0002387 -.0001672 -.0000715 .0000671
3 | -.0007517 -.0006347 -.000117 .0000963
4 | -.0006464 -.000613 -.0000334 .0002023
5 | -.0006811 -.0004573 -.0002238 .0001601
7 | -.0003157 -.0002168 -.0000989 .0002079
8 | .0002191 .0002684 -.0000494 .0000795
9 | .0000297 .0000822 -.0000525 .0003116
10 | -.0004243 -.000399 -.0000253 .0000971
Var2 | -.048388 -.0373825 -.0110055 .0094863
Var3 | .0406569 .0421203 -.0014635 .0099562
Var4 | -.0106254 -.0106307 5.29e-06 .0050543
Var5 | -.000814 -.0007089 -.0001051 .0002861
Year |
2015 | .0008862 .0009376 -.0000515 .0004362
2016 | .0002122 .0003204 -.0001081 .0007677
2017 | .0035253 .0038309 -.0003056 .0008378
2018 | .0008915 .0013017 -.0004102 .0012713
2019 | -.0017244 -.0014532 -.0002713 .001677
------------------------------------------------------------------------------
b = consistent under Ho and Ha; obtained from xtreg
B = inconsistent under Ha, efficient under Ho; obtained from xtreg
Test: Ho: difference in coefficients not systematic
chi2(19) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 12.71
Prob>chi2 = 0.8533
. xtreg ROA_new c.Var1_c##ib6.IndustryRank Var2 Var3 Var4 Var5 i.Year, re vce(cluster Company_ID)
Random-effects GLS regression Number of obs = 472
Group variable: Company_ID Number of groups = 106
R-sq: Obs per group:
within = 0.2685 min = 1
between = 0.3871 avg = 4.5
overall = 0.4329 max = 6
Wald chi2(28) = 313.61
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
(Std. Err. adjusted for 106 clusters in Company_ID)
-----------------------------------------------------------------------------------------
| Robust
ROA_new | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------------------+----------------------------------------------------------------
Var1_c | .00071 .0002914 2.44 0.015 .0001389 .001281
|
IndustryRank |
Communication Services | .0097942 .0152693 0.64 0.521 -.020133 .0397215
Consumer Discretionary | .0130758 .0124774 1.05 0.295 -.0113794 .0375311
Consumer Staples | .0087058 .0248933 0.35 0.727 -.0400842 .0574959
Financials | -.0081966 .0115438 -0.71 0.478 -.030822 .0144289
Health Care | .0134478 .0150852 0.89 0.373 -.0161187 .0430143
Information Technology | .0324283 .0149935 2.16 0.031 .0030415 .061815
Materials | -.0025767 .0120635 -0.21 0.831 -.0262206 .0210673
Real Estate | -.0099463 .0086922 -1.14 0.253 -.0269827 .0070901
Utilities | -.0306672 .012636 -2.43 0.015 -.0554333 -.005901
|
IndustryRank#c.Var1_c |
Communication Services | -.0000123 .0005877 -0.02 0.983 -.0011642 .0011395
Consumer Discretionary | -.0001672 .0004009 -0.42 0.677 -.0009529 .0006185
Consumer Staples | -.0006347 .0003247 -1.95 0.051 -.0012711 1.69e-06
Financials | -.000613 .0003028 -2.02 0.043 -.0012065 -.0000194
Health Care | -.0004573 .0003238 -1.41 0.158 -.0010919 .0001773
Information Technology | -.0002168 .0003364 -0.64 0.519 -.0008762 .0004425
Materials | .0002684 .0004278 0.63 0.530 -.00057 .0011069
Real Estate | .0000822 .0003729 0.22 0.826 -.0006487 .0008131
Utilities | -.000399 .0003621 -1.10 0.270 -.0011087 .0003107
|
Var2 | -.0373825 .0152058 -2.46 0.014 -.0671854 -.0075796
Var3 | .0421203 .0229078 1.84 0.066 -.0027781 .0870188
Var4 | -.0106307 .0026749 -3.97 0.000 -.0158735 -.005388
Var5 | -.0007089 .0009975 -0.71 0.477 -.0026639 .0012461
|
Year |
2015 | .0009376 .0021478 0.44 0.662 -.003272 .0051473
2016 | .0003204 .0021163 0.15 0.880 -.0038274 .0044682
2017 | .0038309 .0028211 1.36 0.174 -.0016983 .0093602
2018 | .0013017 .0031098 0.42 0.676 -.0047934 .0073969
2019 | -.0014532 .0034477 -0.42 0.673 -.0082105 .0053041
|
_cons | .2299991 .0442903 5.19 0.000 .1431917 .3168064
------------------------+----------------------------------------------------------------
sigma_u | .0357301
sigma_e | .01435012
rho | .86110172 (fraction of variance due to u_i)
-----------------------------------------------------------------------------------------
Following that Var1_c in pooled OLS and random effects is similar in significance and having the same sign, I would conclude that the results obtained from pooled OLS seem reasonable and accept/reject my hypothesis from there.
Question 6: Would this be a correct way to do this?
Thank you very much for bearing with me so long. I am looking forward to your answers.
Best regards,
Pietro

), total cost of a given health care programmes do follow a gamma distribution, which is positively skewed, with a long right tail, as some patients need longer than average therapies and/or may experience adverse events that are expensive to manage. That said, by ruling out the so called outliers, you're actually making up your original dataset and nobody can tell you the direction and the magnitude of the bias that you impose in your analysis. In addition, normality is a weak requirement for residual distribution only (and oftentimes an oversold one).
Comment