Hi everyone,
I spend a reasonable amount of time analysing panel data using the -xtreg- command, and have been looking for a command which summarises data across the panels adequately.
I've managed to find a solution that sort-of does what I want, however my solution is hardly elegant, and quite cumbersome when I've used more recent datasets and more complex models - which is why I wanted to pose my problem to the Statalisters in case anyone on here has a neater solution.
If I explain my problem with a hypothetical dataset that highlights (in an extreme way) the problem I have:
So this dataset has 3 variables: caseno (a patient ID variable), score (a test score), and visit (a coding of which study visit the patient's score belongs to). The data is obviously therefore in long format.
There are 100 patients in the dataset, and 5 visits at which a score may have been observed. In this fictional dataset, all patients had a baseline observation (at visit 1), and for each subsequent visit, 20 patients had a missing observation. Crucially, no patient has more than one missing observation.
A quick summary therefore shows the following:
I ran a fixed-effects panel regression, to test how the score changes over the visits, giving the following output:
What is not immediately obvious from this model is the huge amount of discarded data. Stata drops patients in a casewise fashion, therefore the above estimate for _b[visit] is derived from only 20 of the 100 patients in the sample. There is little information in the table above to hint that this has occurred.
To check which patients ended up contributing to this beta, I thought to type:
But as you can see, this does not show the underlying issue – it doesn’t show which observations were excluded from the estimation of _b[visit].
The only way I've found to effectively summarise which patients are used in the xtreg fully is to reshape the data, and then run -tabstat-, with the -casewise- option, as follows:
This highlights how little data is actually used in the final estimation - just under one quarter of the actual observations.
In the datasets I analyse there are many variables with missing values in multiple variables at different visits, meaning that there is a high risk that many of the models will only use a small number of observations to generate the regression estimates. This information is therefore very useful to me, but relatively awkward to obtain, and not easy to spot from the initial regression output.
My question is this: Should I need to run several models like this in a given dataset, is there a simpler way to get a tabulation of the number of observations used in an -xtreg- model without having to reshape the data, and then run -tabstat, casewise-, reshape back, run the next model, etc?
This same issue applies to all panel models, as far as I can tell - I just use -xtreg, fe- as an example.
Any help would be greatly appreciated.
Many thanks,
Matthew Parkes
I spend a reasonable amount of time analysing panel data using the -xtreg- command, and have been looking for a command which summarises data across the panels adequately.
I've managed to find a solution that sort-of does what I want, however my solution is hardly elegant, and quite cumbersome when I've used more recent datasets and more complex models - which is why I wanted to pose my problem to the Statalisters in case anyone on here has a neater solution.
If I explain my problem with a hypothetical dataset that highlights (in an extreme way) the problem I have:
So this dataset has 3 variables: caseno (a patient ID variable), score (a test score), and visit (a coding of which study visit the patient's score belongs to). The data is obviously therefore in long format.
There are 100 patients in the dataset, and 5 visits at which a score may have been observed. In this fictional dataset, all patients had a baseline observation (at visit 1), and for each subsequent visit, 20 patients had a missing observation. Crucially, no patient has more than one missing observation.
A quick summary therefore shows the following:
Code:
. tabstat score, by(visit) statistics(n mean sd) Summary for variables: score by categories of: visit visit | N mean sd ---------+------------------------------ 1 | 100 24.74 13.61121 2 | 80 34.35 13.84736 3 | 80 47.575 16.17137 4 | 80 53.075 14.2986 5 | 80 64.2 15.27089 ---------+------------------------------ Total | 420 43.83333 20.34959
I ran a fixed-effects panel regression, to test how the score changes over the visits, giving the following output:
Code:
. xtreg score visit, i(caseno) fe Fixed-effects (within) regression Number of obs = 420 Group variable: caseno Number of groups = 100 R-sq: within = 0.5466 Obs per group: min = 4 between = 0.0840 avg = 4.2 overall = 0.4837 max = 5 F(1,319) = 384.50 corr(u_i, Xb) = -0.0080 Prob > F = 0.0000 ------------------------------------------------------------------------------ score | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- visit | 9.826471 .5011256 19.61 0.000 8.840542 10.8124 _cons | 15.28978 1.620849 9.43 0.000 12.10087 18.47868 -------------+---------------------------------------------------------------- sigma_u | 7.2268207 sigma_e | 14.610197 rho | .19657481 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=0: F(99, 319) = 1.02 Prob > F = 0.4476
What is not immediately obvious from this model is the huge amount of discarded data. Stata drops patients in a casewise fashion, therefore the above estimate for _b[visit] is derived from only 20 of the 100 patients in the sample. There is little information in the table above to hint that this has occurred.
To check which patients ended up contributing to this beta, I thought to type:
Code:
. tabstat score if e(sample), by(visit) statistics(n mean sd) Summary for variables: score by categories of: visit visit | N mean sd ---------+------------------------------ 1 | 100 24.74 13.61121 2 | 80 34.35 13.84736 3 | 80 47.575 16.17137 4 | 80 53.075 14.2986 5 | 80 64.2 15.27089 ---------+------------------------------ Total | 420 43.83333 20.34959 ----------------------------------------
The only way I've found to effectively summarise which patients are used in the xtreg fully is to reshape the data, and then run -tabstat-, with the -casewise- option, as follows:
Code:
. reshape wide score, i(study_id) j(visit) . tabstat score1 - score5, casewise statistics(n mean sd) stats | score1 score2 score3 score4 score5 ---------+-------------------------------------------------- N | 20 20 20 20 20 mean | 23.65 30.45 47.85 56.85 62.75 sd | 13.12801 13.18881 16.88124 12.36836 15.12709 ------------------------------------------------------------
This highlights how little data is actually used in the final estimation - just under one quarter of the actual observations.
In the datasets I analyse there are many variables with missing values in multiple variables at different visits, meaning that there is a high risk that many of the models will only use a small number of observations to generate the regression estimates. This information is therefore very useful to me, but relatively awkward to obtain, and not easy to spot from the initial regression output.
My question is this: Should I need to run several models like this in a given dataset, is there a simpler way to get a tabulation of the number of observations used in an -xtreg- model without having to reshape the data, and then run -tabstat, casewise-, reshape back, run the next model, etc?
This same issue applies to all panel models, as far as I can tell - I just use -xtreg, fe- as an example.
Any help would be greatly appreciated.
Many thanks,
Matthew Parkes
Comment