Casewise summaries of panel data, following estimation command

Matthew Parkes

Join Date: Sep 2014

Posts: 4
#1

Casewise summaries of panel data, following estimation command

01 Sep 2014, 05:35

Hi everyone,

I spend a reasonable amount of time analysing panel data using the -xtreg- command, and have been looking for a command which summarises data across the panels adequately.

I've managed to find a solution that sort-of does what I want, however my solution is hardly elegant, and quite cumbersome when I've used more recent datasets and more complex models - which is why I wanted to pose my problem to the Statalisters in case anyone on here has a neater solution.

If I explain my problem with a hypothetical dataset that highlights (in an extreme way) the problem I have:

So this dataset has 3 variables: caseno (a patient ID variable), score (a test score), and visit (a coding of which study visit the patient's score belongs to). The data is obviously therefore in long format.

There are 100 patients in the dataset, and 5 visits at which a score may have been observed. In this fictional dataset, all patients had a baseline observation (at visit 1), and for each subsequent visit, 20 patients had a missing observation. Crucially, no patient has more than one missing observation.

A quick summary therefore shows the following:

Code:

. tabstat score, by(visit) statistics(n mean sd) Summary for variables: score by categories of: visit visit | N mean sd ---------+------------------------------ 1 | 100 24.74 13.61121 2 | 80 34.35 13.84736 3 | 80 47.575 16.17137 4 | 80 53.075 14.2986 5 | 80 64.2 15.27089 ---------+------------------------------ Total | 420 43.83333 20.34959

I ran a fixed-effects panel regression, to test how the score changes over the visits, giving the following output:

Code:

. xtreg score visit, i(caseno) fe Fixed-effects (within) regression Number of obs = 420 Group variable: caseno Number of groups = 100 R-sq: within = 0.5466 Obs per group: min = 4 between = 0.0840 avg = 4.2 overall = 0.4837 max = 5 F(1,319) = 384.50 corr(u_i, Xb) = -0.0080 Prob > F = 0.0000 ------------------------------------------------------------------------------ score | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- visit | 9.826471 .5011256 19.61 0.000 8.840542 10.8124 _cons | 15.28978 1.620849 9.43 0.000 12.10087 18.47868 -------------+---------------------------------------------------------------- sigma_u | 7.2268207 sigma_e | 14.610197 rho | .19657481 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=0: F(99, 319) = 1.02 Prob > F = 0.4476

What is not immediately obvious from this model is the huge amount of discarded data. Stata drops patients in a casewise fashion, therefore the above estimate for _b[visit] is derived from only 20 of the 100 patients in the sample. There is little information in the table above to hint that this has occurred.

To check which patients ended up contributing to this beta, I thought to type:

Code:

. tabstat score if e(sample), by(visit) statistics(n mean sd) Summary for variables: score by categories of: visit visit | N mean sd ---------+------------------------------ 1 | 100 24.74 13.61121 2 | 80 34.35 13.84736 3 | 80 47.575 16.17137 4 | 80 53.075 14.2986 5 | 80 64.2 15.27089 ---------+------------------------------ Total | 420 43.83333 20.34959 ----------------------------------------

But as you can see, this does not show the underlying issue – it doesn’t show which observations were excluded from the estimation of _b[visit].

The only way I've found to effectively summarise which patients are used in the xtreg fully is to reshape the data, and then run -tabstat-, with the -casewise- option, as follows:

Code:

. reshape wide score, i(study_id) j(visit) . tabstat score1 - score5, casewise statistics(n mean sd) stats | score1 score2 score3 score4 score5 ---------+-------------------------------------------------- N | 20 20 20 20 20 mean | 23.65 30.45 47.85 56.85 62.75 sd | 13.12801 13.18881 16.88124 12.36836 15.12709 ------------------------------------------------------------

This highlights how little data is actually used in the final estimation - just under one quarter of the actual observations.

In the datasets I analyse there are many variables with missing values in multiple variables at different visits, meaning that there is a high risk that many of the models will only use a small number of observations to generate the regression estimates. This information is therefore very useful to me, but relatively awkward to obtain, and not easy to spot from the initial regression output.

My question is this: Should I need to run several models like this in a given dataset, is there a simpler way to get a tabulation of the number of observations used in an -xtreg- model without having to reshape the data, and then run -tabstat, casewise-, reshape back, run the next model, etc?

This same issue applies to all panel models, as far as I can tell - I just use -xtreg, fe- as an example.

Any help would be greatly appreciated.

Many thanks,

Matthew Parkes

Last edited by Matthew Parkes; 01 Sep 2014, 05:40.
Tags: None
Phil Schumm

Join Date: Mar 2014

Posts: 169
#2

01 Sep 2014, 06:12

Originally posted by Matthew Parkes View Post

Stata drops patients in a casewise fashion, therefore the above estimate for _b[visit] is derived from only 20 of the 100 patients in the sample.

Note that in your example above, all of the available data (420 observations spread over 100 individuals, as indicated in the output) are being used to fit the model (casewise deletion is not occurring here). However, in general, if you want to see what observations were used in the model just fit, you may use the e(sample) function, e.g.:
gen used = e(sample)
The variable used will then contain 1 for observations that were used by the estimation command, and 0 for those that were not.
Comment
Matthew Parkes

Join Date: Sep 2014

Posts: 4
#3

01 Sep 2014, 06:18

Hi Phil.

You're absolutely right - I'd assumed that Stata employed casewise deletion with xtreg, but having looked into things a little further, it appears not. My mistake!

The e(sample) function is really useful, but it doesn't allow me to quickly see how many complete cases I have for a given set of variables, particularly since xtreg doesn't do casewise deletion - the only way I think I can see how many complete cases I have, is to do as I suggested earlier, namely reshape, run -tabstat, casewise-, and reshape back. Is there a simpler way to do this, without having to reshape the data?

Thanks for your help.
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#4

01 Sep 2014, 06:29

There are a number of ways you might do this without reshaping the data, but with a small to moderate number of time points, the command xtdescribe is very useful. For example, assuming you have already xtset the data, you could use
xtdes if e(sample)
Comment
Matthew Parkes

Join Date: Sep 2014

Posts: 4
#5

01 Sep 2014, 06:43

That's exactly what I'm looking for. Perfect.

I'd used -xtdescribe- and -xtsum-, but not tried -if e(sample)-, which makes all the difference.

Thanks again for your help.
Comment

Announcement

Casewise summaries of panel data, following estimation command

Comment

Comment

Comment

Comment