Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Casewise summaries of panel data, following estimation command

    Hi everyone,

    I spend a reasonable amount of time analysing panel data using the -xtreg- command, and have been looking for a command which summarises data across the panels adequately.

    I've managed to find a solution that sort-of does what I want, however my solution is hardly elegant, and quite cumbersome when I've used more recent datasets and more complex models - which is why I wanted to pose my problem to the Statalisters in case anyone on here has a neater solution.

    If I explain my problem with a hypothetical dataset that highlights (in an extreme way) the problem I have:

    So this dataset has 3 variables: caseno (a patient ID variable), score (a test score), and visit (a coding of which study visit the patient's score belongs to). The data is obviously therefore in long format.

    There are 100 patients in the dataset, and 5 visits at which a score may have been observed. In this fictional dataset, all patients had a baseline observation (at visit 1), and for each subsequent visit, 20 patients had a missing observation. Crucially, no patient has more than one missing observation.




    A quick summary therefore shows the following:

    Code:
    . tabstat score, by(visit) statistics(n mean sd)
    
    Summary for variables: score
         by categories of: visit
    
       visit |         N      mean        sd
    ---------+------------------------------
           1 |       100     24.74  13.61121
           2 |        80     34.35  13.84736
           3 |        80    47.575  16.17137
           4 |        80    53.075   14.2986
           5 |        80      64.2  15.27089
    ---------+------------------------------
       Total |       420  43.83333  20.34959



    I ran a fixed-effects panel regression, to test how the score changes over the visits, giving the following output:


    Code:
    .         xtreg score visit, i(caseno) fe
    
    Fixed-effects (within) regression               Number of obs      =       420
    Group variable: caseno                          Number of groups   =       100
    
    R-sq:  within  = 0.5466                         Obs per group: min =         4
           between = 0.0840                                        avg =       4.2
           overall = 0.4837                                        max =         5
    
                                                    F(1,319)           =    384.50
    corr(u_i, Xb)  = -0.0080                        Prob > F           =    0.0000
    
    ------------------------------------------------------------------------------
           score |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           visit |   9.826471   .5011256    19.61   0.000     8.840542     10.8124
           _cons |   15.28978   1.620849     9.43   0.000     12.10087    18.47868
    -------------+----------------------------------------------------------------
         sigma_u |  7.2268207
         sigma_e |  14.610197
             rho |  .19657481   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------
    F test that all u_i=0:     F(99, 319) =     1.02             Prob > F = 0.4476




    What is not immediately obvious from this model is the huge amount of discarded data. Stata drops patients in a casewise fashion, therefore the above estimate for _b[visit] is derived from only 20 of the 100 patients in the sample. There is little information in the table above to hint that this has occurred.

    To check which patients ended up contributing to this beta, I thought to type:
    Code:
    . tabstat score if e(sample), by(visit) statistics(n mean sd)
    
    Summary for variables: score
         by categories of: visit
    
       visit |         N      mean        sd
    ---------+------------------------------
           1 |       100     24.74  13.61121
           2 |        80     34.35  13.84736
           3 |        80    47.575  16.17137
           4 |        80    53.075   14.2986
           5 |        80      64.2  15.27089
    ---------+------------------------------
       Total |       420  43.83333  20.34959
    ----------------------------------------
    But as you can see, this does not show the underlying issue – it doesn’t show which observations were excluded from the estimation of _b[visit].

    The only way I've found to effectively summarise which patients are used in the xtreg fully is to reshape the data, and then run -tabstat-, with the -casewise- option, as follows:


    Code:
    . reshape wide score, i(study_id) j(visit)
    
    . tabstat score1 - score5, casewise statistics(n mean sd)
    
       stats |    score1    score2    score3    score4    score5
    ---------+--------------------------------------------------
           N |        20        20        20        20        20
        mean |     23.65     30.45     47.85     56.85     62.75
          sd |  13.12801  13.18881  16.88124  12.36836  15.12709
    ------------------------------------------------------------

    This highlights how little data is actually used in the final estimation - just under one quarter of the actual observations.

    In the datasets I analyse there are many variables with missing values in multiple variables at different visits, meaning that there is a high risk that many of the models will only use a small number of observations to generate the regression estimates. This information is therefore very useful to me, but relatively awkward to obtain, and not easy to spot from the initial regression output.



    My question is this: Should I need to run several models like this in a given dataset, is there a simpler way to get a tabulation of the number of observations used in an -xtreg- model without having to reshape the data, and then run -tabstat, casewise-, reshape back, run the next model, etc?

    This same issue applies to all panel models, as far as I can tell - I just use -xtreg, fe- as an example.

    Any help would be greatly appreciated.


    Many thanks,


    Matthew Parkes
    Last edited by Matthew Parkes; 01 Sep 2014, 05:40.

  • #2
    Originally posted by Matthew Parkes View Post
    Stata drops patients in a casewise fashion, therefore the above estimate for _b[visit] is derived from only 20 of the 100 patients in the sample.
    Note that in your example above, all of the available data (420 observations spread over 100 individuals, as indicated in the output) are being used to fit the model (casewise deletion is not occurring here). However, in general, if you want to see what observations were used in the model just fit, you may use the e(sample) function, e.g.:
    gen used = e(sample)
    The variable used will then contain 1 for observations that were used by the estimation command, and 0 for those that were not.

    Comment


    • #3
      Hi Phil.

      You're absolutely right - I'd assumed that Stata employed casewise deletion with xtreg, but having looked into things a little further, it appears not. My mistake!

      The e(sample) function is really useful, but it doesn't allow me to quickly see how many complete cases I have for a given set of variables, particularly since xtreg doesn't do casewise deletion - the only way I think I can see how many complete cases I have, is to do as I suggested earlier, namely reshape, run -tabstat, casewise-, and reshape back. Is there a simpler way to do this, without having to reshape the data?

      Thanks for your help.

      Comment


      • #4
        There are a number of ways you might do this without reshaping the data, but with a small to moderate number of time points, the command xtdescribe is very useful. For example, assuming you have already xtset the data, you could use
        xtdes if e(sample)

        Comment


        • #5
          That's exactly what I'm looking for. Perfect.

          I'd used -xtdescribe- and -xtsum-, but not tried -if e(sample)-, which makes all the difference.

          Thanks again for your help.

          Comment

          Working...
          X