Descriptive statistics on a specific subsample of the data, following a regression or by themselves?

John Adler

Join Date: Apr 2017
Posts: 173

Descriptive statistics on a specific subsample of the data, following a regression or by themselves?

16 Jun 2018, 07:19

I have a panel dataset of mothers across several waves, I would like to create some descriptive statistics for those mothers who appear in a certain sample and at certain waves.

Particularly, for mothers who appear in at least two waves or more, I would like to describe the percentage that report having a certain outcome at each wave.

To do this I do the following:

Code:

tab overweight_wave1 if has_wave1_questionnaire==1 &  has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1

tab overweight_wave2 if has_wave1_questionnaire==1 &  has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1

tab overweight_wave3 if has_wave1_questionnaire==1 &  has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1

Which basically is such that, if a respondent appeared in wave 1 and wave 2, or wave 1 and wave 3 or in wave 1 and wave 2 and wave 3, I want them to be a part of my analysis.

i.e. if a mother appeared in at least 2 out of the 3 waves I want to include her.

The variable "overweight" is measured in each of the three waves so I repeat the above for each of time it is recorded.

I then take the results as below

Code:

tab overweight_wave1 if has_wave1_questionnaire==1 &  has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1

    Binary BMI |
    Overweight |      Freq.     Percent        Cum.
---------------+-----------------------------------
Not Overweight |        368       70.36       70.36
    Overweight |        155       29.64      100.00
---------------+-----------------------------------
         Total |        523      100.00

tab overweight_wave2 if has_wave1_questionnaire==1 &  has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1

    Binary BMI |
    Overweight |      Freq.     Percent        Cum.
---------------+-----------------------------------
Not Overweight |        223       46.85       46.85
    Overweight |        253       53.15      100.00
---------------+-----------------------------------
         Total |        476      100.00

tab overweight_wave3 if has_wave1_questionnaire==1 &  has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1

    Binary BMI |
    Overweight |      Freq.     Percent        Cum.
---------------+-----------------------------------
Not Overweight |        130       48.69       48.69
    Overweight |        137       51.31      100.00
---------------+-----------------------------------
         Total |        267      100.00

I then add the percentage overweight from these results to the descriptive statistics table to describe the percentage who were overweight from this sample of 614 at each different wave.

Does this seem like a reasonable approach?

Someone had suggested that as I will be running a regression in these panel data anyway (after they have been transformed to the long format) that another way to run the descriptive statistics for the estimation sample would be first run my xtreg regression and then after it type

Code:


. xtreg overweight_y total_unemployment_y i.schooling_y i.married_y i.socialass_y i.working_y i.age_y if has_y0_questionnaire==1 &  has_y5_questionnaire==1 | has_y0_questionnaire==1 & has_y10_questionnaire==1 | has_y0_questionnaire==1 & has_y5_questionnaire==1 & has_y10_questionnaire==1, cluster (current_county_y1) re robust


Random-effects GLS regression                   Number of obs     =      1,133
Group variable: id                              Number of groups  =        556

R-sq:                                           Obs per group:
     within  = 0.1143                                         min =          1
     between = 0.0367                                         avg =        2.0
     overall = 0.0460                                         max =          3

                                                Wald chi2(21)     =          .
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =          .

                                                                    (Std. Err. adjusted for 28 clusters in current_county_y1)
-----------------------------------------------------------------------------------------------------------------------------
                                                            |               Robust
                                        binbmi_overweight_y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------------------------------------------------+----------------------------------------------------------------
                               psum_unemployed_total_cont_y |   .0035115   .0034119     1.03   0.303    -.0031758    .0101987
                                                            |
                                            own_education_y |
                                              No schooling  |          0  (empty)
                                  Primary school education  |          0  (omitted)
                                     Some secondary school  |  -.0563388    .348275    -0.16   0.871    -.7389453    .6262676
                              Complete secondary education  |   .0156475   .3483944     0.04   0.964     -.667193     .698488
    Some third level education at college, university, RTC  |   .0435395   .3826468     0.11   0.909    -.7064344    .7935134
Complete third level education at college, university, RTC  |  -.0777283   .3348961    -0.23   0.816    -.7341126     .578656
                                                            |
                                            maritalstatus_y |
                                                Cohabiting  |   .0125737   .0442492     0.28   0.776    -.0741531    .0993004
                                                 Separated  |   .2100415   .0979599     2.14   0.032     .0180437    .4020393
                                                  Divorced  |  -.0461317   .1568389    -0.29   0.769    -.3535304    .2612669
                                                   Widowed  |   .0103922   .1495897     0.07   0.945    -.2827982    .3035826
                                      Single/Never married  |  -.1006385   .0886542    -1.14   0.256    -.2743976    .0731206
                                                            |
                                             medical_card_y |
                                                       Yes  |    .109604   .0362536     3.02   0.003     .0385482    .1806598
                                                            |
                                               employment_y |
                                                Unemployed  |   .0937273   .0693045     1.35   0.176     -.042107    .2295615
  Unable to work owing to permanent sickness or disability  |   .2762914   .1568829     1.76   0.078    -.0311934    .5837762
                                         At school/student  |  -.0367244   .1006047    -0.37   0.715    -.2339059    .1604572
                           Seeking work for the first time  |  -.0047052   .1996473    -0.02   0.981    -.3960066    .3865962
                                                  Employed  |   -.078526   .0319799    -2.46   0.014    -.1412054   -.0158466
                                             Self Employed  |   .0551793   .1316321     0.42   0.675    -.2028147    .3131734
                                                            |
                                                  ord_age_y |
                                                     20-23  |    .345725   .1389647     2.49   0.013     .0733592    .6180909
                                                     24-27  |   .4427777   .1610612     2.75   0.006     .1271035    .7584519
                                                     28-32  |   .4197833   .1499522     2.80   0.005     .1258823    .7136843
                                                      33 +  |   .5414033   .1478685     3.66   0.000     .2515864    .8312202
                                                            |
                                                      _cons |  -.0447299   .3548179    -0.13   0.900    -.7401603    .6507005
------------------------------------------------------------+----------------------------------------------------------------
                                                    sigma_u |  .34267108
                                                    sigma_e |  .34640219
                                                        rho |  .49458548   (fraction of variance due to u_i)
-----------------------------------------------------------------------------------------------------------------------------

. 

. 
. 
. * followed by:

. 
. 
. 
. tab overweight_wave1 if e(sample)

    Binary BMI |
    Overweight |      Freq.     Percent        Cum.
---------------+-----------------------------------
Not Overweight |        760       72.38       72.38
    Overweight |        290       27.62      100.00
---------------+-----------------------------------
         Total |      1,050      100.00

. 
. 
. 
. tab overweight_wave2 if e(sample)

    Binary BMI |
    Overweight |      Freq.     Percent        Cum.
---------------+-----------------------------------
Not Overweight |        475       48.03       48.03
    Overweight |        514       51.97      100.00
---------------+-----------------------------------
         Total |        989      100.00

. 
. 
. 
. tab overweight_wave3 if e(sample)

    Binary BMI |
    Overweight |      Freq.     Percent        Cum.
---------------+-----------------------------------
Not Overweight |        315       49.22       49.22
    Overweight |        325       50.78      100.00
---------------+-----------------------------------
         Total |        640      100.00

My concern is that, by doing this in a regression with a lot of covariates I am losing mothers who have missing covariates because I am no longer focusing purely on the overweight variable.

Thus the percentages may be under or overstated as they no longer reflect mothers from the sample of 614 who had a variable on overweight at each wave, and what this was, they now have to include mothers from the sample who have an overweight variable at each wave and are also not precluded by any of the included covariates.

Which of my approaches do you think is best?

Tags: descriptive, panel data, regression, syntax

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

16 Jun 2018, 19:07

In work I've participated in, we've reported the descriptive statistics for the observations on which the models are built. I believe that is the common practice.

I note that the tabulations you show following your xtreg have inflated numbers of observations, because each individual appears 2 or 3 times in the dataset. I think you may have wanted something more like

Code:

tab overweight_y wave, column
Comment

Announcement

Descriptive statistics on a specific subsample of the data, following a regression or by themselves?

Comment