Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Descriptive statistics on a specific subsample of the data, following a regression or by themselves?

    I have a panel dataset of mothers across several waves, I would like to create some descriptive statistics for those mothers who appear in a certain sample and at certain waves.

    Particularly, for mothers who appear in at least two waves or more, I would like to describe the percentage that report having a certain outcome at each wave.

    To do this I do the following:

    Code:
    tab overweight_wave1 if has_wave1_questionnaire==1 &  has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1
    
    tab overweight_wave2 if has_wave1_questionnaire==1 &  has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1
    
    tab overweight_wave3 if has_wave1_questionnaire==1 &  has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1
    Which basically is such that, if a respondent appeared in wave 1 and wave 2, or wave 1 and wave 3 or in wave 1 and wave 2 and wave 3, I want them to be a part of my analysis.

    i.e. if a mother appeared in at least 2 out of the 3 waves I want to include her.

    The variable "overweight" is measured in each of the three waves so I repeat the above for each of time it is recorded.

    I then take the results as below

    Code:
    tab overweight_wave1 if has_wave1_questionnaire==1 &  has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1
    
        Binary BMI |
        Overweight |      Freq.     Percent        Cum.
    ---------------+-----------------------------------
    Not Overweight |        368       70.36       70.36
        Overweight |        155       29.64      100.00
    ---------------+-----------------------------------
             Total |        523      100.00
    
    tab overweight_wave2 if has_wave1_questionnaire==1 &  has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1
    
        Binary BMI |
        Overweight |      Freq.     Percent        Cum.
    ---------------+-----------------------------------
    Not Overweight |        223       46.85       46.85
        Overweight |        253       53.15      100.00
    ---------------+-----------------------------------
             Total |        476      100.00
    
    tab overweight_wave3 if has_wave1_questionnaire==1 &  has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1
    
        Binary BMI |
        Overweight |      Freq.     Percent        Cum.
    ---------------+-----------------------------------
    Not Overweight |        130       48.69       48.69
        Overweight |        137       51.31      100.00
    ---------------+-----------------------------------
             Total |        267      100.00
    I then add the percentage overweight from these results to the descriptive statistics table to describe the percentage who were overweight from this sample of 614 at each different wave.

    Does this seem like a reasonable approach?

    Someone had suggested that as I will be running a regression in these panel data anyway (after they have been transformed to the long format) that another way to run the descriptive statistics for the estimation sample would be first run my xtreg regression and then after it type

    Code:
    
    . xtreg overweight_y total_unemployment_y i.schooling_y i.married_y i.socialass_y i.working_y i.age_y if has_y0_questionnaire==1 &  has_y5_questionnaire==1 | has_y0_questionnaire==1 & has_y10_questionnaire==1 | has_y0_questionnaire==1 & has_y5_questionnaire==1 & has_y10_questionnaire==1, cluster (current_county_y1) re robust
    
    
    Random-effects GLS regression                   Number of obs     =      1,133
    Group variable: id                              Number of groups  =        556
    
    R-sq:                                           Obs per group:
         within  = 0.1143                                         min =          1
         between = 0.0367                                         avg =        2.0
         overall = 0.0460                                         max =          3
    
                                                    Wald chi2(21)     =          .
    corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =          .
    
                                                                        (Std. Err. adjusted for 28 clusters in current_county_y1)
    -----------------------------------------------------------------------------------------------------------------------------
                                                                |               Robust
                                            binbmi_overweight_y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    ------------------------------------------------------------+----------------------------------------------------------------
                                   psum_unemployed_total_cont_y |   .0035115   .0034119     1.03   0.303    -.0031758    .0101987
                                                                |
                                                own_education_y |
                                                  No schooling  |          0  (empty)
                                      Primary school education  |          0  (omitted)
                                         Some secondary school  |  -.0563388    .348275    -0.16   0.871    -.7389453    .6262676
                                  Complete secondary education  |   .0156475   .3483944     0.04   0.964     -.667193     .698488
        Some third level education at college, university, RTC  |   .0435395   .3826468     0.11   0.909    -.7064344    .7935134
    Complete third level education at college, university, RTC  |  -.0777283   .3348961    -0.23   0.816    -.7341126     .578656
                                                                |
                                                maritalstatus_y |
                                                    Cohabiting  |   .0125737   .0442492     0.28   0.776    -.0741531    .0993004
                                                     Separated  |   .2100415   .0979599     2.14   0.032     .0180437    .4020393
                                                      Divorced  |  -.0461317   .1568389    -0.29   0.769    -.3535304    .2612669
                                                       Widowed  |   .0103922   .1495897     0.07   0.945    -.2827982    .3035826
                                          Single/Never married  |  -.1006385   .0886542    -1.14   0.256    -.2743976    .0731206
                                                                |
                                                 medical_card_y |
                                                           Yes  |    .109604   .0362536     3.02   0.003     .0385482    .1806598
                                                                |
                                                   employment_y |
                                                    Unemployed  |   .0937273   .0693045     1.35   0.176     -.042107    .2295615
      Unable to work owing to permanent sickness or disability  |   .2762914   .1568829     1.76   0.078    -.0311934    .5837762
                                             At school/student  |  -.0367244   .1006047    -0.37   0.715    -.2339059    .1604572
                               Seeking work for the first time  |  -.0047052   .1996473    -0.02   0.981    -.3960066    .3865962
                                                      Employed  |   -.078526   .0319799    -2.46   0.014    -.1412054   -.0158466
                                                 Self Employed  |   .0551793   .1316321     0.42   0.675    -.2028147    .3131734
                                                                |
                                                      ord_age_y |
                                                         20-23  |    .345725   .1389647     2.49   0.013     .0733592    .6180909
                                                         24-27  |   .4427777   .1610612     2.75   0.006     .1271035    .7584519
                                                         28-32  |   .4197833   .1499522     2.80   0.005     .1258823    .7136843
                                                          33 +  |   .5414033   .1478685     3.66   0.000     .2515864    .8312202
                                                                |
                                                          _cons |  -.0447299   .3548179    -0.13   0.900    -.7401603    .6507005
    ------------------------------------------------------------+----------------------------------------------------------------
                                                        sigma_u |  .34267108
                                                        sigma_e |  .34640219
                                                            rho |  .49458548   (fraction of variance due to u_i)
    -----------------------------------------------------------------------------------------------------------------------------
    
    . 
    
    . 
    . 
    . * followed by:
    
    . 
    . 
    . 
    . tab overweight_wave1 if e(sample)
    
        Binary BMI |
        Overweight |      Freq.     Percent        Cum.
    ---------------+-----------------------------------
    Not Overweight |        760       72.38       72.38
        Overweight |        290       27.62      100.00
    ---------------+-----------------------------------
             Total |      1,050      100.00
    
    . 
    . 
    . 
    . tab overweight_wave2 if e(sample)
    
        Binary BMI |
        Overweight |      Freq.     Percent        Cum.
    ---------------+-----------------------------------
    Not Overweight |        475       48.03       48.03
        Overweight |        514       51.97      100.00
    ---------------+-----------------------------------
             Total |        989      100.00
    
    . 
    . 
    . 
    . tab overweight_wave3 if e(sample)
    
        Binary BMI |
        Overweight |      Freq.     Percent        Cum.
    ---------------+-----------------------------------
    Not Overweight |        315       49.22       49.22
        Overweight |        325       50.78      100.00
    ---------------+-----------------------------------
             Total |        640      100.00

    My concern is that, by doing this in a regression with a lot of covariates I am losing mothers who have missing covariates because I am no longer focusing purely on the overweight variable.

    Thus the percentages may be under or overstated as they no longer reflect mothers from the sample of 614 who had a variable on overweight at each wave, and what this was, they now have to include mothers from the sample who have an overweight variable at each wave and are also not precluded by any of the included covariates.

    Which of my approaches do you think is best?

  • #2
    In work I've participated in, we've reported the descriptive statistics for the observations on which the models are built. I believe that is the common practice.

    I note that the tabulations you show following your xtreg have inflated numbers of observations, because each individual appears 2 or 3 times in the dataset. I think you may have wanted something more like
    Code:
    tab overweight_y wave, column

    Comment

    Working...
    X