I have a panel dataset of mothers across several waves, I would like to create some descriptive statistics for those mothers who appear in a certain sample and at certain waves.
Particularly, for mothers who appear in at least two waves or more, I would like to describe the percentage that report having a certain outcome at each wave.
To do this I do the following:
Which basically is such that, if a respondent appeared in wave 1 and wave 2, or wave 1 and wave 3 or in wave 1 and wave 2 and wave 3, I want them to be a part of my analysis.
i.e. if a mother appeared in at least 2 out of the 3 waves I want to include her.
The variable "overweight" is measured in each of the three waves so I repeat the above for each of time it is recorded.
I then take the results as below
I then add the percentage overweight from these results to the descriptive statistics table to describe the percentage who were overweight from this sample of 614 at each different wave.
Does this seem like a reasonable approach?
Someone had suggested that as I will be running a regression in these panel data anyway (after they have been transformed to the long format) that another way to run the descriptive statistics for the estimation sample would be first run my xtreg regression and then after it type
My concern is that, by doing this in a regression with a lot of covariates I am losing mothers who have missing covariates because I am no longer focusing purely on the overweight variable.
Thus the percentages may be under or overstated as they no longer reflect mothers from the sample of 614 who had a variable on overweight at each wave, and what this was, they now have to include mothers from the sample who have an overweight variable at each wave and are also not precluded by any of the included covariates.
Which of my approaches do you think is best?
Particularly, for mothers who appear in at least two waves or more, I would like to describe the percentage that report having a certain outcome at each wave.
To do this I do the following:
Code:
tab overweight_wave1 if has_wave1_questionnaire==1 & has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1 tab overweight_wave2 if has_wave1_questionnaire==1 & has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1 tab overweight_wave3 if has_wave1_questionnaire==1 & has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1
i.e. if a mother appeared in at least 2 out of the 3 waves I want to include her.
The variable "overweight" is measured in each of the three waves so I repeat the above for each of time it is recorded.
I then take the results as below
Code:
tab overweight_wave1 if has_wave1_questionnaire==1 & has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1 Binary BMI | Overweight | Freq. Percent Cum. ---------------+----------------------------------- Not Overweight | 368 70.36 70.36 Overweight | 155 29.64 100.00 ---------------+----------------------------------- Total | 523 100.00 tab overweight_wave2 if has_wave1_questionnaire==1 & has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1 Binary BMI | Overweight | Freq. Percent Cum. ---------------+----------------------------------- Not Overweight | 223 46.85 46.85 Overweight | 253 53.15 100.00 ---------------+----------------------------------- Total | 476 100.00 tab overweight_wave3 if has_wave1_questionnaire==1 & has_wave2_questionnaire==1 | has_wave1_questionnaire==1 & has_wave3_questionnaire==1 | has_wave1_questionnaire==1 & has_wave2_questionnaire==1 & has_wave3_questionnaire==1 Binary BMI | Overweight | Freq. Percent Cum. ---------------+----------------------------------- Not Overweight | 130 48.69 48.69 Overweight | 137 51.31 100.00 ---------------+----------------------------------- Total | 267 100.00
Does this seem like a reasonable approach?
Someone had suggested that as I will be running a regression in these panel data anyway (after they have been transformed to the long format) that another way to run the descriptive statistics for the estimation sample would be first run my xtreg regression and then after it type
Code:
. xtreg overweight_y total_unemployment_y i.schooling_y i.married_y i.socialass_y i.working_y i.age_y if has_y0_questionnaire==1 & has_y5_questionnaire==1 | has_y0_questionnaire==1 & has_y10_questionnaire==1 | has_y0_questionnaire==1 & has_y5_questionnaire==1 & has_y10_questionnaire==1, cluster (current_county_y1) re robust Random-effects GLS regression Number of obs = 1,133 Group variable: id Number of groups = 556 R-sq: Obs per group: within = 0.1143 min = 1 between = 0.0367 avg = 2.0 overall = 0.0460 max = 3 Wald chi2(21) = . corr(u_i, X) = 0 (assumed) Prob > chi2 = . (Std. Err. adjusted for 28 clusters in current_county_y1) ----------------------------------------------------------------------------------------------------------------------------- | Robust binbmi_overweight_y | Coef. Std. Err. z P>|z| [95% Conf. Interval] ------------------------------------------------------------+---------------------------------------------------------------- psum_unemployed_total_cont_y | .0035115 .0034119 1.03 0.303 -.0031758 .0101987 | own_education_y | No schooling | 0 (empty) Primary school education | 0 (omitted) Some secondary school | -.0563388 .348275 -0.16 0.871 -.7389453 .6262676 Complete secondary education | .0156475 .3483944 0.04 0.964 -.667193 .698488 Some third level education at college, university, RTC | .0435395 .3826468 0.11 0.909 -.7064344 .7935134 Complete third level education at college, university, RTC | -.0777283 .3348961 -0.23 0.816 -.7341126 .578656 | maritalstatus_y | Cohabiting | .0125737 .0442492 0.28 0.776 -.0741531 .0993004 Separated | .2100415 .0979599 2.14 0.032 .0180437 .4020393 Divorced | -.0461317 .1568389 -0.29 0.769 -.3535304 .2612669 Widowed | .0103922 .1495897 0.07 0.945 -.2827982 .3035826 Single/Never married | -.1006385 .0886542 -1.14 0.256 -.2743976 .0731206 | medical_card_y | Yes | .109604 .0362536 3.02 0.003 .0385482 .1806598 | employment_y | Unemployed | .0937273 .0693045 1.35 0.176 -.042107 .2295615 Unable to work owing to permanent sickness or disability | .2762914 .1568829 1.76 0.078 -.0311934 .5837762 At school/student | -.0367244 .1006047 -0.37 0.715 -.2339059 .1604572 Seeking work for the first time | -.0047052 .1996473 -0.02 0.981 -.3960066 .3865962 Employed | -.078526 .0319799 -2.46 0.014 -.1412054 -.0158466 Self Employed | .0551793 .1316321 0.42 0.675 -.2028147 .3131734 | ord_age_y | 20-23 | .345725 .1389647 2.49 0.013 .0733592 .6180909 24-27 | .4427777 .1610612 2.75 0.006 .1271035 .7584519 28-32 | .4197833 .1499522 2.80 0.005 .1258823 .7136843 33 + | .5414033 .1478685 3.66 0.000 .2515864 .8312202 | _cons | -.0447299 .3548179 -0.13 0.900 -.7401603 .6507005 ------------------------------------------------------------+---------------------------------------------------------------- sigma_u | .34267108 sigma_e | .34640219 rho | .49458548 (fraction of variance due to u_i) ----------------------------------------------------------------------------------------------------------------------------- . . . . * followed by: . . . . tab overweight_wave1 if e(sample) Binary BMI | Overweight | Freq. Percent Cum. ---------------+----------------------------------- Not Overweight | 760 72.38 72.38 Overweight | 290 27.62 100.00 ---------------+----------------------------------- Total | 1,050 100.00 . . . . tab overweight_wave2 if e(sample) Binary BMI | Overweight | Freq. Percent Cum. ---------------+----------------------------------- Not Overweight | 475 48.03 48.03 Overweight | 514 51.97 100.00 ---------------+----------------------------------- Total | 989 100.00 . . . . tab overweight_wave3 if e(sample) Binary BMI | Overweight | Freq. Percent Cum. ---------------+----------------------------------- Not Overweight | 315 49.22 49.22 Overweight | 325 50.78 100.00 ---------------+----------------------------------- Total | 640 100.00
My concern is that, by doing this in a regression with a lot of covariates I am losing mothers who have missing covariates because I am no longer focusing purely on the overweight variable.
Thus the percentages may be under or overstated as they no longer reflect mothers from the sample of 614 who had a variable on overweight at each wave, and what this was, they now have to include mothers from the sample who have an overweight variable at each wave and are also not precluded by any of the included covariates.
Which of my approaches do you think is best?
Comment