Hi all,
In my own data, I am currently working on the replication of the following article Currie, J., Duque, V., & Garfinkel, I. (2015). The Great Recession and mothers' health. The Economic Journal, 125(588). (which can be found here: http://onlinelibrary.wiley.com/doi/1...12239/abstract).
I have a panel data-set across three waves (year 0, year 5 and year 10) which includes the health outcomes and behaviors of respondents. I do the following in order to prepare for my analysis:
Looking at the excel file that this dataset is made from, I know that the number of people to report their self-rated health (binary_health_y) in year 0 is 1095 and in year 5 is 558. As I want to look at change in health across waves, my understanding that I can look at a maximum of 558 people’s self-rated health, as a maximum of 558 people could have information across both waves.
However, when I run the following regression of binary self-rated health as good or bad (binary_health_y) and the total number of people unemployed in the county in which a person lives (psum_unemployed_total_cont_y)
I get the following results:
The number of groups is reported as = 1,077.
It is my understanding that the number of groups is equal to the sample size, so why is this number so much larger than the true number of observations that I see in the excel file?
What I want to know is how many people answered the questionnaire at both time points, is there a way that I can find the true number of people who provided information at both time points for me to compare?
Relatedly, these numbers are very small, much smaller than I had originally hoped for my dataset, is there an alternative method of analysis I can make use of in order to avoid losing so much data? For example, maybe I could run a linear regression for each wave independent of eachother and avoid a panel data approach altogether.
Or am I worrying too much? With random effects instead of fixed effects, and a linear probability model (which I intend to use later for easier interpretation of results to the reader) is it OK to treat the "Number of groups " as the number of observations in the tables of journal articles and papers and do we not have to worry about double counting the number of unique individuals who answered questionnaires across waves and may be included twice for answering a questionnaire twice?
I'm starting to feel that there may be some theoretical underpinning to random effects estimation that I'm missing, possibly due to coming from such a strongly "fixed effects" background. Some theoretical reason to suggest that even though a person may be counted more than once, that this is ok because each time they are providing relevant data.
Any feedback would be greatly appreciated,
Best regards,
John
In my own data, I am currently working on the replication of the following article Currie, J., Duque, V., & Garfinkel, I. (2015). The Great Recession and mothers' health. The Economic Journal, 125(588). (which can be found here: http://onlinelibrary.wiley.com/doi/1...12239/abstract).
I have a panel data-set across three waves (year 0, year 5 and year 10) which includes the health outcomes and behaviors of respondents. I do the following in order to prepare for my analysis:
Code:
// SET DIRECTORY cd "$mainpath" quietly do 3_waves_import_dataset_building_and_analysis_13_02_18_pre_transsem.do reshape long bin_residence_y psum_unemployed_total_cont_y household_income_y health_y current_county_y binary_health_y bmi_y binbmi_overweight_y binbmi_underweight_y binbmi_obese_y ord_bmi_y any_tobacco_y only_cigarettes_y occ_cigarettes_y reg_cigarettes_y no_cigs_cons_deflated_y no_cigs_cons_more10_y triedstopsmoking_y times_quit_cigarettes_y smokeintention_y smokeintention_binary_y lastdrank_y usually_drink_y days_drink_y drink_count_y prescribed_medication_y bin_mild_ex_y mild_exercise_y bin_moderate_ex_y no_activity_y strenuous_exercise_y bin_strenous_ex_y moderate_exercise_y residence_y accommodation_y home_owner_y bin_home_owner_y health_insurance_y own_education_y age_y ord_age_y medical_card_y employment_y binary_employment_y binmartatus_y, i(id) j(year) xtset id year quietly tab1 year, gen(yr) drop if year == 10
However, when I run the following regression of binary self-rated health as good or bad (binary_health_y) and the total number of people unemployed in the county in which a person lives (psum_unemployed_total_cont_y)
Code:
xtlogit binary_health_y psum_unemployed_total_cont_y
Code:
. xtlogit binary_health_y psum_unemployed_total_cont_y Fitting comparison model: Iteration 0: log likelihood = -973.25989 Iteration 1: log likelihood = -973.24175 Iteration 2: log likelihood = -973.24175 Fitting full model: tau = 0.0 log likelihood = -973.24175 tau = 0.1 log likelihood = -969.80706 tau = 0.2 log likelihood = -966.3578 tau = 0.3 log likelihood = -962.98739 tau = 0.4 log likelihood = -959.84848 tau = 0.5 log likelihood = -957.20783 tau = 0.6 log likelihood = -955.57269 tau = 0.7 log likelihood = -956.02716 Iteration 0: log likelihood = -955.57204 Iteration 1: log likelihood = -946.86439 Iteration 2: log likelihood = -946.5548 Iteration 3: log likelihood = -946.55444 Iteration 4: log likelihood = -946.55444 Random-effects logistic regression Number of obs = 1,613 Group variable: id Number of groups = 1,077 Random effects u_i ~ Gaussian Obs per group: min = 1 avg = 1.5 max = 2 Integration method: mvaghermite Integration pts. = 12 Wald chi2(1) = 0.02 Log likelihood = -946.55444 Prob > chi2 = 0.8855 ---------------------------------------------------------------------------------------------- binary_health_y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -----------------------------+---------------------------------------------------------------- psum_unemployed_total_cont_y | -.0071684 .0497788 -0.14 0.885 -.1047331 .0903964 _cons | 1.384735 .4005664 3.46 0.001 .5996391 2.16983 -----------------------------+---------------------------------------------------------------- /lnsig2u | 1.187308 .2524483 .6925184 1.682097 -----------------------------+---------------------------------------------------------------- sigma_u | 1.810592 .2285404 1.413769 2.318798 rho | .4991151 .0631119 .3779334 .6204009 ---------------------------------------------------------------------------------------------- LR test of rho=0: chibar2(01) = 53.37 Prob >= chibar2 = 0.000
It is my understanding that the number of groups is equal to the sample size, so why is this number so much larger than the true number of observations that I see in the excel file?
What I want to know is how many people answered the questionnaire at both time points, is there a way that I can find the true number of people who provided information at both time points for me to compare?
Relatedly, these numbers are very small, much smaller than I had originally hoped for my dataset, is there an alternative method of analysis I can make use of in order to avoid losing so much data? For example, maybe I could run a linear regression for each wave independent of eachother and avoid a panel data approach altogether.
Or am I worrying too much? With random effects instead of fixed effects, and a linear probability model (which I intend to use later for easier interpretation of results to the reader) is it OK to treat the "Number of groups " as the number of observations in the tables of journal articles and papers and do we not have to worry about double counting the number of unique individuals who answered questionnaires across waves and may be included twice for answering a questionnaire twice?
I'm starting to feel that there may be some theoretical underpinning to random effects estimation that I'm missing, possibly due to coming from such a strongly "fixed effects" background. Some theoretical reason to suggest that even though a person may be counted more than once, that this is ok because each time they are providing relevant data.
Any feedback would be greatly appreciated,
Best regards,
John
Comment