Difference in the sample size reported by xtlogit and the sample size seen in the data.

John Adler

Join Date: Apr 2017
Posts: 173

Difference in the sample size reported by xtlogit and the sample size seen in the data.

16 Feb 2018, 03:55

Hi all,

In my own data, I am currently working on the replication of the following article Currie, J., Duque, V., & Garfinkel, I. (2015). The Great Recession and mothers' health. The Economic Journal, 125(588). (which can be found here: http://onlinelibrary.wiley.com/doi/1...12239/abstract).

I have a panel data-set across three waves (year 0, year 5 and year 10) which includes the health outcomes and behaviors of respondents. I do the following in order to prepare for my analysis:

Code:

// SET DIRECTORY
cd "$mainpath"    

quietly do 3_waves_import_dataset_building_and_analysis_13_02_18_pre_transsem.do

reshape long bin_residence_y psum_unemployed_total_cont_y household_income_y health_y current_county_y binary_health_y bmi_y binbmi_overweight_y binbmi_underweight_y binbmi_obese_y ord_bmi_y any_tobacco_y only_cigarettes_y occ_cigarettes_y reg_cigarettes_y no_cigs_cons_deflated_y no_cigs_cons_more10_y triedstopsmoking_y times_quit_cigarettes_y smokeintention_y smokeintention_binary_y lastdrank_y usually_drink_y days_drink_y drink_count_y prescribed_medication_y bin_mild_ex_y mild_exercise_y bin_moderate_ex_y no_activity_y strenuous_exercise_y bin_strenous_ex_y moderate_exercise_y residence_y accommodation_y home_owner_y bin_home_owner_y health_insurance_y own_education_y  age_y ord_age_y medical_card_y employment_y binary_employment_y binmartatus_y, i(id) j(year)
 
xtset id year

quietly tab1 year, gen(yr)

drop if year == 10

Looking at the excel file that this dataset is made from, I know that the number of people to report their self-rated health (binary_health_y) in year 0 is 1095 and in year 5 is 558. As I want to look at change in health across waves, my understanding that I can look at a maximum of 558 people’s self-rated health, as a maximum of 558 people could have information across both waves.

However, when I run the following regression of binary self-rated health as good or bad (binary_health_y) and the total number of people unemployed in the county in which a person lives (psum_unemployed_total_cont_y)

Code:

xtlogit binary_health_y psum_unemployed_total_cont_y

I get the following results:

Code:

 
. xtlogit binary_health_y psum_unemployed_total_cont_y
 
Fitting comparison model:
 
Iteration 0:   log likelihood = -973.25989
Iteration 1:   log likelihood = -973.24175
Iteration 2:   log likelihood = -973.24175
 
Fitting full model:
 
tau =  0.0     log likelihood = -973.24175
tau =  0.1     log likelihood = -969.80706
tau =  0.2     log likelihood =  -966.3578
tau =  0.3     log likelihood = -962.98739
tau =  0.4     log likelihood = -959.84848
tau =  0.5     log likelihood = -957.20783
tau =  0.6     log likelihood = -955.57269
tau =  0.7     log likelihood = -956.02716
 
Iteration 0:   log likelihood = -955.57204
Iteration 1:   log likelihood = -946.86439
Iteration 2:   log likelihood =  -946.5548
Iteration 3:   log likelihood = -946.55444
Iteration 4:   log likelihood = -946.55444
 
Random-effects logistic regression              Number of obs     =      1,613
Group variable: id                              Number of groups  =      1,077
 
Random effects u_i ~ Gaussian                   Obs per group:
                                                              min =          1
                                                              avg =        1.5
                                                              max =          2
 
Integration method: mvaghermite                 Integration pts.  =         12
 
                                                Wald chi2(1)      =       0.02
Log likelihood  = -946.55444                    Prob > chi2       =     0.8855
 
----------------------------------------------------------------------------------------------
             binary_health_y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------------------+----------------------------------------------------------------
psum_unemployed_total_cont_y |  -.0071684   .0497788    -0.14   0.885    -.1047331    .0903964
                       _cons |   1.384735   .4005664     3.46   0.001     .5996391     2.16983
-----------------------------+----------------------------------------------------------------
                    /lnsig2u |   1.187308   .2524483                      .6925184    1.682097
-----------------------------+----------------------------------------------------------------
                     sigma_u |   1.810592   .2285404                      1.413769    2.318798
                         rho |   .4991151   .0631119                      .3779334    .6204009
----------------------------------------------------------------------------------------------
LR test of rho=0: chibar2(01) = 53.37                  Prob >= chibar2 = 0.000

The number of groups is reported as = 1,077.

It is my understanding that the number of groups is equal to the sample size, so why is this number so much larger than the true number of observations that I see in the excel file?
What I want to know is how many people answered the questionnaire at both time points, is there a way that I can find the true number of people who provided information at both time points for me to compare?

Relatedly, these numbers are very small, much smaller than I had originally hoped for my dataset, is there an alternative method of analysis I can make use of in order to avoid losing so much data? For example, maybe I could run a linear regression for each wave independent of eachother and avoid a panel data approach altogether.

Or am I worrying too much? With random effects instead of fixed effects, and a linear probability model (which I intend to use later for easier interpretation of results to the reader) is it OK to treat the "Number of groups " as the number of observations in the tables of journal articles and papers and do we not have to worry about double counting the number of unique individuals who answered questionnaires across waves and may be included twice for answering a questionnaire twice?

I'm starting to feel that there may be some theoretical underpinning to random effects estimation that I'm missing, possibly due to coming from such a strongly "fixed effects" background. Some theoretical reason to suggest that even though a person may be counted more than once, that this is ok because each time they are providing relevant data.

Any feedback would be greatly appreciated,

Best regards,

John

Last edited by John Adler; 16 Feb 2018, 04:38. Reason: A sort of "p.s. of further thoughts begins at "Or am I worrying too much?".

Tags: fixed effects, panel data, random effects

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#2

16 Feb 2018, 05:25

John:
as far as your first question is concerned, Stata is telling you that the number of persons included in -xtlogit- is 1,077 out of 1,095 (probably due to missing values).
Roughly half them has observed values for both waves: that's why Stata is telling you that the average number of observations per person is 1.5.
You can rely on -e(sample)- after -xtlogit- to check how many persons were included in your regression.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#3

16 Feb 2018, 05:54

Dear Carlo,

Thank you for your help, that makes sense. I wonder what is the norm in analysis with random effects models, when creating tables describing results should one report the number of persons with observed values for multiple waves? Or is it appropriate just to report the number of persons included in the xtlogit? I admit to being quite new to using random effects as I primarily use fixed, where the interest is in a person who has experienced some change across waves, does this differ for random?

Kindest regards,

Jonathan
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#4

16 Feb 2018, 07:36

John:
what I usually do is reporting the original sample size, along with the missing values per wave.
A futher step might be to consider imputing the missing values, so that you can restore the original sample size and present both the analyses, thatis the one performed on complete case analysis and the one on imputed data.

Kind regards,
Carlo
(Stata 19.0)
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#5

16 Feb 2018, 07:52

Carlo,

Thank you so much for your feedback, which was very informative. You mentioned in #2 above that I can rely on -e(sample)- after -xtlogit- to check how many persons were included in my regression. I have since been trying to put this into effect, but without much luck. Do you think you could show me how in my regression above this might be done?

Best,

John
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17714

16 Feb 2018, 10:30

John:
you're right.
My reply # 2 was too broad.

In the folowing toy-example, -e(sample)- will give you the # of observations, whereas -egen- with -tag-function coupled with -e(sample)- will give you the number of persons included in the regression model.

Code:

. xtlogit union age i.black

Fitting comparison model:

Iteration 0:   log likelihood =  -13864.23 
Iteration 1:   log likelihood = -13733.133 
Iteration 2:   log likelihood = -13732.189 
Iteration 3:   log likelihood = -13732.189 

Fitting full model:

tau =  0.0     log likelihood = -13732.189
tau =  0.1     log likelihood = -13091.844
tau =  0.2     log likelihood = -12566.896
tau =  0.3     log likelihood = -12135.376
tau =  0.4     log likelihood = -11775.922
tau =  0.5     log likelihood = -11474.491
tau =  0.6     log likelihood = -11224.734
tau =  0.7     log likelihood = -11030.919
tau =  0.8     log likelihood =  -10920.54

Iteration 0:   log likelihood = -11031.029 
Iteration 1:   log likelihood = -10624.936 
Iteration 2:   log likelihood = -10602.567 
Iteration 3:   log likelihood = -10601.881 
Iteration 4:   log likelihood = -10601.881  (backed up)
Iteration 5:   log likelihood = -10601.877 
Iteration 6:   log likelihood = -10601.877 

Random-effects logistic regression              Number of obs     =     26,200
Group variable: idcode                          Number of groups  =      4,434

Random effects u_i ~ Gaussian                   Obs per group:
                                                              min =          1
                                                              avg =        5.9
                                                              max =         12

Integration method: mvaghermite                 Integration pts.  =         12

                                                Wald chi2(2)      =     111.47
Log likelihood  = -10601.877                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
       union |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |    .019754   .0036005     5.49   0.000     .0126972    .0268109
     1.black |   .9363116   .1023713     9.15   0.000     .7356675    1.136956
       _cons |  -3.278563   .1284967   -25.51   0.000    -3.530412   -3.026714
-------------+----------------------------------------------------------------
    /lnsig2u |   1.787467   .0467545                       1.69583    1.879104
-------------+----------------------------------------------------------------
     sigma_u |   2.444238   .0571395                      2.334774    2.558835
         rho |   .6448825   .0107072                      .6236295     .665579
------------------------------------------------------------------------------
LR test of rho=0: chibar2(01) = 6260.62                Prob >= chibar2 = 0.000

. count if e(sample)==1
  26,200

. egen tag=tag(idcode)

. count if e(sample)==1 & tag==1
  4,434

.

Kind regards,
Carlo
(Stata 19.0)

Comment

John Adler

Join Date: Apr 2017

Posts: 173
#7

16 Feb 2018, 12:48

Dear Carlo,

Thank you so much for your response which is greatly appreciated, I have implemented the above in my analysis and have found it to be very informative. Is there a similar approach with which one could determine how many unique individuals have been included in the regression model? For example, in a panel data analysis, the above will tell me that, for a person who has been measured across three waves of panel data, that they are three individuals in the regression model, however, if I wanted to determine how many unique individuals are analysed in the regression, how would I go about doing that? i.e. If I wanted to counted a person as "1 person", regardless of how many waves they appear in, what approach would I take?

Thank you again,

John
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#8

17 Feb 2018, 05:14

John:
elaborating a bit on my previous example (which is based on: use "http://www.stata-press.com/data/r15/union.dta"), you may want to try:

Code:

count if tag==1

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Difference in the sample size reported by xtlogit and the sample size seen in the data.

Comment

Comment

Comment

Comment

Comment

Comment

Comment