Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference in the sample size reported by xtlogit and the sample size seen in the data.

    Hi all,

    In my own data, I am currently working on the replication of the following article Currie, J., Duque, V., & Garfinkel, I. (2015). The Great Recession and mothers' health. The Economic Journal, 125(588). (which can be found here: http://onlinelibrary.wiley.com/doi/1...12239/abstract).

    I have a panel data-set across three waves (year 0, year 5 and year 10) which includes the health outcomes and behaviors of respondents. I do the following in order to prepare for my analysis:


    Code:
    // SET DIRECTORY
    cd "$mainpath"    
    
    quietly do 3_waves_import_dataset_building_and_analysis_13_02_18_pre_transsem.do
    
    reshape long bin_residence_y psum_unemployed_total_cont_y household_income_y health_y current_county_y binary_health_y bmi_y binbmi_overweight_y binbmi_underweight_y binbmi_obese_y ord_bmi_y any_tobacco_y only_cigarettes_y occ_cigarettes_y reg_cigarettes_y no_cigs_cons_deflated_y no_cigs_cons_more10_y triedstopsmoking_y times_quit_cigarettes_y smokeintention_y smokeintention_binary_y lastdrank_y usually_drink_y days_drink_y drink_count_y prescribed_medication_y bin_mild_ex_y mild_exercise_y bin_moderate_ex_y no_activity_y strenuous_exercise_y bin_strenous_ex_y moderate_exercise_y residence_y accommodation_y home_owner_y bin_home_owner_y health_insurance_y own_education_y  age_y ord_age_y medical_card_y employment_y binary_employment_y binmartatus_y, i(id) j(year)
     
    xtset id year
    
    quietly tab1 year, gen(yr)
    
    drop if year == 10
    Looking at the excel file that this dataset is made from, I know that the number of people to report their self-rated health (binary_health_y) in year 0 is 1095 and in year 5 is 558. As I want to look at change in health across waves, my understanding that I can look at a maximum of 558 people’s self-rated health, as a maximum of 558 people could have information across both waves.

    However, when I run the following regression of binary self-rated health as good or bad (binary_health_y) and the total number of people unemployed in the county in which a person lives (psum_unemployed_total_cont_y)

    Code:
    xtlogit binary_health_y psum_unemployed_total_cont_y
    I get the following results:
    Code:
     
    . xtlogit binary_health_y psum_unemployed_total_cont_y
     
    Fitting comparison model:
     
    Iteration 0:   log likelihood = -973.25989
    Iteration 1:   log likelihood = -973.24175
    Iteration 2:   log likelihood = -973.24175
     
    Fitting full model:
     
    tau =  0.0     log likelihood = -973.24175
    tau =  0.1     log likelihood = -969.80706
    tau =  0.2     log likelihood =  -966.3578
    tau =  0.3     log likelihood = -962.98739
    tau =  0.4     log likelihood = -959.84848
    tau =  0.5     log likelihood = -957.20783
    tau =  0.6     log likelihood = -955.57269
    tau =  0.7     log likelihood = -956.02716
     
    Iteration 0:   log likelihood = -955.57204
    Iteration 1:   log likelihood = -946.86439
    Iteration 2:   log likelihood =  -946.5548
    Iteration 3:   log likelihood = -946.55444
    Iteration 4:   log likelihood = -946.55444
     
    Random-effects logistic regression              Number of obs     =      1,613
    Group variable: id                              Number of groups  =      1,077
     
    Random effects u_i ~ Gaussian                   Obs per group:
                                                                  min =          1
                                                                  avg =        1.5
                                                                  max =          2
     
    Integration method: mvaghermite                 Integration pts.  =         12
     
                                                    Wald chi2(1)      =       0.02
    Log likelihood  = -946.55444                    Prob > chi2       =     0.8855
     
    ----------------------------------------------------------------------------------------------
                 binary_health_y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -----------------------------+----------------------------------------------------------------
    psum_unemployed_total_cont_y |  -.0071684   .0497788    -0.14   0.885    -.1047331    .0903964
                           _cons |   1.384735   .4005664     3.46   0.001     .5996391     2.16983
    -----------------------------+----------------------------------------------------------------
                        /lnsig2u |   1.187308   .2524483                      .6925184    1.682097
    -----------------------------+----------------------------------------------------------------
                         sigma_u |   1.810592   .2285404                      1.413769    2.318798
                             rho |   .4991151   .0631119                      .3779334    .6204009
    ----------------------------------------------------------------------------------------------
    LR test of rho=0: chibar2(01) = 53.37                  Prob >= chibar2 = 0.000
    The number of groups is reported as = 1,077.

    It is my understanding that the number of groups is equal to the sample size, so why is this number so much larger than the true number of observations that I see in the excel file?
    What I want to know is how many people answered the questionnaire at both time points, is there a way that I can find the true number of people who provided information at both time points for me to compare?

    Relatedly, these numbers are very small, much smaller than I had originally hoped for my dataset, is there an alternative method of analysis I can make use of in order to avoid losing so much data? For example, maybe I could run a linear regression for each wave independent of eachother and avoid a panel data approach altogether.

    Or am I worrying too much? With random effects instead of fixed effects, and a linear probability model (which I intend to use later for easier interpretation of results to the reader) is it OK to treat the "Number of groups " as the number of observations in the tables of journal articles and papers and do we not have to worry about double counting the number of unique individuals who answered questionnaires across waves and may be included twice for answering a questionnaire twice?

    I'm starting to feel that there may be some theoretical underpinning to random effects estimation that I'm missing, possibly due to coming from such a strongly "fixed effects" background. Some theoretical reason to suggest that even though a person may be counted more than once, that this is ok because each time they are providing relevant data.

    Any feedback would be greatly appreciated,

    Best regards,

    John
    Last edited by John Adler; 16 Feb 2018, 04:38. Reason: A sort of "p.s. of further thoughts begins at "Or am I worrying too much?".

  • #2
    John:
    as far as your first question is concerned, Stata is telling you that the number of persons included in -xtlogit- is 1,077 out of 1,095 (probably due to missing values).
    Roughly half them has observed values for both waves: that's why Stata is telling you that the average number of observations per person is 1.5.
    You can rely on -e(sample)- after -xtlogit- to check how many persons were included in your regression.
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      Dear Carlo,

      Thank you for your help, that makes sense. I wonder what is the norm in analysis with random effects models, when creating tables describing results should one report the number of persons with observed values for multiple waves? Or is it appropriate just to report the number of persons included in the xtlogit? I admit to being quite new to using random effects as I primarily use fixed, where the interest is in a person who has experienced some change across waves, does this differ for random?

      Kindest regards,

      Jonathan

      Comment


      • #4
        John:
        what I usually do is reporting the original sample size, along with the missing values per wave.
        A futher step might be to consider imputing the missing values, so that you can restore the original sample size and present both the analyses, thatis the one performed on complete case analysis and the one on imputed data.
        Kind regards,
        Carlo
        (Stata 18.0 SE)

        Comment


        • #5
          Carlo,

          Thank you so much for your feedback, which was very informative. You mentioned in #2 above that I can rely on -e(sample)- after -xtlogit- to check how many persons were included in my regression. I have since been trying to put this into effect, but without much luck. Do you think you could show me how in my regression above this might be done?

          Best,

          John

          Comment


          • #6
            John:
            you're right.
            My reply # 2 was too broad.

            In the folowing toy-example, -e(sample)- will give you the # of observations, whereas -egen- with -tag-function coupled with -e(sample)- will give you the number of persons included in the regression model.

            Code:
            . xtlogit union age i.black
            
            Fitting comparison model:
            
            Iteration 0:   log likelihood =  -13864.23 
            Iteration 1:   log likelihood = -13733.133 
            Iteration 2:   log likelihood = -13732.189 
            Iteration 3:   log likelihood = -13732.189 
            
            Fitting full model:
            
            tau =  0.0     log likelihood = -13732.189
            tau =  0.1     log likelihood = -13091.844
            tau =  0.2     log likelihood = -12566.896
            tau =  0.3     log likelihood = -12135.376
            tau =  0.4     log likelihood = -11775.922
            tau =  0.5     log likelihood = -11474.491
            tau =  0.6     log likelihood = -11224.734
            tau =  0.7     log likelihood = -11030.919
            tau =  0.8     log likelihood =  -10920.54
            
            Iteration 0:   log likelihood = -11031.029 
            Iteration 1:   log likelihood = -10624.936 
            Iteration 2:   log likelihood = -10602.567 
            Iteration 3:   log likelihood = -10601.881 
            Iteration 4:   log likelihood = -10601.881  (backed up)
            Iteration 5:   log likelihood = -10601.877 
            Iteration 6:   log likelihood = -10601.877 
            
            Random-effects logistic regression              Number of obs     =     26,200
            Group variable: idcode                          Number of groups  =      4,434
            
            Random effects u_i ~ Gaussian                   Obs per group:
                                                                          min =          1
                                                                          avg =        5.9
                                                                          max =         12
            
            Integration method: mvaghermite                 Integration pts.  =         12
            
                                                            Wald chi2(2)      =     111.47
            Log likelihood  = -10601.877                    Prob > chi2       =     0.0000
            
            ------------------------------------------------------------------------------
                   union |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                     age |    .019754   .0036005     5.49   0.000     .0126972    .0268109
                 1.black |   .9363116   .1023713     9.15   0.000     .7356675    1.136956
                   _cons |  -3.278563   .1284967   -25.51   0.000    -3.530412   -3.026714
            -------------+----------------------------------------------------------------
                /lnsig2u |   1.787467   .0467545                       1.69583    1.879104
            -------------+----------------------------------------------------------------
                 sigma_u |   2.444238   .0571395                      2.334774    2.558835
                     rho |   .6448825   .0107072                      .6236295     .665579
            ------------------------------------------------------------------------------
            LR test of rho=0: chibar2(01) = 6260.62                Prob >= chibar2 = 0.000
            
            . count if e(sample)==1
              26,200
            
            . egen tag=tag(idcode)
            
            . count if e(sample)==1 & tag==1
              4,434
            
            .
            Kind regards,
            Carlo
            (Stata 18.0 SE)

            Comment


            • #7
              Dear Carlo,

              Thank you so much for your response which is greatly appreciated, I have implemented the above in my analysis and have found it to be very informative. Is there a similar approach with which one could determine how many unique individuals have been included in the regression model? For example, in a panel data analysis, the above will tell me that, for a person who has been measured across three waves of panel data, that they are three individuals in the regression model, however, if I wanted to determine how many unique individuals are analysed in the regression, how would I go about doing that? i.e. If I wanted to counted a person as "1 person", regardless of how many waves they appear in, what approach would I take?

              Thank you again,

              John

              Comment


              • #8
                John:
                elaborating a bit on my previous example (which is based on: use "http://www.stata-press.com/data/r15/union.dta"), you may want to try:
                Code:
                count if tag==1
                Kind regards,
                Carlo
                (Stata 18.0 SE)

                Comment

                Working...
                X