Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Choice and Implementation of Regression Technique for Pooled data

    Hello Everyone

    I am trying to run regressions of the form given below for the first time:



    The independent variable predicts the health outcome for an individual i falling in a given month m and a given year y.

    Var is the primary independent variable of interest (age) and I am interested in determining that if Var exceeds a certain threshold whether it has an impact on the dependent variable. Xi is a vector of dummies for the individual representing marital status, education status, gender etc.

    θmy would be a set of fixed time period effects for a given month and year.
    The dependent variable Y predicts various binary categorical health outcomes and the independent variable considers the demographic variables of age to predict the health outcome.

    I was thinking of using a pooled OLS regression or a fixed effects regression for the above analysis and wanted to understand if that would be the best approach for tacking the problem.

    I am very new to Stata software and would appreciate any help or references to sources that I could use for replicating the above regression.

    Thanks and Regards

    Alex


  • #2
    Alex:
    welcome to this forum.
    Please use CODE delimiters to share what you typed and what Stata gave you back (see the FAQ on this and other posting-related topics). Thanks.
    That said, you may want to take a look at -regress- and -xtreg- entries in Stata .pdf manual.
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      Just as a side note after Carlo's helpful advice, and considering your DV is binary, you may also check - xtlogit - as well.
      Best regards,

      Marcos

      Comment


      • #4
        Hi Carlo and Marcos,

        Thanks for your helpful reply. I had gone through the entries for the xtreg entry in Stata manual. However, it seems to be more suitable for panel data rather than the pooled cross sectional data I am working with. Wanted to understand which options should I use in this case. Have attached a small example data set below, my data set has more than a million rows, so I will have enough data for every month and year. I wanted to run regressions of the form in my earlier post on the below data while including a fixed time effect by month and year. What should be the options that I use, will generating a set of time dummies for three years and twelve months for a total of 36 months work?

        Please excuse if my question may appear to be a bit basic, I am a complete beginner to Stata and would appreciate any help from you a lot.
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(health_ind mother_age mother_edind marital_ind) int dob_yy byte dob_mm
        0 29 4 1 2014  1
        0 32 1 1 2013 10
        0 34 3 1 2014  1
        0 31 2 1 2014  1
        0 27 2 1 2013  7
        0 23 1 2 2013  2
        0 20 2 2 2013  7
        0 18 1 2 2013  3
        0 37 3 1 2014  1
        1 34 3 1 2013  7
        1 32 2 2 2012  4
        0 30 4 1 2013  7
        0 37 3 1 2012  5
        0 17 1 2 2012  3
        0 23 2 1 2012  3
        0 26 3 2 2012 10
        0 29 2 1 2014  1
        0 30 2 1 2012  5
        0 25 3 1 2013 12
        1 31 2 1 2012  8
        0 38 . 1 2014  1
        0 29 2 2 2013  8
        0 30 3 1 2013 12
        0 20 2 2 2012 10
        0 20 2 2 2013 12
        1 34 2 1 2014  1
        0 36 4 1 2013  9
        0 30 3 1 2013  8
        0 20 1 2 2013  3
        0 19 1 2 2013  7
        1 24 2 2 2012  9
        0 20 2 2 2014  1
        1 24 3 1 2013  2
        0 33 3 1 2013  9
        0 21 2 2 2013  6
        0 18 1 2 2013  1
        0 27 2 1 2012  7
        0 36 1 1 2013  6
        0 36 4 1 2012  8
        1 24 2 1 2013  6
        0 33 3 1 2012  1
        1 28 2 1 2014  1
        1 23 2 2 2013 12
        0 27 2 1 2014  1
        0 30 3 1 2012  2
        1 32 3 1 2012  7
        1 26 3 1 2012  9
        0 24 2 2 2013 12
        1 25 3 1 2013  6
        0 24 3 1 2013 11
        0 28 4 1 2013 11
        0 29 3 1 2014  1
        0 27 4 1 2013 10
        0 38 . 1 2012  8
        0 24 2 1 2012 10
        1 24 2 2 2013  1
        0 31 3 2 2013  4
        0 29 2 2 2013  9
        0 27 4 1 2013 11
        0 41 3 1 2013  1
        0 31 2 1 2013  3
        0 31 2 1 2013  8
        0 31 2 1 2012  2
        1 43 2 1 2012  6
        0 22 2 2 2012  4
        0 21 2 1 2013 10
        1 37 3 2 2014  1
        0 38 3 1 2012 10
        0 26 2 1 2012 12
        0 29 4 2 2013  6
        0 33 3 1 2014  1
        0 23 . 1 2012  5
        0 20 2 1 2013 12
        1 27 2 1 2012  2
        1 28 2 2 2012  2
        0 37 . 2 2013  1
        0 23 2 2 2012 12
        0 20 . 2 2013  8
        1 25 1 1 2013  1
        1 36 3 1 2012  4
        0 23 2 2 2012 12
        0 31 2 1 2012 11
        0 28 3 1 2013  5
        0 39 . 2 2012  2
        1 25 . 2 2013  1
        1 34 2 2 2013  6
        1 37 3 1 2012  9
        0 22 2 1 2012  1
        0 19 2 2 2012  6
        0 22 3 1 2012  6
        0 40 1 1 2013 10
        0 32 2 1 2012  6
        1 35 . 1 2012  4
        0 37 3 2 2012  2
        0 29 3 1 2012  3
        0 35 2 1 2012  9
        0 34 2 1 2012 12
        0 21 2 2 2013  6
        0 31 1 2 2014  1
        0 32 3 1 2014  1
        end
        label values mother_edind edu
        label def edu 1 "<High School", modify
        label def edu 2 "High School/GED", modify
        label def edu 3 "College", modify
        label def edu 4 "College+", modify

        Comment


        • #5
          Alex:
          thanks for providing further details.
          However, since you data excerpt does not provide -id- of the unit included in your dataset (whereas a time variable is provided), I cannot say whether you have repeated cross-sectional or panel data.
          Assuming that you have repeated cross-sectional data (ie, different units measured at different points in time), you should probably conside -logit- or -logistic- (as Marcos pointed out, your regressand seems to be categorical 0/1, so there's no space for -regress-, which requires continuous regressand instead):
          Code:
          g id=_n
          . logit health_ind mother_age i.mother_edind marital_ind dob_mm i.dob_yy
          
          note: 4.mother_edind != 0 predicts failure perfectly
                4.mother_edind dropped and 8 obs not used
          
          Iteration 0:   log likelihood = -47.236152 
          Iteration 1:   log likelihood = -43.689615 
          Iteration 2:   log likelihood = -43.553499 
          Iteration 3:   log likelihood = -43.552846 
          Iteration 4:   log likelihood = -43.552846 
          
          Logistic regression                             Number of obs     =         84
                                                          LR chi2(7)        =       7.37
                                                          Prob > chi2       =     0.3917
          Log likelihood = -43.552846                     Pseudo R2         =     0.0780
          
          ----------------------------------------------------------------------------------
                health_ind |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -----------------+----------------------------------------------------------------
                mother_age |   .0791846   .0531739     1.49   0.136    -.0250343    .1834036
                           |
              mother_edind |
          High School/GED  |   1.649684   1.164436     1.42   0.157    -.6325696    3.931937
                  College  |   1.270483   1.208148     1.05   0.293    -1.097444     3.63841
                 College+  |          0  (empty)
                           |
               marital_ind |   .2764639   .6216595     0.44   0.657    -.9419664    1.494894
                    dob_mm |  -.1353358   .0862361    -1.57   0.117    -.3043555    .0336839
                           |
                    dob_yy |
                     2013  |   .1431512   .6144185     0.23   0.816    -1.061087    1.347389
                     2014  |  -.9886959   .8740855    -1.13   0.258    -2.701872    .7244802
                           |
                     _cons |  -4.262752   2.317531    -1.84   0.066     -8.80503    .2795259
          ----------------------------------------------------------------------------------
          
          .
          Kind regards,
          Carlo
          (Stata 18.0 SE)

          Comment


          • #6
            Hi Carlo:

            Thanks a lot for your help on my question. You guessed correctly that I have repeated cross-sectional data.

            I'll proceed with the -logit- regression method. One small question that I wanted to ask was that since I want to add fixed time effects at a month and year level, should I create dummies for all the months too.
            This would lead to 11 month dummies and 2 year dummies. I see that you have dob_mm as a single variable in the regression. Wanted to understand what's the right approach here for capturing the fixed effects for the 36 time periods (3yrs * 12 months).

            Best Regards

            Alex

            Comment


            • #7
              Alex:
              my bad, I overlooked that dob_mm was a categorical variables.
              Hnce, elaborating a bit on your data excerpt:
              Code:
              . g id=_n
              . logit health_ind mother_age i.mother_edind marital_ind i.dob_mm i.dob_yy
              
              note: 4.mother_edind != 0 predicts failure perfectly
                    4.mother_edind dropped and 8 obs not used
              
              note: 3.dob_mm != 0 predicts failure perfectly
                    3.dob_mm dropped and 6 obs not used
              
              note: 5.dob_mm != 0 predicts failure perfectly
                    5.dob_mm dropped and 3 obs not used
              
              note: 10.dob_mm != 0 predicts failure perfectly
                    10.dob_mm dropped and 7 obs not used
              
              note: 11.dob_mm != 0 predicts failure perfectly
                    11.dob_mm dropped and 2 obs not used
              
              Iteration 0:   log likelihood =  -41.28243 
              Iteration 1:   log likelihood = -37.483049 
              Iteration 2:   log likelihood = -37.346245 
              Iteration 3:   log likelihood = -37.345562 
              Iteration 4:   log likelihood = -37.345562 
              
              Logistic regression                             Number of obs     =         66
                                                              LR chi2(13)       =       7.87
                                                              Prob > chi2       =     0.8517
              Log likelihood = -37.345562                     Pseudo R2         =     0.0954
              
              ----------------------------------------------------------------------------------
                    health_ind |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
              -----------------+----------------------------------------------------------------
                    mother_age |   .0584523   .0581291     1.01   0.315    -.0554786    .1723832
                               |
                  mother_edind |
              High School/GED  |   1.136505     1.2723     0.89   0.372    -1.357157    3.630168
                      College  |     .75351    1.30818     0.58   0.565    -1.810476    3.317496
                     College+  |          0  (empty)
                               |
                   marital_ind |  -.1707325   .7173884    -0.24   0.812    -1.576788    1.235323
                               |
                        dob_mm |
                            2  |   .2530097   1.241003     0.20   0.838    -2.179311    2.685331
                            3  |          0  (empty)
                            4  |   .3974027   1.462063     0.27   0.786    -2.468188    3.262993
                            5  |          0  (empty)
                            6  |   .0403314   1.150109     0.04   0.972     -2.21384    2.294503
                            7  |  -.1301078   1.288359    -0.10   0.920    -2.655244    2.395029
                            8  |  -.9250612   1.524987    -0.61   0.544    -3.913982    2.063859
                            9  |   .2963607   1.269578     0.23   0.815    -2.191966    2.784688
                           10  |          0  (empty)
                           11  |          0  (empty)
                           12  |  -1.690895   1.436328    -1.18   0.239    -4.506046    1.124257
                               |
                        dob_yy |
                         2013  |   .0707898   .7233872     0.10   0.922    -1.347023    1.488603
                         2014  |  -.8799477   1.206099    -0.73   0.466    -3.243857    1.483962
                               |
                         _cons |  -2.829852   2.547678    -1.11   0.267    -7.823209    2.163504
              ----------------------------------------------------------------------------------
              
              . testparm(i.dob_mm)
              
               ( 1)  [health_ind]2.dob_mm = 0
               ( 2)  [health_ind]4.dob_mm = 0
               ( 3)  [health_ind]6.dob_mm = 0
               ( 4)  [health_ind]7.dob_mm = 0
               ( 5)  [health_ind]8.dob_mm = 0
               ( 6)  [health_ind]9.dob_mm = 0
               ( 7)  [health_ind]12.dob_mm = 0
              
                         chi2(  7) =    3.14
                       Prob > chi2 =    0.8722
              
              . testparm(i.dob_yy)
              
               ( 1)  [health_ind]2013.dob_yy = 0
               ( 2)  [health_ind]2014.dob_yy = 0
              
                         chi2(  2) =    0.64
                       Prob > chi2 =    0.7267
              
              .*-testparm- tests the joint statistical significance of your time variables; for both of them there's no evidence of an effect*
              Kind regards,
              Carlo
              (Stata 18.0 SE)

              Comment

              Working...
              X