Choice and Implementation of Regression Technique for Pooled data

Alex Anders

Join Date: Mar 2019

Posts: 6
#1

Choice and Implementation of Regression Technique for Pooled data

19 Mar 2019, 00:05

Hello Everyone

I am trying to run regressions of the form given below for the first time:

The independent variable predicts the health outcome for an individual i falling in a given month m and a given year y.

Var is the primary independent variable of interest (age) and I am interested in determining that if Var exceeds a certain threshold whether it has an impact on the dependent variable. Xi is a vector of dummies for the individual representing marital status, education status, gender etc.

θmy would be a set of fixed time period effects for a given month and year.
The dependent variable Y predicts various binary categorical health outcomes and the independent variable considers the demographic variables of age to predict the health outcome.

I was thinking of using a pooled OLS regression or a fixed effects regression for the above analysis and wanted to understand if that would be the best approach for tacking the problem.

I am very new to Stata software and would appreciate any help or references to sources that I could use for replicating the above regression.

Thanks and Regards

Alex
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

19 Mar 2019, 00:32

Alex:
welcome to this forum.
Please use CODE delimiters to share what you typed and what Stata gave you back (see the FAQ on this and other posting-related topics). Thanks.
That said, you may want to take a look at -regress- and -xtreg- entries in Stata .pdf manual.

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

19 Mar 2019, 04:44

Just as a side note after Carlo's helpful advice, and considering your DV is binary, you may also check - xtlogit - as well.

Best regards,

Marcos
Comment

Alex Anders

Join Date: Mar 2019
Posts: 6

19 Mar 2019, 20:56

Hi Carlo and Marcos,

Thanks for your helpful reply. I had gone through the entries for the xtreg entry in Stata manual. However, it seems to be more suitable for panel data rather than the pooled cross sectional data I am working with. Wanted to understand which options should I use in this case. Have attached a small example data set below, my data set has more than a million rows, so I will have enough data for every month and year. I wanted to run regressions of the form in my earlier post on the below data while including a fixed time effect by month and year. What should be the options that I use, will generating a set of time dummies for three years and twelve months for a total of 36 months work?

Please excuse if my question may appear to be a bit basic, I am a complete beginner to Stata and would appreciate any help from you a lot.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(health_ind mother_age mother_edind marital_ind) int dob_yy byte dob_mm
0 29 4 1 2014  1
0 32 1 1 2013 10
0 34 3 1 2014  1
0 31 2 1 2014  1
0 27 2 1 2013  7
0 23 1 2 2013  2
0 20 2 2 2013  7
0 18 1 2 2013  3
0 37 3 1 2014  1
1 34 3 1 2013  7
1 32 2 2 2012  4
0 30 4 1 2013  7
0 37 3 1 2012  5
0 17 1 2 2012  3
0 23 2 1 2012  3
0 26 3 2 2012 10
0 29 2 1 2014  1
0 30 2 1 2012  5
0 25 3 1 2013 12
1 31 2 1 2012  8
0 38 . 1 2014  1
0 29 2 2 2013  8
0 30 3 1 2013 12
0 20 2 2 2012 10
0 20 2 2 2013 12
1 34 2 1 2014  1
0 36 4 1 2013  9
0 30 3 1 2013  8
0 20 1 2 2013  3
0 19 1 2 2013  7
1 24 2 2 2012  9
0 20 2 2 2014  1
1 24 3 1 2013  2
0 33 3 1 2013  9
0 21 2 2 2013  6
0 18 1 2 2013  1
0 27 2 1 2012  7
0 36 1 1 2013  6
0 36 4 1 2012  8
1 24 2 1 2013  6
0 33 3 1 2012  1
1 28 2 1 2014  1
1 23 2 2 2013 12
0 27 2 1 2014  1
0 30 3 1 2012  2
1 32 3 1 2012  7
1 26 3 1 2012  9
0 24 2 2 2013 12
1 25 3 1 2013  6
0 24 3 1 2013 11
0 28 4 1 2013 11
0 29 3 1 2014  1
0 27 4 1 2013 10
0 38 . 1 2012  8
0 24 2 1 2012 10
1 24 2 2 2013  1
0 31 3 2 2013  4
0 29 2 2 2013  9
0 27 4 1 2013 11
0 41 3 1 2013  1
0 31 2 1 2013  3
0 31 2 1 2013  8
0 31 2 1 2012  2
1 43 2 1 2012  6
0 22 2 2 2012  4
0 21 2 1 2013 10
1 37 3 2 2014  1
0 38 3 1 2012 10
0 26 2 1 2012 12
0 29 4 2 2013  6
0 33 3 1 2014  1
0 23 . 1 2012  5
0 20 2 1 2013 12
1 27 2 1 2012  2
1 28 2 2 2012  2
0 37 . 2 2013  1
0 23 2 2 2012 12
0 20 . 2 2013  8
1 25 1 1 2013  1
1 36 3 1 2012  4
0 23 2 2 2012 12
0 31 2 1 2012 11
0 28 3 1 2013  5
0 39 . 2 2012  2
1 25 . 2 2013  1
1 34 2 2 2013  6
1 37 3 1 2012  9
0 22 2 1 2012  1
0 19 2 2 2012  6
0 22 3 1 2012  6
0 40 1 1 2013 10
0 32 2 1 2012  6
1 35 . 1 2012  4
0 37 3 2 2012  2
0 29 3 1 2012  3
0 35 2 1 2012  9
0 34 2 1 2012 12
0 21 2 2 2013  6
0 31 1 2 2014  1
0 32 3 1 2014  1
end
label values mother_edind edu
label def edu 1 "<High School", modify
label def edu 2 "High School/GED", modify
label def edu 3 "College", modify
label def edu 4 "College+", modify

Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

20 Mar 2019, 00:34

Alex:
thanks for providing further details.
However, since you data excerpt does not provide -id- of the unit included in your dataset (whereas a time variable is provided), I cannot say whether you have repeated cross-sectional or panel data.
Assuming that you have repeated cross-sectional data (ie, different units measured at different points in time), you should probably conside -logit- or -logistic- (as Marcos pointed out, your regressand seems to be categorical 0/1, so there's no space for -regress-, which requires continuous regressand instead):

Code:

g id=_n
. logit health_ind mother_age i.mother_edind marital_ind dob_mm i.dob_yy

note: 4.mother_edind != 0 predicts failure perfectly
      4.mother_edind dropped and 8 obs not used

Iteration 0:   log likelihood = -47.236152 
Iteration 1:   log likelihood = -43.689615 
Iteration 2:   log likelihood = -43.553499 
Iteration 3:   log likelihood = -43.552846 
Iteration 4:   log likelihood = -43.552846 

Logistic regression                             Number of obs     =         84
                                                LR chi2(7)        =       7.37
                                                Prob > chi2       =     0.3917
Log likelihood = -43.552846                     Pseudo R2         =     0.0780

----------------------------------------------------------------------------------
      health_ind |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
      mother_age |   .0791846   .0531739     1.49   0.136    -.0250343    .1834036
                 |
    mother_edind |
High School/GED  |   1.649684   1.164436     1.42   0.157    -.6325696    3.931937
        College  |   1.270483   1.208148     1.05   0.293    -1.097444     3.63841
       College+  |          0  (empty)
                 |
     marital_ind |   .2764639   .6216595     0.44   0.657    -.9419664    1.494894
          dob_mm |  -.1353358   .0862361    -1.57   0.117    -.3043555    .0336839
                 |
          dob_yy |
           2013  |   .1431512   .6144185     0.23   0.816    -1.061087    1.347389
           2014  |  -.9886959   .8740855    -1.13   0.258    -2.701872    .7244802
                 |
           _cons |  -4.262752   2.317531    -1.84   0.066     -8.80503    .2795259
----------------------------------------------------------------------------------

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Alex Anders

Join Date: Mar 2019

Posts: 6
#6

20 Mar 2019, 08:45

Hi Carlo:

Thanks a lot for your help on my question. You guessed correctly that I have repeated cross-sectional data.

I'll proceed with the -logit- regression method. One small question that I wanted to ask was that since I want to add fixed time effects at a month and year level, should I create dummies for all the months too.
This would lead to 11 month dummies and 2 year dummies. I see that you have dob_mm as a single variable in the regression. Wanted to understand what's the right approach here for capturing the fixed effects for the 36 time periods (3yrs * 12 months).

Best Regards

Alex
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

20 Mar 2019, 13:09

Alex:
my bad, I overlooked that dob_mm was a categorical variables.
Hnce, elaborating a bit on your data excerpt:

Code:

. g id=_n
. logit health_ind mother_age i.mother_edind marital_ind i.dob_mm i.dob_yy

note: 4.mother_edind != 0 predicts failure perfectly
      4.mother_edind dropped and 8 obs not used

note: 3.dob_mm != 0 predicts failure perfectly
      3.dob_mm dropped and 6 obs not used

note: 5.dob_mm != 0 predicts failure perfectly
      5.dob_mm dropped and 3 obs not used

note: 10.dob_mm != 0 predicts failure perfectly
      10.dob_mm dropped and 7 obs not used

note: 11.dob_mm != 0 predicts failure perfectly
      11.dob_mm dropped and 2 obs not used

Iteration 0:   log likelihood =  -41.28243 
Iteration 1:   log likelihood = -37.483049 
Iteration 2:   log likelihood = -37.346245 
Iteration 3:   log likelihood = -37.345562 
Iteration 4:   log likelihood = -37.345562 

Logistic regression                             Number of obs     =         66
                                                LR chi2(13)       =       7.87
                                                Prob > chi2       =     0.8517
Log likelihood = -37.345562                     Pseudo R2         =     0.0954

----------------------------------------------------------------------------------
      health_ind |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
      mother_age |   .0584523   .0581291     1.01   0.315    -.0554786    .1723832
                 |
    mother_edind |
High School/GED  |   1.136505     1.2723     0.89   0.372    -1.357157    3.630168
        College  |     .75351    1.30818     0.58   0.565    -1.810476    3.317496
       College+  |          0  (empty)
                 |
     marital_ind |  -.1707325   .7173884    -0.24   0.812    -1.576788    1.235323
                 |
          dob_mm |
              2  |   .2530097   1.241003     0.20   0.838    -2.179311    2.685331
              3  |          0  (empty)
              4  |   .3974027   1.462063     0.27   0.786    -2.468188    3.262993
              5  |          0  (empty)
              6  |   .0403314   1.150109     0.04   0.972     -2.21384    2.294503
              7  |  -.1301078   1.288359    -0.10   0.920    -2.655244    2.395029
              8  |  -.9250612   1.524987    -0.61   0.544    -3.913982    2.063859
              9  |   .2963607   1.269578     0.23   0.815    -2.191966    2.784688
             10  |          0  (empty)
             11  |          0  (empty)
             12  |  -1.690895   1.436328    -1.18   0.239    -4.506046    1.124257
                 |
          dob_yy |
           2013  |   .0707898   .7233872     0.10   0.922    -1.347023    1.488603
           2014  |  -.8799477   1.206099    -0.73   0.466    -3.243857    1.483962
                 |
           _cons |  -2.829852   2.547678    -1.11   0.267    -7.823209    2.163504
----------------------------------------------------------------------------------

. testparm(i.dob_mm)

 ( 1)  [health_ind]2.dob_mm = 0
 ( 2)  [health_ind]4.dob_mm = 0
 ( 3)  [health_ind]6.dob_mm = 0
 ( 4)  [health_ind]7.dob_mm = 0
 ( 5)  [health_ind]8.dob_mm = 0
 ( 6)  [health_ind]9.dob_mm = 0
 ( 7)  [health_ind]12.dob_mm = 0

           chi2(  7) =    3.14
         Prob > chi2 =    0.8722

. testparm(i.dob_yy)

 ( 1)  [health_ind]2013.dob_yy = 0
 ( 2)  [health_ind]2014.dob_yy = 0

           chi2(  2) =    0.64
         Prob > chi2 =    0.7267

.*-testparm- tests the joint statistical significance of your time variables; for both of them there's no evidence of an effect*

Kind regards,
Carlo
(Stata 19.0)

Announcement