Fixing a sample so observations between specifications are the same to enable comparisons

Chris Boulis

Join Date: Feb 2019

Posts: 368
#1

Fixing a sample so observations between specifications are the same to enable comparisons

05 Nov 2022, 22:49

In my regressions, I add variables across five specifications in which the last contains all the variables in my model. As each specification is based on different samples, I cannot accurately compare results between them. I understand one way of addressing this is to 'fix' the sample to the observations in the final (5th) specification as it contains all the variables added across the previous specifications. Doing so, I understand, will ensure the samples in each are the same, therefore, allowing more accurate comparisons of results across specifications.

To do this, I thought of generating a new variable, which equals the variables in the final specification and adding this new variable to each of the first four specifications to ensure the samples are the same. I'm not sure, but would it look something like:

Code:

generate fixed2 = faith2 + at3 + attend_diff + hgage1 + hgage2 + agediff + esbrd1 + esbrd2 + child + linc

then adding

Code:

if fixed2 == 1

to the first four specifications. (I've not addressed missing values yet). Some guidance on approach/code is appreciated.

Here's an example of my data:

Code:

input byte(faith2 at3) float attend_diff int(hgage1 hgage2) byte(agediff esbrd1 esbrd2) float child byte linc 2 0 0 49 48 1 3 1 2 10 2 0 0 50 49 1 1 1 2 9 2 0 0 51 50 1 1 1 2 10 2 0 0 52 51 1 1 1 2 10 6 0 0 48 38 10 1 1 3 9 6 0 0 49 39 10 1 1 3 9 6 0 0 50 40 10 1 1 3 9 6 0 0 51 41 10 1 1 3 8 1 0 0 20 22 2 2 3 . 9 1 0 0 30 23 7 . 1 . 11 1 0 0 31 24 7 1 1 . 11 1 0 0 32 25 7 1 3 . 11 1 0 0 33 26 7 1 1 . 11 1 0 0 34 27 7 1 1 . 10 1 0 0 35 28 7 1 1 . . 1 0 0 36 29 7 1 1 . 11 1 0 0 37 30 7 1 3 . 11 1 0 0 38 31 7 1 2 . 11 1 0 0 39 32 7 1 3 . 11 3 0 0 47 44 3 1 1 . 11 3 0 0 48 45 3 1 1 . 11 3 0 0 49 46 3 1 1 . 11 3 0 0 50 47 3 1 1 . 11 3 0 0 51 48 3 1 1 . 11 3 0 0 52 49 3 1 1 . 11 3 0 0 53 50 3 1 1 . 11 3 0 0 54 51 3 1 1 . 11 end

I'm using panel data.
Stata v.15.1.

Last edited by Chris Boulis; 05 Nov 2022, 22:51.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17851
#2

06 Nov 2022, 02:28

Chris:
I fail to get what you're after.
When your observations miss one of the variables included in -fixed2-, a missing value in -fixed2- is granted.
Therefore, the -e(sample)- of your panel data regression will be reduced accordingly.
That said, you're seemingly dealing with an unbalanced panel dataset.
Why not living with it and using postestimation command to test the resulting coefficients?

Last edited by Carlo Lazzaro; 06 Nov 2022, 02:30.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#3

06 Nov 2022, 02:49

Thank you for your reply Carlo Lazzaro. Yes, I have an unbalanced panel dataset and I was living with the declining observations (due to missing values) as I added more variables to each specification, however, I received a comment suggesting I fix the sample so I can measure the additional explanatory power of the extra controls added to each specification. Do you have an idea how best to address this?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17851
#4

06 Nov 2022, 03:02

Chris:
the main issue with this nuisance is that the functional form of the regressand is correctly specified.
In addition, if each specification is based on a different sample, I fail to see a panel dataset then (unless specification means panel in the jargon of your research field).
My initial thought was that you were dealing with an unbalanced panel (but I did not find -panelid- and -timevar- in your data excerpt).
In addition, if you're going -xtreg,fe- the -fe- estimator will wipe out all the time-invariant variables reducing the number of coefficients (and the explanatory power of the related variables).

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

06 Nov 2022, 03:46

Hi Carlo Lazzaro. I refer to different sample sizes due to missing values so the comment I received was that it in this case it is not accurate to make comparisons of results between these specifications. Here's my specifications and sample sizes associated with each (as you can see I'm using the Cox proportional hazard model):

Code:

stcox i.faith2 // (57,095)
stcox i.faith2 i.at3 c.attend_diff // (37,491)
stcox i.faith2 i.at3 c.attend_diff c.hgage1 c.hgage2 c.agediff i.esbrd1 i.esbrd2 // (22,907)
stcox i.faith2 i.at3 c.attend_diff c.hgage1 c.hgage2 c.agediff i.esbrd1 i.esbrd2 i.educc* // (22,894)
stcox i.faith2 i.at3 c.attend_diff c.hgage1 c.hgage2 c.agediff i.esbrd1 i.esbrd2 i.educc* c.child c.linc // (20,690)

Updated sample of my panel data including id (couple id) and wave (HILDA dataset):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float couple byte(wave faith2 at3) float attend_diff int(hgage1 hgage2) byte(agediff esbrd1 esbrd2) float child byte linc
 1  1  2 0 0 49 48  1 3 1 2 10
 1  2  2 0 0 50 49  1 1 1 2  9
 1  3  2 0 0 51 50  1 1 1 2 10
 1  4  2 0 0 52 51  1 1 1 2 10
 2  1  6 0 0 48 38 10 1 1 3  9
 2  2  6 0 0 49 39 10 1 1 3  9
 2  3  6 0 0 50 40 10 1 1 3  9
 2  4  6 0 0 51 41 10 1 1 3  8
 8 12 11 . 1 50 47  3 1 1 0 12
 8 13 11 . 1 51 48  3 1 1 0 12
 8 14 11 . 1 52 49  3 1 1 0 12
 8 15 11 . 1 53 50  3 1 1 0 12
 8 16 11 . 1 54 51  3 1 3 0 12
 8 17 11 . 1 55 52  3 1 1 0 12
 8 18  . . 1 56 53  3 1 1 0 12
 8 19  . . 1 57 54  3 1 1 0 12
10  1  3 0 0 47 44  3 1 1 . 11
10  2  3 0 0 48 45  3 1 1 . 11
10  3  3 0 0 49 46  3 1 1 . 11
10  4  3 0 0 50 47  3 1 1 . 11
10  5  3 0 0 51 48  3 1 1 . 11
10  6  3 0 0 52 49  3 1 1 . 11
10  7  3 0 0 53 50  3 1 1 . 11
10  8  3 0 0 54 51  3 1 1 . 11
end

I hope this helps clarify a few things.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17851
#6

06 Nov 2022, 04:01

Chris:
thanks for clarifying.
Assuming that the proprtional hazard requirement holds for all your models, why not keeping it simpler and going -estat ic- to compare their goodness of fit?

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#7

06 Nov 2022, 04:29

Ok thanks Carlo Lazzaro, I'll give that a try and will post back on the outcome.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1548

06 Nov 2022, 08:22

Chris Boulis like most estimation commands, stcox creates e(sample) which is available after you run the command. You can use this to have a comparable sample across specifications.

So you could do this, for instance:

Code:

stcox i.faith2 i.at3 c.attend_diff c.hgage1 c.hgage2 c.agediff i.esbrd1 i.esbrd2 i.educc* c.child c.linc
gen byte in_sample = e(sample)
stcox i.faith2 if in_sample
stcox i.faith2 i.at3 c.attend_diff if in_sample
stcox i.faith2 i.at3 c.attend_diff c.hgage1 c.hgage2 c.agediff i.esbrd1 i.esbrd2 if in_sample
stcox i.faith2 i.at3 c.attend_diff c.hgage1 c.hgage2 c.agediff i.esbrd1 i.esbrd2 i.educc* if in_sample

Comment

Chris Boulis

Join Date: Feb 2019

Posts: 368
#9

06 Nov 2022, 16:41

Hi Hemanshu Kumar. Thank you. I appreciate your suggestion/advice and for making me aware of the e(sample) function - it worked! Now the estimates for each specification are based on the same sample as the 5th specification (20,690) as noted in #5. What a wonderfully neat solution. Thank you so much.Hemanshu.
Comment

Announcement