Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fixing a sample so observations between specifications are the same to enable comparisons

    In my regressions, I add variables across five specifications in which the last contains all the variables in my model. As each specification is based on different samples, I cannot accurately compare results between them. I understand one way of addressing this is to 'fix' the sample to the observations in the final (5th) specification as it contains all the variables added across the previous specifications. Doing so, I understand, will ensure the samples in each are the same, therefore, allowing more accurate comparisons of results across specifications.

    To do this, I thought of generating a new variable, which equals the variables in the final specification and adding this new variable to each of the first four specifications to ensure the samples are the same. I'm not sure, but would it look something like:
    Code:
    generate fixed2 = faith2 + at3 + attend_diff + hgage1 + hgage2 + agediff + esbrd1 + esbrd2 + child + linc
    then adding
    Code:
    if fixed2 == 1
    to the first four specifications. (I've not addressed missing values yet). Some guidance on approach/code is appreciated.

    Here's an example of my data:
    Code:
    input byte(faith2 at3) float attend_diff int(hgage1 hgage2) byte(agediff esbrd1 esbrd2) float child byte linc
     2 0 0 49 48  1 3 1 2 10
     2 0 0 50 49  1 1 1 2  9
     2 0 0 51 50  1 1 1 2 10
     2 0 0 52 51  1 1 1 2 10
     6 0 0 48 38 10 1 1 3  9
     6 0 0 49 39 10 1 1 3  9
     6 0 0 50 40 10 1 1 3  9
     6 0 0 51 41 10 1 1 3  8
     1 0 0 20 22  2 2 3 .  9
     1 0 0 30 23  7 . 1 . 11
     1 0 0 31 24  7 1 1 . 11
     1 0 0 32 25  7 1 3 . 11
     1 0 0 33 26  7 1 1 . 11
     1 0 0 34 27  7 1 1 . 10
     1 0 0 35 28  7 1 1 .  .
     1 0 0 36 29  7 1 1 . 11
     1 0 0 37 30  7 1 3 . 11
     1 0 0 38 31  7 1 2 . 11
     1 0 0 39 32  7 1 3 . 11
     3 0 0 47 44  3 1 1 . 11
     3 0 0 48 45  3 1 1 . 11
     3 0 0 49 46  3 1 1 . 11
     3 0 0 50 47  3 1 1 . 11
     3 0 0 51 48  3 1 1 . 11
     3 0 0 52 49  3 1 1 . 11
     3 0 0 53 50  3 1 1 . 11
     3 0 0 54 51  3 1 1 . 11
    end
    I'm using panel data.
    Stata v.15.1.
    Last edited by Chris Boulis; 05 Nov 2022, 22:51.

  • #2
    Chris:
    I fail to get what you're after.
    When your observations miss one of the variables included in -fixed2-, a missing value in -fixed2- is granted.
    Therefore, the -e(sample)- of your panel data regression will be reduced accordingly.
    That said, you're seemingly dealing with an unbalanced panel dataset.
    Why not living with it and using postestimation command to test the resulting coefficients?
    Last edited by Carlo Lazzaro; 06 Nov 2022, 02:30.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thank you for your reply Carlo Lazzaro. Yes, I have an unbalanced panel dataset and I was living with the declining observations (due to missing values) as I added more variables to each specification, however, I received a comment suggesting I fix the sample so I can measure the additional explanatory power of the extra controls added to each specification. Do you have an idea how best to address this?

      Comment


      • #4
        Chris:
        the main issue with this nuisance is that the functional form of the regressand is correctly specified.
        In addition, if each specification is based on a different sample, I fail to see a panel dataset then (unless specification means panel in the jargon of your research field).
        My initial thought was that you were dealing with an unbalanced panel (but I did not find -panelid- and -timevar- in your data excerpt).
        In addition, if you're going -xtreg,fe- the -fe- estimator will wipe out all the time-invariant variables reducing the number of coefficients (and the explanatory power of the related variables).
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Hi Carlo Lazzaro. I refer to different sample sizes due to missing values so the comment I received was that it in this case it is not accurate to make comparisons of results between these specifications. Here's my specifications and sample sizes associated with each (as you can see I'm using the Cox proportional hazard model):
          Code:
          stcox i.faith2 // (57,095)
          stcox i.faith2 i.at3 c.attend_diff // (37,491)
          stcox i.faith2 i.at3 c.attend_diff c.hgage1 c.hgage2 c.agediff i.esbrd1 i.esbrd2 // (22,907)
          stcox i.faith2 i.at3 c.attend_diff c.hgage1 c.hgage2 c.agediff i.esbrd1 i.esbrd2 i.educc* // (22,894)
          stcox i.faith2 i.at3 c.attend_diff c.hgage1 c.hgage2 c.agediff i.esbrd1 i.esbrd2 i.educc* c.child c.linc // (20,690)
          Updated sample of my panel data including id (couple id) and wave (HILDA dataset):

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float couple byte(wave faith2 at3) float attend_diff int(hgage1 hgage2) byte(agediff esbrd1 esbrd2) float child byte linc
           1  1  2 0 0 49 48  1 3 1 2 10
           1  2  2 0 0 50 49  1 1 1 2  9
           1  3  2 0 0 51 50  1 1 1 2 10
           1  4  2 0 0 52 51  1 1 1 2 10
           2  1  6 0 0 48 38 10 1 1 3  9
           2  2  6 0 0 49 39 10 1 1 3  9
           2  3  6 0 0 50 40 10 1 1 3  9
           2  4  6 0 0 51 41 10 1 1 3  8
           8 12 11 . 1 50 47  3 1 1 0 12
           8 13 11 . 1 51 48  3 1 1 0 12
           8 14 11 . 1 52 49  3 1 1 0 12
           8 15 11 . 1 53 50  3 1 1 0 12
           8 16 11 . 1 54 51  3 1 3 0 12
           8 17 11 . 1 55 52  3 1 1 0 12
           8 18  . . 1 56 53  3 1 1 0 12
           8 19  . . 1 57 54  3 1 1 0 12
          10  1  3 0 0 47 44  3 1 1 . 11
          10  2  3 0 0 48 45  3 1 1 . 11
          10  3  3 0 0 49 46  3 1 1 . 11
          10  4  3 0 0 50 47  3 1 1 . 11
          10  5  3 0 0 51 48  3 1 1 . 11
          10  6  3 0 0 52 49  3 1 1 . 11
          10  7  3 0 0 53 50  3 1 1 . 11
          10  8  3 0 0 54 51  3 1 1 . 11
          end
          I hope this helps clarify a few things.

          Comment


          • #6
            Chris:
            thanks for clarifying.
            Assuming that the proprtional hazard requirement holds for all your models, why not keeping it simpler and going -estat ic- to compare their goodness of fit?
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Ok thanks Carlo Lazzaro, I'll give that a try and will post back on the outcome.

              Comment


              • #8
                Chris Boulis like most estimation commands, stcox creates e(sample) which is available after you run the command. You can use this to have a comparable sample across specifications.

                So you could do this, for instance:

                Code:
                stcox i.faith2 i.at3 c.attend_diff c.hgage1 c.hgage2 c.agediff i.esbrd1 i.esbrd2 i.educc* c.child c.linc
                gen byte in_sample = e(sample)
                stcox i.faith2 if in_sample
                stcox i.faith2 i.at3 c.attend_diff if in_sample
                stcox i.faith2 i.at3 c.attend_diff c.hgage1 c.hgage2 c.agediff i.esbrd1 i.esbrd2 if in_sample
                stcox i.faith2 i.at3 c.attend_diff c.hgage1 c.hgage2 c.agediff i.esbrd1 i.esbrd2 i.educc* if in_sample

                Comment


                • #9
                  Hi Hemanshu Kumar. Thank you. I appreciate your suggestion/advice and for making me aware of the e(sample) function - it worked! Now the estimates for each specification are based on the same sample as the 5th specification (20,690) as noted in #5. What a wonderfully neat solution. Thank you so much.Hemanshu.

                  Comment

                  Working...
                  X