Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to get out-of-sample predictions for specific subgroups?

    I have a mixed effects logistic regression model:

    Code:
    quietly melogit y i.x1 i.x2 || x3:
    Variable x1 is coded 0/1. I create the predicted probabilities for both values of x1:

    Code:
    margins x1
    Then I obtain the predicted probabilities for each observation included in the model:

    Code:
    predict probhat if e(sample)
    summarize probhat
    To make out-of-sample predictions, I load my second dataset with the same variables:

    Code:
    use "C:\file path\newdata.dta", clear
    Now I can get the predicted probabilities for each observation in the new dataset:

    Code:
    predict probhat_new
    summarize probhat_new
    My question is: How do I get what the 'margins' command created for the original dataset, but for the new dataset?

    When I try this:
    Code:
    margins anypriors
    Stata tells me this:
    Code:
    e(sample) does not identify the estimation sample
    I also tried to recreate original output based on 'margins' by calculating the mean of probhat for each value of x1, hoping that I could use the same approach to get out-of-sample subgroup predicted probabilities:

    Code:
    summarize probhat if x1== 0, meanonly
    scalar mean_probhat_x1_0 = r(mean)
    
    gen mean_probhat=.
    replace mean_probhat = mean_probhat_x1_0 if x1== 0
    summarize mean_probhat
    However, the mean based on this code is different from the mean for x1==0 based on the 'margins' command.


  • #2
    margins will use the e(b) and e(V) matrices from the estimation, which you do not have. I think you'll have to make the calculations manually, which probably won't be fun with melogit but I have never done so.

    Comment


    • #3
      I cannot completely follow, but I have a possible suggestion: load your first set of data and create an indicator (dummy) that is 1 for each observation in the data; then append the second data set (if possible) and estimate your model on only the original data (e.g., appending if "indicator"==1 to your command) - then see what you get for your out-of-sample predictions

      Comment


      • #4
        Originally posted by George Ford View Post
        margins will use the e(b) and e(V) matrices from the estimation, which you do not have. I think you'll have to make the calculations manually, which probably won't be fun with melogit but I have never done so.
        George Ford: How would I use the 'predict' command to calculate what the 'margins x1' command produces with a vanilla logit model (rather than a mixed effects model using 'melogit')?

        I tried the following, but the last couple lines of code do not produce the same results as the 'margins x1' line of code (I'm not sure why):

        Code:
        quietly logit y i.x1 i.x2 i.x3
        
        margins x1
        
        predict ps if e(sample)
        bysort x1: summarize ps

        --------------------------------------------

        Rich Goldstein: Is this what you're suggesting?

        Code:
        cls
        use "C:\dataset1.dta", clear
        
        gen indicator=1
        
        append using "C:\dataset2.dta"
        
        quietly logit y i.x1 i.x2 i.x3 if indicator==1
        
        margins x1
        Now, the last line of code provides the average marginal outcomes for both values of x1, using all observations included in the model. However, I don't know how to the the average marginal outcomes for both values of x1 for all observations *not* included in the model.




        Comment


        • #5
          Originally posted by George Ford View Post
          margins will use the e(b) and e(V) matrices from the estimation, which you do not have. I think you'll have to make the calculations manually, which probably won't be fun with melogit but I have never done so.
          George Ford: How would I use the 'predict' command to calculate what the 'margins x1' command produces with a vanilla logit model (rather than a mixed effects model using melogit)?

          I tried the following, but the last couple lines of code don't produce the same results as the 'margins x1' line of code (I'm not sure why):

          Code:
           quietly logit y i.x1 i.x2 i.x3
          
          margins x1
          
          predict ps if e(sample)
          bysort x1: summarize ps
          ----------------------------------------------------------

          Rich Goldstein: Is this what you're suggesting?

          Code:
          use "C:\dataset1.dta", clear
          
          gen indicator=1
          
          append using "C:\dataset2.dta"
          
          quietly logit y i.x1 i.x2 i.x3 if indicator==1
          
          margins x1
          Now, the last line of code provides the average marginal outcomes for both values of x1, using all observations included in the logit model. However, I don't know how to get the average marginal outcomes for both values of x1 for all observations *not* included in the model (i.e. for all observations from dataset2).

          Comment


          • #6
            It's easy enough to mark the estimation sample to allow margins to proceed, see

            Code:
            ssc describe erepost
            [,] but, you should verify whether margins does the calculations correctly if you do so.

            Code:
            webuse lbw, clear
            set seed 12022023
            gen sample= runiformint(0,1)
            keep if sample
            logit low age lwt i.race smoke ptl ht ui
            margins race
            
            frame create outsample
            frame outsample{
                webuse lbw, clear
                set seed 12022023
                gen sample= runiformint(0,1)
                keep if !sample
                *USE ALL OBSERVATIONS IN FRAME
                gen insample= 1
                erepost, esample(insample)
                margins race
            }
            frame drop outsample
            Res.:

            Code:
            . logit low age lwt i.race smoke ptl ht ui
            
            Iteration 0:  Log likelihood = -61.958741  
            Iteration 1:  Log likelihood = -51.239685  
            Iteration 2:  Log likelihood = -50.900119  
            Iteration 3:  Log likelihood = -50.898507  
            Iteration 4:  Log likelihood = -50.898507  
            
            Logistic regression                                     Number of obs =     95
                                                                    LR chi2(8)    =  22.12
                                                                    Prob > chi2   = 0.0047
            Log likelihood = -50.898507                             Pseudo R2     = 0.1785
            
            ------------------------------------------------------------------------------
                     low | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
            -------------+----------------------------------------------------------------
                     age |  -.0895979   .0527077    -1.70   0.089    -.1929031    .0137073
                     lwt |  -.0077514   .0101248    -0.77   0.444    -.0275956    .0120929
                         |
                    race |
                  Black  |   1.058808   .7043676     1.50   0.133     -.321727    2.439343
                  Other  |   .8293784    .601216     1.38   0.168    -.3489833     2.00774
                         |
                   smoke |   1.045063   .5563525     1.88   0.060    -.0453677    2.135494
                     ptl |   .1034313    .512202     0.20   0.840    -.9004663    1.107329
                      ht |   2.024702   1.032164     1.96   0.050     .0016979    4.047706
                      ui |     1.8056   .7169437     2.52   0.012     .4004162    3.210784
                   _cons |   1.108562     1.6743     0.66   0.508    -2.173005     4.39013
            ------------------------------------------------------------------------------
            
            . 
            . margins race
            
            Predictive margins                                          Number of obs = 95
            Model VCE: OIM
            
            Expression: Pr(low), predict()
            
            ------------------------------------------------------------------------------
                         |            Delta-method
                         |     Margin   std. err.      z    P>|z|     [95% conf. interval]
            -------------+----------------------------------------------------------------
                    race |
                  White  |   .2699205   .0629268     4.29   0.000     .1465863    .3932547
                  Black  |   .4688104   .1200475     3.91   0.000     .2335216    .7040993
                  Other  |   .4224025   .0864041     4.89   0.000     .2530535    .5917514
            ------------------------------------------------------------------------------
            
            . 
            . 
            . 
            . frame create outsample
            
            . 
            . frame outsample{
            . 
            .     webuse lbw, clear
            (Hosmer & Lemeshow data)
            . 
            .     set seed 12022023
            . 
            .     gen sample= runiformint(0,1)
            . 
            .     keep if !sample
            (95 observations deleted)
            . 
            .     *USE ALL OBSERVATIONS IN FRAME
            . 
            .     gen insample= 1
            . 
            .     erepost, esample(insample)
            . 
            .     margins race
            warning: cannot perform check for estimable functions.
            
            Predictive margins                                          Number of obs = 94
            Model VCE: OIM
            
            Expression: Pr(low), predict()
            
            ------------------------------------------------------------------------------
                         |            Delta-method
                         |     Margin   std. err.      z    P>|z|     [95% conf. interval]
            -------------+----------------------------------------------------------------
                    race |
                  White  |   .2910254   .0598138     4.87   0.000     .1737925    .4082583
                  Black  |   .4754966   .1122323     4.24   0.000     .2555253    .6954679
                  Other  |   .4322994   .0807988     5.35   0.000     .2739367    .5906621
            ------------------------------------------------------------------------------
            . 
            . }
            Last edited by Andrew Musau; 01 Dec 2023, 18:30.

            Comment


            • #7
              Thank you, Andrew Musau, this seems to work.

              How would I go about verifying whether margins does the calculations correctly for the second dataset?

              Comment


              • #8
                This may help with option -atmeans-. The default uses the observed values. Check the documentation for details.

                Code:
                webuse lbw, clear
                logit low age i.race
                margins race, atmeans
                assert e(sample)
                sum age
                replace age=r(mean)
                
                levelsof race, local(rlevs)
                foreach num of numlist `rlevs'{
                    replace race = `num'
                    predict prob`num', pr
                    sum prob`num'
                }
                Res.:

                Code:
                . margins race, atmeans
                
                Adjusted predictions                                       Number of obs = 189
                Model VCE: OIM
                
                Expression: Pr(low), predict()
                At: age    =  23.2381 (mean)
                    1.race = .5079365 (mean)
                    2.race = .1375661 (mean)
                    3.race = .3544974 (mean)
                
                ------------------------------------------------------------------------------
                             |            Delta-method
                             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
                -------------+----------------------------------------------------------------
                        race |
                      white  |   .2448658   .0444639     5.51   0.000     .1577182    .3320134
                      black  |   .4059021   .0972933     4.17   0.000     .2152106    .5965935
                      other  |   .3643626   .0592512     6.15   0.000     .2482323    .4804929
                ------------------------------------------------------------------------------
                
                .
                . assert e(sample)
                
                .
                . sum age
                
                    Variable |        Obs        Mean    Std. dev.       Min        Max
                -------------+---------------------------------------------------------
                         age |        189     23.2381    5.298678         14         45
                
                .
                . replace age=r(mean)
                variable age was byte now float
                (189 real changes made)
                
                .
                .
                .
                . levelsof race, local(rlevs)
                1 2 3
                
                .
                . foreach num of numlist `rlevs'{
                  2.
                .     replace race = `num'
                  3.
                .     predict prob`num', pr
                  4.
                .     sum prob`num'
                  5.
                . }
                (93 real changes made)
                
                    Variable |        Obs        Mean    Std. dev.       Min        Max
                -------------+---------------------------------------------------------
                       prob1 |        189    .2448658           0   .2448658   .2448658
                (189 real changes made)
                
                    Variable |        Obs        Mean    Std. dev.       Min        Max
                -------------+---------------------------------------------------------
                       prob2 |        189    .4059021           0   .4059021   .4059021
                (189 real changes made)
                
                    Variable |        Obs        Mean    Std. dev.       Min        Max
                -------------+---------------------------------------------------------
                       prob3 |        189    .3643626           0   .3643626   .3643626
                
                .
                Last edited by Andrew Musau; 02 Dec 2023, 11:47.

                Comment


                • #9
                  Thank you so much!

                  Using the code you provided, I was able to replicate the the output based on "margins, x1 atmeans" for both datasets. I all checks out.

                  I'll also look into the documentation to replicate the default 'margins x1' output.

                  Comment

                  Working...
                  X