How to get out-of-sample predictions for specific subgroups?

Gate Nucht

Join Date: Jan 2017

Posts: 20
#1

How to get out-of-sample predictions for specific subgroups?

29 Nov 2023, 10:30

I have a mixed effects logistic regression model:

Code:

quietly melogit y i.x1 i.x2 || x3:

Variable x1 is coded 0/1. I create the predicted probabilities for both values of x1:

Code:

margins x1

Then I obtain the predicted probabilities for each observation included in the model:

Code:

predict probhat if e(sample) summarize probhat

To make out-of-sample predictions, I load my second dataset with the same variables:

Code:

use "C:\file path\newdata.dta", clear

Now I can get the predicted probabilities for each observation in the new dataset:

Code:

predict probhat_new summarize probhat_new

My question is: How do I get what the 'margins' command created for the original dataset, but for the new dataset?

When I try this:

Code:

margins anypriors

Stata tells me this:

Code:

e(sample) does not identify the estimation sample

I also tried to recreate original output based on 'margins' by calculating the mean of probhat for each value of x1, hoping that I could use the same approach to get out-of-sample subgroup predicted probabilities:

Code:

summarize probhat if x1== 0, meanonly scalar mean_probhat_x1_0 = r(mean) gen mean_probhat=. replace mean_probhat = mean_probhat_x1_0 if x1== 0 summarize mean_probhat

However, the mean based on this code is different from the mean for x1==0 based on the 'margins' command.
Tags: logit, margins, out-of-sample prediction, predict, regression
George Ford

Join Date: Aug 2014

Posts: 3177
#2

01 Dec 2023, 12:44

margins will use the e(b) and e(V) matrices from the estimation, which you do not have. I think you'll have to make the calculations manually, which probably won't be fun with melogit but I have never done so.
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4485
#3

01 Dec 2023, 12:49

I cannot completely follow, but I have a possible suggestion: load your first set of data and create an indicator (dummy) that is 1 for each observation in the data; then append the second data set (if possible) and estimate your model on only the original data (e.g., appending if "indicator"==1 to your command) - then see what you get for your out-of-sample predictions
1 like
Comment
Gate Nucht

Join Date: Jan 2017

Posts: 20
#4

01 Dec 2023, 15:13

Originally posted by George Ford View Post

margins will use the e(b) and e(V) matrices from the estimation, which you do not have. I think you'll have to make the calculations manually, which probably won't be fun with melogit but I have never done so.

George Ford: How would I use the 'predict' command to calculate what the 'margins x1' command produces with a vanilla logit model (rather than a mixed effects model using 'melogit')?

I tried the following, but the last couple lines of code do not produce the same results as the 'margins x1' line of code (I'm not sure why):

Code:

quietly logit y i.x1 i.x2 i.x3 margins x1 predict ps if e(sample) bysort x1: summarize ps

--------------------------------------------

Rich Goldstein: Is this what you're suggesting?

Code:

cls use "C:\dataset1.dta", clear gen indicator=1 append using "C:\dataset2.dta" quietly logit y i.x1 i.x2 i.x3 if indicator==1 margins x1

Now, the last line of code provides the average marginal outcomes for both values of x1, using all observations included in the model. However, I don't know how to the the average marginal outcomes for both values of x1 for all observations *not* included in the model.
Comment
Gate Nucht

Join Date: Jan 2017

Posts: 20
#5

01 Dec 2023, 15:19

Originally posted by George Ford View Post

margins will use the e(b) and e(V) matrices from the estimation, which you do not have. I think you'll have to make the calculations manually, which probably won't be fun with melogit but I have never done so.

George Ford: How would I use the 'predict' command to calculate what the 'margins x1' command produces with a vanilla logit model (rather than a mixed effects model using melogit)?

I tried the following, but the last couple lines of code don't produce the same results as the 'margins x1' line of code (I'm not sure why):

Code:

quietly logit y i.x1 i.x2 i.x3 margins x1 predict ps if e(sample) bysort x1: summarize ps

----------------------------------------------------------

Rich Goldstein: Is this what you're suggesting?

Code:

use "C:\dataset1.dta", clear gen indicator=1 append using "C:\dataset2.dta" quietly logit y i.x1 i.x2 i.x3 if indicator==1 margins x1

Now, the last line of code provides the average marginal outcomes for both values of x1, using all observations included in the logit model. However, I don't know how to get the average marginal outcomes for both values of x1 for all observations *not* included in the model (i.e. for all observations from dataset2).
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10254

01 Dec 2023, 18:23

It's easy enough to mark the estimation sample to allow margins to proceed, see

Code:

ssc describe erepost

[,] but, you should verify whether margins does the calculations correctly if you do so.

Code:

webuse lbw, clear
set seed 12022023
gen sample= runiformint(0,1)
keep if sample
logit low age lwt i.race smoke ptl ht ui
margins race

frame create outsample
frame outsample{
    webuse lbw, clear
    set seed 12022023
    gen sample= runiformint(0,1)
    keep if !sample
    *USE ALL OBSERVATIONS IN FRAME
    gen insample= 1
    erepost, esample(insample)
    margins race
}
frame drop outsample

Res.:

Code:

. logit low age lwt i.race smoke ptl ht ui

Iteration 0:  Log likelihood = -61.958741  
Iteration 1:  Log likelihood = -51.239685  
Iteration 2:  Log likelihood = -50.900119  
Iteration 3:  Log likelihood = -50.898507  
Iteration 4:  Log likelihood = -50.898507  

Logistic regression                                     Number of obs =     95
                                                        LR chi2(8)    =  22.12
                                                        Prob > chi2   = 0.0047
Log likelihood = -50.898507                             Pseudo R2     = 0.1785

------------------------------------------------------------------------------
         low | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0895979   .0527077    -1.70   0.089    -.1929031    .0137073
         lwt |  -.0077514   .0101248    -0.77   0.444    -.0275956    .0120929
             |
        race |
      Black  |   1.058808   .7043676     1.50   0.133     -.321727    2.439343
      Other  |   .8293784    .601216     1.38   0.168    -.3489833     2.00774
             |
       smoke |   1.045063   .5563525     1.88   0.060    -.0453677    2.135494
         ptl |   .1034313    .512202     0.20   0.840    -.9004663    1.107329
          ht |   2.024702   1.032164     1.96   0.050     .0016979    4.047706
          ui |     1.8056   .7169437     2.52   0.012     .4004162    3.210784
       _cons |   1.108562     1.6743     0.66   0.508    -2.173005     4.39013
------------------------------------------------------------------------------

. 
. margins race

Predictive margins                                          Number of obs = 95
Model VCE: OIM

Expression: Pr(low), predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        race |
      White  |   .2699205   .0629268     4.29   0.000     .1465863    .3932547
      Black  |   .4688104   .1200475     3.91   0.000     .2335216    .7040993
      Other  |   .4224025   .0864041     4.89   0.000     .2530535    .5917514
------------------------------------------------------------------------------

. 
. 
. 
. frame create outsample

. 
. frame outsample{
. 
.     webuse lbw, clear
(Hosmer & Lemeshow data)
. 
.     set seed 12022023
. 
.     gen sample= runiformint(0,1)
. 
.     keep if !sample
(95 observations deleted)
. 
.     *USE ALL OBSERVATIONS IN FRAME
. 
.     gen insample= 1
. 
.     erepost, esample(insample)
. 
.     margins race
warning: cannot perform check for estimable functions.

Predictive margins                                          Number of obs = 94
Model VCE: OIM

Expression: Pr(low), predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        race |
      White  |   .2910254   .0598138     4.87   0.000     .1737925    .4082583
      Black  |   .4754966   .1122323     4.24   0.000     .2555253    .6954679
      Other  |   .4322994   .0807988     5.35   0.000     .2739367    .5906621
------------------------------------------------------------------------------
. 
. }

Last edited by Andrew Musau; 01 Dec 2023, 18:30.

Comment

Gate Nucht

Join Date: Jan 2017

Posts: 20
#7

02 Dec 2023, 09:13

Thank you, Andrew Musau, this seems to work.

How would I go about verifying whether margins does the calculations correctly for the second dataset?
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10254

02 Dec 2023, 11:22

This may help with option -atmeans-. The default uses the observed values. Check the documentation for details.

Code:

webuse lbw, clear
logit low age i.race
margins race, atmeans
assert e(sample)
sum age
replace age=r(mean)

levelsof race, local(rlevs)
foreach num of numlist `rlevs'{
    replace race = `num'
    predict prob`num', pr
    sum prob`num'
}

Res.:

Code:

. margins race, atmeans

Adjusted predictions                                       Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()
At: age    =  23.2381 (mean)
    1.race = .5079365 (mean)
    2.race = .1375661 (mean)
    3.race = .3544974 (mean)

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        race |
      white  |   .2448658   .0444639     5.51   0.000     .1577182    .3320134
      black  |   .4059021   .0972933     4.17   0.000     .2152106    .5965935
      other  |   .3643626   .0592512     6.15   0.000     .2482323    .4804929
------------------------------------------------------------------------------

.
. assert e(sample)

.
. sum age

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |        189     23.2381    5.298678         14         45

.
. replace age=r(mean)
variable age was byte now float
(189 real changes made)

.
.
.
. levelsof race, local(rlevs)
1 2 3

.
. foreach num of numlist `rlevs'{
  2.
.     replace race = `num'
  3.
.     predict prob`num', pr
  4.
.     sum prob`num'
  5.
. }
(93 real changes made)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       prob1 |        189    .2448658           0   .2448658   .2448658
(189 real changes made)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       prob2 |        189    .4059021           0   .4059021   .4059021
(189 real changes made)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       prob3 |        189    .3643626           0   .3643626   .3643626

.

Last edited by Andrew Musau; 02 Dec 2023, 11:47.

Comment

Gate Nucht

Join Date: Jan 2017

Posts: 20
#9

02 Dec 2023, 12:53

Thank you so much!

Using the code you provided, I was able to replicate the the output based on "margins, x1 atmeans" for both datasets. I all checks out.

I'll also look into the documentation to replicate the default 'margins x1' output.
Comment

Announcement

How to get out-of-sample predictions for specific subgroups?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment