Confidence interval for mean of predicted probabilities following a binary logistic regression

Mikkel Andersen

Join Date: Mar 2017

Posts: 26
#1

Confidence interval for mean of predicted probabilities following a binary logistic regression

04 Mar 2017, 12:40

How does one compute a confidence interval for the mean of the predicted probabilities following a binary logistic regression? Do I e.g. use some kind of bootstrapping approach or the margins commando?

To be more specific, I have a very large sample of individuals from the same country but from different geographic regions that differs widely in the number of observations in my sample. Using this sample, I run a binary logistic regression and predict the individual probabilities for the outcome. Then, I find the mean of the predicted probabilities for different geographic regions. The problem is how to calculate the confidence interval for the mean of the predicted probabilities for each geographic region.

In the abovementioned, I have sample data. Should the calculation of the confidence interval for the mean of predicted probabilities be different if I instead have data for all individuals in the country? (Some argue that it is still relevant to talk about sampling error in this situation since one could view the population as a sample from some kind of super population.)
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

04 Mar 2017, 14:22

The -margins- command does this, if you have used factor-variable notation in your logistic regression. So, roughly speaking:

Code:

logistic outcome i.region other_covariates margins region

Note that what you will get here is not exactly what you described. This use of -margins- will give you the average adjusted predicted probabilities, taking into account the observed distributions of all of the covariates in the model. This is what is usually of interest.

But if you really want predicted probabilities that are calculated using only region-specific observations, then the -margins- command must be changed to -margins, over(region)-. Bear in mind that these latter results are not adjusted to a common distribution of other covariates, so they may not be directly comparable to each other.

To learn more about the -margins- command, I recommend starting with https://www3.nd.edu/~rwilliam/stats/Margins01.pdf, which covers the basics of the command in a particularly clear way. Then you can learn more from the -margins- section of the online user manuals.

Last edited by Clyde Schechter; 04 Mar 2017, 14:25.
1 like
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

04 Mar 2017, 15:36

Clyde Schechter Following your example, I decided to investigate whether the estimation of predicted values would gave same mean and similar CIs, if compared to margins, over(region).

Below, in the example, the mean is exactly the same. But the CIs are much shorter under the command - ci means - and I gather the difference in CIs is due to the delta method under margins, but I wonder whether such CIs are to taken as incorrect, I mean, they should be dismissed, or they are "valid" as well, just as an alternative strategy to estimate the CI

Code:

. webuse lbw
(Hosmer & Lemeshow data)

.  logistic low age lwt i.race

Logistic regression                             Number of obs     =        189
                                                LR chi2(4)        =      12.00
                                                Prob > chi2       =     0.0174
Log likelihood = -111.33847                     Pseudo R2         =     0.0511

------------------------------------------------------------------------------
         low | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .9747731   .0324118    -0.77   0.442      .913273    1.040415
         lwt |   .9857717   .0064287    -2.20   0.028     .9732518    .9984526
             |
        race |
      black  |   2.727539   1.358248     2.01   0.044     1.027764    7.238499
      other  |   1.558693   .5614898     1.23   0.218     .7693628    3.157838
             |
       _cons |   3.685916   3.942892     1.22   0.223     .4528972    29.99793
------------------------------------------------------------------------------

 
. margins, over(race)

Predictive margins                              Number of obs     =        189
Model VCE    : OIM

Expression   : Pr(low), predict()
over         : race

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
      white  |   .2395833   .0429128     5.58   0.000     .1554759    .3236908
      black  |   .4230769   .0933667     4.53   0.000     .2400816    .6060722
      other  |   .3731343   .0582432     6.41   0.000     .2589798    .4872889
------------------------------------------------------------------------------

. predict mypred, p

. by race, sort : ci means mypred

----------------------------------------------------------------------------------------------------------------------------
-> race = white

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
-------------+---------------------------------------------------------------
      mypred |         96    .2395833    .0075385        .2246176    .2545491

----------------------------------------------------------------------------------------------------------------------------
-> race = black

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
-------------+---------------------------------------------------------------
      mypred |         26    .4230769    .0264062        .3686923    .4774616

----------------------------------------------------------------------------------------------------------------------------
-> race = other

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
-------------+---------------------------------------------------------------
      mypred |         67    .3731343    .0100175        .3531337     .393135

I tested it out again, this time with a linear regression model, and the means were exactly the same with predict and margins, but CIs differ anew, and the Delta-method once more provided larger CIs.

Code:

.  regress lwt age i.race

      Source |       SS           df       MS      Number of obs   =       189
-------------+----------------------------------   F(3, 185)       =      8.05
       Model |  20285.7689         3  6761.92296   Prob > F        =    0.0000
    Residual |  155464.115       185  840.346566   R-squared       =    0.1154
-------------+----------------------------------   Adj R-squared   =    0.1011
       Total |  175749.884       188  934.839806   Root MSE        =    28.989

------------------------------------------------------------------------------
         lwt |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   1.079484   .4080005     2.65   0.009     .2745522    1.884416
             |
        race |
      black  |   17.72765   6.506647     2.72   0.007     4.890882    30.56442
      other  |  -9.967319   4.679671    -2.13   0.034     -19.1997   -.7349376
             |
       _cons |   105.8296    10.3432    10.23   0.000     85.42383    126.2354
------------------------------------------------------------------------------

. margins, over(race)

Predictive margins                              Number of obs     =        189
Model VCE    : OLS

Expression   : Linear prediction, predict()
over         : race

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
      white  |   132.0521    2.95865    44.63   0.000     126.2151    137.8891
      black  |   146.8077   5.685158    25.82   0.000     135.5916    158.0238
      other  |   120.0299   3.541537    33.89   0.000     113.0429    127.0168
------------------------------------------------------------------------------

. predict mypred2, xb

. by race, sort : ci means mypred2

----------------------------------------------------------------------------------------------------------------------------
-> race = white

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
-------------+---------------------------------------------------------------
     mypred2 |         96    132.0521    .6230184        130.8152    133.2889

----------------------------------------------------------------------------------------------------------------------------
-> race = black

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
-------------+---------------------------------------------------------------
     mypred2 |         26    146.8077    1.081526        144.5802    149.0351

----------------------------------------------------------------------------------------------------------------------------
-> race = other

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
-------------+---------------------------------------------------------------
     mypred2 |         67    120.0299    .5981947        118.8355    121.2242

The results striked me as puzzling. I wonder whether both methods should be considered trustful, or just the delta.

Thanks in forward.

Best regards,

Marcos

Comment

Jeph Herrin

Join Date: Apr 2014
Posts: 335

04 Mar 2017, 16:19

Originally posted by Marcos Almeida View Post

I think using -ci-, you want to get the SE for each predicted outcome and use the inverse square as a weight. The predicted values have uncertainty, but -ci- assumes no measurement uncertainty. I did this with the the linear model and got narrower confidence intervals, though not as narrow as -margins-

Code:

. predict se, stdp

. gen invvar=1/(se^2)

. bys race: ci means xb [aw=invvar]

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> race = white

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
-------------+---------------------------------------------------------------
          xb |         96    131.6041    .4992147         130.613    132.5952

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> race = black

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
-------------+---------------------------------------------------------------
          xb |         26    146.4072    .9520795        144.4463     148.368

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> race = other

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
-------------+---------------------------------------------------------------
          xb |         67    119.8806    .5263877        118.8297    120.9316

hth,
Jeph

Comment

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#5

05 Mar 2017, 05:38

Jeph Herrin Thank you for the reply and the advice.

I see you got narrow CIs for predict se, xtdp, even narrower than those I got with the "default" predict command.

I also noticed that. in your example, the mean is slightly different, contrary to the examples I shared, where the means are exactly the same with predict and margins.

Now, wishing to stick with the "most accurate one", shall there be such a thing, I cannot help but keep speculating that all these different CIs would eventually converge to the same point under asymptotic conditions...

Surely I must be missing some important aspect!

Best regards,

Marcos
Comment
Mikkel Andersen

Join Date: Mar 2017

Posts: 26
#6

05 Mar 2017, 06:39

First of all, thank you all for your insightful comments and suggestions.

However, I do not include geographic region as a covariate in my model, since I do not wish to include regional fixed effects in my model. I only want the predicted probabilities for each individual to reflect their observed values on the covariates at the individual level.
Thus, how do I calculate confidence interval for the mean of the predicted probabilities for each geographic region? Can I just use the margins commando with the if sub setting, e.g. margins if region == 345? And should the confidence interval be calculated using the default delta method or the vce(unconditional) option?

If the margins commando is not appropriate for my problem, do you think I should use some kind of bootstrapping approach to obtain valid confidence intervals? E.g. replicating the binary logistic regression 1,000 times and each time calculate and save the mean of the predicted probabilities for each region. And finally, find the 2.5 and 97.5 percentile from the distribution of calculated means for a geographic region and use these as the lower and upper limit for the confidence interval for that particular geographic region? One downside to this solution is probably that the size of the confidence interval won’t reflect the number of observation in the geographic region in the sample. But maybe one can find a way to take that into account.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#7

05 Mar 2017, 06:51

I gather Clyde's advice in #2, with margins plus "over()" option, pointed clearly towards what you wish, at least concerning the mean values. It rests a question with regards to the CIs but, considering margins gave the larger ones, it seems to be the most conservative approach.

That said, a description of the data set, with command and output, as recommended in the FAQ, would surely entice more insighful replies.

Best regards,

Marcos
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#8

05 Mar 2017, 12:50

You can still use -margins, over(race)- even though race is not a variable in the model. [By contrast, you can not use -margins race- without i.race in the model..] And it will give you mean predicted values for each race group with the predictions adjusted for whatever variables are in the model, but restricted to the individual race groups, which I gather is what you are looking for.
Comment
Mikkel Andersen

Join Date: Mar 2017

Posts: 26
#9

06 Mar 2017, 06:56

Again, thank you for your comments.

Basically, my code is follows:

Code:

Logic y x z Predict yhat

Or

Code:

Logic y x z, vce(cluster id) Predict yhat

My data is data for many individuals across five years. However, I only estimate my model on data from the newest year (simple cross-section). Afterwards, I predict the individual probabilities for all individuals in the five years.

I have looked at -margins, over(region)-, and it seems as a sound approach. However, I have some questions regarding this approach:
Is it possible to compare the mean of the predicted probabilities between two regions in the same year? (i.e. is the difference in the means of the predicted probabilities statistically significant). Is it also possible to adjust for multiple comparisons if I want to compare more than two regions at the same time? (e.g. bonferroni)

Should I use the default delta method with respects to the standard errors or should I use the vce(unconditional) method? And what kind of statistically uncertainty does the default delta method handle? In principle, I have data from all individuals in a country in each year, but one could also view it as a sample from a super population. And the observations from years different from the newest year have not been part of the estimation sample.

Do the confidence intervals take account of whether the individuals in a given region have a relatively low, medium or high probability of the outcome? I have read that the statistically uncertainty is higher for an individual if his values on the covariates are very different from the mean values on the covariates. Individuals with atypical values would probably differ in the predicted probability from other individuals with more typical values.

How does -over(region)- differ from subpopulation option?

Thank you in advance.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#10

06 Mar 2017, 07:07

Since some of your questions regarding the CI estimations are practically the same concerns I shared in #3, I hope "we both" get further advice on that.

This being said, I will pick the item 4 of your questions in #9:

According to this lecture:

Specifying over is equivalent to running margins on subpopulations

Hopefully that helps.

Best regards,

Marcos
Comment

Announcement