Recovering p-values from multiple bootstrapped iterations of a logit or glm regression

Jack Needleman

Join Date: May 2014

Posts: 3
#1

Recovering p-values from multiple bootstrapped iterations of a logit or glm regression

02 Mar 2023, 19:14

The CMS models for choosing comorbidities to include in a risk adjustment model for a quality measure are based on bootstrapping the regression 1000 times and choosing comorbidities (coded as clinical conditions with 0/1 coding of the presence of the condition for each patient, there are approximately 100 clinical conditions) for the final risk adjustment model if the p-value for the estimated coefficient on the comorbidity is statistically significant 90% of the time. Regardless of the wisdom of this strategy, I need to duplicate this methodology in data I have. In order to duplicate it, I need to set a seed, bootstrap the relevant regression, which will be either a logit for 0/1 outcome, or glm with log transform for cost outcomes, and recover the p-values for each covariate in the model for each iteration, so I can count if it is <0.05 in 90% of the cases. I am not sure how to bootstrap with replacement and save the p-values for each covariate from each iteration in a file or matrix so I can examine them. Any suggestions on how to code this would be appreciated.
Tags: None

Mike Lacy

Join Date: Apr 2014
Posts: 2416

02 Mar 2023, 19:58

Stata's built-in -bootstrap- command takes bootstrap samples (i.e., with replacement, by definition), executes an estimation command of your choice, and is able to save whatever result expressions you request from each estimation command into a Stata data file. Useful documentation to examine would include -help bootstrap- and -help return list-. Below you'll find an example that illustrates some technique. In this example, some of the bootstrap iterations don't produce valid regression results, so your 90% would presumably be based on the number of valid regression results. (That might well not be a problem in your situation.)

Code:

// Define a program to feed to the -bootstrap- command
cap prog drop myprog
prog myprog, rclass
    logit foreign weight price headroom // nonsense, I know
    tempname T // using a temp matrix is cleaner
    mat `T' = r(table) // r(table) is left behind after the regression command
    // put those p-values into the return list
    return scalar weightp = `T'[4,1]  
    return scalar pricep = `T'[4,2]
    return scalar headroomp = `T'[4,3]
end
//
clear
sysuse auto
// check out what ends up in r(table), for illustration
 logit foreign weight price headroom
mat list r(table)
//
bootstrap wp = r(weightp) pp = r(pricep) hp = r(headroomp), ///
   reps(100) saving("YourOutPutFile.dta", replace): myprog
clear
use "YourOutPutFile.dta"
desc
count if wp < 0.05
count if pp < 0.05
count if hp < 0.05

Comment

Jack Needleman

Join Date: May 2014

Posts: 3
#3

04 Mar 2023, 15:58

Thanks, Mike. Helpful. Will review the documentation more closely. Our problem is we need to collect the p values for all covariates in the regression, so a vector rather than a scalar, unless we loop through the results and create a scalar for each covariate in the regression. Any thoughts on which of these would be preferred?
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

05 Mar 2023, 08:04

I don't think I understand what's different about your situation than my example, and why an extended version of it won't work well for you. In my example, the p-value for *each* covariate is returned, and what I did could in principle be extended to as many covariates as you like by adding as many -return scalar ..- statements as needed and expanding the expression list given to -bootstrap-.

Perhaps the issue is that extending my example is a nuisance in your situation and you need something more automated? In that case, yes, you could make -myprog- loop through all the columns of r(table) and put p-values with varnames for each covariate into the return list:

Code:

// Define a more automated program to feed to the -bootstrap- command cap prog drop myprog prog myprog, rclass logit foreign weight price headroom // nonsense, I know tempname T // using a temp matrix is cleaner mat `T' = r(table) // r(table) is left behind after the regression command // put those p-values into the return list local nvar = colsof(`T') - 1 local varnames: colnames `T' forval i = 1/`nvar' { local var: word `i' of `varnames' return scalar p_`var' = T[4,`i'] } end

You could then have some code to create a local macro with an appropriate expression list for -bootstrap- out of the items in the return list of -myprog-.

It's also possible, as you are thinking, to have -myprog- return a vector with all the p-values. In place of the -forval- loop returning each p-value in -myprog-, you could do this:

Code:

mat `T' = `T'[4, 1..`nvar'] return matrix P = `T'

However, this might not be desirable. -bootstrap- doesn't accept vectors or matrices in its expression list, so you'd have to (I think) use the -bsample- command in a loop to take repeated bootstrap samples, create your own matrix that accumulated all the vectors across your bootstrapping repetitions, and do your own "housekeeping" instead of letting -bootstrap- do it. I can't think of an easy way to do that, but perhaps someone else can. For me, it would be easier to take the approach outlined above, but maybe that's not workable in your situation.
1 like
Comment
Jack Needleman

Join Date: May 2014

Posts: 3
#5

05 Mar 2023, 12:22

I am a little shaken by the 87 covariates I'm trying to analyze, and was thinking pulling the whole vector of p values might be preferred. But a look through the covariates to get the scalar will work. We'll try that. Thanks for hanging in there on this.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#6

05 Mar 2023, 14:23

If you want to do this with a matrix of the probabilities of all the variables you can. Although, ultimately, to count the number of p < 0.05 hits for each variable you have to get it into a data set* and then loop over the variables. But this illustrates the approach, with changes to Mike Lacy's code shown in bold face:

Code:

clear* // Define a program to feed to the -bootstrap- command cap prog drop myprog // NOT rclass ANY MORE prog myprog logit foreign weight price headroom // nonsense, I know tempname T // using a temp matrix is cleaner mat `T' = r(table) // r(table) is left behind after the regression command // BUILD A MATRIX OF THE p-VALUES--MUST NOT USE A TEMPNAME FOR THIS ONE matrix P = nullmat(P) \ `T'[4, 1..3] end // sysuse auto // check out what ends up in r(table), for illustration logit foreign weight price headroom mat list r(table) bootstrap, noisily reps(100) seed(12345678) : myprog // P-VALUES HAVE BEEN SAVED IN MATRIX P. PUT THEM INTO MEMORY AS A DATA SET clear svmat P, names(col) // COUNT p-VALUES < 0.05 foreach v of varlist _all { quietly count if `v' < 0.05 display "`v': `r(N)'" }

Note: For reasons I cannot even begin to fathom, if the -noisily- option is omitted from the -bootstrap- command, the -bootstrap- breaks and carries out no valid regressions. (If somebody has seen a problem like this before, I'd really appreciate an explanation of what is going on here.) But it runs just fine with -noisily-. I suppose since you have to do 1,000 reps instead of 100 all that output is going to be annoying. So perhaps best to stay with the solution in #4. But that's your call.

* I would guess there is some clever way to use Mata instead of Stata matrices that would enable you to count up the p < 0.05 results for each variable without first moving the results to a data set. But I don't know what that might be.

Finally, I do appreciate O.P.'s acknowledgment that the wisdom of this entire strategy is questionable. I would not have been so polite about that observation.

Last edited by Clyde Schechter; 05 Mar 2023, 14:32.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#7

06 Mar 2023, 08:28

Clyde Schechter offers a clever use of -bootstrap- I've never seen: Taking advantage of the conveniences it offers, but *not* using it to actually manage the data that's accumulated.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#8

06 Mar 2023, 18:50

As nobody responded to my query about why the absence of -noisily- caused -bootstrap- to break, I contacted Stata Technical Support about this. I received the following (excerpted) reply from Gustavo Sanchez, who credits Jeff Pitblado with the key insight:

...When the expression list is composed of (or assumed to be) -_b-,
-bootstrap- and other replication method commands call

. set coeftabresult off

before entering the replication loop. This provides for faster
bootstrapping since Stata does not go to the effort of composing and
posting -r(table)- when an estimation command is run quietly.
See -help set coeftabresults-.

The -noisily- option forces the estimation command to report the
coefficient table for each replication, so -r(table)- consequently is
composed and posted. No error in your program, and -bootstrap- has
enough information in the replicates to compute and report its results.

Without the -noisily- option, your program attempts to pull the 4th row
out of a matrix that did not get posted, exits with an error since `T'
does not have a 4th row, -bootstrap- posts missing values for each
replicate and fails to compute any results.

We have -simulate- and replication methods like -bootstrap- call

. set coeftabresult off

when working with -_b- (and nothing else) to prevent Stata from doing
extra work.

We acknowledge that this is very technical/subtle and plan to add a
technical note to the documentation that explains this.

In the mean time, if you specify something other than -_b-, then
-bootstrap- will leave -set coeftabresult- unchanged for the
replications since you could specify any legal expression following an
estimation command, including elements of -r(table)-. For example,

. bootstrap (e(N)), reps(100) seed(12345678) : myprog

And I have verified that, indeed, this workaround of including an expression list in the -bootstrap- command enables it to run without -noisily-.
Comment

Announcement