Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Recovering p-values from multiple bootstrapped iterations of a logit or glm regression

    The CMS models for choosing comorbidities to include in a risk adjustment model for a quality measure are based on bootstrapping the regression 1000 times and choosing comorbidities (coded as clinical conditions with 0/1 coding of the presence of the condition for each patient, there are approximately 100 clinical conditions) for the final risk adjustment model if the p-value for the estimated coefficient on the comorbidity is statistically significant 90% of the time. Regardless of the wisdom of this strategy, I need to duplicate this methodology in data I have. In order to duplicate it, I need to set a seed, bootstrap the relevant regression, which will be either a logit for 0/1 outcome, or glm with log transform for cost outcomes, and recover the p-values for each covariate in the model for each iteration, so I can count if it is <0.05 in 90% of the cases. I am not sure how to bootstrap with replacement and save the p-values for each covariate from each iteration in a file or matrix so I can examine them. Any suggestions on how to code this would be appreciated.

  • #2
    Stata's built-in -bootstrap- command takes bootstrap samples (i.e., with replacement, by definition), executes an estimation command of your choice, and is able to save whatever result expressions you request from each estimation command into a Stata data file. Useful documentation to examine would include -help bootstrap- and -help return list-. Below you'll find an example that illustrates some technique. In this example, some of the bootstrap iterations don't produce valid regression results, so your 90% would presumably be based on the number of valid regression results. (That might well not be a problem in your situation.)
    Code:
    // Define a program to feed to the -bootstrap- command
    cap prog drop myprog
    prog myprog, rclass
        logit foreign weight price headroom // nonsense, I know
        tempname T // using a temp matrix is cleaner
        mat `T' = r(table) // r(table) is left behind after the regression command
        // put those p-values into the return list
        return scalar weightp = `T'[4,1]  
        return scalar pricep = `T'[4,2]
        return scalar headroomp = `T'[4,3]
    end
    //
    clear
    sysuse auto
    // check out what ends up in r(table), for illustration
     logit foreign weight price headroom
    mat list r(table)
    //
    bootstrap wp = r(weightp) pp = r(pricep) hp = r(headroomp), ///
       reps(100) saving("YourOutPutFile.dta", replace): myprog
    clear
    use "YourOutPutFile.dta"
    desc
    count if wp < 0.05
    count if pp < 0.05
    count if hp < 0.05

    Comment


    • #3
      Thanks, Mike. Helpful. Will review the documentation more closely. Our problem is we need to collect the p values for all covariates in the regression, so a vector rather than a scalar, unless we loop through the results and create a scalar for each covariate in the regression. Any thoughts on which of these would be preferred?

      Comment


      • #4
        I don't think I understand what's different about your situation than my example, and why an extended version of it won't work well for you. In my example, the p-value for *each* covariate is returned, and what I did could in principle be extended to as many covariates as you like by adding as many -return scalar ..- statements as needed and expanding the expression list given to -bootstrap-.

        Perhaps the issue is that extending my example is a nuisance in your situation and you need something more automated? In that case, yes, you could make -myprog- loop through all the columns of r(table) and put p-values with varnames for each covariate into the return list:
        Code:
        // Define a more automated program to feed to the -bootstrap- command
        cap prog drop myprog
        prog myprog, rclass
            logit foreign weight price headroom // nonsense, I know
            tempname T // using a temp matrix is cleaner
            mat `T' = r(table) // r(table) is left behind after the regression command
            // put those p-values into the return list
            local nvar = colsof(`T') - 1
            local varnames: colnames `T'
            forval i = 1/`nvar' {
               local var: word `i' of `varnames'
               return scalar p_`var' = T[4,`i']
            }
        end
        You could then have some code to create a local macro with an appropriate expression list for -bootstrap- out of the items in the return list of -myprog-.

        It's also possible, as you are thinking, to have -myprog- return a vector with all the p-values. In place of the -forval- loop returning each p-value in -myprog-, you could do this:
        Code:
        mat `T' = `T'[4, 1..`nvar']
        return matrix P = `T'
        However, this might not be desirable. -bootstrap- doesn't accept vectors or matrices in its expression list, so you'd have to (I think) use the -bsample- command in a loop to take repeated bootstrap samples, create your own matrix that accumulated all the vectors across your bootstrapping repetitions, and do your own "housekeeping" instead of letting -bootstrap- do it. I can't think of an easy way to do that, but perhaps someone else can. For me, it would be easier to take the approach outlined above, but maybe that's not workable in your situation.

        Comment


        • #5
          I am a little shaken by the 87 covariates I'm trying to analyze, and was thinking pulling the whole vector of p values might be preferred. But a look through the covariates to get the scalar will work. We'll try that. Thanks for hanging in there on this.

          Comment


          • #6
            If you want to do this with a matrix of the probabilities of all the variables you can. Although, ultimately, to count the number of p < 0.05 hits for each variable you have to get it into a data set* and then loop over the variables. But this illustrates the approach, with changes to Mike Lacy's code shown in bold face:

            Code:
            clear*
            // Define a program to feed to the -bootstrap- command
            cap prog drop myprog // NOT rclass ANY MORE
            prog myprog
                logit foreign weight price headroom // nonsense, I know
                tempname T // using a temp matrix is cleaner
                mat `T' = r(table) // r(table) is left behind after the regression command
                // BUILD A MATRIX OF THE p-VALUES--MUST NOT USE A TEMPNAME FOR THIS ONE
                matrix P = nullmat(P) \ `T'[4, 1..3]
            end
            //
            
            sysuse auto
            // check out what ends up in r(table), for illustration
             logit foreign weight price headroom
            mat list r(table)
            
            bootstrap, noisily reps(100) seed(12345678) : myprog
            
            //  P-VALUES HAVE BEEN SAVED IN MATRIX P.  PUT THEM INTO MEMORY AS A DATA SET
            clear
            svmat P, names(col)
            
            //  COUNT p-VALUES < 0.05
            foreach v of varlist _all {
                quietly count if `v' < 0.05
                display "`v': `r(N)'"
            }
            Note: For reasons I cannot even begin to fathom, if the -noisily- option is omitted from the -bootstrap- command, the -bootstrap- breaks and carries out no valid regressions. (If somebody has seen a problem like this before, I'd really appreciate an explanation of what is going on here.) But it runs just fine with -noisily-. I suppose since you have to do 1,000 reps instead of 100 all that output is going to be annoying. So perhaps best to stay with the solution in #4. But that's your call.

            * I would guess there is some clever way to use Mata instead of Stata matrices that would enable you to count up the p < 0.05 results for each variable without first moving the results to a data set. But I don't know what that might be.

            Finally, I do appreciate O.P.'s acknowledgment that the wisdom of this entire strategy is questionable. I would not have been so polite about that observation.
            Last edited by Clyde Schechter; 05 Mar 2023, 14:32.

            Comment


            • #7
              Clyde Schechter offers a clever use of -bootstrap- I've never seen: Taking advantage of the conveniences it offers, but *not* using it to actually manage the data that's accumulated.

              Comment


              • #8
                As nobody responded to my query about why the absence of -noisily- caused -bootstrap- to break, I contacted Stata Technical Support about this. I received the following (excerpted) reply from Gustavo Sanchez, who credits Jeff Pitblado with the key insight:
                ...When the expression list is composed of (or assumed to be) -_b-,
                -bootstrap- and other replication method commands call

                . set coeftabresult off

                before entering the replication loop. This provides for faster
                bootstrapping since Stata does not go to the effort of composing and
                posting -r(table)- when an estimation command is run quietly.
                See -help set coeftabresults-.

                The -noisily- option forces the estimation command to report the
                coefficient table for each replication, so -r(table)- consequently is
                composed and posted. No error in your program, and -bootstrap- has
                enough information in the replicates to compute and report its results.

                Without the -noisily- option, your program attempts to pull the 4th row
                out of a matrix that did not get posted, exits with an error since `T'
                does not have a 4th row, -bootstrap- posts missing values for each
                replicate and fails to compute any results.

                We have -simulate- and replication methods like -bootstrap- call

                . set coeftabresult off

                when working with -_b- (and nothing else) to prevent Stata from doing
                extra work.

                We acknowledge that this is very technical/subtle and plan to add a
                technical note to the documentation that explains this.

                In the mean time, if you specify something other than -_b-, then
                -bootstrap- will leave -set coeftabresult- unchanged for the
                replications since you could specify any legal expression following an
                estimation command, including elements of -r(table)-. For example,

                . bootstrap (e(N)), reps(100) seed(12345678) : myprog
                And I have verified that, indeed, this workaround of including an expression list in the -bootstrap- command enables it to run without -noisily-.

                Comment

                Working...
                X