Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Speeding up Bootstrapping for Confidence Interval with High (>1000) repetitions

    Hello all,

    I am a first-time poster, so please bear with me.

    Context:
    I am conducting an analysis where I have 13 sites, and am conducting a risk factor analysis of tuberculosis (TB) on the outcome of COPD, adjusting for just 7 variables. Unfortunately, although the dataset that I've been given is quite large (12k participants), among the 13 sites, the amount of reported TB cases can be very low, in some cases <10. Therefore, to get the site-specific 95% CI, my advisor (who is an R man, hence this posting), suggested I bootstrap the mixed-effects regression I'm using, with the reps equal to the site N. That all sounded great, but when I test out my code even for 11 reps, let alone the nearly 2,000 I'll have to do for some sites, it takes at least 20 minutes. Part of what I think is slowing down the bootstrapping is that for some of the sites, there are repetitions that fail (a small red x displays instead of the dot charting rep progress), I think because there were not enough tb cases in the random sample.

    I would ideally like to get the OR and 95% CI from each of the covariates, but if push came to shove, I would settle simply for the TB variable.

    I am using Stata 14.2 I/C and the melogit package.

    Code:
    bootstrap if site==6,reps(1748) seed(1989): melogit copd tb age sex bmi pack educ fuel || site: ,or
    estat bootstrap, percentile
    Primary issue: What can I do to speed up the code?
    Secondary Issue: What is the best way to pull out the 5th and 95th values for the ORs that I need? I'm currently using estat bootstrap, percentile, but that also generates the following error, breaking up my dofile.

    equation [var(_cons[site_city] not found
    r(303);
    N.B. I shortened names of variables above to have it all on one line, site=site_city

    I think this is because the random intercept for the site variable is not being stored through estat, but I also haven't been able to find where it might be in the Stata file on melogit, if it is stored at all.

    Thank you for the time, I hope to hear from you soon!

    Best,
    Reuben
    Last edited by Reuben Mathew; 15 Jun 2018, 05:58.

  • #2
    Re the speed problem: My guess is that in your failed repetitions, -melogit- is going to a huge number of iterations before failing to converge to a result. -help maximize- shows that the default maximum number of iterations is 16,000. Using the -noisily- option on -bootstrap- will let you see if this is happening. If it is, you could set the maximum number of iterations to some reasonable but smaller number, say 50 or so, e.g.
    Code:
    bootstrap ........ : melogit ....., iterate(50)
    "Reasonable" here would mean some number of iterations such that any instance of -melogit- that is likely to eventually converge will converge by that number. My experience is that if a command doesn't converge in a reasonable number of iterations, it's never going to converge anyway.
    Your practice of using a number of repetitions equal to the sample size for a particular site is new to me. I've done quite a bit of bootstrapping, and have never heard of that as a standard choice for the number of bootstrap reps. I think the more standard choice is to pick some fixed number, say 1,000, and stick with that. Perhaps there is something I'm ignorant about here, though.

    Beyond this first and easy fix, it's possible to speed up boostrapping by parallelization (-ssc describe parallel-). Limiting the number of iterations could speed up your work by a factor of 10 or more, but parallelization might only double the speed, so the first suggestion is the most important.

    Comment


    • #3
      Hello Mike, thanks so much for the insight, my bootstrap is progressing at a much faster speed!

      As for the bootstrapping=site N aspect, I wasn't familiar with it too, but I think the theory behind it is that the resulting bootstrapped CI gives a more realistic estimate of what the CI and thus variance would be for that particular sample size, while the 2000 repetition number might be more useful for seeing a more full estimate range. The problem I was trying to avoid before was a normal assumption in the variance estimates but with that gone we can stick to more closely representing what the data gives us.

      Comment


      • #4
        Re number of reps, it sounds like you might be conflating number of reps and size of the sample used in each rep. Yes, using a bootstrap at the observed sample size is what you would want, but that (controlled by the size() option on bootstrap, not the reps) is by default set at _N.

        Comment


        • #5
          Mike Lacy, a follow-up question to this part of your previous post:

          Originally posted by Mike Lacy View Post
          Re the speed problem: My guess is that in your failed repetitions, -melogit- is going to a huge number of iterations before failing to converge to a result. -help maximize- shows that the default maximum number of iterations is 16,000. Using the -noisily- option on -bootstrap- will let you see if this is happening. If it is, you could set the maximum number of iterations to some reasonable but smaller number, say 50 or so, e.g.
          Code:
          bootstrap ........ : melogit ....., iterate(50)
          "Reasonable" here would mean some number of iterations such that any instance of -melogit- that is likely to eventually converge will converge by that number. My experience is that if a command doesn't converge in a reasonable number of iterations, it's never going to converge anyway.
          I am now also trying to reduce the speed of bootstrapping with melogit. For starters I'm running a model without controls, just including my dependent and independent variables.

          Dependent Variable:
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float childcare
          0
          0
          1
          1
          1
          0
          1
          1
          1
          1
          1
          0
          0
          0
          1
          0
          0
          0
          1
          1
          1
          0
          0
          1
          0
          1
          0
          1
          1
          0
          0
          1
          0
          1
          1
          1
          0
          0
          0
          1
          0
          0
          0
          1
          1
          1
          1
          1
          0
          1
          0
          1
          .
          1
          0
          1
          0
          0
          0
          .
          1
          0
          0
          1
          0
          1
          1
          1
          1
          1
          1
          0
          1
          0
          1
          0
          0
          1
          1
          1
          0
          1
          0
          0
          0
          1
          1
          1
          1
          0
          1
          0
          1
          1
          1
          1
          1
          1
          0
          1
          end
          label values childcare childcare
          label def childcare 0 "Not provided", modify
          label def childcare 1 "Provided", modify

          Independent Variable:
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float p_regime
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          1
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          2
          end
          label values p_regime p_regime
          label def p_regime 1 "Pre-K only", modify
          label def p_regime 2 "Both", modify
          My command is

          Code:
           bootstrap, reps(100) seed(1): melogit childcare i.p_regime || p_regime:
          ... and it's taking 20 minutes to run.

          Now my question is, how do I determine whether / how many iterations are being run? Not sure what you meant by using the noisy option on bootstrap. I'm working with Stata 15, if that matters.

          Comment


          • #6
            The number of iterations used by an iterative are reported in a format like this:

            Code:
            melogit childcare i.p_regime || p_regime:
            Fitting fixed-effects model:
            
            Iteration 0:   log likelihood = -12329.5485  
            Iteration 1:   log likelihood = -11998.5268  
            ...
            When you run a command with the -bootstrap- prefix, it hides the iteration log and results that would otherwise be displayed at each repetition, so that would be why you aren't seeing it. -help bootstrap- shows that one of the options for bootstrap is -noisily-, which will "unhide" that material, e.g.:
            Code:
            bootstrap, reps(100) seed(1) noisily: melogit childcare i.p_regime || p_regime:

            Comment


            • #7
              Thank you for this! So yes, in several repetitions melogit is indeed running through 16000 iterations before not achieving conversions. In your first post, you showed how to limit the iterations and suggested the number 50. Where does that number come from?

              I ask because I went through all the repetitions, and the highest number of iterations when they converged successfully was 64. So I limited the iterations to 65. The results I got were substantially different to when I had not limited the iterations. Obviously this is more of a statistics question rather than a coding question, but I would like to know the reasoning behind your suggested 50 iterations.

              Comment


              • #8
                I got the number 50 out of the "clear blue sky," as the idiom goes <grin>. The way you chose 65 is the kind of thing I would have done, although I might have chosen 100 to be safe. However, when you say "the results are substantially different," I'm not sure what you mean, and would need more information about exactly what you did and exactly which results where "substantially different."

                If you are *keeping* results from repetitions that don't converge in 65 iterations, yes, that would cause funny bootstrap results. I don't recall right now if -bootstrap- is going to by default keep results where your -melogit- does not converge in 50 iterations. I'm thinking it does. So, in situations like this, I have put my estimation command inside a small wrapper program, and detected whether the estimation command converged by checking the e(converged) flag in the ereturn list. In my situation, though, I was only keeping a few things (e.g., a few slopes) from my estimation routines, so it was easy to build a small program that only kept the stuff I wanted and return that stuff as missing otherwise. In sketch form, I *think* I have done something that could be sketched like this:

                Code:
                capture prog MyProg
                prog MyProg, rclass
                SomeEstimationCommand, iterate(50)
                if (e(converged) == 0 {
                  matrix B = .
                }
                else {  // converged OK
                   matrix B = e(b)
                }
                return scalar b1 = B[1,1]
                return scalar b2 = B[1,2]
                ...whatever other results you want to return
                end
                bootstrap, b1 = r(b1) b2 = r(b2) .... : MyProg
                I don't have any good way to test this, and it may be a clumsy way to do something more easily accomplished otherwise, so I hope someone else will jump in to comment or suggest something better for you.

                Comment

                Working...
                X