Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • analyzing multiple subsamples

    I have a big dataset, and I wish to create say 1000 samples with replacement (multiple bsample (s)), give each sample an id and then analyze each of the samples separately (like in a loop) and save the result for each of these iterations. While I have figured the analysis part, I am still struggling with creating these multiple subsamples with replacement. I will really appreciate any advice on this.

  • #2
    It sounds like you are trying to carry out some analysis and bootstrap it with 1000 reps. If that is correct, you don't have to do the sampling and looping yourself. the -bootstrap- command will do it for you. See the online help for the -bootstrap- command. It may reduce to something as simple as:

    Code:
    bootstrap, saving(myresults, replace)  reps(1000): your analysis command here
    If that isn't what you need, post back with more details.

    Comment


    • #3
      It sounds to me like bootstrap might be of assistance, see help bootstrap for more details.

      Added in edit: Clyde is clearly one step ahead of me this morning.

      Comment


      • #4
        Thank you Clyde and William.
        So I want to get the prevalence of HIV positive individuals by age (9 age groups) after adjusting for weight. I tried to use the svy bootstrap but I struggled with the exp_list. I could not figure out how to get the stored row proportions so I ended up doing the following:

        svyset [pweight=weight]
        svy: tab age HIV, row
        estpost svy: tab age HIV, row
        esttab, cell("b(f(9))") varlabels(`e(labels)') eqlabels(`e(eqlabels)')
        matrix A= e(row)
        matrix list A
        matrix x=A[1, 11..19]
        putexcel A1=matrix(x) using booknew, sheet("sheet1")
        end

        As you Clyde have mentioned I need to repeat this 1000 times using say samples of size 500 each

        Comment


        • #5
          So, what you need to do is wrap your command in a short program, because -bootstrap- will not work with an explicit mention of svy:.

          Code:
          capture program drop myprogram
          program define myprogram
               svy:tab age HIV
               exit
          end
          
          bootstrap, saving(myresults, replace) reps(1000) size(500): myprogram
          
          use myresults, clear
          The program myprogram stores the results of svy:tab in e(b), and bootstrap automatically picks that up without your having to specify anything. You may want to rename the variables that -bootstrap- gives to something more mnemonic. Note that in most applications of bootstrap, the sample size is the same as the size of the entire data set. If you want to override that to specifically use samples of 500, the above code does that. For the more conventional approach, just omit the -size(500)- option.

          Comment


          • #6
            Much much appreciated Clyde! But it seems that bootstrap even ignores svy within my program and calculates its own "percentages". So "svy: tab age HIV, row" and "svy: tab age HIV, col" and svy: tab age HIV" all return the same results which I guess here are column percent
            .

            Comment


            • #7
              Removed after #8 posted, mine was indeed a naive solution.
              Last edited by William Lisowski; 28 Dec 2015, 12:31.

              Comment


              • #8
                Sorry, I should have been clearer. -boostrap- picks up the entries in e(b), which are in all cases the cell proportions. The -row- and -col- options of svy: tab affect what is displayed on the screen, but not what is stored in e(b). So you need to calculate those row percentages yourself. This can be done in the myresults file, but it is probably easier to do it in the program and then pick up the returned values in the bootstrap command. I don't know how your HIV variable is coded. But let's assume it is 0 for no and 1 for yes. Then _b[p11] will be the cell proportion for age group 1, HIV neg, _b[p12] will be the cell proportion for age group 1 HIV positive. So you can do this:

                Code:
                capture program drop myprogram
                program define myprogram, rclass
                     svy:tab age HIV 
                     forvalues i = 1/9 {
                          return scalar prev`i' = _b[p`i'2]/(_b[p`i'2] + _b[p`i'1])
                     }
                     exit
                end
                
                bootstrap r(prev1) r(prev2) r(prev3) r(prev4) r(prev5) ///
                    r(prev6) r(prev7) r(prev8) r(prev9),  ///
                    saving(myresults, replace) reps(1000) size(500): myprogram
                
                use myresults, clear
                
                // AND IF YOU WOULD LIKE BETTER VARIABLE NAMES IN MY RESULTS
                rename _bs_* prev*


                Comment


                • #9

                  1. Your original svyset statement surprises me, because it omits any reference to strata or primary sampling units. Were there none? Please describe the population and the sampling design. What are the population and sample sizes? Were any post-stratification adjustments applied?

                  2. Clyde's program will give good results with non-survey, non-weighted, data. However with weighted survey data, it can yield biased estimates. The reason: the survey weights in each bootstrap replicate will not add to the population totals (and post-stratified weights would not add to control totals). Standard errors will also be biased for some designs (Kolenikov, 2010, p. 178).

                  The proper approach is to construct bootstrap replicates with specially constructed weights. To do this, use Stas Kolenikov's bsweights program (Kolenikov, 2010), which you can locate by typing.
                  Code:
                  findit bsweights
                  After creating the replicates, write a svyset statement appropriate for replicate weights; follow by svy: tabulate.

                  3. bsweights is not simple. If you are not very familiar with survey replication methods, I suggest that you don't bootstrap at all. Use a svyset statement that reflects the original design, then depend on Stata's default linearization method to estimate standard errors.

                  Knowing more about your design and study might lead me to modify these suggestions, so please give details.

                  Reference:

                  Kolenikov, Stanislav. 2010. Resampling variance estimation for complex survey data. Stata Journal 10, no. 2: 165-199.
                  http://www.stata-journal.com/article.html?article=st0187
                  Last edited by Steve Samuels; 28 Dec 2015, 15:41.
                  Steve Samuels
                  Statistical Consulting
                  [email protected]

                  Stata 14.2

                  Comment


                  • #10
                    Steve Samuels

                    Thanks for this post. I was not aware of this difficulty, but what you say makes perfect sense. I suppose this is (at least part of) the reason that -bootstrap- does not support use of the -svy:- prefix.

                    Comment


                    • #11
                      Steve Samuels

                      Many thanks for your very informative post.
                      I am using Demographic and Health Survey data: http://www.dhsprogram.com/publicatio...nd-manuals.cfm
                      These are nationally representative household-based cross-sectional surveys, and you are right the svyset command is actually as follows: svyset psu [pw=weight], strata(sample_domain).
                      I can actually avoid the whole bootstrapping thing if I can figure out how to repeat the following script 1000 times and saving the result in one excel sheet

                      bsample 500
                      svyset psu [pw=weight], strata(sample_domain)
                      svy: tab age HIV, row
                      estpost svy: tab age HIV, row
                      esttab, cell("b(f(9))") varlabels(`e(labels)') eqlabels(`e(eqlabels)')
                      matrix A= e(row)
                      matrix list A
                      matrix x=A[1, 11..19]
                      putexcel A1=matrix(x) using booknew, sheet("sheet1")
                      end

                      However, I will also look into the replicate weight as using the svyset would be easier to execute.

                      Hiam

                      Comment


                      • #12
                        Latest update:

                        I have been trying different things. I first tried to create replicates and I have used the following script:

                        bsweights wgtnew, n(3) reps(60) balanced dots replace
                        bs4rw, rweights(wgtnew*): tab age ob03a, row [pw=finalwgt]

                        The number of stratas is 6 and the number of PSUs is 12008
                        This did not take me far as I received a warning that the first-order balance was not achieved

                        I also tried the following:

                        forvalues i=1/1000 {
                        use test
                        bsample 1000
                        svyset psu [pw=ob05], strata(pv023)
                        svy: tab age ob03a, row
                        estpost svy: tab age ob03a, row
                        esttab, cell("b(f(9)) lb ub") varlabels(`e(labels)') eqlabels(`e(eqlabels)')
                        matrix A= e(row)
                        matrix list A
                        matrix x=A[1, 11..19]
                        putexcel A'i'=matrix(x) using booknew, sheet("sheet1")
                        }

                        Here STATA refused to reopen the dataset since "the data in memory will be lost"

                        Comment


                        • #13

                          Thanks for the additional information, but you didn't answer all my questions about your study. The number 12,008 looks more like the number of respondents in a DHS survey, rather than the number of PSUs, which are area units. See some good advice from Emma Slaymaker at: http://www.stata.com/statalist/archi.../msg00906.html and from Stas Kolewnikov at: http://www.stata.com/statalist/archi.../msg00877.html

                          As I said, bweights is not a simple command. Study again the description of the n() option.

                          The goal of your analysis is get good estimates of prevalence and good standard errors. The default linearization method will do both. Your bsample attempt, like Clyde's program, will do neither: it will not produce good prevalence estimates because the sampled data sets do not respect the weight structure of the original sample. The standard errors will be completely wrong (much too small), because you resample individuals, whereas the correct unit for resampling is the PSU.

                          Here is code that will give good estimates of prevalences and standard errors. Any resampling results that you generate will have to be close to these.

                          Code:
                          svyset psu [pw=ob05], strata(pv023)
                          svy: tab  age ob03a, row se
                          Last edited by Steve Samuels; 29 Dec 2015, 07:50.
                          Steve Samuels
                          Statistical Consulting
                          [email protected]

                          Stata 14.2

                          Comment


                          • #14
                            Many thanks for your reply. The number of psus is 1267 and the number of strata is 6
                            So now I have the following code:

                            svyset psu [pw=weight], strata(pv023)
                            bsweights wgtnew, n(-1) reps(100) dots replace
                            svyset [pw=weight], vce(bootstrap) bsrweight(wgtnew*)
                            svy:tab age ob03a, row

                            How do I get the row percentages generated from each individual iteration?

                            Comment


                            • #15
                              I can also use the following but I need to input an expression list after bs4rw

                              bs4rw, rweights(wgtnew*) saving(myresults, replace): tab age ob03a [iw=ob05], row

                              Comment

                              Working...
                              X