Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bootstrapping with frequency weights

    I'm trying to bootstrap from data that represents a frequency table. To simplify, say x is a variable and n is its frequency, and say I have data called original, which summarizes the distribution as follows:

    x n
    1 40
    2 30
    3 30

    So the original data summarizes 100 cases with x=(1,2,3) in proportions 40:30:30.

    What I'd like to do is generate another dataset representing the distribution of 100 cases drawn at random, with replacement, from the distribution described by the original data. Or actually I'd like to do that 200 times and stack the results. I'm open to different ways of representing the results, but they might look something like this:

    sample x n
    1 1 42
    1 2 32
    1 3 26
    2 1 35
    2 2 30
    2 3 35
    ....
    200 1 34
    200 2 27
    200 3 39

    bsample 3, weight(n) doesn't do this, and neither does bsample 100, weight(n).

    Many thanks for any suggestions.



  • #2
    Paul:
    some years ago, I started a thread on a similar topic http://statalist.1588530.n2.nabble.c...td3743854.html
    I do hope that the related replies can be helpful.
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      "expand n" then bootstrap? Each boostrap sample will be a little off on the frequency weights, but on average will achieve it.
      Last edited by ben earnhart; 24 Jan 2015, 09:16.

      Comment


      • #4
        n can be pretty big (outside my toy example). I'm not sure "expand n" is the way to go.

        Comment


        • #5
          Paul: does the following code achieve the sort of thing that you're after? What writing the code brought home to me was one has to make assumptions about how/where the randomness comes in. I've written something that works directly on your frequency table, even though the random process generating the distribution across categories is presumably occurring at some individual unit level (which you've then summarised).

          Code:
           clear all
          set obs 600 // 3 * 200
          ge id = _n
          seq cat, from(1) to (3)
          seq rep, b(3)
          list in 1/21, noobs sepby(rep)
            set seed 12345
          ge catprop = .
          replace catprop = round( 40 + 100*rnormal(0,.05) ) if cat == 1
          replace catprop = round( 30 + 100*rnormal(0,.05) ) if cat == 2
          bysort rep (cat): replace catprop = 100 - catprop[_n-1] - catprop[_n-2] if cat == 3
            sort id
          list in 1/30, noobs sepby(rep)
          ta cat, su(catprop)

          Comment


          • #6
            I don't think you need to make any assumptions. I just want to repeatedly simulate 100 draws from a multinomial distribution with values x=(1,2,3) in proportions 40:30:30, summarized in the stated form.

            Comment


            • #7
              Paul: I don't think you quite take my remarks in the constructive spirit that they were intended, though I concede the remarks may not have been entirely clear. I'll try again. You refer a summary table of frequencies and associated proportions. Underlying the table is presumably a sample of units (let's call them persons), each of which has an associated categorical outcome value (1, 2, or 3). With access to the underlying (unit-record) data on the persons, I think one can sample from a multinomial distribution using the methods such set as out at e.g. http://en.wikipedia.org/wiki/Multino...l_distribution . My point was that I don't know how one goes about this when one only has the summary table. That led to my code -- which did at least produce something looking like you said you wanted. Are there literature references on related (re)sampling problems that you might point us to?

              Comment


              • #8
                Thanks! I do recognize the constructive spirit of your comments -- so sorry if that didn't come across.

                I just don't think you need unit-level data to estimate and sample from a multinomial distribution. The frequencies alone are adequate to do that.

                Since originally posting, I've come across Stata's rmultinom() function, and Buis' suggestions for using the uniform() function to simulate multinomial data. I can work with those functions, but I was sort of hoping for something as simple and elegant as the bootstrap, bsample, sample, or gsample command.

                Best,
                Paul

                Comment


                • #9
                  Stephen, I just realized that you will be familiar with the actual problem that motivated my question. Consider the situation where (min,max) is a range of incomes and n is the number of households with incomes in that range. Then it is common to see an income distribution summarized like this:

                  x1 x2 n
                  0 9999 4000
                  10000 19999 6600
                  20000 29999 11340
                  etc.

                  Now there are a variety of methods for estimating summary statistics such as the Gini coefficient -- including your favored approach of fitting the generalized beta distribution or something similar.

                  What I'm trying to do is estimate a confidence interval for the Gini by using the bootstrap. Make sense?

                  Comment


                  • #10
                    P.S. And I'd like to generate the bootstrap samples using only the multinomial frequencies from the original data. I don't want to assume anything further about the underlying distribution, which may or not be generalized beta etc.

                    Comment


                    • #11
                      I've come across Stata's rmultinom() function
                      Would you give a more precise reference please? I can't find this (which may be my obtuseness). I know of binomial functions in Stata and Mata, but not that multinominal one. [If I had found one, I wouldn't have written the code in post #5!] Maarten's recommendations to use uniform() are in effect implementing the sort of algorithm in the Wikipedia article I cited, I think. I've used that approach in other work (data generation for Monte Carlo analysis; working with unit record data, however, not a table of frequencies.)

                      But thanks for setting out what more clearly what you actually want to do. Got it! However, I am now led to ask whether your cart is a bit before your horse. You have grouped data (published US Census data?), so there are serious issues to consider about the estimation of inequality indices per se, let alone their sampling variances. Which estimator is "best" to use also depends on how much "information" you have, including e.g. the mean within each band, and what you know about the top interval (typically open). Yes, parametric estimators are one way to go. (BTW they are not my favoured approach necessarily; it depends what one is trying to do.) Indeed, I like non-parametrically estimated indices, which are what you want. One of my favourite articles on this is:
                      Cowell, F.A. and Mehta, F. 1982. The estimation and interpolation of inequality measures. Review of Economic Studies 49 (2): 273-290. This also reviews previous literature (about placing bounds, and getting point estimates of inequality indices like the Gini). In short, won't your resampling design also depend on which estimator you use? Whatever, Cowell and Mehta also refer to SEs and CIs in their empirical section (see also footnote 15 re methods). Given your samples are likely large, won't a linearization formula for the SE work as well as bootstrapping?

                      Comment


                      • #12
                        Sorry, I misspoke. rmultinom() is a function in R. The closest thing in the Stata environment is rdiscrete(), but that's actually part of Mata. So I'm starting to think that the type of resampling I want to do will be a bit of work -- not excessive, but not as easy as just invoking the bootstrap command.

                        Regarding the estimation of inequality: as you might imagine there is a larger project beyond my question about the bootstrap. I can share that with Stephen separately, outside of Statalist.

                        Comment


                        • #13
                          Interesting problem, and one I'm not familiar with. I do have a couple of thoughts, perhaps way off the mark. If you have census data, then, presumably, there is no sampling error, though there might be measurement error. But the re-sampling approach would estimate non-existent sampling error, unless you want to take a super-population modeling approach. If, on the other hand, you have sample survey data, then why not base confidence intervals on the sample design?
                          Steve Samuels
                          Statistical Consulting
                          [email protected]

                          Stata 14.2

                          Comment


                          • #14
                            Unfortunately income distributions are estimated from samples, not populations. I'm not sure what I can do about the sampling design without unit-level data or a published design effect.

                            Comment


                            • #15
                              Originally posted by Stephen Jenkins View Post
                              Maarten's recommendations to use uniform() are in effect implementing the sort of algorithm in the Wikipedia article I cited, I think.
                              That is correct.

                              For those following this thread: the complete reference to that Stata tip is: M.L. Buis (2007), "Stata tip 48: Discrete uses for uniform()", The Stata Journal, 7(3), pp. 434-435. It can be freely downloaded here: http://www.stata-journal.com/article...article=pr0032

                              Notice, that I wrote that tip when runiform() was still called uniform(). The code in that article still works (StataCorp does a great job in ensuring that old code continues to work on newer versions of Stata), but if I were to write such code now I would replace all occurances of uniform() with runiform().
                              ---------------------------------
                              Maarten L. Buis
                              University of Konstanz
                              Department of history and sociology
                              box 40
                              78457 Konstanz
                              Germany
                              http://www.maartenbuis.nl
                              ---------------------------------

                              Comment

                              Working...
                              X