Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Survey Data: weighted frequencies with subpop vs. "keep if"

    I'm working with survey data. The survey design includes pweights and strata, as well as poststratification variables (both poststrata and post-weights). I'm finding that if I report frequency tables using the full design in svyset, the weighted percentages differ if I use a subpop commands vs. just dropping observations with 'keep if'. If I only use pweights and strata, however, I get the same percentages with 'keep if' and subpop.

    My understanding had been that subpop should be used for correct standard errors, but that it shouldn't effect estimates. That seems to be the case when I use the incomplete survey design, but not when the full design is set. Can anybody explain what might be happening here?

    Many thanks.

  • #2
    It would help if you showed your code and output or gave a reproducible example. My understanding about estimates and standard errors is the same as yours.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      Does the difference occur if you use only the original pweights and not the post-stratification weights? To find out, you'll have to run
      Code:
      svyset, clear
      and then do the new svyset.

      You should know that I prefer to engage only with posters who follow Statalist etiquette and use their real full names as user-names. See: http://www.statalist.org/forums/register. You can request a change of your user-name by clicking the "Contact" button at the bottom right of the page. I hope that you will do so, but if not, perhaps others will continue the discussion.

      Steve

      Steve Samuels
      Consulting Statistician
      18 Cantine's Island
      Saugerties NY 12477 USA
      [email protected]
      Phone: 845-246-0774
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        Richard, it turns out that for post-stratified data, any attempt to exclude a category with an if expression will fail. Stata will just reweight the remaining data to represent the entire post-stratified population. I thank Joy Wang of StataCorp for pointing this out.

        In the example below, postweight() specifies the population total for each gender to be 60. So, even if part of the sample is excluded with an if expression, the estimated total for each gender will still be 60. Only the version of svy tab with the subpop() option gets it right. Moral: never use an if expression for a post-stratified survey analysis.

        I should just note that combining stratification , especially post-stratification, with subpopulation analysis risks bias if the subpopulation is not one of the post-strata. The reason is that weights intended for analysis of the whole population may not apply to the subpopulation. The example below also shows this. See Levy and Lemeshow , 2008, Sampling of Populations, Wiley, Section 6.4, p. 151 for another illustration.



        Code:
        clear
        input gender cat
        1 1
        1 1
        1 2
        1 2
        2 1
        2 1
        2 2
        2 2
        2 2
        2 2
        2 2
        2 2
        end
        
        gen pwt = 12
        gen postwt = 60
        
        
        svyset [pweight = pwt], poststrata(gender) postweight(postwt)
        tab gender cat, col
        svy: tab gender, count
        svy, subpop(if cat==1): tab gender, count
        svy: tab gender if cat==1,count
        with abbreviated results

        Code:
        . svy: tab gender, count
        
        ----------------------
           gender |      count
        ----------+-----------
                1 |         60
                2 |         60
                  |
            Total |        120
        ----------------------
         
        . svy, subpop(if cat==1): tab gender, count
        ----------------------
           gender |      count
        ----------+-----------
                1 |         30
                2 |         15
                  |
            Total |         45
        ----------------------
        
        . svy: tab gender if cat==1,count
        ----------------------
           gender |      count
        ----------+-----------
                1 |         60
                2 |         60
                  |
            Total |        120
        ----------------------
        Last edited by Steve Samuels; 23 Jul 2014, 16:21.
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment


        • #5
          Thanks Steve. Austin Nichols went on at some length once about why you should always use subpop rather than if. I didn't really understand him, but he is a lot smarter than I am about these things so I figured I should mindlessly trust him.

          Just to be clear -- like the original poster, I thought that if you used if instead of subpop, the significance tests could get screwed up but the point estimates were still ok. But if I follow you, you are saying that even the point estimates can be wrong if the data are post-stratified.

          The thing I find sort of annoying about this -- suppose I had a million cases but I only wanted to analyze a subpopulation of 1,000. I still have to keep all those other 999,000 cases around. I've never actually had this happen to me, but if it ever does I hope it doesn't slow things down to a crawl.
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          Stata Version: 17.0 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment


          • #6
            Richard, you do follow: if with post-stratified data will exclude a subgroup only if it is one of the post-strata.. But standard errors will still be wrong.
            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              Richard,

              I agree with your last paragraph. We have that situation quite frequently: analyzing a data set that has millions of records but only really interested in, say, 100,000 or less. I have found that the difference in standard errors between using subpop and keep if is very small (e.g., 4th significant digit or so). For what we are doing, this isn't such a big problem and it sure beats having to analyze the entire data set all the time. However, I'm sure there are those that can't or won't tolerate even that level of error.

              Regards,
              Joe

              Comment


              • #8
                My experience too is that the differences in errors is usually quite small but I don't know if that is always the case. Steve's point about how the coefficients could be wrong has further put the fear of God in me. Here is a post from Austin that shows how you could create subsamples if you are careful. I'd be afraid I would screw it up, but if you have frequent experience with such situations maybe it is worth playing around with:

                http://www.stata.com/statalist/archi.../msg00810.html
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                Stata Version: 17.0 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Hi Steve,

                  I'm a bit late to this post, but hopefully it might catch your eye. I wanted to ask you a follow-up question regarding your statement:

                  "I should just note that combining stratification, especially post-stratification, with subpopulation analysis risks bias if the subpopulation is not one of the post-strata. The reason is that weights intended for analysis of the whole population may not apply to the subpopulation"

                  My basic question is: If my subpopulation is not one of the post-strata used to weight my data, what is a good strategy to avoid bias if I need to use the subpop command?

                  Details:
                  I am using pooled 2011-2014 CPS ASEC (March Supplemental) to look, for example, at the % of working parents with private sector jobs. I have flagged everyone in the CPS who is a working parent (F_sample), and I am using the --subpop-- command with this variable F_sample, i.e.: subpop(F_sample). However, your comment alerted me to potential complication with the subpop command if there is post-stratification.

                  CPS documentation explains that in the development of the final weight, a second stage ratio adjustment factor was computed using coverage ratios based on poststrata defined on characteristics including large age group, sex, race, Hispanic origin, housing tenure, large city, urban, or rural residence, and geographic region of residence.

                  As you'll notice, neither employment or parental status are one of these poststrata, so I am now concerned that using the subpop command could create bias. Do you have any recommendations on what to do about this, or could you recommend any readings that would help me figure out what to do?

                  Thank you,
                  Kim

                  Comment


                  • #10
                    This issue arises from time to time on the list. A thorough discussion of the problem can be found in

                    West, B.T., Berglund, P. and Heeringa, S.G. 2008. A closer examination of subpopulation analysis of complex-sample survey data. The Stata Journal 8, Number 4, pp. 520–531,

                    available for free on the Stata Journal archive..

                    From a didactic point of view, the same issues are covered at perhaps a more accessible level in section 4.5 of

                    Heeringa, Steven, Brady T. West, and Patricia A. Berglund. 2010. Applied survey data analysis. Boca Raton : Taylor & Francis.

                    I highly recommend this book to anyone who routinely analyzes complex survey data.

                    The issue was originally dealt with at length by Lelie Kish in his 1965 and 1987 books, both of which are cited in West et al. It is not uncommon to find that when one does it "right" results don't differ much if at all from when one does it "wrong." But there are instances, covered in detail in the material cited above, where it does matter and there are cases where it is perfectly ok to subset the data. I leave a fuller discussion to the authors cited above.

                    Richard T. Campbell
                    Emeritus Professor of Biostatistics and Sociology
                    University of Illinois at Chicago

                    Comment


                    • #11
                      This is better asked as a new question, with a link to this discussion.
                      Steve Samuels
                      Statistical Consulting
                      [email protected]

                      Stata 14.2

                      Comment

                      Working...
                      X