Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • svyset poststratification vs weight generated via survwgt post

    Dear all,
    I am puzzled about the use of post stratification weights with svyset. I computed a designweight which is the inverse of the inclusion probability of each individual in the sample standardized to a mean of 1. Then I computed a final_weight to match a known distribution of the population with

    survwgt post designweight, by(poststratum) totvar(poststratumpopulation) gen(finalweight)

    I compared two options for data analysis with different results regarding SE:

    svyset psu [pweight= designweight], poststrata(poststratum) postweight(poststratumpopulation) vce(linearized) singleunit(missing)

    Number of strata = 1 Number of obs = 2564
    Number of PSUs = 17 Population size = 24250
    N. of poststrata = 14 Design df = 16


    linearized
    Mean Std. Err. [95% Conf. Interval]

    age 20.14886 .1186727 19.89728 20.40043

    svyset psu [pweight=final_weight], vce(linearized) singleunit(missing)


    Number of strata = 1 Number of obs = 2564
    Number of PSUs = 17 Population size = 24045.5
    Design df = 16


    Linearized
    Mean Std. Err. [95% Conf. Interval]

    age 20.14893 .2864927 19.5416 20.75627


    I would love to have small standard errors, but which option is valid? Obviously there are some missings in the “age” variable here (n=20 / 0.8%)

    Many thanks for attention!
    Christian Meyer


  • #2
    Maybe my english was incomprehensible, so I like to reformulate my question:

    The common practice is to do sample survey analysis with final weights which often include a base weight adjusting for unequal selection probabilities and a post stratification weight to match the distribution for a characteristic which is known from the population. SE estimates appear to be different when using Stata svy commands with the base weight and specification of post-stratification in the survey set instead of using only the final weight. From the Stata documentation and the Levy and Lemeshow (2008) book I learned that post stratification may reduce SE. Is there any explanation why this was different when post stratification is included in the final weight?

    If the question is still incomprehensible or silly feedback is also very welcome!

    Thank you

    Christian Meyer

    Last edited by Christian Meyer; 17 May 2016, 01:39.

    Comment


    • #3
      Thanks for posting this, Christian and thanks for the clarification. Your English is fine, but your first post was difficult to read because you did not paste commands and results between CODE delimiters, as requested in Advice on Posting.

      I can reproduce the discrepancy you observe. I believe it to be a bug in svyset and I've reported it to technical support. I know of four Stata commands that can do post-stratification:
      1. ipfweight by Michael Bergmann at SSC
      2. ipfraking by Stas Kolenikov (findit)
      3. survwgt post by Nick Winter (SSC). (His survwgt rake gives identical results)
      In all of these, one supplies a generated weight to svyset
      4. Stata's svyset

      Below is code that demonstrates the problem. The first three commands produce standard errors from svy mean that are similar. Stata's standard error is about two-thirds the size of the others.

      Code:
      sysuse auto, clear
      gen str2 mkr = substr(make,1,2)  // new psu variable
      
      /* ipfweight */
      ipfweight foreign, gen(finalwt0) ///
      startwgt(turn) ///
      val(200 100) ///
      maxiter(100) tol(1)
      svyset mkr [pw =finalwt0]
      svy: tab foreign, count col
      svy: mean mpg
      estimates store ipfweight
      
      /* ipfraking */
      gen _c=1
      matrix total_foreign = (200,100)
      matrix colnames total_foreign = _c:0  _c:1
      matrix rownames total_foreign = foreign
      ipfraking [pw =turn], ctotal(total_foreign) gen(finalwt1) tol(1)
      svyset mkr [pw =finalwt1]
      svy: tab foreign, count
      svy: mean mpg
      estimates store ipfraking
      
      /* survwgt post */
      gen tfor = cond(foreign,200,100)
      survwgt post turn , by(foreign) totvar(tfor) gen(finalwt2)
      svyset mkr [pw = finalwt2]
      svy: tab foreign , count
      svy: mean mpg
      estimates store survwgt
      
      /* svyset */
      svyset mkr [pw = turn], poststrata(foreign) ///
       postweight(tfor)
      svy: tab foreign, count
      svy: mean mpg
      estimates store svyset
      
      estimates table ///
      ipfweight ipfraking survwgt svyset, se
      Results of svy mean:
      Code:
      ------------------------------------------------------------------
          Variable | ipfweight    ipfraking     survwgt       svyset    
      -------------+----------------------------------------------------
               mpg |  21.189439    21.189439    22.897672    20.697549  
                   |  .93894393    .93894391    1.1014303    .63185368  
      ------------------------------------------------------------------
                                                            legend: b/se
      Last edited by Steve Samuels; 17 May 2016, 11:57.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        Thank you very much for this brilliant illustration!

        Just one speculation about a possible reason: In a strict sense probability weights should be the inverse of the sampling fraction and this is the base weight or design weight. A post-stratification weight may compensate for non-response (or coverage) which may or may not be perceived as a random (selection) mechanism. In the Levy and Lemeshow book (p. 168) it is explained that post-stratification may lead to smaller se and confidence interval estimates. So maybe using the final weight as a pweight is the problem which leads to more conservative estimtates.

        It might be also plausible, because the variance of the base weight is usually smaller compared to the final weight.

        Comment


        • #5
          The reduced standard error is not a bug. After svyset with post-stratification options, Stata not only computes new weights (as the other commands do), but also modifies the variance calculations by considering each observation's difference from it's post-stratum mean. It is these modifications which lead to the reduced standard error that you observed. See pp 1486 and 1487 in the manual entry for mean and p. 191 of the SVY manual.
          Last edited by Steve Samuels; 18 May 2016, 05:13.
          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment


          • #6
            what is the command to perform the cointegration test peasaran and all (2001)

            what is the command to perform the causality test of toda yamamoto

            Comment


            • #7
              please

              Comment


              • #8
                nahousse

                Comment


                • #9
                  Dear Nahousse, Welcome to Statalist. Unfortunately, you have interrupted a discussion about survey analysis. Please not post here again or at the end of any other unrelated discussion..Start a new topic in the General Forum page (there's a + Topic button) but you are unlikely to get help there unless you first read and then follow the instructions at http://www.statalist.org/forums/help#adviceextras and at
                  http://www.statalist.org/forums/help, especially #12.
                  Last edited by Steve Samuels; 18 May 2016, 05:31.
                  Steve Samuels
                  Statistical Consulting
                  [email protected]

                  Stata 14.2

                  Comment


                  • #10
                    Correction: The s table I showed above did not result from code above, but from another do file whose log file I mis-numbered. That do file utilized rep78 as the PSU and also tried a different set of control totals. I apologize for the confusion:

                    The code above yields the following estimated standard errors:
                    Code:
                    ------------------------------------------------------------------
                        Variable | ipfweight    ipfraking     survwgt       svyset    
                    -------------+----------------------------------------------------
                             mpg |  21.189439    21.189439    22.897672    22.897672  
                                 |  .93894393    .93894391    1.1014303    .89704639  
                    ------------------------------------------------------------------
                                                                          legend: b/se
                    Thus this version does not reproduce Christian's observation that svyset with the post options yields much smaller standard errors. For me the moral is: don't try to draw conclusions with small data sets. I'll try on a much larger sample and report back.
                    Last edited by Steve Samuels; 18 May 2016, 11:43.
                    Steve Samuels
                    Statistical Consulting
                    [email protected]

                    Stata 14.2

                    Comment


                    • #11
                      But even in your corrected table the effect is still there. An explanation for the more extrem example I reported might be the following. My designweights were based on similary variable that I used for poststratification:

                      The design weight was composed of three levels of inclusion probabilities. Vocational schools stratified by profession group / Classes stratified by year of training / students with no stratification. So the baseweight was p(school)*p(class)*p(student) . For poststratification I used a variable differentiating year of training*profession group. The weighting by the base-value was close to the population distribution but I like to have an exact match with the population. I presume that the size of the effect we discuss here depends on the "correlation" of the designweight and the poststratification weight component.

                      So it seems to be an advantage to use poststratification, although I am not sure if this make sens from a statistical point of view.

                      Comment


                      • #12
                        I agree that svyset with post* options should be the first choice for post-stratification on one set of control totals. I've pointed out one likely contributing reason for the reduction in standard errors, compared to those from the other commands: that Stata modifies the variance formulas.

                        Unfortunately, in practice, there are usually multiple sets of control totals, so that one has to turn to the other commands. Among Stata solutions, I should have mentioned John D'Souza's calibrate (followed by calibest) at SSC, which can control for means and other characteristics of quantitative variables in categories.

                        Some recent references on weighting that have caught my eye:

                        Special Issue: 'Weighting: Practical Issues and 'How to' Approach, Survey Methods:Insights from the Field (July, 2015)

                        Buskirk, T. D., & Kolenikov, S. (2015). Finding Respondents in the Forest: A Comparison of Logistic Regression and Random Forest Models for Response Propensity Weighting and Stratification. Survey Methods: Insights from the Field (SMIF).

                        Kolenikov, S. (2014). Calibrating survey data using iterative proportional fitting (raking). The Stata Journal, 14(1), 22-59.



                        Last edited by Steve Samuels; 19 May 2016, 07:38.
                        Steve Samuels
                        Statistical Consulting
                        [email protected]

                        Stata 14.2

                        Comment


                        • #13
                          Thank you for this most informative post.

                          I think I have data of approximately the same type, and the same problem as Christian.

                          My survey data:
                          • A stratified random sample of approximately 42.000 observations, stratified on organization
                          • Proportional allocation, i.e. the sampling fraction nh / Nh is the same for all strata h
                          • poststrata defined by organization x age x gender - I have the population distribution over poststrata, but could for practical reasons not sample within poststrata.
                          My objective:
                          • I wish to handle non-response and random over-/under-representation of different poststrata by poststratification on organization x age x gender to improve precision and reduce possible sources of bias. Remark: I treat nonresponse as random (at least in the poststrata).
                          I have computed a final weight final_weight in three steps:
                          1. for each category of my "poststratitication matrix" a weight calculated as: ProportionPopulation / ProportionSample (The resulting weight has a sum of nSanple and a sample mean of 1).
                          2. The weight is truncated to a maximum-value of 5
                          3. The resulting weight is then rescaled to a mean of 1.
                          This weight is already in my data. Since the sample referred to is the realized sample of responses, the final weight also adjust for non-response bias.

                          I believe, that this weight is a correct total weight, including both a design weight and poststratication adjustment. It is not based on, but resembles closely the poststratification weight computed in the European Social Survey data (even though the ESS team use a raking procedure): https://www.europeansocialsurvey.org...ing_data_1.pdf

                          My main question: How do I implement poststratification in my data and analysis using Stata?

                          I believe I have these alternatives:

                          1. Treat the poststratification weight final_weight as a design weight. (as if I had sampled on the poststrata with proportional allocation and equal non-response in all poststrata)

                          Code:
                          svyset psu [pweight=final_weight], strata(post_strata_var) vce(linearized) singleunit(missing)
                          2. Specify the different design weight and poststratification adjustment:

                          Code:
                          svyset psu [pweight= design_weight], strata(organization) poststrata(poststratum_var) postweight(poststratumpopulation_var) vce(linearized) singleunit(missing)

                          Can I use these alternatives, and which one is more correct?

                          Setting aside the question of truncating weights for a moment, I believed, that I could statistically treat stratification and poststratification as the same, and hence use alternative 1 above. However Steve's post (#5 above) seems to imply, that only alternative 2 is correct?


                          Supplementary Problem 1. Keep original final weight
                          I would prefer it, if I could keep my final_weight and implement it in Stata - possibly partitioned into a design weight and a poststratification adjustment. Is this possible?
                          (I have speculated, that it might be possible to "go backwards" and specify a design weight including truncation, that would result in a final poststratified weight in stata, that matches my final_weight variable, but so far found no solution)

                          Supplementary Problem 2. Truncated weights
                          I have truncated weights. Is it possible to implement this in Statas svyset command or in any of the ado-packages like for instance survwgt?

                          Supplementary Problem 3: Scale of weight
                          Can I scale the final weight including poststratication adjustment in Stata, so that the weights in the sample sum to the sample size?
                          This is done e.g. in the European Social Survey data (see page 4 bottom in http://www.europeansocialsurvey.org/...umentation.pdf)
                          However the combined final weight variable is not accessible in Stata, it seems.


                          Best Regards, Lars

                          Comment


                          • #14
                            Can someone kindly help with the last comment on this post?

                            Comment

                            Working...
                            X