Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Post-stratification weights, calibrated weights, and sampling desing weights: How to combine them?

    Dear all,

    I want to calculate weighted means of variable x and don't know how to combine the weights provided in the data set with post-stratification weights that I calculated on my own.

    I am working with cross-sectional individual-level survey data in Stata 15.

    The data set comes with two different weights: (i) a sampling design weight that account for unequal selection probabilities of the sample units (inverse of the probability to be in the sample) and (ii) calibrated weights that also consider calibration margins based on gender and regions.

    Because the age distribution in the sample is not the same as the age distribution in the population, I want to further apply post-stratification weights considering the age structure (in addition to gender and region) when calculating the weighted means of x.

    I know that I could calculate post-stratification weights by dividing the share of each gender-region-age group in the population (N) by the share of the same gender-region-age group in the sample (n) and then use these weights as pweights (pweight = N/n) when calculating means.

    My question is: How do I combine these weights with the calibrated weights provided in the sample? Or do I need to combine them with the sampling design weights somehow?

    I do have information on strata, psu and ssu - but (i) this information is missing for 1/3 of my observations and (ii) I do not know how this information relates to my problem.
    The information on the share of gender-region-age groups in the population (N) comes from census data.

    I know this is a very specific problem, but if you could at least lead me to some applied readings on "combining sampling design weights with post-stratification weights", I would be very grateful.

    Best regards,
    Stephanie

    *----------------------------------------------------------

    I also tried the following:

    1. Collapse the dataset using the calibrated weights provided in the dataset:

    Code:
     collapse (mean) x  [pweight = calibrated weight] , by(age)
    2. Merge shares of age groups from census data

    3. Calculate means manually


    Code:
    
    gen help = x * N //  e.g. mean of x in age group 30-34 * the share of 30-34-year-olds in the population
    egen x_mean = total(help)
    drop help

    But (i) I am not sure if this is a valid option and (ii) it makes it hard to compare the means with and without the post-stratification weights so I am not that happy with this approach.

  • #2
    Do the calibration weights control for any variables besides age group and region? If yes, what are those variables?
    Last edited by Steve Samuels; 15 Jul 2018, 12:08. Reason: Waiting for answer to question
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Hi Steve,
      Thank you for your response.

      No, the data manual says that the calibration weights only consider GENDER and (within-country) REGIONS as their calibration margins, I want to add AGE myself. The data manual also makes a vague statement that "the calibrated weights are as close as possible to the original sampling design weights".

      Do you need any more information?

      Kind regards,
      Stephanie

      Some more information regarding strata, psu, and ssu (in case this is relevant): The dataset includes several countries. The weights are calculated for each country separately. Each country has its own sampling design. Some countries use simple random sampling, others multistage methods. I think that is why strata, psu, ssu is missing for 1/3 of my data. The values for strata, psu, and ssu are either missing entirely for a whole country (if it has a simple sampling design?), or not missing (if it has a multistage sampling design?). I can only guess here, as there is no more information on it in the manual.

      Comment


      • #4
        In the meanwhile, I stumbled upon the programme sreweight by Daniele Pacifico. Here is the link to an article in the Stata journal: https://www.stata-journal.com/articl...article=st0322


        As far as I understand it generates calibrated weights (nweight) based on original survey weights (the sampling design weights?) and external totals (my shares from census data?). Would that be an option?

        Thank you!

        Comment


        • #5
          I was going to suggest just such a solution: Abandon the original calibration rates and redo the calibration/poststratification yourself.. Stata 15 now has built-in calibration, via the regress and rake options of svyset. See the SVY manual entry for calibration.


          However with large enough sample cell sizes (n~50 or n~25 ) and only three control variables, the other methods are overkill and may not be optimal. You're likely to get smaller standard errors if you poststratify with the postweight and poststrata options in svyset. That's my first recommendation, but by all means try other methods and compare. Let us know what you find.

          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment


          • #6
            To continue: you need to construct your svyset statement.

            Your survey sounds similar in design to the European Social Surveys (ESS). (One difference, at least, is that the poststratification weight in ESS balances for age, gender, education, and region.) In ESS, sampling from lists of individuals households and addresses is permitted; presumably, systematic sampling is employed. For countries with such sampling frames, the PSU is therefore the individual, household, or address. The same will apply to your survey. I guess that the entire country is the single stratum in such instances. If so,
            Code:
            gen stratum = 1  if country has list frame
            egen superstrat = group(country stratum)
            Then the stratum variable for svyset should be superstrat.

            Again, I'm guessing based on what you have said so far. I can't say much more without access to the study documents.

            Good luck!



            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              Thank you Steve - I will play around a bit today and then get back to you! Best, Stephanie

              Comment


              • #8
                Dear Steve (or other statalist member),

                I have now performed sreweight and it worked really well, I tried to replicate the original calibration weights and got pretty close.

                However, I have two more question regarding the procedure:

                On page 11 in his article on sreweight, Pacifico writes the following footnote:

                "As the code shows, the original estimate of the population size is also included in the last row of the vector (e(N pop)). Hence, only five out of the six new totals related to the population size by age group need to be included in the vector."

                He is referring to his vector of population totals, in which he only included 5 of his 6 age-group totals.

                I played around a little while trying to replicate the original calibrated weights in my dataset and found out that I get the exact same results (i) if I include all gender and region totals in my vector of totals and (ii) if I include all gender totals and all region totals BUT ONE. This worked even if though I DID NOT add the whole population size in the vector.

                My two questions are:
                1. Why is there no difference in my weights if I leave out one of the region totals in the vector of calibration margins?
                2. When I calculate my own calibration weights in which I also want to consider age, do I have to also leave out one of the age totals?

                Here is, once again, the link to the stata journal paper in which Pacifico describes sreweight: https://www.stata-journal.com/articl...article=st0322

                Let me know if you need any more information to answer my questions.

                Thanks a lot!
                Stephanie

                Comment


                • #9
                  Three questions:
                  • What do you mean by "all gender and region totals"? 1) a vector in which each entry is the total for one gender-region combination?(e.g. number of females in region 1) or 2) two vectors, one of gender totals, one of region totals?

                  * For exactly what combinations of age, gender, and region do you have totals?

                  • Have you any persons with missing age( or gender or region)? If so, then before running sreweight (redistribute them to one of the other categories. This can be a little tricky; you might want to enlist Stata's MI capabilities.

                  To answer your questions:

                  1. My guess is that when you leave out a region, sreweight fills in the last region from the total of the gender counts
                  2. No, don't leave out an age category. Although sreweight will apparently tolerate marginal totals that do not add to the same numbers, but I think it worthwhile to fix the numbers yourself.

                  I recommended in Post #5 that if you have the three-way totals and if sample cell sizes are large enough, use the and poststrata() and postweight() options in svyset (see note below).

                  Other comments:

                  A If you have three-way totals, but some sample totals are spare, combine some small neighboring cells first (a pain) or go to B
                  B If you have all two-way total, feed sreweight with three vectors of two-way totals: gender-region, gender-age, region-age
                  C If you have only one two-way total (e.g. gender -age), then feed sreweight two vectors: one with the known two-way total and one for the one-way total


                  Note: In this post, users noticed that estimated standard errors for means with poststrata() and postweight() were smaller than those estimated after other reweighting commands had been used. Was this a bug? I investigated and discovered that this is an expected benefit of the algorithm that Stata uses.
                  Last edited by Steve Samuels; 25 Jul 2018, 12:15.
                  Steve Samuels
                  Statistical Consulting
                  [email protected]

                  Stata 14.2

                  Comment


                  • #10
                    Dear Steve,

                    I ended up calculating my post-stratification weights with survwgt rake by Nicholas Winter. The reason for why I used that package was that (i) some of the age-gender-region cells included small numbers of observations, and (ii) sreweight did not converge for some of the countries observed. I managed to replicate the original calibration weights using gender and region totals, then I calculated my own weights using gender, region, and age totals.

                    Thanks a lot for your advise!
                    Best,
                    Stephanie

                    Comment

                    Working...
                    X