Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weighting in Stata when weight variable accounts for both sample-based and population-based corrections?

    Hello,

    I am using National Survey Data (specifically, UK LCF) for Regression Analysis that contains a variable weighta described as following:
    Click image for larger version

Name:	weight.PNG
Views:	1
Size:	39.6 KB
ID:	1601432




    I really don't know which of the following Stata specifications (from "help weight") for weighting is suitable to apply to my variable weighta. Can you please help me?
    Click image for larger version

Name:	Capture.PNG
Views:	1
Size:	39.5 KB
ID:	1601433




    Here you can see how my the values for my weighta variable look like; they are not integer so this makes me doubt fweight specification in Stata would fit, but I am still not sure about any of the others.
    Click image for larger version

Name:	Capture1.PNG
Views:	1
Size:	13.1 KB
ID:	1601434





    Also I need to apply my variables "weighta" within the following regression:

    logit repair sav_rate inc_a age fam HighEd1 price_ratio ib1.d2001 d2002 d2003 d2004 d2005 d2006 d2007 d2008 d2009 d2010 d2011 d2012 d2013 d2014 d2015 d2016 d2017 d2018

    where the the variables d20** are dummies for the year indeed.

    Once I have assessed which STATA weight specification I need to use eg [pweight = weighta], I am not sure where I need to position this within the regression. At the end? Next to all the household specific features like "age" and "fam" (ie size of the household)?

    I have also run into the following example by typing "help svy" on the STATA consol, but it seems hard to interpret. Especially I did not get what "psuid" variable stand for.

    Thank you in advance!


    Best,

    Linda

    Last edited by Linda Luciani; 04 Apr 2021, 03:51.

  • #2
    These weights should be dealt with as -pweight-s in Stata.

    To use them in a regression you should include [pweight = weighta] after all regression variables, and also after any -if- or -in- restrictions. If you also specify any options for the regression command, this should precede both the comma and the options themselves.

    I have also run into the following example by typing "help svy" on the STATA consol, but it seems hard to interpret. Especially I did not get what "psuid" variable stand for.
    An alternative approach to specifying weights in the command is to -svyset- the data and then use the -svy:- prefix with the command. The -svyset- command, in addition to enabling you to specify the pweight variable, also lets you provide information about stratification and primary or higher order sampling units if any. psuid refers to the variable which identifies the primary sampling units in your data. You would have to refer to the documentation provided by the curators of the survey itself to know which variable that might be. The same is true for the strata and any higher order sampling units. Most, but not all, large-scale surveys use both strata and one or more levels of sampling units, though regrettably, public use data often omits this information out of concern about data privacy. This is unfortunate, because without that information standard errors, test statistics, confidence intervals, and p-values will all be incorrect. (This information, however, is not needed to get unbiased estimates--only the sampling weights are required for that.)

    Comment


    • #3
      Dear Mr Clyde Schechter ,

      Thank you for an incredibly exhaustive answer!
      Indeed I could see that in the regression example from "help svy", once is requested to specify the following.
      1. Pweight : finalwgt
      2. Strata 1: strataid
      3. SU 1: psuid
      Although I don't have any variable analogous like "strataid" and "psuid" in my dataset, I guess for the privacy reasons that you are mentioning. I only have "weighta" analogous to "finalwgt" in the example.
      In the technical report they do talk about how they make stratification and standard errors extensively, I report a paragraph here for illustration:
      Click image for larger version

Name:	222.PNG
Views:	1
Size:	43.2 KB
ID:	1601580



      So my question is : Do you think it might be possible for me to retrieve Strata 1 and SU 1 from the information provided in the technical report or rather I shall just surrender to the idea that I cannot do inference on my regression results?
      If the latter is the case, is there any other way to get around it?

      Thank you so much for your help!

      Best,
      Linda
      Last edited by Linda Luciani; 05 Apr 2021, 02:30.

      Comment


      • #4
        I don't think it will be possible to identify the strata and psu's from that formula. At most, you might be able to figure out how many strata there are and how many psu's each contains, though, frankly it would be an enormously complicated task. I think that if the technical report doesn't tell you where strata and psu are in the data, they are deliberately hiding it.

        Comment


        • #5
          Dear Mr Clyde Schechter,
          you are exactly right: they give me a region/age classification and the number of psu, but it would be hard to match that with the actual data.
          I can see from some tables they report that there can be up to 20% difference in standard errors computed with full method versus the simple ones...my inference will be a pure "applied-theory" exercise I guess.

          Thank you very much for your precious help!

          Best,
          Linda

          Comment

          Working...
          X