Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • svy versus [pw=...] for regression analysis

    Hello everybody,
    I have data from household budgets and the only available weight is a pweight. I run regression analysis and I see that running this command

    regress HEALTHRT i.MB02 i.AGEGROUP2 [pweight = HA10] if HA02==2008

    has different results compared to these

    svyset [pw=HA10]
    regress HEALTHRT i.MB02 i.AGEGROUP2 if HA02==2008

    as for the coefficients. At p-values there are also differences, not as for the significance but as for the values.
    Which method is more appropriate?
    I saw a previous post but they talked about subpopulation and I didn't get it.

    Many thanks,
    Lina


  • #2
    Neither is correct if you have data from a multistage sample, so please describe the sampling design in detail. Was a different sample taken in each year? Quote from the study documentation or link to it, and we'll then construct the correct svyset statement and advise about subpop().

    In addition, we can't see the difference you are talking about. Please follow the directions of FAQ 12, which, in part, are to 1) show not only commands, but all the results from those commands; 2) to put commands and results between CODE delimiters, which are explained. As it is, your second statement had no svy: prefix and ignored the (possibly incorrect) svyset statement
    Last edited by Steve Samuels; 18 Nov 2015, 06:38.
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Thanks for this. Well, my dataset has the expenditures on health and goods since 2008.

      I have a variable (HA10) that gives the sample weight and it is the only information that I have about the sampling design.
      HEALTHRT is the expenditure on health care, MB02 the gender of the household head, Agegroup2 the age group that he is included, Marrital the marital status, education level the next one, tworegions a dummy variable for 2 regions that I have created from the dataset, ME01_AGGREGATED the ses, household size the household size, QUI_RT_CONSUM the quintile of consumption, and HA02 the year of survey, I want to analyze the dataset year by year, from 2008 to 2014.
      Typing
      Code:
      svyset [pw=HA10]
      regress HEALTHRT i.MB02 i.AGEGROUP2 i.MARRITAL i.education i.tworegions i.ME01_AGGREGATED i.Household_size i.QUI_RT_CONSUM if HA02==2013
      I receive the results of pic1.
      While typing
      Code:
      regress HEALTHRT i.MB02 i.AGEGROUP2 i.MARRITAL i.education i.tworegions i.ME01_AGGREGATED i.Household_size i.QUI_RT_CONSUM [pweight = HA10] if HA02==2013
      I take results of pic2.

      Could you help me please?

      Comment


      • #4
        Just svysetting the data isn't enough. You have to use the svy: prefix on commands, e.g.

        Code:
        svy: reg y x1 x2
        I think that will resolve the discrepancies. If not post your output again. It is better to use code tags when doing so. See pt 12 of the FAQ.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5

          Thanks for using CODE delimiters, but FAQ I2 asked that you put results, not just commands, between CODE delimiters. It also asks that you not post photos. They block the screen and, in your case, are too fuzzy to be readable.

          Do you want to quote standard errors, confidence intervals, or p-values? Do you plan on comparing different years? Different countries? If the answer is "yes" to any of these questions, knowing something of the design is crucial.

          Your naming of the HA10 and HA02 variables was enough to identify your survey as an EU Household Budget Survey (HBS). I found some overall description in these two documents (I didn't look for more).

          2008:

          http://ec.europa.eu/eurostat/documen...2-db5d330f27aa

          2010:

          http://ec.europa.eu/eurostat/documen...4-757d4342015f


          These indicate that design varied from country-to-country, but you should be able to pick out _your_ country (if you are studying a single country) and get a description that way. See Annex 2 (p. 31+) of the 2008 document and Appendix 2 (p. 52+) of the 2010 document. Also of some importance is the "effective" sample size of Table 2, page 8 of the 2010 document. So report, verbatim, what those sections say about your country.
          Last edited by Steve Samuels; 18 Nov 2015, 08:42.
          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment


          • #6
            Dear friends,

            first of all, please accept my apologies for my mistake with the attached photos, but I didn't know how to add the results between CODE delimiters, do you mean copy-paste?...I'm really sorry for this.
            Next, Mr Williams was right, the results are the same using svy before regression. As for the HBS, thank you for the material, but I couldn't find how to use it in my regression analysis.

            Many thanks for your support and your immediate response.

            Best,

            Lina

            Comment


            • #7
              Lina:
              as far as code delimiters are concerned, here is an excerpt from FAQ #12 (that you're recommended to read):
              Stata code (i.e. the exact commands issued) is very much easier to read if presented as such. Click on the “Toggle Advanced Editor” button (an underlined A) in the area above where you enter text for posts and, in the menu that appears, click on the # button to insert
              Code:
               and
              mark-up. Write your code between, paying particular attention to linebreaks and indentation. Or just insert those mark-ups manually before, or indeed after, you insert your code.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                thank you!

                Comment


                • #9
                  I asked you to share the documentation so that I might advise you on what to do. Without the information I requested, I can only say that standard errors, confidence intervals, and p-values from svy: regress may all be invalid.
                  Last edited by Steve Samuels; 18 Nov 2015, 11:30.
                  Steve Samuels
                  Statistical Consulting
                  [email protected]

                  Stata 14.2

                  Comment


                  • #10
                    The best practical discussion I've found on correctly specifying -svyset- is here:

                    http://siteresources.worldbank.org/I...quityFINAL.pdf

                    However, there are cases where the survey documentation is quite sparse, making it very challenging to completely specify the sampling.
                    __________________________________________________ __
                    Assistant Professor, Department of Biostatistics and Epidemiology
                    School of Public Health and Health Sciences
                    University of Massachusetts- Amherst

                    Comment


                    • #11
                      In the absence of PSU and Stratum identifiers, I would use the 2010 DEFFs to modify e(V). It's only a rough fix for standard errors, but probably better than doing nothing.

                      Code:
                      cap program drop _all
                      program define udeff, eclass
                          args deff
                          matrix b = e(b)
                          matrix V = `deff'*e(V)
                          ereturn post b V
                      end
                      
                      sysuse auto, clear
                      svyset _n [pw = rep78]
                      
                      svy: reg price weight mpg
                      di _se[weight]
                      test weight
                      
                      /* Apply DEFF = 2 */
                      udeff 2
                      di _se[weight]
                      test weight
                      Last edited by Steve Samuels; 19 Nov 2015, 06:21.
                      Steve Samuels
                      Statistical Consulting
                      [email protected]

                      Stata 14.2

                      Comment


                      • #12
                        Thank you for your help.
                        As for the documentation of weights I write here the guides for the authority that carried out the survey:

                        " For the survey features estimation, the characteristics of each individual as well as these of each household were multiplied with a reducing coefficient that created as the product of these 3 factors:
                        1. the reverse probability of choosing this individual (this is equal to the reverse probability of choosing the specific household)
                        2. the reverse of the ratio of response of household in this strata
                        3. a correction coefficient. "

                        Is this helpful?..Because I find it even more complicated.

                        What happens with the heteroskedasticity if I use svy or pweights in my regression?

                        Comment


                        • #13
                          It is complicated but it is not what I asked for. I asked for a description of the sampling design in your country and linked to specific places where you might find it. (There might also be in the document you quoted.) The regressions you've done so far assumes that households were selected at random from a list. That's not necessarily true, as many studies select geographic areas first, then households within those areas. This is called multiple-stage sampling. The larger areas are "clusters" of households, known as"primary sampling units" or PSUs, and it is variation between PSUs which is the primary determinant of standard errors. If your survey had multiple stages, but PSUs are not identified in your data, then you cannot write down a correct svyset statement and will have standard errors that are much too small. That is why it is important that you report the sampling design pertinent to your country, especially the "Design Effect" (DEFF) on p 52 of the 2010 document. You can look up the definition of the DEFF in the Stata survey manual.

                          To answer your question about heteroskedasticity: survey regression works with heteroskedastic data. The standard error computation does not assume a constant residual SD.

                          So I'm now curious: what is the country or countries you are studying.
                          Steve Samuels
                          Statistical Consulting
                          [email protected]

                          Stata 14.2

                          Comment

                          Working...
                          X