Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • svy commands for survey data sets

    Hello,

    I am working with a complex survey dataset currently that has applied a stratified random sampling design.
    I would like to understand here whether not using svy setting in Stata would lead to wrong analysis or will it yield the same results when the sampling design is taken into account (using svy commands in stata)

    Hoping to get a reply soon.

    Thanks

    Deeksha

  • #2
    most svywhatever commands have been subsumed under svy: whatever. The originals like svyreg and svylogit may or may not work, but go with svy: whatever to get the up-to-date version. You don't specify what version of Stata you are using, but under Stata 13, "help svylogit" for example, lists those commands that have been subsumed.

    Even if you are working from old documentation that specifies the old commands, best off adapting to the newer approach to get it right.

    Quote from help:

    These commands have been replaced by the svy prefix command in Stata version 9 (and later).
    The old command names continue to work.
    svygnbreg svy: gnbreg
    svyheckman svy: heckman
    svyheckprob svy: heckprob
    svyintreg svy: intreg
    svyivreg svy: ivreg
    svylogit svy: logit
    svymean svy: mean
    svymlogit svy: mlogit
    svynbreg svy: nbreg
    svyologit svy: ologit
    svyoprobit svy: oprobit
    svypoisson svy: poisson
    svyprobit svy: probit
    svyprop svy: proportion
    svyratio svy: ratio
    svyregress svy: regress
    svytab svy: tabulate
    svytotal svy: total
    Last edited by ben earnhart; 07 Jan 2015, 01:33.

    Comment


    • #3
      I'm fairly new to complex survey data analysis myself, but here's what I think the effects are..

      The effects of using svy depend on your data.
      Firstly on the sample design
      Secondly on how variation is related to your strata and clusters.

      Stratification leads to a sample that is more representative of the population on the characteristic that you stratify on then a simple random sample (SRS).
      So for instance if your sample is stratified in urban and rural, your sample will have a urban-rural divide that is similar to the population, whereas in SRS normal sampling variability can lead to samples with larger shares of urban or rural than the population.
      Stratification makes population estimates more efficient - depending a bit on how the stratification variable relates to your variable of interest.
      Using svy to analyse stratified data is likely to lead to a downward correction of the standard errors, so a result that is not significant with 'normal' analysis might be significant after correcting for stratification with svy.

      Clustering violates the assumption of independence of observations. For face-to-face surveys, you usually use a multistage cluster sample, in which you first select villages, and then people within villages. This violates the assumption of independence of observations because in general people who live in the same village are more alike than people across villages. This means there is unobserved heterogeneity in your sample.
      In a significance test you examine how likely it is to draw a sample with a value that is as least as far removed from the H0 as the value in your sample.
      Let's say you test a sample of students from school A and find their average IQ is 123. You know that samples vary, so just because the sample has an average IQ of 123 doesn't mean the average of the entire school is 123. if you'd draw lots of samples from the same population, they will have different mean IQ scores: mostly these will lie around the population mean, and some samples would lie further away. You could determine the sampling distribution of a population; the mean of this distribution is the population mean and the standard deviation is the standard error.
      To determine whether a sample with an average IQ of 123 is enough ground to say that students at school A are very bright you can do an null-hypothesis significance test (NHST) in which you determine how likely it is to draw a sample with an IQ of 123+ if the average IQ in the school is 100 (H0).
      For this you calculate the sampling distribution based on the normal variation of samples of that size.
      But how do you know what the 'normal variation' is? Well, usually you don't. You use the standard deviation of your sample as an estimate of the standard error of the sampling distribution of the population.
      This is where your cluster sample can become a problem; because the people in a cluster might be more similar than an SRS you are likely to underestimate the standard error.
      Note that this is only a problem if people in your clusters are more similar than people across clusters.
      If you do not correct for clustering, you may underestimate the standard error and thus overestimate the significance of your results.

      It is not possible to predict how the svy corrections will work out for your data. I would recommend running your analysis both with the normal command and the svy set so that you can see what the corrections are doing.
      If this shows that controlling for clustering/stratification leads substantial changes in the significance levels of your estimates I think you should stick with the svy estimates.

      A great book on this topic is
      Heeringa, S. G., West, B. T., & Berglund, P. A. (2010). Applied survey data analysis. CRC Press.

      The website has example codes http://www.isr.umich.edu/src/smp/asda/

      Comment


      • #4
        Stratification leads to a sample that is more representative of the population on the characteristic that you stratify on then a simple random sample (SRS).
        Not necessarily. It depends on how you do the stratification. In fact, sometimes the purpose of stratification is to allow for over-sampling of small subpopulations so that sub-population specific results can be estimated with reasonable precision, while also enabling unbiased estimation of overall population results.

        I think that what you are referring to in the quote is post-stratification, which adjusts results from a (not necessarily stratified) sample to reflect known distributions of attributes in the population. That is a different matter.

        Comment


        • #5
          Evelyn -- nice write-up of why to use the svy commands, but unless I'm wrong, the original poster is already sold on the advantages of taking the sampling methodology into account,

          What I *think* s/he was asking about was whether to use the old commands, whereby clustering was handled by a separate command (svlylogit, svyreg, svymean, etc.), or whether to use the new syntax, whereby you precede the normal command with "svy:" (svy: logit, svy: reg, svy: mean, etc). There is plenty of out-of-date documentation for datasets floating around out there using the old syntax.

          AFAIK, the old syntax should get the same results as the newer syntax, but might as well move on to the newer stuff just in case.

          Comment


          • #6
            Thanks Clyde for pointing out this important omission.
            Stratification can indeed be used to purposefully oversample a group within the population.
            You then need weights to correct estimates for this - something svy can incorporate but I didn't address.

            Stratification with the aim to get a more representative sample can happen both before and after data collection.

            Comment


            • #7
              On my "wish list" for 2015: a newly published book on survey analysis in Stata...
              Best regards,

              Marcos

              Comment


              • #8
                Marcos, why the need for a newly published book?
                What are you missing from the current selection?
                http://www.stata.com/bookstore/survey-statistics/

                Comment


                • #9
                  Far from me any attempt to speak on behalf of Marcos, but I thnk that he meant a Stata Press textbook on survey data analysis (despite having all the textbooks listed via the link Evelyn reported, I would welcome it, too).
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    Dear Evelyn, thank you very much for the suggestion. Surely, I'll check the book. It seems, by reading the back cover, Stata (version 10?) is mostly used. I shall include this book in next month's purchase, I guess.

                    Dear Carlo, thank you again. You absolutely guessed what I really meant: something like "Survey Data Analysis Using Stata".

                    Well, you see, I provided the title - ; ) - and made a wishful thinking!

                    Kind regards,

                    Marcos
                    Best regards,

                    Marcos

                    Comment


                    • #11
                      By the way, I have Stata 13 and I've been using Stata only since Stata 12. Is there much difference in survey data analysis (in terms of commands, options, graphics and the likes) between Stata 10 and Stata 13?

                      Thanks in advance for the information.

                      Best,

                      Marcos
                      Best regards,

                      Marcos

                      Comment


                      • #12
                        Doing some homework in other to compare improvements and differences, I found two articles from a single source (UCLA), both on survey data analysis:

                        one using Stata 11

                        http://www.ats.ucla.edu/stat/stata/s...d_svy_stata11/

                        and the other using Stata 13

                        http://www.ats.ucla.edu/stat/stata/s...ed_svy_stata13

                        Best regards,

                        Marcos

                        Comment


                        • #13
                          Doing some homework in other to compare improvements and differences, I found two articles from a single source (UCLA), both on survey data analysis:

                          one using Stata 11

                          http://www.ats.ucla.edu/stat/stata/s...d_svy_stata11/

                          and the other using Stata 13

                          http://www.ats.ucla.edu/stat/stata/s...d_svy_stata13/

                          Best regards,

                          Marcos

                          Comment

                          Working...
                          X