Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correct setup for "svyset" command in Stata 15

    Dear Statalisters:

    I am facing the following issue.

    I am using data from a multi-stage survey conducted by the Global Entrepreneurship Monitor (GEM), which describes the data in the following terms: “Our Adult Population Survey (APS) looks at the characteristics, motivations and ambitions of individuals starting businesses, as well as social attitudes towards entrepreneurship”.

    The GEM collects nationally representative samples of the adult population on a number of countries. To the best of my understanding, the countries are not randomly selected. Instead, the inclusion of certain countries is an administrative decision and/or the result of such past decisions. Within country, further stratified and clustered sampling is done. In the final stage, a random sample of individuals within country, within strata, within clusters is selected.

    However, when GEM reports the data, no information regarding the specifics is provided. Instead, the GEM provides data on the individuals surveyed (a unique ID for the individual) in all countries surveyed (along with a country ID) and the final weights for each individual. The method for sample design weights is described here: http://gem-consortium.ns-client.xyz/wiki/1175

    My situation is the following. I want to estimate a random effect model (for country) using Stata’s “svy:” prefix for the data for any given year.

    The data would look like this:
    country_ID year respondent_ID weight_a var1 var2 var3
    Netherlands 2011 1 0.784929 Retired, 57 27
    Netherlands 2011 2 1.081878 Retired, 100 81
    Netherlands 2011 3 1.081878 Retired, 28 92
    Netherlands 2011 4 1.081878 Not work 37 6
    Belgium 2011 5 0.75417 Full: fu 73 58
    Belgium 2011 6 0.75417 76 72
    Belgium 2011 7 0.75417 Full: fu 92 14
    Belgium 2011 8 0.75417 Full: fu 22 92
    France 2011 9 0.939495 Full: fu 53 96
    France 2011 10 0.909229 Homemake 90 66
    France 2011 11 1.021805 Retired, 1 82
    France 2011 12 1.058208 Full: fu 13 19
    France 2011 13 0.815568 Retired, 59 83
    France 2011 14 1.001615 Retired, 20 60
    Specifically, I am not sure how to proceed with the “svyset” command.

    I have tried the following (where "R[country_ID"is a latent variable that stands for the country “random effect”):
    Code:
    svyset respondent_ID [pweight= weight_a ], strata(country_ID)
    then:
    Code:
    svy: gsem (var2 <- var1 var2 R[country_ID], family(gaussian) link(identity))
    (Side note: I have my reasons, which are not directly related to the issue at hand, for wanting to use gsem)


    However, I get an error message:

    survey final weights not allowed with multilevel models; a final weight variable was svyset using the [pw=exp] syntax, but multilevel models require that each stage-level weight variable is svyset using the stage's corresponding weight() option
    I have scoured the web (and Statalist in particular) searching for a solution. A couple of people have posted a solution that seems reasonable. For example:
    Code:
    gen country_weights = 1
    
    svyset country_ID, weight(country_weights) || respondent_ID, weight(weight_a)
    This comes from:

    https://www.statalist.org/forums/for...-data-question

    https://www.statalist.org/forums/for...-in-stata-13-1

    I find it strange to setup “country_ID” as the Stage 1 PSU because as explained above, no sampling occurred at this stage. Countries were selected a priori for other reasons. That is why I thought it made sense to treat them as “strata”. I can convince myself that the last svyset is correct by consideirng the meaning of “country_weights = 1”. If the weights are inversely proportional to the probability of being selected, then a weight of 1 implies a selection probability of 1, which is precisely the case when the countries were selected a priori. Can someone please shed some light as to which of these is the correct “svyset” command?

    Thank you in advance.
    Last edited by Arkangel Cordero; 03 Mar 2021, 13:22.

  • #2
    Did I break some cardinal rule in posting my question? Or is the question a dumb one? Any feedback either way will be appreciated.

    Comment


    • #3
      Originally posted by Arkangel Cordero View Post
      Did I break some cardinal rule in posting my question? Or is the question a dumb one? Any feedback either way will be appreciated.
      No cardinal rules were broken. It's probably more a combination of 1) people on Statalist have real lives and don't live solely to answer questions on the forum, so expecting a response immediately is not reasonable, and 2) this is a complex question, and it looks like it requires some specialist expertise in survey weights, which not everyone has.
      Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

      When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

      Comment


      • #4
        Hi Weiwen:

        Thank you for your response and advice. I was just unsure as to the merits of my questions. I apologize for not using datex before. Following your advice.

        To recap:

        I am using data from a multi-stage survey conducted by the Global Entrepreneurship Monitor (GEM), which describes the data in the following terms: “Our Adult Population Survey (APS) looks at the characteristics, motivations and ambitions of individuals starting businesses, as well as social attitudes towards entrepreneurship”.

        The GEM collects nationally representative samples of the adult population on a number of countries. To the best of my understanding, the countries are not randomly selected. Instead, the inclusion of certain countries is an administrative decision and/or the result of such past decisions. Within country, further stratified and clustered sampling is done. In the final stage, a random sample of individuals within country, within strata, within clusters is selected.

        However, when GEM reports the data, no information regarding the specifics is provided. Instead, the GEM provides data on the individuals surveyed (a unique ID for the individual) in all countries surveyed (along with a country ID) and the final weights for each individual. The method for sample design weights is described here: http://gem-consortium.ns-client.xyz/wiki/1175

        Here is a snippet of the data using datex:

        Code:
        * dataex  teayyopp age gender setid country weight_a
        * Example using dataex
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte(teayyopp age gender) double setid int country double weight_a
        0 28 2 1162002037 1 1.2240842389020523
        0 66 2 1161041973 1  .6321595854033221
        0 42 1 1161103240 1  1.080803148373842
        0 80 1 1161043779 1  .6908092312309552
        0 64 2 1161045474 1  1.172785127969912
        0  . 2 1161043966 1  .6321595854033221
        0 64 1 1161044176 1 1.0385648684953017
        1 60 1 1161032507 1  .8179595225715345
        0 55 1 1161044313 1 1.0376199539857618
        0 50 2 1161111980 1  1.164631073417762
        0 40 2 1161044694 1 1.3773787917732125
        0 45 2 1161045347 1  .9325321525170057
        0 68 1 1161045398 1  .6908092312309552
        0 54 1 1161045428 1 1.0513856127205818
        0 59 1 1161046163 1 1.2221943058467821
        0 23 2 1161046165 1 1.3338850805978524
        0 47 2 1161075912 1  1.126789150273802
        0 33 1 1162047563 1  1.177621273027142
        0 64 2 1161047087 1  .9780854104256947
        0 69 2 1161047206 1  .8373483440740934
        0 33 1 1162003106 1 1.2627191940190121
        0 44 2 1161047902 1 1.3274417959133125
        0 55 2 1161011584 1  .9780854104256947
        0 45 1 1161074080 1 1.1993917351200423
        1 19 2 1162061077 1 1.0591103261894117
        0 46 2 1161007963 1  .7507734412762663
        0 56 2 1161049355 1 1.2164679851085323
        0 74 2 1161049563 1  .6411883728475521
        0 34 1 1162014649 1 1.5130231745531926
        0 38 1 1161049913 1  .9704133617121607
        1 66 1 1161050287 1  .7368901382541844
        0 62 1 1161031297 1  .8021522085857914
        0 20 2 1162019737 1 1.0591103261894117
        0 54 2 1161050965 1 1.0103009612501517
        1 30 2 1162046415 1 1.0351781687685817
        1 69 1 1161043267 1  .7368901382541844
        0 26 1 1162025810 1 1.0254740912210918
        0 64 1 1161007878 1  .8505421842791906
        0 66 2 1161052937 1  .6321595854033221
        0 35 1 1161053225 1  .9704133617121607
        0 56 2 1161053257 1  .6983096677427892
        0 66 2 1161053606 1  .6879737359910212
        0 50 2 1161053863 1 1.0103009612501517
        0 43 2 1161036355 1  .8546381606247415
        0 58 1 1161054222 1 1.0385648684953017
        0 54 2 1161107827 1  1.125240159605462
        0 64 2 1161054893 1  .8440575312073275
        0 21 2 1161055172 1 1.2049563148278222
        0 40 1 1161055485 1  .7529948201505994
        0  . 2 1161038839 1 1.0103009612501517
        0 35 1 1161008917 1  1.070918334411092
        1 34 1 1162000350 1  1.168272772334822
        0 53 2 1161062318 1  1.125240159605462
        0 62 2 1161055932 1 1.2164679851085323
        0 53 2 1161056046 1  .7507734412762663
        0 50 2 1161068950 1 1.2059634655517522
        0 77 1 1161056461 1  .7486202206286103
        0 60 2 1161056809 1  .6983096677427892
        0 19 2 1161081609 1 1.2484862303698723
        0 52 1 1161100186 1  1.132632779992362
        0 32 2 1162007910 1  .8057594850798915
        0 22 1 1162031781 1 1.0268404506051219
        0 70 1 1161059260 1  .7368901382541844
        0 37 1 1162006184 1 1.3166472330499424
        0 75 2 1161095732 1  .6879737359910212
        0 66 2 1161046735 1   .587013330572969
        0 23 1 1162006125 1 1.2571268568986322
        0 78 2 1161060771 1   .587013330572969
        1 20 2 1162059678 1 1.0245126213845517
        0 47 2 1161120951 1  .7507734412762663
        0 59 1 1161024647 1  .8021522085857914
        0 62 1 1161046306 1  .8021522085857914
        0 75 1 1161059886 1  .6500096082222752
        0 27 1 1162036637 1  .8349585799663535
        0 24 1 1162006709 1  .8445792690361905
        0 54 1 1161055922 1  1.145200237820082
        0 59 1 1161063746 1 1.0385648684953017
        0 47 1 1161063975 1  .7022708889159323
        0 59 2 1161064264 1  .8802975053755105
        0 33 1 1162012339 1 1.2627191940190121
        0 48 2 1162006059 1  1.125240159605462
        1 44 1 1162013863 1  .9704133617121607
        0 64 1 1161065419 1 1.0385648684953017
        0 47 2 1161065543 1  .9325321525170057
        0 62 1 1161019877 1  .9697046350404888
        1  . 2 1161094703 1 1.2059634655517522
        0 56 1 1161067427 1 1.0376199539857618
        0 55 1 1161068033 1  .8821668227450665
        0 64 2 1161026034 1  .8440575312073275
        0 39 2 1162014577 1  1.138006958241562
        1 36 1 1162005370 1  .9704133617121607
        0 46 2 1161054121 1 1.0103009612501517
        0 52 2 1161069997 1  .9325321525170057
        0 63 2 1161070648 1  1.172785127969912
        0 20 1 1162060317 1 1.0412415591977418
        0 27 1 1162013078 1  1.168272772334822
        0 53 2 1161071079 1  1.164631073417762
        0  . 2 1162057781 1 1.0103009612501517
        0 62 1 1161102167 1  .8179595225715345
        1 49 1 1161043409 1 1.0513856127205818
        end
        label values teayyopp LABG
        label def LABG 0 "No", modify
        label def LABG 1 "Yes", modify
        label values age AGE
        label values gender GENDER
        label def GENDER 1 "Male", modify
        label def GENDER 2 "Female", modify
        label values country COUNTRY
        label def COUNTRY 1 "United States", modify


        I have used svyset in 2 different ways:
        #1)
        Code:
        svyset setid [pweight= weight_a ], strata(country)
        svy: gsem (teayyopp <- age gender R[country] , family(bernoulli) link(probit))
        This leads to the error message below:
        (running gsem on estimation sample)
        Code:
        survey final weights not allowed with multilevel models;
            a final weight variable was svyset using the [pw=exp] syntax, but multilevel models require that each stage-level weight variable is svyset using the stage's corresponding weight()
            option
        an error occurred when svy executed gsem
        # 2) I also tried:
        Code:
        gen x = 1
        svyset country, weight(x) || setid, weight(weight_a)
        svy: gsem (teayyopp <- age gender R[country] , family(bernoulli) link(probit))
        This seems to work:


        Code:
        (running gsem on estimation sample)
        
        Survey: Generalized structural equation model
        
        Number of strata   =         1                  Number of obs     =    192,528
        Number of PSUs     =        65                  Population size   = 192,421.69
                                                        Design df         =         64
        Response       : teayyopp
        Family         : Bernoulli
        Link           : probit
        
         ( 1)  [teayyopp]R[country] = 1
        ---------------------------------------------------------------------------------
                        |             Linearized
                        |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        ----------------+----------------------------------------------------------------
        teayyopp        |
                    age |  -.0097973   .0007836   -12.50   0.000    -.0113627    -.008232
                 gender |   -.235161   .0238639    -9.85   0.000    -.2828346   -.1874874
                        |
             R[country] |          1  (constrained)
                        |
                  _cons |  -.6977571   .0668339   -10.44   0.000    -.8312732    -.564241
        ----------------+----------------------------------------------------------------
         var(R[country])|   .0781716   .0139937                      .0546685    .1117792
        ---------------------------------------------------------------------------------

        However, I don't know whether the svy set command in the last instance is correct, given that country is not randomly selected (not a psu), but a cluster instead. In other words, which countries are surveyed is not done via random sampling, but rather determined ahead of time based on administrative issues. So my question is whether the svyset command below is correct:

        Code:
        gen x = 1
        svyset country, weight(x) || setid, weight(weight_a)
        Last edited by Arkangel Cordero; 03 Mar 2021, 19:41.

        Comment

        Working...
        X