Correct setup for "svyset" command in Stata 15

Arkangel Cordero

Join Date: Apr 2020
Posts: 32

Correct setup for "svyset" command in Stata 15

03 Mar 2021, 13:17

Dear Statalisters:

I am facing the following issue.

I am using data from a multi-stage survey conducted by the Global Entrepreneurship Monitor (GEM), which describes the data in the following terms: “Our Adult Population Survey (APS) looks at the characteristics, motivations and ambitions of individuals starting businesses, as well as social attitudes towards entrepreneurship”.

The GEM collects nationally representative samples of the adult population on a number of countries. To the best of my understanding, the countries are not randomly selected. Instead, the inclusion of certain countries is an administrative decision and/or the result of such past decisions. Within country, further stratified and clustered sampling is done. In the final stage, a random sample of individuals within country, within strata, within clusters is selected.

However, when GEM reports the data, no information regarding the specifics is provided. Instead, the GEM provides data on the individuals surveyed (a unique ID for the individual) in all countries surveyed (along with a country ID) and the final weights for each individual. The method for sample design weights is described here: http://gem-consortium.ns-client.xyz/wiki/1175

My situation is the following. I want to estimate a random effect model (for country) using Stata’s “svy:” prefix for the data for any given year.

The data would look like this:

country_ID	year	respondent_ID	weight_a	var1	var2	var3
Netherlands	2011	1	0.784929	Retired,	57	27
Netherlands	2011	2	1.081878	Retired,	100	81
Netherlands	2011	3	1.081878	Retired,	28	92
Netherlands	2011	4	1.081878	Not work	37	6
Belgium	2011	5	0.75417	Full: fu	73	58
Belgium	2011	6	0.75417		76	72
Belgium	2011	7	0.75417	Full: fu	92	14
Belgium	2011	8	0.75417	Full: fu	22	92
France	2011	9	0.939495	Full: fu	53	96
France	2011	10	0.909229	Homemake	90	66
France	2011	11	1.021805	Retired,	1	82
France	2011	12	1.058208	Full: fu	13	19
France	2011	13	0.815568	Retired,	59	83
France	2011	14	1.001615	Retired,	20	60

Specifically, I am not sure how to proceed with the “svyset” command.

I have tried the following (where "R[country_ID"is a latent variable that stands for the country “random effect”):

Code:

svyset respondent_ID [pweight= weight_a ], strata(country_ID)

then:

Code:

svy: gsem (var2 <- var1 var2 R[country_ID], family(gaussian) link(identity))

(Side note: I have my reasons, which are not directly related to the issue at hand, for wanting to use gsem)

However, I get an error message:

“survey final weights not allowed with multilevel models; a final weight variable was svyset using the [pw=exp] syntax, but multilevel models require that each stage-level weight variable is svyset using the stage's corresponding weight() option”

I have scoured the web (and Statalist in particular) searching for a solution. A couple of people have posted a solution that seems reasonable. For example:

Code:

gen country_weights = 1

svyset country_ID, weight(country_weights) || respondent_ID, weight(weight_a)

This comes from:

https://www.statalist.org/forums/for...-data-question

https://www.statalist.org/forums/for...-in-stata-13-1

I find it strange to setup “country_ID” as the Stage 1 PSU because as explained above, no sampling occurred at this stage. Countries were selected a priori for other reasons. That is why I thought it made sense to treat them as “strata”. I can convince myself that the last svyset is correct by consideirng the meaning of “country_weights = 1”. If the weights are inversely proportional to the probability of being selected, then a weight of 1 implies a selection probability of 1, which is precisely the case when the countries were selected a priori. Can someone please shed some light as to which of these is the correct “svyset” command?

Thank you in advance.

Last edited by Arkangel Cordero; 03 Mar 2021, 13:22.

Tags: None

Arkangel Cordero

Join Date: Apr 2020

Posts: 32
#2

03 Mar 2021, 17:04

Did I break some cardinal rule in posting my question? Or is the question a dumb one? Any feedback either way will be appreciated.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#3

03 Mar 2021, 17:29

Originally posted by Arkangel Cordero View Post

Did I break some cardinal rule in posting my question? Or is the question a dumb one? Any feedback either way will be appreciated.

No cardinal rules were broken. It's probably more a combination of 1) people on Statalist have real lives and don't live solely to answer questions on the forum, so expecting a response immediately is not reasonable, and 2) this is a complex question, and it looks like it requires some specialist expertise in survey weights, which not everyone has.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Arkangel Cordero

Join Date: Apr 2020
Posts: 32

03 Mar 2021, 19:35

Hi Weiwen:

Thank you for your response and advice. I was just unsure as to the merits of my questions. I apologize for not using datex before. Following your advice.

To recap:

I am using data from a multi-stage survey conducted by the Global Entrepreneurship Monitor (GEM), which describes the data in the following terms: “Our Adult Population Survey (APS) looks at the characteristics, motivations and ambitions of individuals starting businesses, as well as social attitudes towards entrepreneurship”.

The GEM collects nationally representative samples of the adult population on a number of countries. To the best of my understanding, the countries are not randomly selected. Instead, the inclusion of certain countries is an administrative decision and/or the result of such past decisions. Within country, further stratified and clustered sampling is done. In the final stage, a random sample of individuals within country, within strata, within clusters is selected.

However, when GEM reports the data, no information regarding the specifics is provided. Instead, the GEM provides data on the individuals surveyed (a unique ID for the individual) in all countries surveyed (along with a country ID) and the final weights for each individual. The method for sample design weights is described here: http://gem-consortium.ns-client.xyz/wiki/1175

Here is a snippet of the data using datex:

Code:

* dataex  teayyopp age gender setid country weight_a
* Example using dataex
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(teayyopp age gender) double setid int country double weight_a
0 28 2 1162002037 1 1.2240842389020523
0 66 2 1161041973 1  .6321595854033221
0 42 1 1161103240 1  1.080803148373842
0 80 1 1161043779 1  .6908092312309552
0 64 2 1161045474 1  1.172785127969912
0  . 2 1161043966 1  .6321595854033221
0 64 1 1161044176 1 1.0385648684953017
1 60 1 1161032507 1  .8179595225715345
0 55 1 1161044313 1 1.0376199539857618
0 50 2 1161111980 1  1.164631073417762
0 40 2 1161044694 1 1.3773787917732125
0 45 2 1161045347 1  .9325321525170057
0 68 1 1161045398 1  .6908092312309552
0 54 1 1161045428 1 1.0513856127205818
0 59 1 1161046163 1 1.2221943058467821
0 23 2 1161046165 1 1.3338850805978524
0 47 2 1161075912 1  1.126789150273802
0 33 1 1162047563 1  1.177621273027142
0 64 2 1161047087 1  .9780854104256947
0 69 2 1161047206 1  .8373483440740934
0 33 1 1162003106 1 1.2627191940190121
0 44 2 1161047902 1 1.3274417959133125
0 55 2 1161011584 1  .9780854104256947
0 45 1 1161074080 1 1.1993917351200423
1 19 2 1162061077 1 1.0591103261894117
0 46 2 1161007963 1  .7507734412762663
0 56 2 1161049355 1 1.2164679851085323
0 74 2 1161049563 1  .6411883728475521
0 34 1 1162014649 1 1.5130231745531926
0 38 1 1161049913 1  .9704133617121607
1 66 1 1161050287 1  .7368901382541844
0 62 1 1161031297 1  .8021522085857914
0 20 2 1162019737 1 1.0591103261894117
0 54 2 1161050965 1 1.0103009612501517
1 30 2 1162046415 1 1.0351781687685817
1 69 1 1161043267 1  .7368901382541844
0 26 1 1162025810 1 1.0254740912210918
0 64 1 1161007878 1  .8505421842791906
0 66 2 1161052937 1  .6321595854033221
0 35 1 1161053225 1  .9704133617121607
0 56 2 1161053257 1  .6983096677427892
0 66 2 1161053606 1  .6879737359910212
0 50 2 1161053863 1 1.0103009612501517
0 43 2 1161036355 1  .8546381606247415
0 58 1 1161054222 1 1.0385648684953017
0 54 2 1161107827 1  1.125240159605462
0 64 2 1161054893 1  .8440575312073275
0 21 2 1161055172 1 1.2049563148278222
0 40 1 1161055485 1  .7529948201505994
0  . 2 1161038839 1 1.0103009612501517
0 35 1 1161008917 1  1.070918334411092
1 34 1 1162000350 1  1.168272772334822
0 53 2 1161062318 1  1.125240159605462
0 62 2 1161055932 1 1.2164679851085323
0 53 2 1161056046 1  .7507734412762663
0 50 2 1161068950 1 1.2059634655517522
0 77 1 1161056461 1  .7486202206286103
0 60 2 1161056809 1  .6983096677427892
0 19 2 1161081609 1 1.2484862303698723
0 52 1 1161100186 1  1.132632779992362
0 32 2 1162007910 1  .8057594850798915
0 22 1 1162031781 1 1.0268404506051219
0 70 1 1161059260 1  .7368901382541844
0 37 1 1162006184 1 1.3166472330499424
0 75 2 1161095732 1  .6879737359910212
0 66 2 1161046735 1   .587013330572969
0 23 1 1162006125 1 1.2571268568986322
0 78 2 1161060771 1   .587013330572969
1 20 2 1162059678 1 1.0245126213845517
0 47 2 1161120951 1  .7507734412762663
0 59 1 1161024647 1  .8021522085857914
0 62 1 1161046306 1  .8021522085857914
0 75 1 1161059886 1  .6500096082222752
0 27 1 1162036637 1  .8349585799663535
0 24 1 1162006709 1  .8445792690361905
0 54 1 1161055922 1  1.145200237820082
0 59 1 1161063746 1 1.0385648684953017
0 47 1 1161063975 1  .7022708889159323
0 59 2 1161064264 1  .8802975053755105
0 33 1 1162012339 1 1.2627191940190121
0 48 2 1162006059 1  1.125240159605462
1 44 1 1162013863 1  .9704133617121607
0 64 1 1161065419 1 1.0385648684953017
0 47 2 1161065543 1  .9325321525170057
0 62 1 1161019877 1  .9697046350404888
1  . 2 1161094703 1 1.2059634655517522
0 56 1 1161067427 1 1.0376199539857618
0 55 1 1161068033 1  .8821668227450665
0 64 2 1161026034 1  .8440575312073275
0 39 2 1162014577 1  1.138006958241562
1 36 1 1162005370 1  .9704133617121607
0 46 2 1161054121 1 1.0103009612501517
0 52 2 1161069997 1  .9325321525170057
0 63 2 1161070648 1  1.172785127969912
0 20 1 1162060317 1 1.0412415591977418
0 27 1 1162013078 1  1.168272772334822
0 53 2 1161071079 1  1.164631073417762
0  . 2 1162057781 1 1.0103009612501517
0 62 1 1161102167 1  .8179595225715345
1 49 1 1161043409 1 1.0513856127205818
end
label values teayyopp LABG
label def LABG 0 "No", modify
label def LABG 1 "Yes", modify
label values age AGE
label values gender GENDER
label def GENDER 1 "Male", modify
label def GENDER 2 "Female", modify
label values country COUNTRY
label def COUNTRY 1 "United States", modify

I have used svyset in 2 different ways:
#1)

Code:

svyset setid [pweight= weight_a ], strata(country)
svy: gsem (teayyopp <- age gender R[country] , family(bernoulli) link(probit))

This leads to the error message below:
(running gsem on estimation sample)

Code:

survey final weights not allowed with multilevel models;
    a final weight variable was svyset using the [pw=exp] syntax, but multilevel models require that each stage-level weight variable is svyset using the stage's corresponding weight()
    option
an error occurred when svy executed gsem

# 2) I also tried:

Code:

gen x = 1
svyset country, weight(x) || setid, weight(weight_a)
svy: gsem (teayyopp <- age gender R[country] , family(bernoulli) link(probit))

This seems to work:

Code:

(running gsem on estimation sample)

Survey: Generalized structural equation model

Number of strata   =         1                  Number of obs     =    192,528
Number of PSUs     =        65                  Population size   = 192,421.69
                                                Design df         =         64
Response       : teayyopp
Family         : Bernoulli
Link           : probit

 ( 1)  [teayyopp]R[country] = 1
---------------------------------------------------------------------------------
                |             Linearized
                |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
teayyopp        |
            age |  -.0097973   .0007836   -12.50   0.000    -.0113627    -.008232
         gender |   -.235161   .0238639    -9.85   0.000    -.2828346   -.1874874
                |
     R[country] |          1  (constrained)
                |
          _cons |  -.6977571   .0668339   -10.44   0.000    -.8312732    -.564241
----------------+----------------------------------------------------------------
 var(R[country])|   .0781716   .0139937                      .0546685    .1117792
---------------------------------------------------------------------------------

However, I don't know whether the svy set command in the last instance is correct, given that country is not randomly selected (not a psu), but a cluster instead. In other words, which countries are surveyed is not done via random sampling, but rather determined ahead of time based on administrative issues. So my question is whether the svyset command below is correct:

Code:

gen x = 1
svyset country, weight(x) || setid, weight(weight_a)

Last edited by Arkangel Cordero; 03 Mar 2021, 19:41.

Announcement

Correct setup for "svyset" command in Stata 15

Comment

Comment

Comment