Declaring survey design

Loe Franssen

Join Date: Jul 2016
Posts: 6

Declaring survey design

18 Apr 2018, 04:47

Dear all,

I would like to declare my survey design to Stata (version 13) using svyset.

I have a sample of 495 firms that was collected in 3 industries across 3 countries. These firms were picked based on their Size and Exporting status.

The Table below shows an overview of the sampling design for 1 country*industry combination. It also shows how I calculated sampling weights and the finite population correction, following the formula ((N-n)/(N-1))^1/2.

Country	Sector	Size	Export	Universe (N)	Sample (n)	universe rep	sample rep	survey weight	FPC
A	1	Small	No	40	9	0.42	0.35	1.20	0.86
A	1	Small	Yes	40	4	0.42	0.15	2.71	0.86
A	1	Med	No	5	2	0.05	0.08	0.68	0.86
A	1	Med	Yes	2	2	0.02	0.08	0.27	0.86
A	1	Large	No	5	5	0.05	0.19	0.27	0.86
A	1	Large	Yes	4	4	0.04	0.15	0.27	0.86

I constructed weights to make my sample more representative of the universe of firms per country-sector. That’s also how I calculated the finite population correction (fpc): per country-sector.

I want to tell Stata that my data is structured as such. After reading about svyset, I believe that my Primary Sampling Unit (PSU), or my cluster, is country*sector, while I believe my strata to be Size*Export. Furthermore, from reading this page, I believe my data is best classified as a one-stage clustered design with stratification.

In Stata, I write:

Code:

Egen cluster = group(country sector)
Egen strata = group(size expdummy)
svyset cluster [pweight=surveyweight], vce(linearized) strata(strata) singleunit(missing) fpc(fpc)
Svydes

Stratum	#Units	#Obs
	--------	--------
1	9	258
2	9	82
3	9	43
4	7	46
5	5	25
6	5	41
---	---	---
6	44	495

But I do not think this is in line with my sampling design. I believe #units refers to the number of PSUs, or clusters, of which I only have 9, not 44. However, svyset seems to want to assign each cluster to a unique stratum (i.e. each country*Sector to a group of, say Small exporting firms), but that’s not how my data is organized, since I selected per cluster/psu a number of firms based on their size and exporting status. In addition, if I type

Code:

svy: mean sales

Stata tells me "fpc for all observations within a stratum must be the same (r.461)". However, in my case the fpc of course depends not only on the stratum but also on the cluster (country*sector ID) that we are looking at.

What am I doing wrong? Is my understanding of PSU / cluster and strata wrong? Should I add a second stage to svyset, should I set my strata and clusters differently (e.g. should I identify country and sector as strata as well?), or should I change my sampling design altogether, e.g. to a mere cluster sample or to a stratified random sampling? Currently, I am indeed leaning towards identifying country and sector as strata as well, and having no psu, but that does not seem to be in line with my conceptual understanding of PSU.

Any advice would be much appreciated, of course.

Many thanks in advance,
Loe

Tags: None

Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

19 Apr 2018, 19:53

As I don't fully understand your design, I'm not going to try for a full svyset.

"PSU" stands for "Primary Sampling Unit", meaning the first or highest level unit that is selected by random sampling . In your study, that unit is the firm. As you surmised, your strata are combinations of country sector size and export status: apparently a new sample of firms was taken in each combination.

Code:

egen stratum = group(country sector size expdummy]

With the new stratum variable, the fpc problem will go away, because sampling fractions will be identical within strata. Note that the fpc is appropriate only for estimating descriptive statistics. To do regression modeling, you'll have to issue a second svyset without the fpc option. (See, e.g. this post.)

You mention, but do not describe, "observations" within firms. What are these and how do they play into your design. Was there a second stage of sampling within firms?

To post-stratify to country-sector totals , the best way is to specify the sampling weight for each "observation". then add the poststrata() and postweight() options of svyset. In Stata 15, you can also post-stratify on multiple dimensions with rake and calibrate options in svyset. There are contributed programs that will do the same in earlier versions.

Last edited by Steve Samuels; 19 Apr 2018, 19:57.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment

Loe Franssen

Join Date: Jul 2016
Posts: 6

20 Apr 2018, 08:44

Hello Steven,

Thank you very much for your response. I will implement your advice on the PSU, generation of strata and FPC.

Regarding the sampling design: The country-sector combinations were predetermined, so they were not sampled from a larger population. For (=within) those 9 country*sector combinations, firms were selected based on the strata Size and Export. I think this is what’s partly confusing (me), the fact that the “clusters” country*sector are not actually clusters in terms of cluster sampling. Perhaps it’s better to think of my data as stacking 9 separate datasets on top of each other, with Size and Export as strata(?)

In any case, I have added an extract of the database, using dataex. Hopefully, this makes things somewhat clearer.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(country sector size expdummy firmid) byte yearsoperation int(universe_csfe universe_cs)
1 1 1 0 511 4 33 94
1 1 1 0 510 4 33 94
1 1 1 0 500 4 33 94
1 1 1 0 517 4 33 94
1 1 1 0 503 3 33 94
1 1 1 0 508 3 33 94
1 1 1 0 514 3 33 94
1 1 1 0 519 4 33 94
1 1 1 0 521 5 33 94
1 1 1 0 494 4 33 94
1 1 1 0 509 4 33 94
1 1 2 0 520 5 23 94
1 1 1 0 516 2 33 94
1 1 1 0 523 . 33 94
1 1 1 0 496 2 33 94
1 1 1 0 518 1 33 94
1 1 1 0 512 3 33 94
1 1 1 0 498 4 33 94
1 1 3 0 495 5 12 94
1 1 1 0 504 3 33 94
1 1 1 0 502 3 33 94
1 1 2 0 490 5 23 94
1 1 1 0 515 3 33 94
1 1 1 0 506 2 33 94
1 1 2 1 485 5  5 94
1 1 1 0 501 3 33 94
1 1 2 1 480 5  5 94
1 1 3 1 483 5 11 94
1 1 2 0 484 5 23 94
1 1 3 1 493 4 11 94
1 1 3 1 482 2 11 94
1 1 2 1 477 5  5 94
1 1 1 0 497 4 33 94
1 1 1 1 513 3 10 94
1 1 3 1 479 . 11 94
1 1 3 1 492 2 11 94
1 1 2 1 488 2  5 94
1 1 3 1 475 4 11 94
1 1 1 1 505 4 10 94
1 1 3 1 507 2 11 94
1 1 2 1 481 5  5 94
1 1 1 1 491 2 10 94
1 1 1 1 489 3 10 94
1 1 3 1 476 4 11 94
1 1 3 1 486 4 11 94
1 1 3 1 487 4 11 94
1 1 3 1 478 2 11 94
1 1 1 1 499 2 10 94
1 1 1 1 522 4 10 94
1 3 . 0 473 4  .  .
end
label values country country
label def country 1 "Kenya", modify
label values sector sector
label def sector 1 "CTA", modify
label def sector 3 "Pulses", modify
label values size firmsampletype
label def firmsampletype 1 "Domestic SME", modify
label def firmsampletype 2 "Domestic Large", modify
label def firmsampletype 3 "Foreign owned", modify
label values yearsoperation years
label def years 1 "Less than 1 year", modify
label def years 2 "Between 1 and 5 years", modify
label def years 3 "Between 6 and 10 years", modify
label def years 4 "Between 11 and 20 years", modify
label def years 5 "More than 20 years", modify

From here, I construct sample totals and survey weights as follows:

Code:

* Generate sample totals
bys country sector size expdummy: gen sample_csfe = _N
bys country sector: gen sample_cs = _N
 
label var sample_cs "Number of sampled firms in this country-sector"
label var universe_cs "Total number of firms in this country-sector"
label var sample_csfe "Number of sampled firms in this country-sector-size-export category"
label var universe_csfe "Total number of firms within this country-sector-size-export category"
 
* Generate surveyweight
g surveyweight = (universe_csfe / universe_cs) / (sample_csfe/sample_cs)
 
* generate survey weight
g fpc = ((universe_cs - sample_cs ) / (universe_cs-1))^0.5
 
  
egen strata = group(country sector size expdummy),m

From your comments, it seems I now have the following 2 options to set my data as survey data:

Code:

*Option1: 
svyset firmid [pweight=surveyweight], strata(strata)

Or:

Code:

*Option2
svyset , poststrata(strata) postweight(universe_csfe)

In both cases, I am leaving fpc out of the equation because my main objective is to do regression analysis.

In choosing between the two options, I notice that when I type:

Code:

svy: mean years

I see that using poststrata indeed provides a correct estimation of the population size, whereas the first option does not (even when I include fpc). Would you therefore suggest I go with the latter option? Or is there something else to keep in mind.

Thank you so much again for your great help!
Loe

Comment

Steve Samuels

Join Date: Mar 2014
Posts: 1786

21 Apr 2018, 12:33

Thank you for presenting the data for one country with dataex. Unfortunately, this is a bad example, as there is a sector with only one observation (firm 473) and that observation is missing values for your universe variables and size. I guess values below, Strata with only one observation will prevent Stata from computing standard errors. To fix that, either use one of the singleunit() options merge firm 473 into a neighboring stratum. Your formulas for sampling weight and fpc were also incorrect. For the latter, read the manual. I was mistaken about the need to use poststrata options here; they are unnecessary, because 1) your sampling strata are subsets of country and sector and 2) your sampling weights take into account the country-sector population counts. However, omitting the sampling weight in your svyset led to wrong standard errors.

Code:

/* fixes for firm 473 */
replace size=0 if firmid==473
replace universe_csfe = 1 if firmid==473
replace universe_cs = 1    if firmid==473


egen strata = group(country sector size expdummy)
gen sampwt =  (universe_csfe/sample_csfe)

svyset firmid [pw = sampwt], strata(strata) ///
singleunit(centered) fpc(universe_csfe)

svydes
svy: tab sector, count
svy: mean yearsoperation

egen cs = group(country sector)

/* Your poststrata statement: */
svyset firmid , strata(strata) poststrata(cs) postweight(universe_cs) fpc(universe_csfe) singleunit(centered)
svy: mean yearsoperation  /* wrong standard error*/

with results

Code:

. svydes

Survey: Describing stage 1 sampling units

      pweight: sampwt
          VCE: linearized
  Single unit: centered
     Strata 1: strata
         SU 1: firmid
        FPC 1: universe_csfe

                                      #Obs per Unit
                              ----------------------------
Stratum    #Units     #Obs      min       mean      max   
--------  --------  --------  --------  --------  --------
       1        23        23         1       1.0         1
       2         6         6         1       1.0         1
       3         3         3         1       1.0         1
       4         5         5         1       1.0         1
       5         1*        1         1       1.0         1
       6        11        11         1       1.0         1
       7         1*        1         1       1.0         1
--------  --------  --------  --------  --------  --------
       7        50        50         1       1.0         1

. svy: tab sector, count
(running tabulate on estimation sample)

Number of strata   =         7                  Number of obs     =         50
Number of PSUs     =        50                  Population size   = 94.9999996
                                                Design df         =         43

----------------------
   sector |      count
----------+-----------
      CTA |         94
   Pulses |          1
          | 
    Total |         95
----------------------
  Key:  count     =  weighted count

. svy: mean yearsoperation
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       7        Number of obs   =         48
Number of PSUs   =      48        Population size =  92.565217
                                  Design df       =         41

----------------------------------------------------------------
               |             Linearized
               |       Mean   Std. Err.     [95% Conf. Interval]
---------------+------------------------------------------------
yearsoperation |    3.96806   .1367889      3.691809    4.244311
----------------------------------------------------------------
Note: Strata with single sampling unit centered at overall mean.

. 
. egen cs = group(country sector)

. svyset firmid , strata(strata) poststrata(cs) postweight(universe_cs) fpc(un
> iverse_csfe) singleunit(centered)

      pweight: <none>
          VCE: linearized
   Poststrata: cs
   Postweight: universe_cs
  Single unit: centered
     Strata 1: strata
         SU 1: firmid
        FPC 1: universe_csfe

. svy: tab sector, count
(running tabulate on estimation sample)

Number of strata   =         7                  Number of obs     =         50
Number of PSUs     =        50                  Population size   =         95
N. of poststrata   =         2                  Design df         =         43

----------------------
   sector |      count
----------+-----------
      CTA |         94
   Pulses |          1
          | 
    Total |         95
----------------------
  Key:  count     =  count

. svy: mean yearsoperation  /* wrong standard error */
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       7        Number of obs   =         48
Number of PSUs   =      48        Population size =         95
N. of poststrata =       2        Design df       =         41

----------------------------------------------------------------
               |             Linearized
               |       Mean   Std. Err.     [95% Conf. Interval]
---------------+------------------------------------------------
yearsoperation |   3.515789   .0716664      3.371056    3.660523
----------------------------------------------------------------
Note: Strata with single sampling unit centered at overall mean.

Last edited by Steve Samuels; 21 Apr 2018, 12:55. Reason: added poststrat result to show wrong standard error

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2

Comment

Loe Franssen

Join Date: Jul 2016

Posts: 6
#5

23 Apr 2018, 05:04

Thank you very much Steven. All is clear now!
Comment

Announcement

Declaring survey design

Comment

Comment

Comment

Comment