Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Declaring survey design

    Dear all,

    I would like to declare my survey design to Stata (version 13) using svyset.

    I have a sample of 495 firms that was collected in 3 industries across 3 countries. These firms were picked based on their Size and Exporting status.

    The Table below shows an overview of the sampling design for 1 country*industry combination. It also shows how I calculated sampling weights and the finite population correction, following the formula ((N-n)/(N-1))^1/2.

    Country Sector Size Export Universe (N) Sample (n) universe rep sample rep survey weight FPC
    A 1 Small No 40 9 0.42 0.35 1.20 0.86
    A 1 Small Yes 40 4 0.42 0.15 2.71 0.86
    A 1 Med No 5 2 0.05 0.08 0.68 0.86
    A 1 Med Yes 2 2 0.02 0.08 0.27 0.86
    A 1 Large No 5 5 0.05 0.19 0.27 0.86
    A 1 Large Yes 4 4 0.04 0.15 0.27 0.86

    I constructed weights to make my sample more representative of the universe of firms per country-sector. That’s also how I calculated the finite population correction (fpc): per country-sector.

    I want to tell Stata that my data is structured as such. After reading about svyset, I believe that my Primary Sampling Unit (PSU), or my cluster, is country*sector, while I believe my strata to be Size*Export. Furthermore, from reading this page, I believe my data is best classified as a one-stage clustered design with stratification.

    In Stata, I write:
    Code:
    Egen cluster = group(country sector)
    Egen strata = group(size expdummy)
    svyset cluster [pweight=surveyweight], vce(linearized) strata(strata) singleunit(missing) fpc(fpc)
    Svydes
    Stratum #Units #Obs
    -------- --------
    1 9 258
    2 9 82
    3 9 43
    4 7 46
    5 5 25
    6 5 41
    --- --- ---
    6 44 495


    But I do not think this is in line with my sampling design. I believe #units refers to the number of PSUs, or clusters, of which I only have 9, not 44. However, svyset seems to want to assign each cluster to a unique stratum (i.e. each country*Sector to a group of, say Small exporting firms), but that’s not how my data is organized, since I selected per cluster/psu a number of firms based on their size and exporting status. In addition, if I type

    Code:
    svy: mean sales
    Stata tells me "fpc for all observations within a stratum must be the same (r.461)". However, in my case the fpc of course depends not only on the stratum but also on the cluster (country*sector ID) that we are looking at.



    What am I doing wrong? Is my understanding of PSU / cluster and strata wrong? Should I add a second stage to svyset, should I set my strata and clusters differently (e.g. should I identify country and sector as strata as well?), or should I change my sampling design altogether, e.g. to a mere cluster sample or to a stratified random sampling? Currently, I am indeed leaning towards identifying country and sector as strata as well, and having no psu, but that does not seem to be in line with my conceptual understanding of PSU.

    Any advice would be much appreciated, of course.

    Many thanks in advance,
    Loe




  • #2
    As I don't fully understand your design, I'm not going to try for a full svyset.

    "PSU" stands for "Primary Sampling Unit", meaning the first or highest level unit that is selected by random sampling . In your study, that unit is the firm. As you surmised, your strata are combinations of country sector size and export status: apparently a new sample of firms was taken in each combination.
    Code:
    egen stratum = group(country sector size expdummy]
    With the new stratum variable, the fpc problem will go away, because sampling fractions will be identical within strata. Note that the fpc is appropriate only for estimating descriptive statistics. To do regression modeling, you'll have to issue a second svyset without the fpc option. (See, e.g. this post.)

    You mention, but do not describe, "observations" within firms. What are these and how do they play into your design. Was there a second stage of sampling within firms?

    To post-stratify to country-sector totals , the best way is to specify the sampling weight for each "observation". then add the poststrata() and postweight() options of svyset. In Stata 15, you can also post-stratify on multiple dimensions with rake and calibrate options in svyset. There are contributed programs that will do the same in earlier versions.
    Last edited by Steve Samuels; 19 Apr 2018, 19:57.
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Hello Steven,

      Thank you very much for your response. I will implement your advice on the PSU, generation of strata and FPC.

      Regarding the sampling design: The country-sector combinations were predetermined, so they were not sampled from a larger population. For (=within) those 9 country*sector combinations, firms were selected based on the strata Size and Export. I think this is what’s partly confusing (me), the fact that the “clusters” country*sector are not actually clusters in terms of cluster sampling. Perhaps it’s better to think of my data as stacking 9 separate datasets on top of each other, with Size and Export as strata(?)

      In any case, I have added an extract of the database, using dataex. Hopefully, this makes things somewhat clearer.


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(country sector size expdummy firmid) byte yearsoperation int(universe_csfe universe_cs)
      1 1 1 0 511 4 33 94
      1 1 1 0 510 4 33 94
      1 1 1 0 500 4 33 94
      1 1 1 0 517 4 33 94
      1 1 1 0 503 3 33 94
      1 1 1 0 508 3 33 94
      1 1 1 0 514 3 33 94
      1 1 1 0 519 4 33 94
      1 1 1 0 521 5 33 94
      1 1 1 0 494 4 33 94
      1 1 1 0 509 4 33 94
      1 1 2 0 520 5 23 94
      1 1 1 0 516 2 33 94
      1 1 1 0 523 . 33 94
      1 1 1 0 496 2 33 94
      1 1 1 0 518 1 33 94
      1 1 1 0 512 3 33 94
      1 1 1 0 498 4 33 94
      1 1 3 0 495 5 12 94
      1 1 1 0 504 3 33 94
      1 1 1 0 502 3 33 94
      1 1 2 0 490 5 23 94
      1 1 1 0 515 3 33 94
      1 1 1 0 506 2 33 94
      1 1 2 1 485 5  5 94
      1 1 1 0 501 3 33 94
      1 1 2 1 480 5  5 94
      1 1 3 1 483 5 11 94
      1 1 2 0 484 5 23 94
      1 1 3 1 493 4 11 94
      1 1 3 1 482 2 11 94
      1 1 2 1 477 5  5 94
      1 1 1 0 497 4 33 94
      1 1 1 1 513 3 10 94
      1 1 3 1 479 . 11 94
      1 1 3 1 492 2 11 94
      1 1 2 1 488 2  5 94
      1 1 3 1 475 4 11 94
      1 1 1 1 505 4 10 94
      1 1 3 1 507 2 11 94
      1 1 2 1 481 5  5 94
      1 1 1 1 491 2 10 94
      1 1 1 1 489 3 10 94
      1 1 3 1 476 4 11 94
      1 1 3 1 486 4 11 94
      1 1 3 1 487 4 11 94
      1 1 3 1 478 2 11 94
      1 1 1 1 499 2 10 94
      1 1 1 1 522 4 10 94
      1 3 . 0 473 4  .  .
      end
      label values country country
      label def country 1 "Kenya", modify
      label values sector sector
      label def sector 1 "CTA", modify
      label def sector 3 "Pulses", modify
      label values size firmsampletype
      label def firmsampletype 1 "Domestic SME", modify
      label def firmsampletype 2 "Domestic Large", modify
      label def firmsampletype 3 "Foreign owned", modify
      label values yearsoperation years
      label def years 1 "Less than 1 year", modify
      label def years 2 "Between 1 and 5 years", modify
      label def years 3 "Between 6 and 10 years", modify
      label def years 4 "Between 11 and 20 years", modify
      label def years 5 "More than 20 years", modify

      From here, I construct sample totals and survey weights as follows:

      Code:
      * Generate sample totals
      bys country sector size expdummy: gen sample_csfe = _N
      bys country sector: gen sample_cs = _N
       
      label var sample_cs "Number of sampled firms in this country-sector"
      label var universe_cs "Total number of firms in this country-sector"
      label var sample_csfe "Number of sampled firms in this country-sector-size-export category"
      label var universe_csfe "Total number of firms within this country-sector-size-export category"
       
      * Generate surveyweight
      g surveyweight = (universe_csfe / universe_cs) / (sample_csfe/sample_cs)
       
      * generate survey weight
      g fpc = ((universe_cs - sample_cs ) / (universe_cs-1))^0.5
       
        
      egen strata = group(country sector size expdummy),m
      From your comments, it seems I now have the following 2 options to set my data as survey data:

      Code:
      *Option1: 
      svyset firmid [pweight=surveyweight], strata(strata)
      Or:

      Code:
      *Option2
      svyset , poststrata(strata) postweight(universe_csfe)
      In both cases, I am leaving fpc out of the equation because my main objective is to do regression analysis.

      In choosing between the two options, I notice that when I type:

      Code:
      svy: mean years
      I see that using poststrata indeed provides a correct estimation of the population size, whereas the first option does not (even when I include fpc). Would you therefore suggest I go with the latter option? Or is there something else to keep in mind.


      Thank you so much again for your great help!
      Loe

      Comment


      • #4

        Thank you for presenting the data for one country with dataex. Unfortunately, this is a bad example, as there is a sector with only one observation (firm 473) and that observation is missing values for your universe variables and size. I guess values below, Strata with only one observation will prevent Stata from computing standard errors. To fix that, either use one of the singleunit() options merge firm 473 into a neighboring stratum. Your formulas for sampling weight and fpc were also incorrect. For the latter, read the manual. I was mistaken about the need to use poststrata options here; they are unnecessary, because 1) your sampling strata are subsets of country and sector and 2) your sampling weights take into account the country-sector population counts. However, omitting the sampling weight in your svyset led to wrong standard errors.


        Code:
        /* fixes for firm 473 */
        replace size=0 if firmid==473
        replace universe_csfe = 1 if firmid==473
        replace universe_cs = 1    if firmid==473
        
        
        egen strata = group(country sector size expdummy)
        gen sampwt =  (universe_csfe/sample_csfe)
        
        svyset firmid [pw = sampwt], strata(strata) ///
        singleunit(centered) fpc(universe_csfe)
        
        svydes
        svy: tab sector, count
        svy: mean yearsoperation
        
        egen cs = group(country sector)
        
        /* Your poststrata statement: */
        svyset firmid , strata(strata) poststrata(cs) postweight(universe_cs) fpc(universe_csfe) singleunit(centered)
        svy: mean yearsoperation  /* wrong standard error*/
        with results
        Code:
        . svydes
        
        Survey: Describing stage 1 sampling units
        
              pweight: sampwt
                  VCE: linearized
          Single unit: centered
             Strata 1: strata
                 SU 1: firmid
                FPC 1: universe_csfe
        
                                              #Obs per Unit
                                      ----------------------------
        Stratum    #Units     #Obs      min       mean      max   
        --------  --------  --------  --------  --------  --------
               1        23        23         1       1.0         1
               2         6         6         1       1.0         1
               3         3         3         1       1.0         1
               4         5         5         1       1.0         1
               5         1*        1         1       1.0         1
               6        11        11         1       1.0         1
               7         1*        1         1       1.0         1
        --------  --------  --------  --------  --------  --------
               7        50        50         1       1.0         1
        
        . svy: tab sector, count
        (running tabulate on estimation sample)
        
        Number of strata   =         7                  Number of obs     =         50
        Number of PSUs     =        50                  Population size   = 94.9999996
                                                        Design df         =         43
        
        ----------------------
           sector |      count
        ----------+-----------
              CTA |         94
           Pulses |          1
                  | 
            Total |         95
        ----------------------
          Key:  count     =  weighted count
        
        . svy: mean yearsoperation
        (running mean on estimation sample)
        
        Survey: Mean estimation
        
        Number of strata =       7        Number of obs   =         48
        Number of PSUs   =      48        Population size =  92.565217
                                          Design df       =         41
        
        ----------------------------------------------------------------
                       |             Linearized
                       |       Mean   Std. Err.     [95% Conf. Interval]
        ---------------+------------------------------------------------
        yearsoperation |    3.96806   .1367889      3.691809    4.244311
        ----------------------------------------------------------------
        Note: Strata with single sampling unit centered at overall mean.
        
        . 
        . egen cs = group(country sector)
        
        . svyset firmid , strata(strata) poststrata(cs) postweight(universe_cs) fpc(un
        > iverse_csfe) singleunit(centered)
        
              pweight: <none>
                  VCE: linearized
           Poststrata: cs
           Postweight: universe_cs
          Single unit: centered
             Strata 1: strata
                 SU 1: firmid
                FPC 1: universe_csfe
        
        . svy: tab sector, count
        (running tabulate on estimation sample)
        
        Number of strata   =         7                  Number of obs     =         50
        Number of PSUs     =        50                  Population size   =         95
        N. of poststrata   =         2                  Design df         =         43
        
        ----------------------
           sector |      count
        ----------+-----------
              CTA |         94
           Pulses |          1
                  | 
            Total |         95
        ----------------------
          Key:  count     =  count
        
        . svy: mean yearsoperation  /* wrong standard error */
        (running mean on estimation sample)
        
        Survey: Mean estimation
        
        Number of strata =       7        Number of obs   =         48
        Number of PSUs   =      48        Population size =         95
        N. of poststrata =       2        Design df       =         41
        
        ----------------------------------------------------------------
                       |             Linearized
                       |       Mean   Std. Err.     [95% Conf. Interval]
        ---------------+------------------------------------------------
        yearsoperation |   3.515789   .0716664      3.371056    3.660523
        ----------------------------------------------------------------
        Note: Strata with single sampling unit centered at overall mean.
        Last edited by Steve Samuels; 21 Apr 2018, 12:55. Reason: added poststrat result to show wrong standard error
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment


        • #5
          Thank you very much Steven. All is clear now!

          Comment

          Working...
          X