Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help - sub-sampling a random representative subset from a large data set (survey data)

    Hello,

    I have a data set consisting of 1500 respondents from a selected area. However, due to the topic restrictions, the respondents were selected through non-probability, following a quota sampling - county level (5 counties), urban - rural and 3 age categories. However, this sample is still biased and I want if possible to correct it. Instead of post-stratification, I'd like to solve for probability too (yeah, I know...).

    Question: How can I draw from my data set a random subset (of, say 8-900 cases) using available census population data (population by county, urban - rural, pop by age category) for stratification?

    Thank you!
    Cristian

  • #2
    The following presumes you have a variable called "stratum," coded from 1 to Whatever, and that you record the desired N for each stratum in a list of locals.
    Code:
    // desired number for each stratum
    local N1 = 15
    local N2 = 19
    ...
    ...
    local NWhatever =
    forval i = 1/Whatever  {
       preserve
       sample `N`i'' if stratum == `i', count
       tempfile sample`i'
       save `sample`i''
       restore
    }
    clear
    forval i = 1/Whatever {
       append using `sample`i''
    }
    I can appreciate the ease of using a sample like this, but wouldn't you get the equivalent results but with more precision by keeping all of the original data set and weighting? I guess the increase in precision from keeping the original would be on the order of sqrt(1500/900), but it still might be attractive.

    Last edited by Mike Lacy; 10 Sep 2019, 12:07.

    Comment


    • #3
      Thank you, Mike for the suggestions. I agree that post-stratification / weighting is a good way to deal with my data but I wanted to try this other approach.
      Regarding your code, it would need to have computed by hand the (desired) numbers for each stratum. That's OK, but this works only for one level with N strata, however I need to control for three different levels (population in 5 counties, population rural-urban and 3 age categories). Or I didn't get this...
      Cheers!
      K.

      Comment


      • #4
        I was thinking that your population data was available for each combination of county/rural-urban/age group. So, if there might be 5 counties * 2 rural-urban * 3 age categories, giving you 30 strata, and I'm presuming you would know the population fraction in each one. The sample Ns would just be each population fraction * 900, giving you a sample that reproduces the population fractions. If this isn't on target, go ahead and post again and explain.

        Comment

        Working...
        X