No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing sample to population

    I am using Stata 14.2. I have a survey and I want to see if the proportions within the crosstab of two ordinal variables are significantly different from the population. My two variables are age categories (25-29, 30-34, 35-39, and 40-44) and number of children (0, 1, 2, 3, 4, 5&6, 7 plus). I also want to incorporate the survey weights into the analysis.

    My first search brought me to the csgof command, which allows me to specify my expected values, but it does not do the test on a crosstab of two variables, just proportions of one variable.

    I ended up doing the svy: tab command for the age categories and number of kids, which allowed me to see the weighted frequencies within each cell. Then, I used the chisqi command to compare the population frequencies with the weighted sample frequencies. However, I had to do this separately for each age category. I know that doing separate tests for each age group is not as ideal as testing whether the entire distribution is significantly different from the population. Is there a simpler way to do this?

    Here is the code that I used:
    svy: tab AGECAT BIOKIDCAT, row obs
    *create locals for each age group that shows the frequency of each no. kids category 
    local NSFBpopfreq2529 4561 2290 1940 850 280 70 10
    local NSFBpopfreq3034 2537 2103 2826 1475 511 183 28
    local NSFBpopfreq3539 1975 1839 3699 1933 679 261 52
    local NSFBpopfreq4044 2291 1898 3864 2078 719 337 56
    local NSFBsamfreq2529 384 263 233 109 21 10 1
    local NSFBsamfreq3034 287 266 337 185 60 25 4
    local NSFBsamfreq3539 247 234 352 205 89 34 5
    local NSFBsamfreq4044 265 215 427 199 81 42 7
    *create local for the labels
    local chisqlabels Zero One Two Three Four FiveSix SevenPlus
    *do a separate chi square test for each age group to compare the distribution of the sample to the population
    chisqi `NSFBpopfreq2529' \ `NSFBsamfreq2529', labels(`chisqlabels') nst(25-29)
    chisqi `NSFBpopfreq3034' \ `NSFBsamfreq3034', labels(`chisqlabels') nst(30-34)
    chisqi `NSFBpopfreq3539' \ `NSFBsamfreq3539', labels(`chisqlabels') nst(35-39)
    chisqi `NSFBpopfreq4044' \ `NSFBsamfreq4044', labels(`chisqlabels') nst(40-44)

  • #2
    From the point of view of the Chi-Square GOF test, you are testing the distribution of a single variable with 4 X 7 = 28 categories. You could easily enough create such a variable in your data set (e.g., you could use something like -egen age_kids = group(age nchilds)- ). However, looking at the description of -csgof- (which is a community-contributed command and not a built-in part of Stata), I don't see any indication that it will perform a test that appropriately adjusts for the survey nature of your sample. It might, or it might not.


    • #3
      Ah, that makes sense! I knew there was a simple solution that I just wasn't thinking of--thank you Mike!

      So maybe I can:
      • Create a new variable that indicates what category out of 28 each individual is in (as you suggested)
      • Then use svy: tab to get those weighted frequencies
      • Then type in the values for chisqi and compare to the population as I did before


      • #4
        That will give you decent frequency estimates, but whether those will give a valid result if put into the Chi2 GOF test "as is," I don't know. In any event, whether or not your sample is different from the population according to some hypothesis test is (though popular) not necessarily the issue: I'd worry about whether the differences are big enough to *matter,* which is not what the test shows. If your sample is large enough, any difference will show a small p-value whether or not such differences will matter.


        • #5
          Stepping back: why do you want to do this? In these situations, standard survey practice is to post-stratify the initial sampling weights so that estimated and actual population counts match for selected characteristics. In your svyset statement, use the poststrata() and postweight() options for the age-number table. Or, if the numbers in the twoway table are too small, use the rake option in Stata 15 to fit both marginal counts. With an earlier version of stata, use survwgt rake in Nick Winter's survwgt package (SSC). You may have to combine categories with low sample counts with neighboring categories.

          Good luck!
          Steve Samuels
          Statistical Consulting

          Stata 14.2


          • #6
            Thanks both Mike and Steve for the considerations! I did not collect this data, I'm just doing a small analysis project for a class with two surveys actually. As a part of the project, I wanted to see if the surveys were representative in terms of number of children, which was not used for sampling or weighting on either survey. However, you make a good point, Mike, that with the large sample sizes, the statistically significance isn't necessarily what's of interest. It definitely would make more sense to focus on whether the differences are big enough to matter, as you suggested. Thanks again! I'll take both of these thoughts into consideration in moving forward.