Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Proportion difference testing for complex survey designs with empty cells.

    Hi
    I have complex survey data and wish to test if there is a difference in proportions between two sub-populations.
    In the survey people were asked if they agreed or not with an opinion, very few people agreed. The problem is that for sub-population B the proportion disagreeing is 1, for sub-population A it is 0.9973. The two sub-populations sizes are n= 11443 and n=417.

    So I looked in the https://www.stata.com/manuals13/svy.pdf which seems to suggest that Rao and Scott's (1984) second-order corrected Pearson statistic is the best test for sparse tables and they recommended using this statistic in all situations. The Pearson test found no significant difference between groups. I also outputted the Wald test which found a significant difference (from the Stata help page, it is mentioned that the Wald test can give erratic results for sparse tables). However, as there a cell with a zero proportion/count and the assumptions are not met for a Chi-squared test and I am not confident in the conclusion from the Pearson test. Is the Pearson test the right way to go for this data?


    Code:

    svy : tab row column,col obs pearson wald

    /*

    Number of strata = 16 Number of obs = 11,860
    Number of PSUs = 11,860 Population size = 3,189,540
    N. of poststrata = 40 Design df = 11,844

    -------------------------------------
    | Column
    Row | A B Total
    ----------+--------------------------
    0 | .9934 1 .9936
    | 1.1e+04 417 1.2e+04
    |
    1 | .0066 0 .0064
    | 69 0 69
    |
    Total | 1 1 1
    | 1.1e+04 417 1.2e+04
    -------------------------------------
    Key: column proportion
    number of observations

    Pearson:
    Uncorrected chi2(1) = 2.6635
    Design-based F(1, 11844) = 2.7560 P = 0.0969

    Wald (Pearson):
    Unadjusted chi2(1) = 66.6589
    Adjusted F(1, 11844) = 66.6589 P = 0.0000
    */


    I also googled this problem generally but didn't find any relevant websites.

    If this is not an appropriate approach other suggestions would be appreciated.

    Thank you very much for your help.

  • #2
    Welcome to Statalist, Sarah!

    Before I try to answer your core question, I do have a couple of concerns:

    1. You say that you have a "complex survey" . That term usually refers to a stratified, multi-stage sample. However in your output, the number of PSU is identical to the number of observations. In other words, the implied design is a stratified sample with no clustering. Is this the case? If not, please describe the actual design; write (and show) a svyset statement that specifies the PSU; then rerun svy tab.

    2. Thank you for trying a monofaced font. Unfortunately, the result was a mashed table. Before posting again, read FAQ 12 (as well as the other FAQ) and follow the instructions for placing code and results between CODE delimiters.
    Last edited by Steve Samuels; 02 Sep 2017, 21:10.
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Hi

      Thank you for responding Steve. Sorry about the errors.

      I have survey data and wish to test if there is a difference in proportions between two sub-populations.
      The survey design used random digit dialling, it was stratified by region. It does not use multi-stage sampling and post-stratification weights have been applied to benchmark the sample to the population.

      Code:
       svyset
      
            pweight: <none>
                VCE: linearized
         Poststrata: bmark_grps
         Postweight: pop
        Single unit: scaled
           Strata 1: region
               SU 1: <observations>
              FPC 1: <zero>
      In the survey people were asked if they agreed or not with an opinion, very few people agreed. The problem is that for sub-population B the proportion disagreeing is 1, for sub-population A it is 0.9973. The two sub-populations sizes are sub-pop A n= 11443 and sub-pop B n=417.

      So I looked in the https://www.stata.com/manuals13/svy.pdf which seems to suggest that Rao and Scott's (1984) second-order corrected Pearson statistic is the best test for sparse tables and they recommended using this statistic in all situations. The Pearson test found no significant difference between groups. I also outputted the Wald test which found a significant difference (from the Stata help page, it is mentioned that the Wald test can give erratic results for sparse tables). However, as there a cell with a zero proportion/count and the assumptions are not met for a Chi-squared test and I am not confident in the conclusion from the Pearson test. Is the Pearson test the right way to go for this data?


      Code:
      Code:
      . svy: tab row column,col obs pearson wald
      (running tabulate on estimation sample)
      
      Number of strata   =        16                  Number of obs     =     11,860
      Number of PSUs     =    11,860                  Population size   =  3,189,540
      N. of poststrata   =        40                  Design df         =     11,844
      
      -------------------------------------
                |          Column          
            Row |       A        B    Total
      ----------+--------------------------
              0 |   .9934        1    .9936
                | 1.1e+04      417  1.2e+04
                |
              1 |   .0066        0    .0064
                |      69        0       69
                |
          Total |       1        1        1
                | 1.1e+04      417  1.2e+04
      -------------------------------------
       
        Key:  column proportion
              number of observations
      
      
        Pearson:
          Uncorrected   chi2(1)         =    2.6635
          Design-based  F(1, 11844)     =    2.7560     P = 0.0969
      
        Wsald (Pearson):
          Unadjusted    chi2(1)         =   66.6589
          Adjusted      F(1, 11844)     =   66.6589     P = 0.0000

      I also googled this problem generally but didn't find any relevant websites.

      If this is not an appropriate approach other suggestions would be appreciated.

      Thank you very much for your help.
      Last edited by Sarah Rendall; 03 Sep 2017, 18:33.

      Comment


      • #4
        Thanks for the explanation, Sara. In recommending the correct F statistic, The Manual refers to simulations by Sribney, 1998. To quote just one statement from Sribney (p. 46):
        The default-corrected Pearson has a rejection rate of 0.04–0.06 for all tables in the large variance degrees of freedom simulation of sparse tables.
        Power for the default corrected Pearson statistic is also better than that of the alternative.

        One comment: Random digit dialing is, in fact, a form of cluster sampling, with the clusters being banks of 100 phone numbers that share the same eight leading digits. See, for example, the section on RDD in this Pew methodology page:
        http://www.pewresearch.org/methodolo...arch/sampling/

        Reference:

        Sribney, W. M. 1998. svy7: Two-way contingency tables for survey or clustered data. Stata Technical Bulletin 45: 33–49
        This can be found at This can be found at: http://www.stata-press.com/journals/...ents/stb45.pdf
        Last edited by Steve Samuels; 05 Sep 2017, 15:59.
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment


        • #5
          Mmh... upon rereading the Pew document, I find nothing to indicate that banks of numbers are randomly selected. So I was wrong: the banks are not sampled clusters. They are, I think, more accurately characterized as sub-strata of telephone numbers.
          Last edited by Steve Samuels; 05 Sep 2017, 17:00.
          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment

          Working...
          X