Testing significant differences between proportions in contingency tables, with complex survey data

Josh Budlender

Join Date: Dec 2015

Posts: 15
#1

Testing significant differences between proportions in contingency tables, with complex survey data

07 Jan 2016, 11:51

Hi

Apologies if this is a very obvious question - it seems to me that this is something people must do quite frequently, but I've done quite a lot of searching online and haven't been able to figure out how it's done.

I am working with complex survey panel data from South Africa, and performing an explanatory analysis which looks at how living in different locations may be related to mobility in certain labour market outcomes. By locations I mean "formal urban area", "informal urban area" and "rural area".

One part of the analysis relies on mobility matrices, which looks at how labour market outcomes in a specific type of location have changed from wave1 to wave3 of the panel. This basically means creating cross-tabs of the same variable, but from different waves. So for example, I use:

Code:

svy, subpop(panel_empl_informal): tab w1_empl_stat w3_empl_stat, row ci percent

to see how employment status has changed between wave1 and wave3 for people living in informal urban areas. The three employment status outcomes used in this example are "Employed", "Unemployed" and "NEA" (not economically active).

Similarly, I use:

Code:

svy, subpop(panel_empl_rural): tab w1_empl_stat w3_empl_stat, row ci percent

to see how employment status has changed for people living in rural areas.

Say I am interested in whether the proportion of people moving from "Employed" in Wave1 to "Unemployed" in Wave3 significantly differs between informal urban and rural areas. One way I can quickly do this myself is by comparing the reported confidence intervals of that proportion in the two tables. If the confidence intervals do not overlap I know that there is a significant difference at the 95% significance level. If one confidence interval completely contains the other confidence interval, I know that there is not sufficient evidence at the 95% level to conclude that there is difference. However I am unsure what to do in Stata when the confidence intervals only partially overlap.

Outside of Stata, my understanding is that in this case the difference may be significant at the 95% level, but that I need to perform a Chi-2 test to check this. The issue is that I'm not sure how to operationalise this in Stata in this context.

Any help on how to do this would be greatly appreciated, or advice on a better way to test whether there is a significant difference in specific proportions between the urban formal/urban informal/ rural tables.

Thanks very much,

Josh
Tags: None
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

07 Jan 2016, 14:54

.Please show us the output of the svy: tab commands, as FAQ 12 requests and point out the proportions that you want to compare.

12. What should I say about the commands and data I use?

Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Josh Budlender

Join Date: Dec 2015
Posts: 15

08 Jan 2016, 08:41

NOTE: This is a slightly amended re-post of the previous comment I made. The main amendment is to exclude the ugly screenshots showing my output, and instead to use CODE formatting (which I only just realised works nicely with Stata output as well). Apologies for the re-post - I am having troubles editing in the general forum, so I just deleted the previous ugly comment and have replaced it with this.

I have also highlighted the key part of my question in red, as in an effort to be complete I see that this post is quite long.
...

Thanks very much for your reply, Steve. Two tables are below, with all of the Stata output. I am running a number of other variations - these are just for the purposes of an example. .

The table immediately below (Table 1) shows the change in employment status between Wave 1 and Wave 3, for people living in formal urban areas in both waves:

Code:

svy, subpop(panel_empl_formal): tab w1_empl_stat w3_empl_stat, row ci percent

Table 1

Code:

Number of strata   =        47                 Number of obs     =      19,644
Number of PSUs     =       361                 Population size   =  43,354,143
                                               Subpop. no. obs   =       4,093
                                               Subpop. size      =  11,067,085
                                               Design df         =         314

----------------------------------------------------------------------
w1 crude  |
employmen |                 w3 crude employment status                
t status  |        0, NEA       1, Unemp       2, Emplo          Total
----------+-----------------------------------------------------------
   0, NEA |         53.97          18.59          27.43            100
          | [49.04,58.83]  [15.43,22.24]  [23.69,31.53]              
          |
 1, Unemp |         24.53          33.88          41.59            100
          | [20.27,29.36]  [28.82,39.34]  [36.04,47.35]              
          |
 2, Emplo |         14.16          9.345          76.49            100
          | [11.68,17.07]   [7.422,11.7]   [72.95,79.7]              
          |
    Total |          26.7           16.5           56.8            100
          | [24.29,29.26]  [14.21,19.08]   [53.33,60.2]              
----------------------------------------------------------------------
  Key:  row percentage
        [95% confidence interval for row percentage]

  Pearson:
    Uncorrected   chi2(4)         = 4792.1113
    Design-based  F(3.63, 1141.28)=  112.8996     P = 0.0000

Note: 6 strata omitted because they contain no subpopulation members.

This second table (Table 2) shows the same, but for people living in rural areas for both waves:

Code:

svy, subpop(panel_empl_rural): tab w1_empl_stat w3_empl_stat, row ci percent

Table 2

Code:

Number of strata   =        43                 Number of obs     =      18,706
Number of PSUs     =       316                 Population size   =  32,135,318
                                               Subpop. no. obs   =       5,096
                                               Subpop. size      = 8,682,608.9
                                               Design df         =         273

----------------------------------------------------------------------
w1 crude  |
employmen |                 w3 crude employment status                
t status  |        0, NEA       1, Unemp       2, Emplo          Total
----------+-----------------------------------------------------------
   0, NEA |         57.21          23.83          18.96            100
          | [53.91,60.45]  [21.23,26.65]  [16.87,21.24]              
          |
 1, Unemp |            34          33.09          32.92            100
          | [30.09,38.13]  [29.45,36.94]  [28.83,37.27]              
          |
 2, Emplo |         28.61          14.83          56.56            100
          | [25.24,32.24]  [12.21,17.89]  [52.28,60.74]              
          |
    Total |         42.37          22.46          35.17            100
          | [39.87,44.91]  [20.42,24.64]   [32.9,37.51]              
----------------------------------------------------------------------
  Key:  row percentage
        [95% confidence interval for row percentage]

  Pearson:
    Uncorrected   chi2(4)         = 2653.3823
    Design-based  F(3.55, 967.91) =   88.5104     P = 0.0000

Note: 10 strata omitted because they contain no subpopulation members.

Distinct confidence intervals
The top right cell in Table 1 suggests that 27.43% of those who were Not Economically Active (NEA) in Wave 1 were employed in Wave 3, when looking at the sample of people who lived in formal urban areas for both waves. Equivalently in rural areas, from Table 2, only 18.96% of the NEA moved to Employed. The confidence intervals of these two estimates do not overlap, suggesting that the difference between the results is significant at the 95% significance level.

Completely overlapping confidence intervals
If one were to look at the proportion of people who remain Unemployed from wave 1 to wave 3 in each sample (Cell 2:2), you would see that the rural (Table 2) confidence interval falls inside the formal urban (Table 1) equivalent, suggesting that there is not a statistically significant difference between the estimates

Partially overlapping confidence intervals - the problem
My question concerns a movement such as from NEA in Wave 1, to Unemployed in Wave 2 (Cell 1:2), in red. For formal urban areas, Table 1 suggests that 18.59% (ci: 15.43, 22.24) of the NEA in Wave 1 move to Unemployed in Wave 3. For rural areas, Table 3 suggests that 23.83% (ci: 21.23, 26.65) of the Wave 1 NEA move to Unemployed in Wave 3.

The issue here is that the confidence intervals overlap, so I cannot tell if the difference between the areas is statistically significant for this movement. My understanding is that I should implement a Chi-2 test, but can't figure out how to do this in this context. I could also calculate the difference between the estimates and see if the confidence interval of the difference contains 0, but (after various efforts) I'm not sure how to do that either.

I hope this makes it a bit clearer. Thanks very much for your time, and if you have any further questions/points of clarity please do raise them.

Best,

Josh Budlender

Last edited by Josh Budlender; 08 Jan 2016, 08:50.

Comment

Josh Budlender

Join Date: Dec 2015
Posts: 15

08 Jan 2016, 14:35

I've finally found that the way to test specific proportions against each other after estimation, within one table, is actually quite obvious. This brings me close to the resolving the problem but not quite there.

If I run the command for Table 1 again:

Code:

svy, subpop(panel_empl_formal): tab w1_empl_stat w3_empl_stat, row ci percent

and get (from the above)

Code:

----------------------------------------------------------------------
w1 crude  |
employmen |                 w3 crude employment status                
t status  |        0, NEA       1, Unemp       2, Emplo          Total
----------+-----------------------------------------------------------
   0, NEA |         53.97          18.59          27.43            100
          | [49.04,58.83]  [15.43,22.24]  [23.69,31.53]              
          |
 1, Unemp |         24.53          33.88          41.59            100
          | [20.27,29.36]  [28.82,39.34]  [36.04,47.35]              
          |
 2, Emplo |         14.16          9.345          76.49            100
          | [11.68,17.07]   [7.422,11.7]   [72.95,79.7]              
          |
    Total |          26.7           16.5           56.8            100
          | [24.29,29.26]  [14.21,19.08]   [53.33,60.2]              
----------------------------------------------------------------------
  Key:  row percentage
        [95% confidence interval for row percentage]

I can quite easily test whether there is a statistically significant difference between estimates within the table, by running, for example,

Code:

test _b[p21] = _b[p22]

which tells me whether there is a significant difference between the coefficients written in blue. The output is below

Code:

Adjusted Wald test

 ( 1)  p21 - p22 = 0

       F(  1,   314) =    5.22
            Prob > F =    0.0230

My question now is how do I save the estimates run in Table 1 above, and then perform -test- to compare the estimates from one table with another table, run on a different sample? The stored estimates used above will otherwise be replaced as soon as I run another -svy, tab- command.

My understanding is that I should be using -estimates store name- and then -suest-. However this does not work for me. If I do the below:

Code:

svy, subpop(panel_empl_formal): tab w1_empl_stat w3_empl_stat, row ci percent
estimates store FORMAL
svy, subpop(panel_empl_rural): tab w1_empl_stat w3_empl_stat, row ci percent
estimates store INFORMAL

suest FORMAL INFORMAL

I get the following error message:

Code:

impossible to retrieve e(b) and e(V) in FORMAL
r(198);

Relatedly, if I do

Code:

estimates replay FORMAL

I get

Code:

--------------------------------------------------------------------------------------------------------
Model FORMAL
--------------------------------------------------------------------------------------------------------
varlist required
r(100);

Can I not use -estimates store- after the tabulate command? And if not, is there another way for me to save results from one table, that will allow me to -test- whether coefficients are statistically significantly different from estimates in another table, from another sample?

I am going to cross-post this question into another topic on the forum, as the question has changed a bit from what I originally posted. Additionally, this thread is now quite long and messy. Cross-post: http://www.statalist.org/forums/foru...another-sample

Last edited by Josh Budlender; 08 Jan 2016, 15:12.

Announcement