Storing estimates after -svy, tab-, to compare with estimates from another sample

Josh Budlender

Join Date: Dec 2015

Posts: 15
#1

Storing estimates after -svy, tab-, to compare with estimates from another sample

08 Jan 2016, 15:03

Hi

My data is panel data, with 3 different waves. Additionally, it has been split into 3 different samples: those who lived in "formal urban" areas in Wave 1 and Wave 3, those who lived in "informal urban" areas in Waves 1 and 3, and those who lived in "rural" areas in Wave 1 and 3. It is complex survey data.

If I execute

Code:

svy, subpop(panel_empl_formal): tab w1_empl_stat w3_empl_stat, row ci percent

I get the change in labour market outcomes between waves for the "formal urban" sample ("Table 1"):

Code:

---------------------------------------------------------------------- w1 crude | employmen | w3 crude employment status t status | 0, NEA 1, Unemp 2, Emplo Total ----------+----------------------------------------------------------- 0, NEA | 53.97 18.59 27.43 100 | [49.04,58.83] [15.43,22.24] [23.69,31.53] | 1, Unemp | 24.53 33.88 41.59 100 | [20.27,29.36] [28.82,39.34] [36.04,47.35] | 2, Emplo | 14.16 9.345 76.49 100 | [11.68,17.07] [7.422,11.7] [72.95,79.7] | Total | 26.7 16.5 56.8 100 | [24.29,29.26] [14.21,19.08] [53.33,60.2] ---------------------------------------------------------------------- Key: row percentage [95% confidence interval for row percentage]

The estimates in the cells 2:1 and 2:2 (blue) have overlapping confidence intervals, and therefore at first glance I cannot tell whether the difference between the estimates is statistically significant at the 95% level. So I execute

Code:

test _b[p21] = _b[p22]

which gives me the output

Code:

Adjusted Wald test ( 1) p21 - p22 = 0 F( 1, 314) = 5.22 Prob > F = 0.0230

And I therefore know that the estimates can be said to be different at the 95% level. This is all fine.

My question now is about how to compare estimates from the "formal urban" sample to another sample, such as the "informal urban" sample. As soon as I run another -svy, tab- command, the stored estimates for "Table 1" are replaced. So I need a way to save the estimates, and then bring them back for comparison. The commands I thought would work for this are -estimates store name- and -suest-.

However when I tried the following

Code:

svy, subpop(panel_empl_for): tab w1_empl_stat_crude w3_empl_stat_crude, row ci percent estimates store FORMAL svy, subpop(panel_empl_rur): tab w1_empl_stat_crude w3_empl_stat_crude, row ci percent estimates store INFORMAL suest FORMAL INFORMAL

I get

Code:

impossible to retrieve e(b) and e(V) in FORMAL r(198);

Probably related to this, when I run -estimates replay FORMAL- I get

Code:

-------------------------------------------------------------------------------------------------------- Model FORMAL -------------------------------------------------------------------------------------------------------- varlist required r(100);

Can I not use -estimates store- after -svy, tab-? Or am I doing something wrong? And if I cannot use -estimates store-, is there another way around my problem?

Thanks very much for any help - and any questions or points of clarity are very welcome.

NOTE: this is a cross-post from http://www.statalist.org/forums/foru...ex-survey-data, as that thread took a very long-winded route before getting to this point, and is very long and messy to follow. Most importantly, I think the question has changed somewhat from what I asked in that topic. However if you want some more background on this question, that thread may be informative.
Tags: None

Josh Budlender

Join Date: Dec 2015
Posts: 15

08 Jan 2016, 15:10

To make clearer what I mean about comparing estimates from different samples:

If I execute (as above)

Code:

svy, subpop(panel_empl_formal): tab w1_empl_stat w3_empl_stat, row ci percent

I get (Table 1):

Code:

----------------------------------------------------------------------
w1 crude  |
employmen |                 w3 crude employment status                
t status  |        0, NEA       1, Unemp       2, Emplo          Total
----------+-----------------------------------------------------------
   0, NEA |         53.97          18.59          27.43            100
          | [49.04,58.83]  [15.43,22.24]  [23.69,31.53]               
          | 
 1, Unemp |         24.53          33.88          41.59            100
          | [20.27,29.36]  [28.82,39.34]  [36.04,47.35]               
          | 
 2, Emplo |         14.16          9.345          76.49            100
          | [11.68,17.07]   [7.422,11.7]   [72.95,79.7]               
          | 
    Total |          26.7           16.5           56.8            100
          | [24.29,29.26]  [14.21,19.08]   [53.33,60.2]               
----------------------------------------------------------------------
  Key:  row percentage
        [95% confidence interval for row percentage]

Table 1 shows labour market mobility for people in formal urban areas. Say I now want to compare this to mobility in informal urban areas:

Then I execute

Code:

svy, subpop(panel_empl_informal): tab w1_empl_stat w3_empl_stat, row ci percent

and get (Table 2):

Code:

----------------------------------------------------------------------
w1 crude  |
employmen |                 w3 crude employment status                
t status  |        0, NEA       1, Unemp       2, Emplo          Total
----------+-----------------------------------------------------------
   0, NEA |         50.55           31.8          17.65            100
          | [44.64,56.44]   [23.39,41.6]  [10.46,28.23]               
          | 
 1, Unemp |          28.9          32.23          38.87            100
          | [22.87,35.78]  [23.55,42.33]  [31.14,47.21]               
          | 
 2, Emplo |         26.11          13.06          60.82            100
          | [18.39,35.67]  [9.133,18.34]  [49.28,71.27]               
          | 
    Total |         33.84          23.86          42.29            100
          | [27.47,40.86]  [18.28,30.52]   [33.4,51.72]               
----------------------------------------------------------------------
  Key:  row percentage
        [95% confidence interval for row percentage]

I want to know if there is a statistically significant difference between the move from Unemployed in Wave 1 to NEA in Wave 3, between "formal urban" and "informal urban" areas. These are the blue estimates. However the confidence intervals overlap, so I need to perform a test. This is what I'm asking how to do.

Thanks again,

Josh Budlender

Comment

Steve Samuels

Join Date: Mar 2014
Posts: 1786

09 Jan 2016, 21:57

You can't do this with -svy: tab-; it is not an estimation command that leaves behind e(b) and e(V). but you can do it with -proportion- followed by -lincom-. You have a 2 x 3 x 3 data structure. I don't have an example with such a structure, so I'll show the solution for a 2 x 2 x 2 structure. Here's the code.

The urban/rural setting will be indicated by the variable "setting". The wave 1 and wave 3 variables are typ_wav1 and typ_wav3. The proportions contrasted in each table are those in the second column.

Code:

/* Set up data set */
use http://www.stata-press.com/data/r14/byssin, clear
drop if workplace ==3  // reduce to 2 x 2 x 2
label drop _all
gen setting = race
gen typ_wav1 = workplace
gen typ_wav3 = smokes
keep setting typ* pop

/* Do survey analyses */
svyset _n [pw = pop]
svy, subpop(if setting==1): tab typ_wav1 typ_wav3, row se
svy, subpop(if setting==2): tab typ_wav1 typ_wav3, row se
svy: prop typ_wav3, over(race typ_wav1) coeflegend

/* Get difference of differences */
lincom  ///
   _b[_prop_2:_subpop_1] -_b[_prop_2:_subpop_2]  ///
 - (_b[_prop_2:_subpop_3] - _b[_prop_2:_subpop_4])

Results (abbreviated):

Code:

. svy, subpop(if race==1): tab typ_wav1 typ_wav3, row se
-------------------------------------
          |         typ_wav3        
 typ_wav1 |       1        2    Total
----------+--------------------------
        1 |   .4403    .5597        1
          | (.2339)  (.2339)        
          |
        2 |   .4494    .5506        1
          | (.2591)  (.2591)        
          |
    Total |    .443     .557        1
          | (.1816)  (.1816)        
-------------------------------------

. svy, subpop(if setting==2): tab typ_wav1 typ_wav3, row se

-------------------------------------
          |         typ_wav3        
 typ_wav1 |       1        2    Total
----------+--------------------------
        1 |   .4098    .5902        1
          | (.1614)  (.1614)        
          |
        2 |   .4281    .5719        1
          | (.1781)  (.1781)        
          |
    Total |   .4146    .5854        1
          | (.1281)  (.1281)        
-------------------------------------
. svy: prop typ_wav3, over(setting typ_wav1) coeflegend
      _prop_1: typ_wav3 = 1
      _prop_2: typ_wav3 = 2

         Over: setting typ_wav1
    _subpop_1: 1 1
    _subpop_2: 1 2
    _subpop_3: 2 1
    _subpop_4: 2 2
------------------------------------------------------------------------------
        Over | Proportion  Legend
-------------+----------------------------------------------------------------
_prop_1      |
   _subpop_1 |   .4403409  _b[_prop_1:_subpop_1]
   _subpop_2 |   .4494382  _b[_prop_1:_subpop_2]
   _subpop_3 |   .4097744  _b[_prop_1:_subpop_3]
   _subpop_4 |   .4280702  _b[_prop_1:_subpop_4]
-------------+----------------------------------------------------------------
_prop_2      |
   _subpop_1 |   .5596591  _b[_prop_2:_subpop_1]
   _subpop_2 |   .5505618  _b[_prop_2:_subpop_2]
   _subpop_3 |   .5902256  _b[_prop_2:_subpop_3]
   _subpop_4 |   .5719298  _b[_prop_2:_subpop_4]
----------------------------------------------------
 
 /* Get difference of differences */
. lincom  ///
>    _b[_prop_2:_subpop_1] -_b[_prop_2:_subpop_2]  ///
>  - (_b[_prop_2:_subpop_3] - _b[_prop_2:_subpop_4])

 ( 1)  [_prop_2]_subpop_1 - [_prop_2]_subpop_2 - [_prop_2]_subpop_3 + [_prop_2]_subpop_4 =
       0

------------------------------------------------------------------------------
  Proportion |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |  -.0091984   .4237844    -0.02   0.983    -.8617426    .8433457
------------------------------------------------------------------------------

How did this work? Well, the four proportions in the second column are contained in the _prop_2 section of the -proportion- results. In the first table the proportions are 0.5597 and 0.5506, with difference .0091. In the second table the proportions are 0.5902 and 0.5719, with difference 0.0183.
The difference of theses differences (first table - second) is 0.0091 - 0.0183 = -0.0092.

With two 3 x 3 tables, as in your problem, you'll have to be careful to match each proportion to its scalar name. The -lincom- result shows that you can shorten the scalar names. So the following would have also worked:

Code:

lincom  ///
   [_prop_2]_subpop_1 -[_prop_2]_subpop_2  ///
 - ([_prop_2]_subpop_3 - [_prop_2]_subpop_4)

Notice that i say nothing about testing. It is apparent from your analysis that you are doing a descriptive investigation--computing proportions and their difference for a particular finite population at specific times. In such a case hypothesis testing is not appropriate. Why? Because an exact null hypothesis of zero difference between two parameters will never be true in the entire finite population (Cochran, 1977; Deming, 1966). Think, for example, of comparing average rates of residents of two cities. If you weighed everyone, there will always be some difference, however small, in the average weights.

The only question you can legitimately ask is: how big is a difference and that question is answered by confidence intervals. See also Tom Lumley's answer at:
http://stats.stackexchange.com/quest...pc/84044#84044

References:
WG Cochran, (1977). Sampling techniques (3rd ed.). New York: Wiley., p.39

WE Deming. (1966). Some theory of sampling. New York: Dover Publications, Chapter 7, p 247, "Distinction between enumerative and analytic studies").

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2

Comment

Josh Budlender

Join Date: Dec 2015

Posts: 15
#4

10 Jan 2016, 03:12

Dear Steve

Thank you very much for this detailed and clear response. Your point about hypothesis testing is also understood, thank you.

I realised a short while ago that I could work around my problem by using -svy: ratio- to construct ratios equivalent to the proportions I am interested in, and then use -lincom- to examine the difference between the ratios, similarly to what you do above. There is a useful example of using lincom in this way (using means) in the Stata manual on Survey data, under the post-estimation section (example 1).

Your method is more efficient than using ratios like this, however.

Thank you very much again,

Josh Budlender
Comment

Announcement