Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Storing estimates after -svy, tab-, to compare with estimates from another sample

    Hi

    My data is panel data, with 3 different waves. Additionally, it has been split into 3 different samples: those who lived in "formal urban" areas in Wave 1 and Wave 3, those who lived in "informal urban" areas in Waves 1 and 3, and those who lived in "rural" areas in Wave 1 and 3. It is complex survey data.

    If I execute
    Code:
    svy, subpop(panel_empl_formal): tab w1_empl_stat w3_empl_stat, row ci percent
    I get the change in labour market outcomes between waves for the "formal urban" sample ("Table 1"):
    Code:
    ----------------------------------------------------------------------
    w1 crude  |
    employmen |                 w3 crude employment status                
    t status  |        0, NEA       1, Unemp       2, Emplo          Total
    ----------+-----------------------------------------------------------
       0, NEA |         53.97          18.59          27.43            100
              | [49.04,58.83]  [15.43,22.24]  [23.69,31.53]               
              | 
     1, Unemp |         24.53          33.88          41.59            100
              | [20.27,29.36]  [28.82,39.34]  [36.04,47.35]               
              | 
     2, Emplo |         14.16          9.345          76.49            100
              | [11.68,17.07]   [7.422,11.7]   [72.95,79.7]               
              | 
        Total |          26.7           16.5           56.8            100
              | [24.29,29.26]  [14.21,19.08]   [53.33,60.2]               
    ----------------------------------------------------------------------
      Key:  row percentage
            [95% confidence interval for row percentage]
    The estimates in the cells 2:1 and 2:2 (blue) have overlapping confidence intervals, and therefore at first glance I cannot tell whether the difference between the estimates is statistically significant at the 95% level. So I execute
    Code:
    test _b[p21] = _b[p22]
    which gives me the output
    Code:
    Adjusted Wald test
    
     ( 1)  p21 - p22 = 0
    
           F(  1,   314) =    5.22
                Prob > F =    0.0230
    And I therefore know that the estimates can be said to be different at the 95% level. This is all fine.

    My question now is about how to compare estimates from the "formal urban" sample to another sample, such as the "informal urban" sample. As soon as I run another -svy, tab- command, the stored estimates for "Table 1" are replaced. So I need a way to save the estimates, and then bring them back for comparison. The commands I thought would work for this are -estimates store name- and -suest-.

    However when I tried the following
    Code:
    svy, subpop(panel_empl_for): tab w1_empl_stat_crude w3_empl_stat_crude, row ci percent
    estimates store FORMAL
    svy, subpop(panel_empl_rur): tab w1_empl_stat_crude w3_empl_stat_crude, row ci percent
    estimates store INFORMAL
    
    suest FORMAL INFORMAL
    I get
    Code:
    impossible to retrieve e(b) and e(V) in FORMAL  
     r(198); 
    Probably related to this, when I run -estimates replay FORMAL- I get
    Code:
    --------------------------------------------------------------------------------------------------------
    Model FORMAL
    --------------------------------------------------------------------------------------------------------
    varlist required
    r(100);

    Can I not use -estimates store- after -svy, tab-? Or am I doing something wrong? And if I cannot use -estimates store-, is there another way around my problem?

    Thanks very much for any help - and any questions or points of clarity are very welcome.


    NOTE: this is a cross-post from http://www.statalist.org/forums/foru...ex-survey-data, as that thread took a very long-winded route before getting to this point, and is very long and messy to follow. Most importantly, I think the question has changed somewhat from what I asked in that topic. However if you want some more background on this question, that thread may be informative.


  • #2
    To make clearer what I mean about comparing estimates from different samples:

    If I execute (as above)
    Code:
    svy, subpop(panel_empl_formal): tab w1_empl_stat w3_empl_stat, row ci percent
    I get (Table 1):

    Code:
    ----------------------------------------------------------------------
    w1 crude  |
    employmen |                 w3 crude employment status                
    t status  |        0, NEA       1, Unemp       2, Emplo          Total
    ----------+-----------------------------------------------------------
       0, NEA |         53.97          18.59          27.43            100
              | [49.04,58.83]  [15.43,22.24]  [23.69,31.53]               
              | 
     1, Unemp |         24.53          33.88          41.59            100
              | [20.27,29.36]  [28.82,39.34]  [36.04,47.35]               
              | 
     2, Emplo |         14.16          9.345          76.49            100
              | [11.68,17.07]   [7.422,11.7]   [72.95,79.7]               
              | 
        Total |          26.7           16.5           56.8            100
              | [24.29,29.26]  [14.21,19.08]   [53.33,60.2]               
    ----------------------------------------------------------------------
      Key:  row percentage
            [95% confidence interval for row percentage]
    Table 1 shows labour market mobility for people in formal urban areas. Say I now want to compare this to mobility in informal urban areas:

    Then I execute
    Code:
    svy, subpop(panel_empl_informal): tab w1_empl_stat w3_empl_stat, row ci percent
    and get (Table 2):

    Code:
    ----------------------------------------------------------------------
    w1 crude  |
    employmen |                 w3 crude employment status                
    t status  |        0, NEA       1, Unemp       2, Emplo          Total
    ----------+-----------------------------------------------------------
       0, NEA |         50.55           31.8          17.65            100
              | [44.64,56.44]   [23.39,41.6]  [10.46,28.23]               
              | 
     1, Unemp |          28.9          32.23          38.87            100
              | [22.87,35.78]  [23.55,42.33]  [31.14,47.21]               
              | 
     2, Emplo |         26.11          13.06          60.82            100
              | [18.39,35.67]  [9.133,18.34]  [49.28,71.27]               
              | 
        Total |         33.84          23.86          42.29            100
              | [27.47,40.86]  [18.28,30.52]   [33.4,51.72]               
    ----------------------------------------------------------------------
      Key:  row percentage
            [95% confidence interval for row percentage]
    I want to know if there is a statistically significant difference between the move from Unemployed in Wave 1 to NEA in Wave 3, between "formal urban" and "informal urban" areas. These are the blue estimates. However the confidence intervals overlap, so I need to perform a test. This is what I'm asking how to do.

    Thanks again,

    Josh Budlender

    Comment


    • #3
      You can't do this with -svy: tab-; it is not an estimation command that leaves behind e(b) and e(V). but you can do it with -proportion- followed by -lincom-. You have a 2 x 3 x 3 data structure. I don't have an example with such a structure, so I'll show the solution for a 2 x 2 x 2 structure. Here's the code.

      The urban/rural setting will be indicated by the variable "setting". The wave 1 and wave 3 variables are typ_wav1 and typ_wav3. The proportions contrasted in each table are those in the second column.
      Code:
      /* Set up data set */
      use http://www.stata-press.com/data/r14/byssin, clear
      drop if workplace ==3  // reduce to 2 x 2 x 2
      label drop _all
      gen setting = race
      gen typ_wav1 = workplace
      gen typ_wav3 = smokes
      keep setting typ* pop
      
      /* Do survey analyses */
      svyset _n [pw = pop]
      svy, subpop(if setting==1): tab typ_wav1 typ_wav3, row se
      svy, subpop(if setting==2): tab typ_wav1 typ_wav3, row se
      svy: prop typ_wav3, over(race typ_wav1) coeflegend
      
      /* Get difference of differences */
      lincom  ///
         _b[_prop_2:_subpop_1] -_b[_prop_2:_subpop_2]  ///
       - (_b[_prop_2:_subpop_3] - _b[_prop_2:_subpop_4])
      Results (abbreviated):
      Code:
      . svy, subpop(if race==1): tab typ_wav1 typ_wav3, row se
      -------------------------------------
                |         typ_wav3        
       typ_wav1 |       1        2    Total
      ----------+--------------------------
              1 |   .4403    .5597        1
                | (.2339)  (.2339)        
                |
              2 |   .4494    .5506        1
                | (.2591)  (.2591)        
                |
          Total |    .443     .557        1
                | (.1816)  (.1816)        
      -------------------------------------
      
      . svy, subpop(if setting==2): tab typ_wav1 typ_wav3, row se
      
      -------------------------------------
                |         typ_wav3        
       typ_wav1 |       1        2    Total
      ----------+--------------------------
              1 |   .4098    .5902        1
                | (.1614)  (.1614)        
                |
              2 |   .4281    .5719        1
                | (.1781)  (.1781)        
                |
          Total |   .4146    .5854        1
                | (.1281)  (.1281)        
      -------------------------------------
      . svy: prop typ_wav3, over(setting typ_wav1) coeflegend
            _prop_1: typ_wav3 = 1
            _prop_2: typ_wav3 = 2
      
               Over: setting typ_wav1
          _subpop_1: 1 1
          _subpop_2: 1 2
          _subpop_3: 2 1
          _subpop_4: 2 2
      ------------------------------------------------------------------------------
              Over | Proportion  Legend
      -------------+----------------------------------------------------------------
      _prop_1      |
         _subpop_1 |   .4403409  _b[_prop_1:_subpop_1]
         _subpop_2 |   .4494382  _b[_prop_1:_subpop_2]
         _subpop_3 |   .4097744  _b[_prop_1:_subpop_3]
         _subpop_4 |   .4280702  _b[_prop_1:_subpop_4]
      -------------+----------------------------------------------------------------
      _prop_2      |
         _subpop_1 |   .5596591  _b[_prop_2:_subpop_1]
         _subpop_2 |   .5505618  _b[_prop_2:_subpop_2]
         _subpop_3 |   .5902256  _b[_prop_2:_subpop_3]
         _subpop_4 |   .5719298  _b[_prop_2:_subpop_4]
      ----------------------------------------------------
       
       /* Get difference of differences */
      . lincom  ///
      >    _b[_prop_2:_subpop_1] -_b[_prop_2:_subpop_2]  ///
      >  - (_b[_prop_2:_subpop_3] - _b[_prop_2:_subpop_4])
      
       ( 1)  [_prop_2]_subpop_1 - [_prop_2]_subpop_2 - [_prop_2]_subpop_3 + [_prop_2]_subpop_4 =
             0
      
      ------------------------------------------------------------------------------
        Proportion |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
               (1) |  -.0091984   .4237844    -0.02   0.983    -.8617426    .8433457
      ------------------------------------------------------------------------------
      How did this work? Well, the four proportions in the second column are contained in the _prop_2 section of the -proportion- results. In the first table the proportions are 0.5597 and 0.5506, with difference .0091. In the second table the proportions are 0.5902 and 0.5719, with difference 0.0183.
      The difference of theses differences (first table - second) is 0.0091 - 0.0183 = -0.0092.

      With two 3 x 3 tables, as in your problem, you'll have to be careful to match each proportion to its scalar name. The -lincom- result shows that you can shorten the scalar names. So the following would have also worked:
      Code:
      lincom  ///
         [_prop_2]_subpop_1 -[_prop_2]_subpop_2  ///
       - ([_prop_2]_subpop_3 - [_prop_2]_subpop_4)
      Notice that i say nothing about testing. It is apparent from your analysis that you are doing a descriptive investigation--computing proportions and their difference for a particular finite population at specific times. In such a case hypothesis testing is not appropriate. Why? Because an exact null hypothesis of zero difference between two parameters will never be true in the entire finite population (Cochran, 1977; Deming, 1966). Think, for example, of comparing average rates of residents of two cities. If you weighed everyone, there will always be some difference, however small, in the average weights.

      The only question you can legitimately ask is: how big is a difference and that question is answered by confidence intervals.
      See also Tom Lumley's answer at:
      http://stats.stackexchange.com/quest...pc/84044#84044



      References:
      WG Cochran, (1977). Sampling techniques (3rd ed.). New York: Wiley., p.39

      WE Deming. (1966). Some theory of sampling. New York: Dover Publications, Chapter 7, p 247, "Distinction between enumerative and analytic studies").


      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        Dear Steve

        Thank you very much for this detailed and clear response. Your point about hypothesis testing is also understood, thank you.

        I realised a short while ago that I could work around my problem by using -svy: ratio- to construct ratios equivalent to the proportions I am interested in, and then use -lincom- to examine the difference between the ratios, similarly to what you do above. There is a useful example of using lincom in this way (using means) in the Stata manual on Survey data, under the post-estimation section (example 1).

        Your method is more efficient than using ratios like this, however.

        Thank you very much again,

        Josh Budlender

        Comment

        Working...
        X