Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Independent (unpaired) ttest using weights

    Hi there,

    I am wanting to test that unemployment rates by race are statistically different from each other. The data is from a weighted labour force survey. The Stata Manual suggests: "For the equivalent of a two-sample t test with sampling weights (pweights), use the svy: mean command with the over() option, and then use lincom." I am using svy:prop with the over option but it does not allow me to stipulate that the samples are independent/unpaired?

    I am concerned because my confidence intervals are very large and overlap almost across the entire lower and upper bounds respectively, however the p-values are still 0.05.

    Thanks for the help,
    Justin

  • #2
    Hello Justin,

    Welcome to the Stata Forum!

    With regards to the wide CIs, I kindly underline that you may get more helpful comments, provided you present details of the model, commands and output. Please prefer to present them under CODE delimiters, as also suggested in the FAQ.

    On what concerns performing t tests under survey data analysis, you may wish to take a look at this text:http://www.ats.ucla.edu/stat/stata/faq/svyttest.htm

    Best,

    Marcos
    Best regards,

    Marcos

    Comment


    • #3
      Thanks Marcos.

      I am new to the forum so I appreciate the advice. My example is as follows:

      Code:
      svy:prop empl_status, over(wave)
      I then run a t-test to see if the 'wave1' unemployment rate is the same as the 'wave 2' unemployment rate.
      Code:
      lincom [Unemployed]wave1 - [Unemployed]wave2
      My question is then simply: how would I specify that the samples are dependent (paired)? The lincom command does not give you an option for this?

      If I have made more sense, I'd appreciate the input.

      Regards,
      Justin

      Comment


      • #4
        Well, the command -svy:prop empl_status, over(wave)- would be inappropriate in the first place if you have paired samples. No, -lincom- doesn't have a paired or unpaired option: it is designed to compare parameter estimates from a Stata estimation command, so no paired/unpaired distinction arises there. Dealing with -paired- data requires appropriate steps in the estimation command itself.

        Do you actually have paired data? Or are you just thinking you want to try a different test because you don't like the looks of what the original approach? The best advice is what Marcos gave you in #2. Do that and you can get more specific guidance.

        Comment


        • #5
          My data comes from a rotating panel. Hence my understanding here is that the observations are not independent and I need to do a paired test (but evidently I am missing something)? Here is my example:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long STRATUM byte Status double Weight float myprovince str8 psu float wave
          171102 2  807.9929231 1 "17100826" 1
          210102 2  972.5970822 2 "20400082" 1
          212402 2  423.0392164 2 "21300240" 1
          543401 1  380.3713473 2 "23700161" 1
          522102 2  765.8896067 5 "51100613" 1
          638102 2  584.7641746 6 "61000058" 1
          774111 1 1464.5488124 7 "77403308" 1
          830103 1  231.8946935 8 "80200026" 1
          832409 2  260.5379986 8 "81700534" 1
          934403 2  284.0616959 9 "90700489" 1
          104102 1 1234.1264185 1 "11800165" 2
          210102 1   769.294491 2 "20900015" 2
          212201 1 1306.5666823 2 "21401020" 2
          419103 2  197.0015112 4 "41200145" 2
          420104 1   357.607638 4 "41700045" 2
          637401 1  333.7739748 6 "60200219" 2
          748102 1  745.1995248 7 "70200159" 2
          742103 1  708.3278536 7 "70400518" 2
          773203 2 2083.8181014 7 "77302577" 2
          935404 2  658.5212166 9 "91100059" 2
          end
          label values Status Status_exp
          label def Status_exp 1 "Employed", modify
          label def Status_exp 2 "Unemployed", modify
          label values myprovince myprovince
          label def myprovince 1 "Western Cape", modify
          label def myprovince 2 "Eastern Cape", modify
          label def myprovince 4 "Free State", modify
          label def myprovince 5 "KZN", modify
          label def myprovince 6 "North West", modify
          label def myprovince 7 "Gauteng", modify
          label def myprovince 8 "Mpumalanga", modify
          label def myprovince 9 "Limpopo", modify
          I first weight my data:

          Code:
          . svyset psu [pweight= Weight],    strata(STRATUM)    singleunit(scaled)
          
          pweight: Weight
          VCE: linearized
          Single unit: scaled
          Strata 1: STRATUM
          SU 1: psu
          FPC 1: <zero>
          ...and then I run my commands to estimate whether the change over time is significant:

          Code:
          . svy: prop Status, over(wave)
          (running proportion on estimation sample)
          
          Survey: Proportion estimation
          
          Number of strata =      19          Number of obs    =      20
          Number of PSUs   =      20          Population size  = 14569.9
                                              Design df        =       1
          
               Employed: Status = Employed
             Unemployed: Status = Unemployed
          
                      1: wave = 1
                      2: wave = 2
          
          --------------------------------------------------------------
                       |             Linearized
                  Over | Proportion   Std. Err.     [95% Conf. Interval]
          -------------+------------------------------------------------
          Employed     |
                     1 |   .3362883    .230853     -2.596977    3.269554
                     2 |   .6498383   .1398804     -1.127511    2.427187
          -------------+------------------------------------------------
          Unemployed   |
                     1 |   .6637117    .230853     -2.269554    3.596977
                     2 |   .3501617   .1398804     -1.427187    2.127511
          --------------------------------------------------------------
          Note: variance scaled to handle strata with a single sampling
                unit.
          
          . lincom [Unemployed]2 - [Unemployed]1
          
           ( 1)  - [Unemployed]1 + [Unemployed]2 = 0
          
          ------------------------------------------------------------------------------
            Proportion |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                   (1) |    -.31355   .0909726    -3.45   0.180    -1.469466    .8423664
          ------------------------------------------------------------------------------
          However my test example above has the assumption that the samples are independent - how do I run this test given that the samples are actually dependent (paired)? Or how do I estimate this in a different manner (but allowing me to use weights)?

          Thanks,
          Justin

          Comment


          • #6
            Hello Justie,

            At first, you didn't mention you have a longitudinal design. That is an important aspect. Now, going directly to the point, the issue on the strategy is exactly the one underlined by Clyde. i.e., using "over" for the waves.

            That said, please make sure you have no missing data for the waves. Otherwise, use a subpopulation with all waves, such as: . svy, subpop(twowaves):. You may create a variable to tag the observations with all waves.

            Hopefully that helps.

            Best,

            Marcos
            Last edited by Marcos Almeida; 29 Sep 2016, 06:18.
            Best regards,

            Marcos

            Comment


            • #7
              On second thoughts, after checking the commands and output of your model:

              a) If you have just 20 observations, a tiny sample size, even for a simple t-test with 2 groups of 10 each, let alone a survey data analysis. Maybe you just selected a fraction to exemplify...

              b) "Status", being a binary variable, you may get the proportion with the commands - svy:prop, svy:mean and svy:tab - as well..


              By the way, the book Applied Survey Data Analysis (Steven G. Heeringa, Brady T. West,Patricia A. Berglund), on page 143, chapter 5, topic 5.6.2., entitled "Comparing Means over Time", suggest the use of "svy: mean namevar, over(wave)", also followed by - lincom - for a case similar to yours ( https://books.google.com.br/books?id...ing%20&f=false),

              Hopefully that helps!

              Best,

              Marcos
              Last edited by Marcos Almeida; 29 Sep 2016, 07:14. Reason: Edited for inclusion of a referencial book
              Best regards,

              Marcos

              Comment


              • #8
                So if this is longitudinal data, you need to have a variable which identifies individual respondents consistently across the waves of the survey. You don't show such a variable in your example, so it isn't possible to proceed.

                Comment


                • #9
                  Sorry, I was giving an example here (there are 30,000 individuals) and realise that I selected 10 individuals from the first wave and a different 10 individuals from the second wave. But you can understand what I am getting at here - i.e. how do you set up the test in a way which assumes that the samples are dependent not independent?

                  Comment


                  • #10
                    Sorry, Justin. But I really don't understand what has happened. At first, there was your demand for unpaired t test (in #1), and the solution was given. Then, the theme changed to a paired t test and, by seeing the waves in your data set, a recommendation for such a test with between 2 waves was given. A link to a referencial book was provided. It seems the data you showed is not exactly the one you wish to work with. Then, now, you say there is no wave in your study and,..yet you wish to perform a paired t test. But how could that be, under the information hereby provided, I fail to envisage.

                    Hopefully you will get more insightful advice.

                    Kind regards,

                    Marcos
                    Best regards,

                    Marcos

                    Comment


                    • #11
                      Originally posted by Justin Visagie View Post
                      Sorry, I was giving an example here (there are 30,000 individuals) and realise that I selected 10 individuals from the first wave and a different 10 individuals from the second wave. But you can understand what I am getting at here - i.e. how do you set up the test in a way which assumes that the samples are dependent not independent?
                      But how can you even tell that this is the case from what you posted? You don't show any variable that identifies the individuals. Do you have such a variable? If so, show an example which includes it. If you don't have such a variable, then you cannot do a paired-analysis. To do a paired analysis you need to be able to identify which observation is paired with which other observation.

                      Comment


                      • #12
                        I see that I am being as clear as mud - so I am going to give it one last go (thanks for the patience!): ("UQNO" and "PERSONNO" in the dataset are the individual identifiers - I have created "ID" to make it easier)
                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input str18 UQNO byte PERSONNO long STRATUM byte Status double Weight float myprovince str8 psu float(wave ID)
                        "110000660000020101" 1 102102 1  555.7417463 1 "11000066" 1  1
                        "110000660000020101" 1 102102 1  599.2113306 1 "11000066" 2  1
                        "171013980000020201" 1 171112 1  621.9730882 1 "17101398" 1  2
                        "171013980000020201" 1 171112 1  671.5553543 1 "17101398" 2  2
                        "171023320000005801" 3 171120 1  312.6357523 1 "17102332" 1  3
                        "171023320000005801" 3 171120 1  353.1812439 1 "17102332" 2  3
                        "171044010000001301" 3 171112 2   373.244352 1 "17104401" 1  4
                        "171044010000001301" 3 171112 2  374.5688729 1 "17104401" 2  4
                        "214005510000002501" 2 212404 1  456.0182191 2 "21400551" 1  5
                        "214005510000002501" 2 212404 1  444.1893343 2 "21400551" 2  5
                        "405001500000004601" 2 417101 2  668.1298938 4 "40500150" 1  6
                        "405001500000004601" 2 417101 2  738.7751504 4 "40500150" 2  6
                        "572019860000010201" 5 572108 1  997.0189523 5 "57201986" 1  7
                        "572019860000010201" 5 572108 1  714.7814299 5 "57201986" 2  7
                        "681000840000002301" 1 345401 1  623.8988706 6 "68100084" 1  8
                        "681000840000002301" 1 345401 1  799.9441265 6 "68100084" 2  9
                        "706000300000014201" 1 742501 1 3936.1005884 7 "70600030" 1  9
                        "706000300000014201" 1 742501 2 3273.5501963 7 "70600030" 2  9
                        "802001130000002901" 4 830103 2  699.4802876 8 "80200113" 1 10
                        "802001130000002901" 4 830103 2   531.064149 8 "80200113" 2 10
                        "813000480000011801" 1 831402 1    572.36369 8 "81300048" 1 11
                        "813000480000011801" 1 831402 1  586.4560197 8 "81300048" 2 11
                        "817002680000011901" 1 832408 1  393.8432799 8 "81700268" 1 12
                        "817002680000011901" 1 832408 1  421.9556498 8 "81700268" 2 12
                        "817002680000013501" 2 832408 1  583.9800323 8 "81700268" 1 13
                        "817002680000013501" 2 832408 1  653.4718418 8 "81700268" 2 13
                        "908002790000008801" 1 934401 2  829.6563427 9 "90800279" 1 14
                        "908002790000008801" 1 934401 2  883.3003076 9 "90800279" 2 14
                        "910001580000011601" 3 935401 1 1186.3311452 9 "91000158" 1 15
                        "910001580000011601" 3 935401 1  1228.646648 9 "91000158" 2 15
                        end
                        label values Status Status_exp
                        label def Status_exp 1 "Employed", modify
                        label def Status_exp 2 "Unemployed", modify
                        label values myprovince myprovince
                        label def myprovince 1 "Western Cape", modify
                        label def myprovince 2 "Eastern Cape", modify
                        label def myprovince 4 "Free State", modify
                        label def myprovince 5 "KZN", modify
                        label def myprovince 6 "North West", modify
                        label def myprovince 7 "Gauteng", modify
                        label def myprovince 8 "Mpumalanga", modify
                        label def myprovince 9 "Limpopo", modify

                        1) If I am looking at testing that the unemployment rate is different across districts (independent samples) then the correct estimation here will be: "svy: prop Status, over(myprovince)" followed up by "lincom". Thanks Marcos, I am happy that this is solved! See below:

                        Code:
                        . svy, subpop(if Status!=4): prop Status, over(myprovince)
                        (running proportion on estimation sample)
                        
                        Survey: Proportion estimation
                        
                        Number of strata =      13          Number of obs    =      30
                        Number of PSUs   =      14          Population size  = 25085.1
                                                            Subpop. no. obs  =      30
                                                            Subpop. size     = 25085.1
                                                            Design df        =       1
                        
                             Employed: Status = Employed
                           Unemployed: Status = Unemployed
                        
                            _subpop_1: myprovince = Western Cape
                            _subpop_2: myprovince = Eastern Cape
                            _subpop_3: myprovince = Free State
                                  KZN: myprovince = KZN
                            _subpop_5: myprovince = North West
                              Gauteng: myprovince = Gauteng
                           Mpumalanga: myprovince = Mpumalanga
                              Limpopo: myprovince = Limpopo
                        
                        --------------------------------------------------------------
                                     |             Linearized
                                Over | Proportion   Std. Err.     [95% Conf. Interval]
                        -------------+------------------------------------------------
                        Employed     |
                           _subpop_1 |   .8063719   .7967823     -9.317707    10.93045
                           _subpop_2 |          1          .             .           .
                           _subpop_3 |          .  (no observations)
                                 KZN |          1          .             .           .
                           _subpop_5 |          1          .             .           .
                             Gauteng |   .5459489          .             .           .
                          Mpumalanga |   .7230135          .             .           .
                             Limpopo |    .585033          .             .           .
                        -------------+------------------------------------------------
                        Unemployed   |
                           _subpop_1 |   .1936281   .7967823     -9.930451    10.31771
                           _subpop_2 |          .  (no observations)
                           _subpop_3 |          1          .             .           .
                                 KZN |          .  (no observations)
                           _subpop_5 |          .  (no observations)
                             Gauteng |   .4540511          .             .           .
                          Mpumalanga |   .2769865          .             .           .
                             Limpopo |    .414967          .             .           .
                        --------------------------------------------------------------
                        Note: variance scaled to handle strata with a single sampling
                              unit.
                        
                        . lincom [Unemployed]_subpop_1 - [Unemployed]Mpumalanga
                        
                         ( 1)  [Unemployed]_subpop_1 - [Unemployed]Mpumalanga = 0
                        
                        ------------------------------------------------------------------------------
                          Proportion |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                        -------------+----------------------------------------------------------------
                                 (1) |  -.0833585   .7967823    -0.10   0.934    -10.20744    10.04072
                        ------------------------------------------------------------------------------
                        2) If I am looking at testing the unemployment rate across waves I could follow the same approach as (1) above:

                        Code:
                        . svy, subpop(if Status!=4): prop Status, over(wave)
                        (running proportion on estimation sample)
                        
                        Survey: Proportion estimation
                        
                        Number of strata =      13          Number of obs    =      30
                        Number of PSUs   =      14          Population size  = 25085.1
                                                            Subpop. no. obs  =      30
                                                            Subpop. size     = 25085.1
                                                            Design df        =       1
                        
                             Employed: Status = Employed
                           Unemployed: Status = Unemployed
                        
                                    1: wave = 1
                                    2: wave = 2
                        
                        --------------------------------------------------------------
                                     |             Linearized
                                Over | Proportion   Std. Err.     [95% Conf. Interval]
                        -------------+------------------------------------------------
                        Employed     |
                                   1 |   .7993421   .1190986     -.7139489    2.312633
                                   2 |    .527379   .1512556     -1.394506    2.449264
                        -------------+------------------------------------------------
                        Unemployed   |
                                   1 |   .2006579   .1190986     -1.312633    1.713949
                                   2 |    .472621   .1512556     -1.449264    2.394506
                        --------------------------------------------------------------
                        Note: variance scaled to handle strata with a single sampling
                              unit.
                        
                        . lincom [Unemployed]2 - [Unemployed]1
                        
                         ( 1)  - [Unemployed]1 + [Unemployed]2 = 0
                        
                        ------------------------------------------------------------------------------
                          Proportion |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                        -------------+----------------------------------------------------------------
                                 (1) |   .2719632    .032157     8.46   0.075    -.1366305    .6805568
                        ------------------------------------------------------------------------------
                        But using this method would be assuming the samples are independent (unpaired) when they are actually panel data? This is my concern.

                        I really apologise if I am still not clear - thank you for bearing with me.

                        Best,
                        Justin

                        Comment


                        • #13
                          Sorry - also note that in example 1 (independent samples- unemployment by district) I would restrict the sample to one particular year: "svy, subpop(if wave==1): prop Status, over(myprovince)" which I forgot to do. But I am happy that I am doing the right estimation for example type 1 - it's example type 2 that I would like the input.

                          Comment


                          • #14
                            This gets somewhat complicated. You can't do this with -proportion-, because -proportion- is not designed to handle paired observations, repeated measures, etc. In principle you can do this by running a 2-level logistic model with Status as the outcome and i.wave as the predictor. The test of significance for the coefficient of wave will be what you want. There are a few technical obstacles to this solution, some of which are easily overcome, and one, perhaps not so easily.

                            1. Your variable Status is coded as 1/2. To serve as an outcome variable in a logistic model it needs to be coded 0/1. That's easy to fix.

                            2. You need a new single variable that identifies individuals (referred to as person in the code below).. That's also easy. -help egen-, with reference to the -group()- function.

                            3. To use -melogit- with svy:, you need to re-do the -svyset- on your data specifying pweights at both the observation and the person level. If you have the information available to do that, then you're in good shape. I do not have the expertise in survey data analysis to advise you on how you might derive these weights from whatever information is in your data set.

                            If you overcome these obstacles, then I think you can just go ahead with:
                            Code:
                            svy, subpop(if Status != 4): melogit employed i.wave || person:
                            margins wave
                            By the way, I don't understand why you are specifying (if Status != 4) when Status only takes on values 1 and 2 (or 0 and 1 after you recode it.)
                            Last edited by Clyde Schechter; 30 Sep 2016, 09:37. Reason: Fix typos.

                            Comment


                            • #15
                              Dear Clyde and Marcos,

                              Thank you very much for the help!!!! I think I have what I need!

                              PS - Status!=4 is for those who individuals who shift into Not Economically Active category: but I see that there are none of these cases in my example so it is not neccessary.

                              Comment

                              Working...
                              X