Independent (unpaired) ttest using weights

Justin Visagie

Join Date: Sep 2016

Posts: 9
#1

Independent (unpaired) ttest using weights

28 Sep 2016, 04:41

Hi there,

I am wanting to test that unemployment rates by race are statistically different from each other. The data is from a weighted labour force survey. The Stata Manual suggests: "For the equivalent of a two-sample t test with sampling weights (pweights), use the svy: mean command with the over() option, and then use lincom." I am using svy:prop with the over option but it does not allow me to stipulate that the samples are independent/unpaired?

I am concerned because my confidence intervals are very large and overlap almost across the entire lower and upper bounds respectively, however the p-values are still 0.05.

Thanks for the help,
Justin
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

28 Sep 2016, 05:02

Hello Justin,

Welcome to the Stata Forum!

With regards to the wide CIs, I kindly underline that you may get more helpful comments, provided you present details of the model, commands and output. Please prefer to present them under CODE delimiters, as also suggested in the FAQ.

On what concerns performing t tests under survey data analysis, you may wish to take a look at this text:http://www.ats.ucla.edu/stat/stata/faq/svyttest.htm

Best,

Marcos

Best regards,

Marcos
Comment
Justin Visagie

Join Date: Sep 2016

Posts: 9
#3

28 Sep 2016, 12:38

Thanks Marcos.

I am new to the forum so I appreciate the advice. My example is as follows:

Code:

svy:prop empl_status, over(wave)

I then run a t-test to see if the 'wave1' unemployment rate is the same as the 'wave 2' unemployment rate.

Code:

lincom [Unemployed]wave1 - [Unemployed]wave2

My question is then simply: how would I specify that the samples are dependent (paired)? The lincom command does not give you an option for this?

If I have made more sense, I'd appreciate the input.

Regards,
Justin
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

28 Sep 2016, 12:59

Well, the command -svy:prop empl_status, over(wave)- would be inappropriate in the first place if you have paired samples. No, -lincom- doesn't have a paired or unpaired option: it is designed to compare parameter estimates from a Stata estimation command, so no paired/unpaired distinction arises there. Dealing with -paired- data requires appropriate steps in the estimation command itself.

Do you actually have paired data? Or are you just thinking you want to try a different test because you don't like the looks of what the original approach? The best advice is what Marcos gave you in #2. Do that and you can get more specific guidance.
Comment

Justin Visagie

Join Date: Sep 2016
Posts: 9

29 Sep 2016, 02:49

My data comes from a rotating panel. Hence my understanding here is that the observations are not independent and I need to do a paired test (but evidently I am missing something)? Here is my example:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long STRATUM byte Status double Weight float myprovince str8 psu float wave
171102 2  807.9929231 1 "17100826" 1
210102 2  972.5970822 2 "20400082" 1
212402 2  423.0392164 2 "21300240" 1
543401 1  380.3713473 2 "23700161" 1
522102 2  765.8896067 5 "51100613" 1
638102 2  584.7641746 6 "61000058" 1
774111 1 1464.5488124 7 "77403308" 1
830103 1  231.8946935 8 "80200026" 1
832409 2  260.5379986 8 "81700534" 1
934403 2  284.0616959 9 "90700489" 1
104102 1 1234.1264185 1 "11800165" 2
210102 1   769.294491 2 "20900015" 2
212201 1 1306.5666823 2 "21401020" 2
419103 2  197.0015112 4 "41200145" 2
420104 1   357.607638 4 "41700045" 2
637401 1  333.7739748 6 "60200219" 2
748102 1  745.1995248 7 "70200159" 2
742103 1  708.3278536 7 "70400518" 2
773203 2 2083.8181014 7 "77302577" 2
935404 2  658.5212166 9 "91100059" 2
end
label values Status Status_exp
label def Status_exp 1 "Employed", modify
label def Status_exp 2 "Unemployed", modify
label values myprovince myprovince
label def myprovince 1 "Western Cape", modify
label def myprovince 2 "Eastern Cape", modify
label def myprovince 4 "Free State", modify
label def myprovince 5 "KZN", modify
label def myprovince 6 "North West", modify
label def myprovince 7 "Gauteng", modify
label def myprovince 8 "Mpumalanga", modify
label def myprovince 9 "Limpopo", modify

I first weight my data:

Code:

. svyset psu [pweight= Weight],    strata(STRATUM)    singleunit(scaled)

pweight: Weight
VCE: linearized
Single unit: scaled
Strata 1: STRATUM
SU 1: psu
FPC 1: <zero>

...and then I run my commands to estimate whether the change over time is significant:

Code:

. svy: prop Status, over(wave)
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata =      19          Number of obs    =      20
Number of PSUs   =      20          Population size  = 14569.9
                                    Design df        =       1

     Employed: Status = Employed
   Unemployed: Status = Unemployed

            1: wave = 1
            2: wave = 2

--------------------------------------------------------------
             |             Linearized
        Over | Proportion   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
Employed     |
           1 |   .3362883    .230853     -2.596977    3.269554
           2 |   .6498383   .1398804     -1.127511    2.427187
-------------+------------------------------------------------
Unemployed   |
           1 |   .6637117    .230853     -2.269554    3.596977
           2 |   .3501617   .1398804     -1.427187    2.127511
--------------------------------------------------------------
Note: variance scaled to handle strata with a single sampling
      unit.

. lincom [Unemployed]2 - [Unemployed]1

 ( 1)  - [Unemployed]1 + [Unemployed]2 = 0

------------------------------------------------------------------------------
  Proportion |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |    -.31355   .0909726    -3.45   0.180    -1.469466    .8423664
------------------------------------------------------------------------------

However my test example above has the assumption that the samples are independent - how do I run this test given that the samples are actually dependent (paired)? Or how do I estimate this in a different manner (but allowing me to use weights)?

Thanks,
Justin

Comment

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

29 Sep 2016, 06:11

Hello Justie,

At first, you didn't mention you have a longitudinal design. That is an important aspect. Now, going directly to the point, the issue on the strategy is exactly the one underlined by Clyde. i.e., using "over" for the waves.

That said, please make sure you have no missing data for the waves. Otherwise, use a subpopulation with all waves, such as: . svy, subpop(twowaves):. You may create a variable to tag the observations with all waves.

Hopefully that helps.

Best,

Marcos

Last edited by Marcos Almeida; 29 Sep 2016, 06:18.

Best regards,

Marcos
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#7

29 Sep 2016, 06:32

On second thoughts, after checking the commands and output of your model:

a) If you have just 20 observations, a tiny sample size, even for a simple t-test with 2 groups of 10 each, let alone a survey data analysis. Maybe you just selected a fraction to exemplify...

b) "Status", being a binary variable, you may get the proportion with the commands - svy:prop, svy:mean and svy:tab - as well..

By the way, the book Applied Survey Data Analysis (Steven G. Heeringa, Brady T. West,Patricia A. Berglund), on page 143, chapter 5, topic 5.6.2., entitled "Comparing Means over Time", suggest the use of "svy: mean namevar, over(wave)", also followed by - lincom - for a case similar to yours ( https://books.google.com.br/books?id...ing%20&f=false),

Hopefully that helps!

Best,

Marcos

Last edited by Marcos Almeida; 29 Sep 2016, 07:14. Reason: Edited for inclusion of a referencial book

Best regards,

Marcos
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

29 Sep 2016, 10:07

So if this is longitudinal data, you need to have a variable which identifies individual respondents consistently across the waves of the survey. You don't show such a variable in your example, so it isn't possible to proceed.
Comment
Justin Visagie

Join Date: Sep 2016

Posts: 9
#9

29 Sep 2016, 11:33

Sorry, I was giving an example here (there are 30,000 individuals) and realise that I selected 10 individuals from the first wave and a different 10 individuals from the second wave. But you can understand what I am getting at here - i.e. how do you set up the test in a way which assumes that the samples are dependent not independent?
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#10

29 Sep 2016, 12:15

Sorry, Justin. But I really don't understand what has happened. At first, there was your demand for unpaired t test (in #1), and the solution was given. Then, the theme changed to a paired t test and, by seeing the waves in your data set, a recommendation for such a test with between 2 waves was given. A link to a referencial book was provided. It seems the data you showed is not exactly the one you wish to work with. Then, now, you say there is no wave in your study and,..yet you wish to perform a paired t test. But how could that be, under the information hereby provided, I fail to envisage.

Hopefully you will get more insightful advice.

Kind regards,

Marcos

Best regards,

Marcos
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

29 Sep 2016, 12:23

Originally posted by Justin Visagie View Post

Sorry, I was giving an example here (there are 30,000 individuals) and realise that I selected 10 individuals from the first wave and a different 10 individuals from the second wave. But you can understand what I am getting at here - i.e. how do you set up the test in a way which assumes that the samples are dependent not independent?

But how can you even tell that this is the case from what you posted? You don't show any variable that identifies the individuals. Do you have such a variable? If so, show an example which includes it. If you don't have such a variable, then you cannot do a paired-analysis. To do a paired analysis you need to be able to identify which observation is paired with which other observation.
Comment

Justin Visagie

Join Date: Sep 2016
Posts: 9

#12

30 Sep 2016, 07:57

I see that I am being as clear as mud - so I am going to give it one last go (thanks for the patience!): ("UQNO" and "PERSONNO" in the dataset are the individual identifiers - I have created "ID" to make it easier)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str18 UQNO byte PERSONNO long STRATUM byte Status double Weight float myprovince str8 psu float(wave ID)
"110000660000020101" 1 102102 1  555.7417463 1 "11000066" 1  1
"110000660000020101" 1 102102 1  599.2113306 1 "11000066" 2  1
"171013980000020201" 1 171112 1  621.9730882 1 "17101398" 1  2
"171013980000020201" 1 171112 1  671.5553543 1 "17101398" 2  2
"171023320000005801" 3 171120 1  312.6357523 1 "17102332" 1  3
"171023320000005801" 3 171120 1  353.1812439 1 "17102332" 2  3
"171044010000001301" 3 171112 2   373.244352 1 "17104401" 1  4
"171044010000001301" 3 171112 2  374.5688729 1 "17104401" 2  4
"214005510000002501" 2 212404 1  456.0182191 2 "21400551" 1  5
"214005510000002501" 2 212404 1  444.1893343 2 "21400551" 2  5
"405001500000004601" 2 417101 2  668.1298938 4 "40500150" 1  6
"405001500000004601" 2 417101 2  738.7751504 4 "40500150" 2  6
"572019860000010201" 5 572108 1  997.0189523 5 "57201986" 1  7
"572019860000010201" 5 572108 1  714.7814299 5 "57201986" 2  7
"681000840000002301" 1 345401 1  623.8988706 6 "68100084" 1  8
"681000840000002301" 1 345401 1  799.9441265 6 "68100084" 2  9
"706000300000014201" 1 742501 1 3936.1005884 7 "70600030" 1  9
"706000300000014201" 1 742501 2 3273.5501963 7 "70600030" 2  9
"802001130000002901" 4 830103 2  699.4802876 8 "80200113" 1 10
"802001130000002901" 4 830103 2   531.064149 8 "80200113" 2 10
"813000480000011801" 1 831402 1    572.36369 8 "81300048" 1 11
"813000480000011801" 1 831402 1  586.4560197 8 "81300048" 2 11
"817002680000011901" 1 832408 1  393.8432799 8 "81700268" 1 12
"817002680000011901" 1 832408 1  421.9556498 8 "81700268" 2 12
"817002680000013501" 2 832408 1  583.9800323 8 "81700268" 1 13
"817002680000013501" 2 832408 1  653.4718418 8 "81700268" 2 13
"908002790000008801" 1 934401 2  829.6563427 9 "90800279" 1 14
"908002790000008801" 1 934401 2  883.3003076 9 "90800279" 2 14
"910001580000011601" 3 935401 1 1186.3311452 9 "91000158" 1 15
"910001580000011601" 3 935401 1  1228.646648 9 "91000158" 2 15
end
label values Status Status_exp
label def Status_exp 1 "Employed", modify
label def Status_exp 2 "Unemployed", modify
label values myprovince myprovince
label def myprovince 1 "Western Cape", modify
label def myprovince 2 "Eastern Cape", modify
label def myprovince 4 "Free State", modify
label def myprovince 5 "KZN", modify
label def myprovince 6 "North West", modify
label def myprovince 7 "Gauteng", modify
label def myprovince 8 "Mpumalanga", modify
label def myprovince 9 "Limpopo", modify

1) If I am looking at testing that the unemployment rate is different across districts (independent samples) then the correct estimation here will be: "svy: prop Status, over(myprovince)" followed up by "lincom". Thanks Marcos, I am happy that this is solved! See below:

Code:

. svy, subpop(if Status!=4): prop Status, over(myprovince)
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata =      13          Number of obs    =      30
Number of PSUs   =      14          Population size  = 25085.1
                                    Subpop. no. obs  =      30
                                    Subpop. size     = 25085.1
                                    Design df        =       1

     Employed: Status = Employed
   Unemployed: Status = Unemployed

    _subpop_1: myprovince = Western Cape
    _subpop_2: myprovince = Eastern Cape
    _subpop_3: myprovince = Free State
          KZN: myprovince = KZN
    _subpop_5: myprovince = North West
      Gauteng: myprovince = Gauteng
   Mpumalanga: myprovince = Mpumalanga
      Limpopo: myprovince = Limpopo

--------------------------------------------------------------
             |             Linearized
        Over | Proportion   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
Employed     |
   _subpop_1 |   .8063719   .7967823     -9.317707    10.93045
   _subpop_2 |          1          .             .           .
   _subpop_3 |          .  (no observations)
         KZN |          1          .             .           .
   _subpop_5 |          1          .             .           .
     Gauteng |   .5459489          .             .           .
  Mpumalanga |   .7230135          .             .           .
     Limpopo |    .585033          .             .           .
-------------+------------------------------------------------
Unemployed   |
   _subpop_1 |   .1936281   .7967823     -9.930451    10.31771
   _subpop_2 |          .  (no observations)
   _subpop_3 |          1          .             .           .
         KZN |          .  (no observations)
   _subpop_5 |          .  (no observations)
     Gauteng |   .4540511          .             .           .
  Mpumalanga |   .2769865          .             .           .
     Limpopo |    .414967          .             .           .
--------------------------------------------------------------
Note: variance scaled to handle strata with a single sampling
      unit.

. lincom [Unemployed]_subpop_1 - [Unemployed]Mpumalanga

 ( 1)  [Unemployed]_subpop_1 - [Unemployed]Mpumalanga = 0

------------------------------------------------------------------------------
  Proportion |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |  -.0833585   .7967823    -0.10   0.934    -10.20744    10.04072
------------------------------------------------------------------------------

2) If I am looking at testing the unemployment rate across waves I could follow the same approach as (1) above:

Code:

. svy, subpop(if Status!=4): prop Status, over(wave)
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata =      13          Number of obs    =      30
Number of PSUs   =      14          Population size  = 25085.1
                                    Subpop. no. obs  =      30
                                    Subpop. size     = 25085.1
                                    Design df        =       1

     Employed: Status = Employed
   Unemployed: Status = Unemployed

            1: wave = 1
            2: wave = 2

--------------------------------------------------------------
             |             Linearized
        Over | Proportion   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
Employed     |
           1 |   .7993421   .1190986     -.7139489    2.312633
           2 |    .527379   .1512556     -1.394506    2.449264
-------------+------------------------------------------------
Unemployed   |
           1 |   .2006579   .1190986     -1.312633    1.713949
           2 |    .472621   .1512556     -1.449264    2.394506
--------------------------------------------------------------
Note: variance scaled to handle strata with a single sampling
      unit.

. lincom [Unemployed]2 - [Unemployed]1

 ( 1)  - [Unemployed]1 + [Unemployed]2 = 0

------------------------------------------------------------------------------
  Proportion |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |   .2719632    .032157     8.46   0.075    -.1366305    .6805568
------------------------------------------------------------------------------

But using this method would be assuming the samples are independent (unpaired) when they are actually panel data? This is my concern.

I really apologise if I am still not clear - thank you for bearing with me.

Best,
Justin

Comment

Justin Visagie

Join Date: Sep 2016

Posts: 9
#13

30 Sep 2016, 08:02

Sorry - also note that in example 1 (independent samples- unemployment by district) I would restrict the sample to one particular year: "svy, subpop(if wave==1): prop Status, over(myprovince)" which I forgot to do. But I am happy that I am doing the right estimation for example type 1 - it's example type 2 that I would like the input.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#14

30 Sep 2016, 09:34

This gets somewhat complicated. You can't do this with -proportion-, because -proportion- is not designed to handle paired observations, repeated measures, etc. In principle you can do this by running a 2-level logistic model with Status as the outcome and i.wave as the predictor. The test of significance for the coefficient of wave will be what you want. There are a few technical obstacles to this solution, some of which are easily overcome, and one, perhaps not so easily.

1. Your variable Status is coded as 1/2. To serve as an outcome variable in a logistic model it needs to be coded 0/1. That's easy to fix.

2. You need a new single variable that identifies individuals (referred to as person in the code below).. That's also easy. -help egen-, with reference to the -group()- function.

3. To use -melogit- with svy:, you need to re-do the -svyset- on your data specifying pweights at both the observation and the person level. If you have the information available to do that, then you're in good shape. I do not have the expertise in survey data analysis to advise you on how you might derive these weights from whatever information is in your data set.

If you overcome these obstacles, then I think you can just go ahead with:

Code:

svy, subpop(if Status != 4): melogit employed i.wave || person: margins wave

By the way, I don't understand why you are specifying (if Status != 4) when Status only takes on values 1 and 2 (or 0 and 1 after you recode it.)

Last edited by Clyde Schechter; 30 Sep 2016, 09:37. Reason: Fix typos.
Comment
Justin Visagie

Join Date: Sep 2016

Posts: 9
#15

01 Oct 2016, 05:26

Dear Clyde and Marcos,

Thank you very much for the help!!!! I think I have what I need!

PS - Status!=4 is for those who individuals who shift into Not Economically Active category: but I see that there are none of these cases in my example so it is not neccessary.
Comment

Announcement