Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compaing Samples with Unpaired and Weighted T-Tests

    Hello,

    So I have searched the Stata forums and internet for this issue and haven't found anything that fits precisely what I want to do, so I thought I would describe it here and see if anyone has any advice.

    I have a set of analyses that I am running with a couple of different dependent variables on the GSS. Unfortunately, these are survey questions that were asked on completely different ballots. Since I can't use the same sample for both outcomes, I want to demonstrate that the samples are similar on my controls and predictors. For example, I want to make sure that my samples have similar education levels, similar racial breakdowns and so on. I thought about using an unpaired t-test, but the problem is that I cannot do a t-test with weights. I am using the GSS recommended weight of wtssall.

    I have seen some previous examples where people use svy commands to be able to use weights and then a regression command, but again the issue is that isn't running an unpaired test. I need an unpaired t-test, since my samples are a completely different set of respondents.

    My hope is that since these ballots are randomly assigned, these samples should be very similar, but I want to demonstrate this. Am I on the right track in thinking that I could run some sort of unpaired t-tests with weights to show this? I was going to run t-tests for each of my variables and see if the mean value of these are different between my samples.

    If this is correct, how can I do a t-test (or something equivalent) that is both unpaired and allows weights?

    Thanks,
    George

  • #2
    George:
    welcome to the list.
    As you have different covariates/predictors, I would go -svy: regress- with -weight-.
    I fail to get why repeating a series of unpaired ttest would outperform OLS.
    I would have also considered running -mvreg-, but I do not know whether it is supported by -svy-.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Carlo,
      Thanks for your reply! So the issue with using any sot of regression based commands is that the regression commands, due to the nature of the function, will drop the sample down so that the two samples are paired. That would get me the same result as running a paired t-test correct?

      I probably wasn't being particularly clear, so here is some example of code that attempts to compare my samples on the education dummy variables I have:

      ttest lessthanhighschool1 == lessthanhighschool2, unpaired;

      ttest highschool1 == highschool2, unpaired;

      ttest somecollege1 == somecollege2, unpaired;

      ttest college1 == college2, unpaired;

      Sample 1 is one subset of respondents from the GSS and Sample 2 is another subset. There is some overlap between teh subsets, but not much. So, for example, if I run the second t-test command I listed as an example, I get these results:



      Two-sample t test with equal variances

      Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

      highsc~1 14132 .2836824 .0037921 .4508005 .2762494 .2911155
      highsc~2 17048 .2836696 .0034525 .4507916 .2769023 .290437

      combined 31180 .2836754 .0025529 .4507884 .2786716 .2886792

      diff .0000128 .0051284 -.010039 .0100646

      diff = mean(highschool1) - mean(highschool2) t = 0.0025
      Ho: diff = 0 degrees of freedom = 31178

      Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
      Pr(T < t) = 0.5010 Pr(T > t) = 0.9980 Pr(T > t) = 0.4990


      As you can see, I have a different number of observations for highschool1 and highschool2. These values for high school should be equivalent between the two, but I want to demonstrate that. This t-test does demonstrate that the there is no statistically significant difference in the mean of highschool from sample 1 and 2. So, I can do this for each variable, but the results are not weighted. So that is where I am running into problems.
      A regression fixes the weighting issue but then creates another issue. I can weight this with a regression function as you suggest, but then what happens is that since it is a regression, stata matches the observations. So the sample drops to 7,000 observations (which are the observations that match up between the subsets). I get a result that the mean value of these variables are the same, but that is because it has forced me to compare the data to itself.

      Does what I'm saying make sense? I'm not even sure if I'm thinking about this correctly, but I think there must be some way that I can correctly demonstrate that that these different sets of respondents from the GSS sample are for all intents and purposes the same in regard to education, income, race, etc.

      Thanks,
      George

      Comment


      • #4
        George:
        I'm probably missing out on something, but I fail to get why -regress- should
        ...drop the sample down so that the two samples are paired.
        .
        Set aside -svy- and -weights- issues for a while, you can see that you can run both -regress- and -ttest- on sample of different sizes:
        Code:
        . sysuse auto.dta
        (1978 Automobile Data)
        
        . ttest price, by(foreign) unpaired
        
        Two-sample t test with equal variances
        ------------------------------------------------------------------------------
           Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
        ---------+--------------------------------------------------------------------
        Domestic |      52    6072.423    429.4911    3097.104    5210.184    6934.662
         Foreign |      22    6384.682    558.9942    2621.915     5222.19    7547.174
        ---------+--------------------------------------------------------------------
        combined |      74    6165.257    342.8719    2949.496    5481.914      6848.6
        ---------+--------------------------------------------------------------------
            diff |           -312.2587    754.4488               -1816.225    1191.708
        ------------------------------------------------------------------------------
            diff = mean(Domestic) - mean(Foreign)                         t =  -0.4139
        Ho: diff = 0                                     degrees of freedom =       72
        
            Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
         Pr(T < t) = 0.3401         Pr(|T| > |t|) = 0.6802          Pr(T > t) = 0.6599
        
        . regress price  i.foreign
        
              Source |       SS           df       MS      Number of obs   =        74
        -------------+----------------------------------   F(1, 72)        =      0.17
               Model |  1507382.66         1  1507382.66   Prob > F        =    0.6802
            Residual |   633558013        72  8799416.85   R-squared       =    0.0024
        -------------+----------------------------------   Adj R-squared   =   -0.0115
               Total |   635065396        73  8699525.97   Root MSE        =    2966.4
        
        ------------------------------------------------------------------------------
               price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
             foreign |
            Foreign  |   312.2587   754.4488     0.41   0.680    -1191.708    1816.225
               _cons |   6072.423    411.363    14.76   0.000     5252.386     6892.46
        ------------------------------------------------------------------------------
        Obviously, any regression procedure has the advantage that coefficients are adjusted for the remaining predictors.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment

        Working...
        X