Gauß- / Z-Test

Christopher Helmreich

Join Date: Feb 2015

Posts: 20
#1

Gauß- / Z-Test

06 Feb 2015, 04:36

Hello,

Could someone please tell me what is the command for performing a mean value comparison of two samples by using the Gauß-Test / Z-Test instead of a T-test?

Thank you!
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#2

06 Feb 2015, 05:18

A z-test is more a pedagogical tool as a stepping stone to a T-test. I cannot think of a single real application where it is appropriate. It would be interesting to find such an application. It would help with teaching. So, could you tell us a bit more about your problem?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#3

06 Feb 2015, 05:25

I agree with Maarten. In essence, if the sample size is large enough, the z test would in practice give essentially the same result as the t test.If that's not true, then you shouldn't use a z test. I think these are good reasons that it is not obviously available.
Comment
Christopher Helmreich

Join Date: Feb 2015

Posts: 20
#4

06 Feb 2015, 06:27

Ah ok, so it is not available in stata. Well, I was just looking for a test that could back-up my results from the t-test. I thought the z-test could be an option.

As alterternative I would calculate the Wilcoxon-Mann-Whitney-Test which should deliver significant results in all cases where the t-tests did so.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

06 Feb 2015, 06:32

As before, the z test cannot in any sense "back up" results from t tests. If the results are the same, the z test is not giving independent evidence as it is essentially the same test. If the results differ, your sample size is too small for the z test to apply.

WMW answers a different question. If t tests don't give the result you expect, it's best to wonder why, not just to move on to another test.
Comment
Christopher Helmreich

Join Date: Feb 2015

Posts: 20
#6

06 Feb 2015, 06:57

The reason why I want to back up the t tests' results is because my data does not follow a normal distribution. Following theory, this violates one of the preconditions of the t test. The WMW does not require a normal distribution and can tell me, whether my samples deviate significantly in their medians. So it at least should point in the same direction as my t tests. Am I wrong about this?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#7

06 Feb 2015, 07:04

A t-Test can easily be expressed in a simple linear regression framework, in which a normal distribution of the variables is not assumed. The assumption is that the error term in this model is normal, but even this assumption can be neglected with a reasonably large sample size (some textbooks state N > 50).

Edit:

What might be more relevant than a normal distribution is the assumption of equal variances in the groups, which is called homoscedasticity in the regression framework. The ttest command has an unequal option to account for violation of this assumption. With regress you would specify the vce(robust) option.

Best
Daniel

Last edited by daniel klein; 06 Feb 2015, 07:11.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

06 Feb 2015, 07:19

WMW is not really a test of different medians. Textbooks differ on what wording is suitable for the masses.

But that aside, your objective is not clear here.

If you want to compare means, then OK, and there are lots of ways to do it. Details can alter conclusions, but here is one counter-example to underline that what is assumed (or better, postulated) about marginal distributions need not be that crucial. Let's underline that 74 is not an especially large size, but for this purpose it's large enough for "being a small sample" not to bite you. We get P-values around 0.68 regardless of entertaining rather different models for the data. Of course, you have to try it for your case. If the reason for being non-normal is a massive outlier, results will be sensitive to assumptions.

Code:

 
. sysuse auto, clear 
(1978 Automobile Data)

. ttest price, by(foreign)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
Domestic |      52    6072.423    429.4911    3097.104    5210.184    6934.662
 Foreign |      22    6384.682    558.9942    2621.915     5222.19    7547.174
---------+--------------------------------------------------------------------
combined |      74    6165.257    342.8719    2949.496    5481.914      6848.6
---------+--------------------------------------------------------------------
    diff |           -312.2587    754.4488               -1816.225    1191.708
------------------------------------------------------------------------------
    diff = mean(Domestic) - mean(Foreign)                         t =  -0.4139
Ho: diff = 0                                     degrees of freedom =       72

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.3401         Pr(|T| > |t|) = 0.6802          Pr(T > t) = 0.6599

. glm price foreign

Iteration 0:   log likelihood = -695.62494  

Generalized linear models                          No. of obs      =        74
Optimization     : ML                              Residual df     =        72
                                                   Scale parameter =   8799417
Deviance         =  633558013.5                    (1/df) Deviance =   8799417
Pearson          =  633558013.5                    (1/df) Pearson  =   8799417

Variance function: V(u) = 1                        [Gaussian]
Link function    : g(u) = u                        [Identity]

                                                   AIC             =  18.85473
Log likelihood   = -695.6249418                    BIC             =  6.34e+08

------------------------------------------------------------------------------
             |                 OIM
       price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |   312.2587   754.4488     0.41   0.679    -1166.434    1790.951
       _cons |   6072.423    411.363    14.76   0.000     5266.166     6878.68
------------------------------------------------------------------------------

. glm price foreign, link(log)

Iteration 0:   log likelihood = -699.23223  
Iteration 1:   log likelihood = -695.81557  
Iteration 2:   log likelihood = -695.62496  
Iteration 3:   log likelihood = -695.62494  

Generalized linear models                          No. of obs      =        74
Optimization     : ML                              Residual df     =        72
                                                   Scale parameter =   8799417
Deviance         =  633558013.5                    (1/df) Deviance =   8799417
Pearson          =  633558013.5                    (1/df) Pearson  =   8799417

Variance function: V(u) = 1                        [Gaussian]
Link function    : g(u) = ln(u)                    [Log]

                                                   AIC             =  18.85473
Log likelihood   = -695.6249418                    BIC             =  6.34e+08

------------------------------------------------------------------------------
             |                 OIM
       price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |   .0501438   .1200041     0.42   0.676    -.1850599    .2853475
       _cons |   8.711513   .0677428   128.60   0.000      8.57874    8.844287
------------------------------------------------------------------------------

. glm price foreign, f(gamma)

Iteration 0:   log likelihood = -719.86823  
Iteration 1:   log likelihood = -719.75548  
Iteration 2:   log likelihood = -719.75513  

Generalized linear models                          No. of obs      =        74
Optimization     : ML                              Residual df     =        72
                                                   Scale parameter =  .2334392
Deviance         =  12.69664531                    (1/df) Deviance =  .1763423
Pearson          =  16.80762473                    (1/df) Pearson  =  .2334392

Variance function: V(u) = u^2                      [Gamma]
Link function    : g(u) = 1/u                      [Reciprocal]

                                                   AIC             =   19.5069
Log likelihood   = -719.7551282                    BIC             =  -297.196

------------------------------------------------------------------------------
             |                 OIM
       price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |  -8.05e-06   .0000195    -0.41   0.679    -.0000462    .0000301
       _cons |   .0001647    .000011    14.97   0.000     .0001431    .0001862
------------------------------------------------------------------------------

. glm price foreign, f(gamma) link(log)

Iteration 0:   log likelihood = -719.92833  
Iteration 1:   log likelihood = -719.75532  
Iteration 2:   log likelihood = -719.75513  
Iteration 3:   log likelihood = -719.75513  

Generalized linear models                          No. of obs      =        74
Optimization     : ML                              Residual df     =        72
                                                   Scale parameter =   .233444
Deviance         =   12.6966453                    (1/df) Deviance =  .1763423
Pearson          =  16.80796837                    (1/df) Pearson  =   .233444

Variance function: V(u) = u^2                      [Gamma]
Link function    : g(u) = ln(u)                    [Log]

                                                   AIC             =   19.5069
Log likelihood   = -719.7551282                    BIC             =  -297.196

------------------------------------------------------------------------------
             |                 OIM
       price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |   .0501439    .122884     0.41   0.683    -.1907043    .2909922
       _cons |   8.711513   .0670025   130.02   0.000     8.580191    8.842835
------------------------------------------------------------------------------

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#9

09 Feb 2015, 07:27

Christopher:
as an aside to previous excellent replies, you may want to compare the results of -ttest- to the ones obtained via a bootstrapped ttest. This procedure is covered under Example 3 of -bootstrap- entry in Stata 13.1 .pdf manual.

Kind regards,
Carlo
(Stata 19.0)
Comment
Sander Soet

Join Date: Mar 2019

Posts: 13
#10

26 Mar 2019, 07:22

Dear mister Lazzaro,
I am having the same problem as Christopher, i am at the moment reading the example you stated here, but the thing is that they again use an example that tests a variable per group (only 2 groups allowed), however i was wondering how i could apply this bootstrap method on a t test with two variables?
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#11

26 Mar 2019, 09:32

That is possible. Below is an example. As an estimator for the p-value I like to use (#(t>t_obs)+1)/(B+1) rather than #(t>t_obs)/B, #(t>t_obs) is the number of replications in which the t is larger (or more extreme, if we think of t and t_obs as absolute values), and B is the number of replications. See Chapter 4 of A.C. Davison and D.V. Hinkley (1997) Bootstrap Methods and their Application. Cambridge: Cambridge University Press. This is a bit pedantic, as you typically need a large B to get a reliable estimate, and at large B the difference between the two becomes quickly very small. In addition, this is a bootstrap test, so there is some randomness the estimate. If we were to run this example again (without setting the seed) then we would get a (slightly) different estimate of the p-value. The uncertainty can be quantified by a Monte Carlo confidence interval. If that interval is too large for your taste, then you need to increase the number of replications.

Code:

clear all webuse fuel ttest mpg1 = mpg2 tempname tobs m1 m2 m scalar `tobs' = r(t) scalar `m1' = r(mu_1) scalar `m2' = r(mu_2) scalar `m' = (`m1' + `m2')/2 replace mpg1 = mpg1 - `m1' + `m' replace mpg2 = mpg2 - `m2' + `m' tempfile bs bootstrap t = r(t) , reps(20000) saving(`bs') : ttest mpg1 = mpg2 use `bs', clear qui count if abs(t)>=abs(`tobs') local a = r(N) + 1 local b = _N +1 - r(N) local alph = (100-c(level))/200 local lb = invibeta(`a', `b', `alph') local ub = invibetatail(`a', `b', `alph') di as txt "achieved signiciance level: " as result (r(N)+1)/(_N+1) di as txt "MC CI : [" as result `lb' as txt ", " as result `ub' as txt "]"

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment