T-Test with Dummy Variables

Rudia Jung

Join Date: Dec 2017

Posts: 1
#1

T-Test with Dummy Variables

08 Dec 2017, 05:10

Stata Users,

I'm new to the program and I am trying to figure out how to conduct a t-test between my continuous dependent variable (mntlhlth) and my independent categorical variable (class) that shows the independent variable groups in the t-test. The dataset for class is which social class people identify themselves as with responses 1-4 (1=Lower Class, 2 = Working Class, 3 = Middle Class, and 4 = Upper Class)
I created dummy variables for the different social classes using gen lowerclass = 1 if class == 1 but not sure how to proceed to conduct a t-test after this.
I have tried ttest mntlhlth, by (lowerclass) and it says "1 group found, 2 required"
How do I go about creating a t-test that shows all 4 of the social classes?
Thank you in advance.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

08 Dec 2017, 07:05

Rudia.
welcome to this forum.
Why don't you go:

Code:

regress mntlhlth i.class

instead?

Kind regards,
Carlo
(Stata 19.0)
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#3

11 Dec 2017, 13:35

After you run the regression, look at test in the documentation to see how to do the tests.
Comment
Rodrigo Abed

Join Date: Sep 2018

Posts: 13
#4

02 May 2019, 08:15

Originally posted by Carlo Lazzaro View Post

Rudia.
welcome to this forum.
Why don't you go:

Code:

regress mntlhlth i.class

instead?

Hi Carlo,

I wanted to do a follow-up to this answer. Rudia was regressing a categorical variable (social class) on a continuous variable (mntlhlth) and the command you provided, was useful to have a similar output as in a ttest.

My case is a bit different. I am interested in finding out if the mean of several categorical variables are different for a dependent variable which is coded as a dummy. Let me give you a few more details to understand the situation better. I want to know if there is a difference in the mean of an independent variable "invest" which is the investment made by an enterprise with respect to the previous year (it is coded as 1=did not invest, 2=less than last year, 3=same as last year, 4=more than last year) for women-owned and non women-owned enterprises. The dependent variable is "womenbuss" which is coded as 0 if an enterprise is not owned by a woman and as 1 if the enterprise is owned by a woman.

Can I still use the same command? And would the interpretation be the same?

Thanks for your help,
Rodrigo
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

02 May 2019, 08:27

Rodrigo:
if I goy your query right, you can try something like:

Code:

. use http://www.stata-press.com/data/r15/lbw.dta
(Hosmer & Lemeshow data)

. logistic low i.race

Logistic regression                             Number of obs     =        189
                                                LR chi2(2)        =       5.01
                                                Prob > chi2       =     0.0817
Log likelihood = -114.83082                     Pseudo R2         =     0.0214

------------------------------------------------------------------------------
         low | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
      black  |   2.327536   1.078613     1.82   0.068     .9385072    5.772385
      other  |   1.889234   .6571342     1.83   0.067     .9554577    3.735597
             |
       _cons |   .3150685   .0753382    -4.83   0.000     .1971825     .503433
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

. test [low]2.race=[low]3.race

 ( 1)  [low]2.race - [low]3.race = 0

           chi2(  1) =    0.20
         Prob > chi2 =    0.6575

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Rodrigo Abed

Join Date: Sep 2018

Posts: 13
#6

02 May 2019, 11:02

Hi Carlo,

Thanks for your prompt response, I greatly appreciate it. You are correct, since the dependent variable is a dummy it should be a logit regression. Thanks for pointing that out.

If I understood the output correctly, then for a given level of birth weight both black and "other race" are different from white at a 10% significance level. Then you proceeded to test if black was different from "other race" and it indeed is, also at a 10% significance level.

How would I go about to test if the mean of the white population is different between low=0 and low=1? This is where I was going with my initial question when, for example, trying to test if the mean of "invest=4" (people who invested more that in the previous year) was different between women-owned businesses and non women-owned businesses ("womenbuss=1 and womenbuss=0", respectively).

Thanks,
Rodrigo
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

02 May 2019, 11:15

Rodrigo:
do you mean something like:

Code:

. test [low]_cons=[low]3.race=[low]2.race, mtest(bonferroni)

 ( 1)  - [low]3.race + [low]_cons = 0
 ( 2)  - [low]2.race + [low]_cons = 0

---------------------------------------
       |        chi2     df       p
-------+-------------------------------
  (1)  |       10.97      1     0.0019 #
  (2)  |       10.35      1     0.0026 #
-------+-------------------------------
  all  |       12.70      2     0.0017
---------------------------------------
         # Bonferroni-adjusted p-values

Kind regards,
Carlo
(Stata 19.0)

Comment

Rodrigo Abed

Join Date: Sep 2018

Posts: 13
#8

02 May 2019, 11:40

Thanks Carlo, I have never performed a Bonferroni test but I understand that you used it to correct since there are multiple hypothesis being tested. I might be confused on my interpretation, but how is this result different from the one provided by the regression you gave in #5?

I have the impression that in both outputs, you are keeping the variable "low" constant and comparing the mean of low between races. What I am interested in, is to know if in keeping a certain race constant (i.e. black) there are significant differences between low=0 and low=1.

Please correct me if I am wrong in my interpretation. Once again, thanks for your help!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#9

02 May 2019, 11:59

Rodrigo:
in #5 the the _cons (ie, white race) was not included in -test-.
The variable low is simply the name of the regressand of the toy-example; it does not affect the -test- outcome.
That said, there's something I fail to get in your query:
you are dealing with a simple logistic regression (ie, with one predictor only (ie, investment made by an enterprise with respect to the previous year), the regressand being the business ownership (women yes or no).
You can test whteher the level of the predictors differs in terms of the variations they caused in the regressand, but you cannot have the same variable being a regressand and a predictor.
Sharing an example/excerpt oy your data via -dataex- would contribute to make things clearer. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment

Rodrigo Abed

Join Date: Sep 2018
Posts: 13

#10

02 May 2019, 12:37

Carlo,

That is a great idea, here is a sample of the database:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(wom_ownbuss inversion shocks)
0 3 1
0 3 2
1 4 2
0 1 2
0 2 1
1 1 3
1 4 2
0 3 1
1 1 1
end

"shocks" is a continuous variable and as I mentioned before in #4 "invest" is a categorical variable. Since shocks is continuous, I can perform a ttest directly to compare the mean per type of business ownership:

Code:

. ttest shocks, by( wom_ownbuss)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
       0 |     498    1.937751    .0584324    1.303972    1.822946    2.052556
       1 |     117    1.811966    .1078496    1.166572    1.598356    2.025576
---------+--------------------------------------------------------------------
combined |     615    1.913821    .0515749    1.279017    1.812536    2.015106
---------+--------------------------------------------------------------------
    diff |            .1257852    .1314122               -.1322876     .383858
------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =   0.9572
Ho: diff = 0                                     degrees of freedom =      613

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.8306         Pr(|T| > |t|) = 0.3389          Pr(T > t) = 0.1694

This output tells me that the mean of shocks in businesses not owned by women is 1.93 and it is 1.81 in the case of women-owned business. The difference is not significantly different from 0.

However, because of the nature of the "invest (or inversion in spanish)" variable, I cannot conduct the ttest directly. That is why I try to use the example you gave with the regress (in this case, the logit) command. What I would like to do, is to test if there is any difference in the mean of a specific category of the variable invest in the 2 groups of businesses. I hope this brings more clarity towards what I am trying to do.

Thanks,
Rodrigo

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#11

02 May 2019, 23:44

Rodrigo:
it seems that it is the first time that you mention the dependent variable you're really interested in: -shocks- (which is different from

...a dependent variable which is coded as a dummy [as reported in your #]

.
Hence we have gone back and forth without any gain in the previous posts, wasting our time: please read the FAQ about posting-related topics.
That said, you may want something along the following lines:

Code:

regress shocks i.wom_ownbuss##i.invest

Last edited by Carlo Lazzaro; 02 May 2019, 23:46.

Kind regards,
Carlo
(Stata 19.0)
Comment

Rodrigo Abed

Join Date: Sep 2018
Posts: 13

#12

03 May 2019, 02:31

Hi Carlo,

Thanks for your help to figure this out. I apologize for the inconvenience, I got confused between the dependent and independent variable without realizing it until your last post. I guess that using the -dataex- command from the beginning, would have brought clarity much earlier in the process and it would have avoided wasting your time. My apologies for this and it is definitely a lesson learned for my next posts.

Just one final remark and to make sure I understood the output of your suggestion:

Code:

 regress shocks i.wom_ownbuss##i.inversion

      Source |       SS       df       MS              Number of obs =     603
-------------+------------------------------           F(  7,   595) =    1.19
       Model |  13.6558372     7  1.95083388           Prob > F      =  0.3057
    Residual |  974.523267   595  1.63785423           R-squared     =  0.0138
-------------+------------------------------           Adj R-squared =  0.0022
       Total |  988.179104   602  1.64149353           Root MSE      =  1.2798

---------------------------------------------------------------------------------------
               shocks |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------+----------------------------------------------------------------
        1.wom_ownbuss |   .1464552    .356106     0.41   0.681    -.5529223    .8458328
                      |
            inversion |
                   2  |   .4241451    .212551     2.00   0.046     .0067037    .8415865
                   3  |   -.014821   .1894706    -0.08   0.938    -.3869335    .3572915
                   4  |   .2089552   .1806512     1.16   0.248    -.1458363    .5637468
                      |
wom_ownbuss#inversion |
                 1 2  |  -.4727562   .4884012    -0.97   0.333    -1.431956    .4864437
                 1 3  |  -.1873849   .4317837    -0.43   0.664     -1.03539    .6606206
                 1 4  |  -.2986291   .4130451    -0.72   0.470    -1.109833    .5125745
                      |
                _cons |   1.791045   .1563508    11.46   0.000     1.483978    2.098111
---------------------------------------------------------------------------------------

The last section -wom_ownbuss##inversion- compares the mean of the -invest variable per type of business ownership,with the constant being the base value for -invest-. Is this correct?

Finally, I have also tried to get a similar result by using the following command:

Code:

. mlogit inversion wom_ownbuss, base(1)

Iteration 0:   log likelihood = -780.57905  
Iteration 1:   log likelihood = -780.56602  
Iteration 2:   log likelihood = -780.56602  

Multinomial logistic regression                   Number of obs   =        604
                                                  LR chi2(3)      =       0.03
                                                  Prob > chi2     =     0.9989
Log likelihood = -780.56602                       Pseudo R2       =     0.0000

------------------------------------------------------------------------------
   inversion |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1            |  (base outcome)
-------------+----------------------------------------------------------------
2            |
 wom_ownbuss |  -.0469722    .381627    -0.12   0.902    -.7949474    .7010029
       _cons |   .1647552   .1660831     0.99   0.321    -.1607617    .4902722
-------------+----------------------------------------------------------------
3            |
 wom_ownbuss |  -.0113489   .3373153    -0.03   0.973    -.6724746    .6497769
       _cons |   .7651207   .1478845     5.17   0.000     .4752724    1.054969
-------------+----------------------------------------------------------------
4            |
 wom_ownbuss |  -.0375721   .3227453    -0.12   0.907    -.6701412     .594997
       _cons |   1.093625   .1411573     7.75   0.000     .8169616    1.370288
------------------------------------------------------------------------------

Here, there are changes in the coefficients but the variables are still not significantly different from 0. Would you agree with this as an alternative?

Thanks,
Rodrigo

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#13

03 May 2019, 03:35

Rodrigo:
1) If you type:

Code:

regress shocks i.wom_ownbuss##i.inversion, allbase

you will have a clearer picture of what's going on with your interaction.
That said, I would say that, in your case, the -cons refers to the situtation in which reference categories are 0 for -wom_ownbuss- and 1 for -invest-

2) No, I do not support your second code as an alternative to 1), as you're indirectly comparing two really different regression models.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Rodrigo Abed

Join Date: Sep 2018

Posts: 13
#14

03 May 2019, 04:01

Thanks Carlo, I greatly appreciate your support!
Comment
Rodrigo Abed

Join Date: Sep 2018

Posts: 13
#15

08 May 2019, 02:23

Hi Carlo,

I would like to reopen this thread to make a follow-up question. I want to know if there is a significant difference in the pricing power between women and non-women owned businesses. Therefore, I am planning to regress -pric_pow- on -wom_ownbuss-. The dependent variable -pric_pow- is a categorical variable with 4 levels while the independent variable is a dummy (1=women owned businesses). Considering the nature of the dependent variable, I decided to brake it into several dummies (i.e. dum=1 if - pric_pow- = 1 and 0 if otherwise, dum2=1 if - pric_pow- = 2 and 0 if otherwise and so on). Here is an excerpt of the dataset with the variables I have just mentioned.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(wom_ownbuss pric_pow dum1 dum2 dum3 dum4) 1 1 1 0 0 0 0 4 0 0 0 1 1 3 0 0 1 0 0 2 0 1 0 0 0 2 0 1 0 0 end

I then run a logit regression to determine if there is any significant difference, as shown below.

Code:

. logit dum1 wom_ownbuss Iteration 0: log likelihood = -385.9273 Iteration 1: log likelihood = -385.90021 Iteration 2: log likelihood = -385.90021 Logistic regression Number of obs = 560 LR chi2(1) = 0.05 Prob > chi2 = 0.8159 Log likelihood = -385.90021 Pseudo R2 = 0.0001 ------------------------------------------------------------------------------ dum1 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- wom_ownbuss | .0497615 .2139026 0.23 0.816 -.3694799 .469003 _cons | .1692921 .0946189 1.79 0.074 -.0161575 .3547416 ------------------------------------------------------------------------------

The variable -wom_ownbuss- is not significantly different from 0. I have taken the constant as the value for non-women owned businesses which is significantly different from 0 at a 10% significance level.

My questions are:
a) Is the interpretation of the output correct?
b) Would you recommend any other commands to run the regression without having to brake the variable -pric_power- into dummies?

Thanks for your help!
Rodrigo
Comment

Announcement