Correlation between binary variable and independent nominal variable

Student Ravina

Join Date: Sep 2023

Posts: 2
#1

Correlation between binary variable and independent nominal variable

27 Sep 2023, 19:02

Hi, which test should I use in STATA for a correlation between a binary variable (0=no and 1=yes) and a nominal variable (eg. city, with 7 categories)
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17730

28 Sep 2023, 00:11

Ravina (I assume):
you may want to consider -ktau- and, just to have a more complete picture, -logit- (that investgates somethng different from correlation, though):

Code:

. use "C:\Program Files\Stata17\ado\base\a\auto.dta"
(1978 automobile data)

. ktau foreign rep78, stats(taua taub obs p)

  Number of obs =      69
Kendall's tau-a =       0.3095
Kendall's tau-b =       0.5589
Kendall's score =     726
    SE of score =     145.056   (corrected for ties)

Test of H0: foreign and rep78 are independent
     Prob > |z| =       0.0000  (continuity corrected)

. logit foreign i.rep78

note: 1.rep78 != 0 predicts failure perfectly;
      1.rep78 omitted and 2 obs not used.

note: 2.rep78 != 0 predicts failure perfectly;
      2.rep78 omitted and 8 obs not used.

note: 5.rep78 omitted because of collinearity.
Iteration 0:   log likelihood = -38.411464  
Iteration 1:   log likelihood = -27.676628  
Iteration 2:   log likelihood = -27.446054  
Iteration 3:   log likelihood = -27.444671  
Iteration 4:   log likelihood = -27.444671  

Logistic regression                                     Number of obs =     59
                                                        LR chi2(2)    =  21.93
                                                        Prob > chi2   = 0.0000
Log likelihood = -27.444671                             Pseudo R2     = 0.2855

------------------------------------------------------------------------------
     foreign | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       rep78 |
          1  |          0  (empty)
          2  |          0  (empty)
          3  |  -3.701302   .9906975    -3.74   0.000    -5.643033   -1.759571
          4  |  -1.504077   .9128709    -1.65   0.099    -3.293271    .2851168
          5  |          0  (omitted)
             |
       _cons |   1.504077    .781736     1.92   0.054    -.0280969    3.036252
------------------------------------------------------------------------------

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Maarten Buis

Join Date: Mar 2014

Posts: 3467
#3

28 Sep 2023, 02:01

ktau is there for ordinal variables, but Ravina stated she has a nominal variable. In case of a binary by nominal variable you can just look at tabulate with chi2 option. So here is an example of what that looks like:

Code:

. sysuse auto (1978 automobile data) . tab foreign rep78, chi2 | Repair record 1978 Car origin | 1 2 3 4 5 | Total -----------+-------------------------------------------------------+---------- Domestic | 2 8 27 9 2 | 48 Foreign | 0 0 3 9 9 | 21 -----------+-------------------------------------------------------+---------- Total | 2 8 30 18 11 | 69 Pearson chi2(4) = 27.2640 Pr = 0.000

Here is a bit of statistical nerding you can easily ignore.

There is also the exact option in the table option for a Fischer's exact test, but that name sounds much better than it actually is: Agresti, A., & Coull, B. A. (1998). Approximate Is Better than “Exact” for Interval Estimation of Binomial Proportions. The American Statistician, 52(2), 119–126. https://doi.org/10.2307/2685469

You can also look at a simulation (this one requires simpplot from SSC):

Code:

clear all program define sim drop _all set obs 4 gen row = ceil(_n/2) gen col = mod(_n,2) + 1 gen freq = rpoisson(10) tab row col [fw=freq], chi2 exact nofreq end simulate p = r(p) exact=r(p_exact), reps(40000): sim simpplot p exact, overall reps(5000)

I repeatedly created data where the null hypothesis is true, performed the chi2 and exact test in each of these datasets and stored the p-values. A p-value measures the probability of drawing a dataset from a population where the null hypothesis is true that deviates as much or more from the null-hypothesis as that dataset. We know in this case that the null-hypothesis is true, since I created the data. So if we find a p-value of 0.05, than 5% of our replications should have a smaller p-value value, and if we find a p-value of 0.10 than 10% of our replications should have a smaller p-value. So the simulation provides for each nominal p-value an estimate of what the p-value should have been: the proportion of replications that have a p-value less than that. The graph shows for each replication the nominal significance (the p-value table spit out for that dataset) and the deviation between that nominal p-value and the p-value we computed based on the simulation (the proportion of replications less than that). We can see that the "exact" test is not as exact as the name suggests: The exact p-value can better be thought of as an upper bound. This is nothing new, this is the point made by Agresti and Coul in the reference above.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35775
#4

28 Sep 2023, 02:40

For Fischer read Fisher. Otherwise I agree with Maarten Buis. The way forward is to see how the outcome (whichever it is) can be predicted from the other variable using an appropriate model fit.
1 like
Comment

Announcement

Correlation between binary variable and independent nominal variable

Comment

Comment

Comment