Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correlation between binary variable and independent nominal variable

    Hi, which test should I use in STATA for a correlation between a binary variable (0=no and 1=yes) and a nominal variable (eg. city, with 7 categories)

  • #2
    Ravina (I assume):
    you may want to consider -ktau- and, just to have a more complete picture, -logit- (that investgates somethng different from correlation, though):
    Code:
    . use "C:\Program Files\Stata17\ado\base\a\auto.dta"
    (1978 automobile data)
    
    . ktau foreign rep78, stats(taua taub obs p)
    
      Number of obs =      69
    Kendall's tau-a =       0.3095
    Kendall's tau-b =       0.5589
    Kendall's score =     726
        SE of score =     145.056   (corrected for ties)
    
    Test of H0: foreign and rep78 are independent
         Prob > |z| =       0.0000  (continuity corrected)
    
    . logit foreign i.rep78
    
    note: 1.rep78 != 0 predicts failure perfectly;
          1.rep78 omitted and 2 obs not used.
    
    note: 2.rep78 != 0 predicts failure perfectly;
          2.rep78 omitted and 8 obs not used.
    
    note: 5.rep78 omitted because of collinearity.
    Iteration 0:   log likelihood = -38.411464  
    Iteration 1:   log likelihood = -27.676628  
    Iteration 2:   log likelihood = -27.446054  
    Iteration 3:   log likelihood = -27.444671  
    Iteration 4:   log likelihood = -27.444671  
    
    Logistic regression                                     Number of obs =     59
                                                            LR chi2(2)    =  21.93
                                                            Prob > chi2   = 0.0000
    Log likelihood = -27.444671                             Pseudo R2     = 0.2855
    
    ------------------------------------------------------------------------------
         foreign | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
           rep78 |
              1  |          0  (empty)
              2  |          0  (empty)
              3  |  -3.701302   .9906975    -3.74   0.000    -5.643033   -1.759571
              4  |  -1.504077   .9128709    -1.65   0.099    -3.293271    .2851168
              5  |          0  (omitted)
                 |
           _cons |   1.504077    .781736     1.92   0.054    -.0280969    3.036252
    ------------------------------------------------------------------------------
    
    .
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      ktau is there for ordinal variables, but Ravina stated she has a nominal variable. In case of a binary by nominal variable you can just look at tabulate with chi2 option. So here is an example of what that looks like:

      Code:
      . sysuse auto
      (1978 automobile data)
      
      . tab foreign rep78, chi2
      
                 |                   Repair record 1978
      Car origin |         1          2          3          4          5 |     Total
      -----------+-------------------------------------------------------+----------
        Domestic |         2          8         27          9          2 |        48
         Foreign |         0          0          3          9          9 |        21
      -----------+-------------------------------------------------------+----------
           Total |         2          8         30         18         11 |        69
      
                Pearson chi2(4) =  27.2640   Pr = 0.000


      Here is a bit of statistical nerding you can easily ignore.

      There is also the exact option in the table option for a Fischer's exact test, but that name sounds much better than it actually is: Agresti, A., & Coull, B. A. (1998). Approximate Is Better than “Exact” for Interval Estimation of Binomial Proportions. The American Statistician, 52(2), 119–126. https://doi.org/10.2307/2685469

      You can also look at a simulation (this one requires simpplot from SSC):

      Code:
      clear all
      
      program define sim
          drop _all
          set obs 4
          gen row = ceil(_n/2)
          gen col = mod(_n,2) + 1
          gen freq = rpoisson(10)
          tab row col [fw=freq], chi2 exact nofreq
      end
      
      simulate p = r(p) exact=r(p_exact), reps(40000): sim
      simpplot p exact, overall reps(5000)
      Click image for larger version

Name:	Graph.png
Views:	1
Size:	87.4 KB
ID:	1728465

      I repeatedly created data where the null hypothesis is true, performed the chi2 and exact test in each of these datasets and stored the p-values. A p-value measures the probability of drawing a dataset from a population where the null hypothesis is true that deviates as much or more from the null-hypothesis as that dataset. We know in this case that the null-hypothesis is true, since I created the data. So if we find a p-value of 0.05, than 5% of our replications should have a smaller p-value value, and if we find a p-value of 0.10 than 10% of our replications should have a smaller p-value. So the simulation provides for each nominal p-value an estimate of what the p-value should have been: the proportion of replications that have a p-value less than that. The graph shows for each replication the nominal significance (the p-value table spit out for that dataset) and the deviation between that nominal p-value and the p-value we computed based on the simulation (the proportion of replications less than that). We can see that the "exact" test is not as exact as the name suggests: The exact p-value can better be thought of as an upper bound. This is nothing new, this is the point made by Agresti and Coul in the reference above.
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        For Fischer read Fisher. Otherwise I agree with Maarten Buis. The way forward is to see how the outcome (whichever it is) can be predicted from the other variable using an appropriate model fit.

        Comment

        Working...
        X