Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Necessary co-variate omitted due to collinearity

    Hi everyone,

    I had a quick question. I am new to Stata, so please forgive me if this question seems silly. I am currently running a regression using NHIS data. I am looking at whether using different alternative medicine therapies is associated with an increased odds of developing eczema. In this example, I will use massage as the alternative medicine therapy. I added a co-variate to control for those who may use massage to treat eczema, which is represented by the variable "massage eczema" (it is binary - 0= did not use massage to treat eczema, 1= did use massage to treat eczema). However, when I run the regression, I get an error stating, "0.massageczema omitted because of collinearity."

    Is there any way to work around this problem? I do need to control for those who use the therapy for eczema, as this would be a confounding variable.

    I tried to look on the FAQ on how to post code, but the help dataex was not very clear. I have copied and pasted my output below, although I know it may be hard to read.

    Code:
    svy: logistic eczema i.massage i.massageczema i.sex i.race i.education age i.houseincome
    (running logistic on estimation sample)
    
    note: 0.massageczema omitted because of collinearity
    
    Survey: Logistic regression
    
    Number of strata   =       295                  Number of obs     =      3,509
    Number of PSUs     =       569                  Population size   =  9,464,725
                                                    Design df         =        274
                                                    F(  11,    264)   =       2.95
                                                    Prob > F          =     0.0010
    
    --------------------------------------------------------------------------------
                   |             Linearized
            eczema | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ---------------+----------------------------------------------------------------
         1.massage |   1.236095   .1598816     1.64   0.102     .9582202     1.59455
    0.massageczema |          1  (omitted)
                   |
               sex |
           Female  |   1.408807    .155003     3.12   0.002     1.134445    1.749524
                   |
              race |
                1  |   1.145142   .2472934     0.63   0.531       .74856    1.751829
                2  |   1.241144    .562923     0.48   0.634     .5082126    3.031091
                3  |   1.306405   .4124466     0.85   0.398     .7017001    2.432225
                4  |   1.726738   .4454404     2.12   0.035     1.039133    2.869341
                   |
         education |
                1  |   .7790274   .1379485    -1.41   0.160     .5497371    1.103953
                2  |   .8614225   .1350667    -0.95   0.342     .6326449    1.172931
                   |
               age |   .9909397   .0027883    -3.23   0.001     .9854657    .9964442
                   |
       houseincome |
                1  |   .8864313   .1281452    -0.83   0.405     .6668787    1.178266
                2  |   .7667886   .1397184    -1.46   0.146     .5356586    1.097648
                   |
             _cons |   .2079644   .0387932    -8.42   0.000     .1440467    .3002444
    --------------------------------------------------------------------------------
    Note: _cons estimates baseline odds.
    Note: 1 stratum omitted because it contains no population members.
    Note: Strata with single sampling unit treated as certainty units.

  • #2
    Sabrina Khan Hey Sabrina. Welcome to Stata.

    There are two issues here I see. Let's look at your middle note from the bottom, where it says
    Code:
     
     Note: 1 stratum omitted because it contains no population
    So, the issue appears to be that there's nobody that's coded as 0 for this category. Check and see if this is the case (as it likely is). I don't see your data, but it's quite possible you have data entry errors to attend to.

    As a matter of fact: Now that I look at your variables again, can you please tell me the difference between i.massage and i.massageeczema? I don't know, since you didn't mention it, but if these are the same thing that you coded mistakenly, then Stata will naturally drop these due to the fact that they are perfectly linear.

    Comment


    • #3
      Hi Jared!

      Thanks for responding. Massage refers to those who have ever had a massage (0= no, 1=yes), where as massageczema refers to those who are using massage to treat eczema (0= no, 1=yes). However, it is important to note that only people who responded "yes" to ever used massage (aka the variable "massage") were asked the question, "have you used massage to treat eczema." I also looked into the issue further and it does turn out that no one responded "yes" to using massage for eczema. In this case, would I not include massageczema in the model?

      I am having this issue on several models (I just chose one at random here). For my one about acupuncture, it gives a similar error although there are entries present. Please see below:

      Code:
      . svy: logistic eczema i.acupuncture i. acupunctureczema i.sex i.race i.education age i.houseincome
      (running logistic on estimation sample)
      
      note: 0.acupunctureczema omitted because of collinearity
      
      Survey: Logistic regression
      
      Number of strata   =       295                  Number of obs     =      3,642
      Number of PSUs     =       570                  Population size   =  9,893,940
                                                      Design df         =        275
                                                      F(  11,    265)   =       3.46
                                                      Prob > F          =     0.0002
      
      ------------------------------------------------------------------------------------
                         |             Linearized
                  eczema | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------------+----------------------------------------------------------------
           1.acupuncture |   1.345992    .221008     1.81   0.071     .9742265    1.859624
      0.acupunctureczema |          1  (omitted)
                         |
                     sex |
                 Female  |    1.43801   .1528953     3.42   0.001      1.16643    1.772823
                         |
                    race |
                      1  |   1.179529   .2473648     0.79   0.432      .780566    1.782409
                      2  |   1.216751   .5337545     0.45   0.655     .5130423    2.885696
                      3  |   1.264825   .3845439     0.77   0.440     .6951784    2.301254
                      4  |   1.691544   .4246336     2.09   0.037      1.03195     2.77273
                         |
               education |
                      1  |    .795289   .1371071    -1.33   0.185     .5664087    1.116658
                      2  |    .893773   .1373369    -0.73   0.465      .660471    1.209486
                         |
                     age |    .989417   .0027969    -3.76   0.000     .9839262    .9949385
                         |
             houseincome |
                      1  |   .9007233   .1295652    -0.73   0.468     .6785915    1.195568
                      2  |   .7307557    .130565    -1.76   0.080     .5140611    1.038794
                         |
                   _cons |   .2247554   .0407266    -8.24   0.000     .1573212    .3210946
      ------------------------------------------------------------------------------------
      Note: _cons estimates baseline odds.
      Note: 1 stratum omitted because it contains no population members.
      Note: Strata with single sampling unit treated as certainty units.
      I am just concerned because I want to make sure I am accounting for confounding but it seems as though this co-variate is messing up my data further? Any suggestions?

      Comment


      • #4
        Sabrina Khan It isn't that you "can't" include massageczema, it's that it's mathematically impossible for a coefficient to be computed since it's got no variation in the dataset. So in a word, yes, you drop that variable from your regression.

        In fact, now that I think about it.... the way it looks like you've coded it, is 1=if you've ever had acupuncture, 2 if you've ever had acupuncture for eczema. This coefficient cannot be estimated because if you're already in the first group (ever acupuncture), you can't have a second coefficient for acupuncture-eczema because there's no variation in this group, everyone in the dataset who responded to the question has used acupuncture. So it can't be estimated for that reason, the only variable you should need is "Have you used x" before.

        Either way: I strongly suspect you've got data entry errors or data management errors you must sort out. Honestly, this sounds like a good topic to discuss with whoever your advisor is. It would be one thing if it was a random occurrence, but the fact that this is a recurrent problem tells me there are other issues in your data that we can't see, unless you could give us an example dataset so I can look further.

        Comment


        • #5
          Hi Jared,

          I think the latter part about what you said is probably correct. The way the NHIS survey data is collected (for 2012 -- it is on the CDC website for anyone to look at), it asks respondents: "do you have eczema or a skin allergy." A separate question asks the respondent to name the top three complementary alternative medicine (CAM) therapies they have used, from a list of 16. Depending on what the response is, the respondent is then asked a series of further questions. For example, if you had answered "acupuncture" as your top therapy, the survey would then ask you further questions about acupuncture. One of those questions asks, "have you used acupuncture for eczema?"

          I want to study the association between CAM use and eczema prevalence using a logistic regression, but I feel that I need to control for people that are using a specific therapy for eczema. Do you know if there is a good way to go about this? I am also trying to get in touch with my advisor about next steps but I want to be as prepared as possible and come up with any possible solutions prior to our meeting.

          Thank you for your help so far.

          Comment

          Working...
          X