Issue with an Independent variable consisting of Non Mutually Exclusive Categories, whereby one category perfectly predicts the outcome.

Mike Hall

Join Date: Apr 2019

Posts: 2
#1

Issue with an Independent variable consisting of Non Mutually Exclusive Categories, whereby one category perfectly predicts the outcome.

09 Apr 2019, 10:04

Hi Statalist,

I'm having a number of problems with my Probit model. I am attempting to model predictors of successful smoking cessation attempts, as such my dependent variable takes a value of one for ex-smokers (who have quit) and a value of zero for current smokers with a previous quit attempt.

One particular explanatory variable I want to examine is the reason given for a quit attempt. I want to see if the motive behind a quit attempt is associated with the probability of success. The survey allows respondents to select multiple reasons for a quit attempt, as such I could not use a categorical variable and instead have separate binary variables (Yes/No) for each quit reason, and use separate regressions for each one.

The independent 'Reason' variables therefore take a value of 1 if stated, and a value of 0 if not stated as a reason for trying to quit smoking.

However I am having trouble due to the design of the survey. The Current Smokers and Ex-Smokers were asked separate questions regarding their reasons for quitting, and Ex-Smokers were given more options to choose from (Pregnancy, "Own Motivation" and "Cannot Remember"). The problem I'm having is due to this "Own Motivation" variable. It supposedly refers to individuals who quit simply because they felt like it, and for no specific reason. However it was an option only available to ex-smokers and therefore perfectly predicts the dependent variable equal to one, with no natural interpretation.

For some reason, over 30% of Ex-Smokers selected "Own Motivation". When this variable is included in my analysis, the coefficients for all other reasons (Financial, Health, Family Pressure, Effect on Others) become extremely negative. When excluding Individuals who only selected "Own Motivation", the coefficients become far more reasonable.

I assume this is because the effect of "Own Motivation" is contained within the zero values for each independent variable. But it makes my the interpretation of my results unclear - Essentially the results tell me that people who quit for i.e. financial reasons are far more likely to relapse than those who do not.

Clearly something is very wrong with this approach, but I do not know how else to go about it, the only thing I can think of is to drop individuals who stated "Own Motivation" and nothing else, but this risks severe selection bias.

Does anybody have any suggestions about how to go about this?

Thanks in Advance
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

10 Apr 2019, 11:01

You didn't get a quick answer. You'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output (fixed spacing fonts helps), and sample data using dataex.

If you're trying to differentiate smokers from ex-smokers, then it is hard to see how you could use a variable that is only available for one of the two groups. Another take is that because they only asked this question of folks who quit, then its value depends on whether folks quit - a reverse causality/endogeneity. I didn't notice if respondents could pick more than one reason.

While there might be a sophisticated way to model this (maybe in gsem?), I don't see any way to handle it except to drop variables not collected in both groups.
Comment

Announcement

Issue with an Independent variable consisting of Non Mutually Exclusive Categories, whereby one category perfectly predicts the outcome.

Comment