-regress- with a heavily skewed DV.

David Speed

Join Date: May 2015

Posts: 98
#1

-regress- with a heavily skewed DV.

07 Feb 2017, 06:41

Hello,

I do a lot of health research with large data sets (e.g., Canadian Community Health Survey, General Social Survey). One of the outcomes I am interested in is depression, specifically persons' scores on a short-form depression scale. However, this DV is extremely skewed because most people do not suffer from depression.

Code:

Depr. scale - | short form | score - (D) | Freq. Percent Cum. ---------------+----------------------------------- 0 | 21,432 91.47 91.47 1 | 33 0.14 91.61 2 | 100 0.43 92.04 3 | 200 0.85 92.89 4 | 307 1.31 94.20 5 | 381 1.63 95.83 6 | 510 2.18 98.01 7 | 347 1.48 99.49 8 | 120 0.51 100.00 ---------------+----------------------------------- Total | 23,430 100.00

I was planning to analyze the data with -regress-, but given the extreme skew of the outcome variable I was curious if this was actually appropriate. I had read through a few postings about this topic, but they seemed to focus on the assumption of homoscedasticity, which would be less relevant as probability weighted data uses HC1 for standard errors.

Any insight would be helpful!

Cheers,

David.
Tags: big data, regress, skew
daniel klein

Join Date: Mar 2014

Posts: 3911
#2

07 Feb 2017, 06:46

Have a look at possion or nbreg. Both can be expressed as general linear model (glm) and might fit the data better than a standard linear model . An ordered logit (ologit) or probit (oprobit) might also be an alternative.

Best
Daniel
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36054
#3

07 Feb 2017, 07:03

I'd consider some kind of logit model here, either ordered logit or if you were really confident that your response was (a good approximation to) a measured scale a logit on that scale/8.

The biggest deal for me would not be worrying about skewness of response or heteroscedasticity but ensuring that predicted values made sense. There would be a high chance, on this evidence, of negative predictions for the response if you went ahead with plain regression.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

07 Feb 2017, 07:46

My vote would also be for ordered logit or probit, as these are the most natural fit to the data in a conceptual sense. You can look at it this was, as I was taught by my methods professor: there is a latent (unobserved) continuous variable, say our actual level of depression, and as that gets higher, the probability of making a higher response on that scale increases.

One of the count models will also produce usable results, but in principle, a count model can produce predicted values from 0 to infinity. I've seen one or two posters using count models for scale scores, though, so it appears to be something that some people do in practice. Personally, I would vote against count model in this context, but if you could demonstrate that it fit the data better than an ordered logit/probit, then I would likely change my mind.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3911
#5

07 Feb 2017, 08:02

In my experience the parallel odds assumption underlying the ordered models is usually violated in real life data. I am not sure whether this is better in any sense than a possible few predictions that fall outside the range of observed values.

Best
Daniel
1 like
Comment
Euslaner

Join Date: Apr 2014

Posts: 219
#6

07 Feb 2017, 11:07

What you want is Gary King's relogit (rare events logit): http://gking.harvard.edu/relogit
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36054
#7

07 Feb 2017, 11:25

I can't see how rare events logit (which despite the twist in its name is just for binary responses) maps onto modelling an ordinal response.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#8

07 Feb 2017, 11:40

Just to add to the confusion, another possibility here is -tobit- regression. You can imagine that there is a more or less normally distributed latent depression variable, but the measurement instrument is censored from below at 0 and above at 8 and your population has a few people in the right tail but lies mostly to the left.

I think just from the plethora of distinct suggestions here it is clear that there is no one ideal approach to modeling this kind of variable. I think Nick's point is important: do the model predictions you get make sense? (And being somewhat out of range does not qualify as not making sense.) It may be appropriate here to try several approaches and look at their predictions before deciding. And if you're lucky, the different approacahes will support substantially the same conclusions so that your findings are somewhat robust to the particular model selection issue.

Please do not interpret this as a recommendation to keep trying models until you get a p < 0.05 that you were hoping for. I'm saying try several models and see how well the model predictions compare to the observed data. In fact, if you are likely to be tempted to p-hack, before running any of the models you can -set pformat %1.0f-. That will cause your p-values to be displayed as either 0 or 1 (depending on whether they are above or below 0.5)--which will hide from you any siren calls from "statistical significance" because nobody is interested in p < 0.5). It's the next best thing to suppressing the p-values altogether.

By the way, just out of curiosity is this the PHQ8 we're talking about here? I'm a bit intrigued because I have used it often in my research. But I have never used it in healthy populations, so I usually find that the PHQ8 distribution is very easy to handle, usually centering around 3 or 4 and reasonably close to bell shaped.
1 like
Comment
Euslaner

Join Date: Apr 2014

Posts: 219
#9

07 Feb 2017, 11:45

I am not sure that an ordered probit or logit would give meaningful results since the % of cases in each response is so tiny. A dichotomous relogit model would likely give better estimates.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36054
#10

07 Feb 2017, 11:54

If there were indeed a compelling research or practical reason to reduce the data to binary, then logit is a preferred choice. Otherwise throwing away much of the detail in the data won't appeal to all.
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#11

07 Feb 2017, 12:05

Originally posted by Euslaner View Post

I am not sure that an ordered probit or logit would give meaningful results since the % of cases in each response is so tiny. A dichotomous relogit model would likely give better estimates.

But, consider that the ordered probability models don't analyze individual levels. They calculate the odds of responding at a higher level, given covariates (and then of course you can predict the probability that x=4, for example). About 200 cases per level and a sample of over 20k is quite large. From a very brief skim of the rare events logit model, it appears to apply a small sample correction to the regular logit model to deal with rare cases in relatively small denominators, correct? I am not sure that the numbers given in the original post would make a strong case for that particular model.

moreover, and perhaps more importantly, by dichotomizing the outcome variable, the original poster could be discarding valuable information on severity of depression. Of course, it's true that people dichotomize all the time, and presumably there are clinically meaningful cutoffs for this scale (e.g. 3 or more is consistent with severe depression, and at this prevalence of severe depression, maybe the rare events model might be appropriate).

last, of course an ordered probability model will give meaningful results in this case, assuming it converges and doesn't violate the proportional odds assumption too badly. The model does theoretically match the process that we think probably generates the data. The first cutoff is going to be very high, and everything else will be a lot more closely spaced (which is potentially consistent with the instrument having a floor effect if you assume a normal distribution of deperssion, and you think that the instrument can't detect subclinical cases at all, which in turn would be a potential argument for a Tobit model like Clyde put forth). We estimate odds ratios and probabilities for all sorts of pretty rare events all the time. I am not sure how the results would not be meaningful (assuming theoretical assumptions met and convergence achieved).

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
2 likes
Comment
John Mullahy

Join Date: Dec 2016

Posts: 772
#12

07 Feb 2017, 15:17

It seems to me that a first-order issue is whether you are trying to estimate (a) the conditional mean of the outcome, E[D|x], or (b) the entire probability structure of the data, Pr(D=d|x), d=0,..,8, or (c) a coarsened version of the probability model, Pr(D=0|x) vs. Pr(D>0|x).

For the conditional mean, my guess is that most of these approaches will result in similar estimated marginal effects w.r.t. the covariates. Other possibilities for conditional mean estimation would be fracreg (after normalizing your LHS variable by dividing by 8) and glm (using family(binomial 8)). Both of these would guarantee that predictions of the conditional means are in the (0,8) interval, as would also be true for ordered logit/probit but not necessarily for Tobit (which could predict > 8 unless you use a doubly censored Tobit) or for linear regression (as pointed out by Nick Cox).

If your goal instead is to estimate the full conditional probability model, then fracreg and linear regression won't help and instead a full conditional probability model like ordered probit/logit, binomial, Tobit, etc. would be needed. One other point: Note that the marginal distribution of your outcome has two modes, at zero and six. If the conditional distributions had a similar structure--i.e. if the modes weren't just created by the distribution of your covariates--then some conditional probability models would be able to accommodate this multimodality (e.g. ordered probit/logit) but others wouldn't (e.g. a binomial).

If the goal is estimation of the coarsened version, Pr(D=0|x) vs. Pr(D>0|x), then -- yes -- binary logit/probit seems sensible.
3 likes
Comment

Announcement

-regress- with a heavily skewed DV.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment