GLM and zeros: raw data or log+1 transformation

Raquel Goncalves

Join Date: Mar 2018

Posts: 7
#1

GLM and zeros: raw data or log+1 transformation

09 Apr 2018, 07:15

Dear all,
I'm comparing 3 treatments over time, each treatment has a different number of cases (houses). The outcome var is the reduction in numbers of insects across treatment through time.

I have a lot of zeros in the dataset, which means that the evaluation were made but no insects were found in the houses (so, it is not zero inflated case, from my understanding).

As the baseline evaluation shows different numbers of insects for each house in each treatment, I'm generating an offset in the baseline value for the outcome var.

My doubt is if the GLM is accounting for these zeros as negative results or if I should inform it to model? I've seen some references where authors transform the outcome var using log+1 previous to run the model. But it is not clear for me if I should transform the data to conduct the GLM...

Q1: Is it sensible to use the raw data with plenty of zeros, or should I transform the outcome with log+1 before run the model?
Q2: In case that I need transform the outcome var (log+1), how should I deal with the offset (raw or transformed)?

Following the codes:

****generating offset based on the value obtained for the baseline:

sort houseid followup
bysort houseid:gen offset1=num_insects[1]

***model interaction treatment*days post IRS

glm num_insects i.treat*daysfromirs i.empty i.season i.presence if irsround==1, offset(offset1) fam(nbinomial) l(log)

Thank you so much for your time,

Regards,

Raq
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35734
#2

09 Apr 2018, 07:36

First of all, watch out for dangerous territory. Across statistical science GLM sometimes means "general linear model" and sometimes "generalized linear model" and although they overlap they certainly are not synonyms.

However, what is certain is that the Stata command glm means generalized linear model.

If we back up, then I gather that num_insects is your response or outcome variable and sometimes contains zeros.

That being so, you can proceed to try glm with log link directly because the leading assumption with such a link is just that the mean response is positive, not that all values are positive. If you apply this method you do not transform the outcome first with log(outcome + 1) as a fix for zeros any more than you would transform it first with log(outcome) if there were no zeros. Transformation of the response is not needed as the log link machinery takes care of estimation on a transformed scale followed by back-transformation.

That said, we necessarily know less about your data than you do and are in no position to confirm that you don't need a zero-inflated model. Perhaps you do!

I have no idea what treating zero as negative results means. Are the zeros treated as if they were negative values? No. Are they treated as if they were missing? No. Do you mean something else?

I don't understand about the offsets here.
1 like
Comment
Raquel Goncalves

Join Date: Mar 2018

Posts: 7
#3

09 Apr 2018, 07:54

Hi Nick,
Thank you so much for your reply.

Yes, I meant if the GLM was treating zeros as missing values. So, you already answered. Thanks for that.. using the raw data make sense then.

About the offset: at baseline the outcome var is very different for each case.
Because I want to estimate the general reduction in the outcome var over time, across treatments, I'm assuming that I need to inform the model some sort of balance (asking the model to consider reduction over time based on the first evaluation for each case).

Thanks for your note about zero inflated.. I'll have some reading to make sure if it is the case. I was assuming that it was not because zeros in my dataset means that the house was assessed but no insects were found. But extra reading will bring more confidence about the model choice.

Many thanks once again
Raq
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#4

09 Apr 2018, 07:58

My understanding of zero inflation is that there are uncomfortably more zeros than your tacit model allows, where uncomfortable means that your estimations and predictions are based on false premises and so not trustworthy. EDIT: For example: a Poisson certainly allows zeros, their probability depending on the mean, but your real data may show more than that.

I wasn't aware that you could establish the absence of zero inflation in advance as a matter of how data are produced, or if you prefer collected.

But there are many people here who know more about these models than I do. Either way, beware of assumption in the sense of presumption because the contrary would be inconvenient.

Last edited by Nick Cox; 09 Apr 2018, 08:14.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#5

09 Apr 2018, 08:09

If I am not mistaken, zero inflated models are mainly called for to deal with situations where there is a different model to explain "zero/non-zero" than "zero/one/two/...". A classic example is asking people how many fish they caught while visiting the park. If they have actually done fishing, this might depend on water conditions and skill. If they simply didn't fish, those conditions may or may not be very relevant.

As a result, it is not because you have many zeroes that you need a zero-inflated model, but the opposite is also true (although if only 2% of your observations are zero, you can probably take the risk).
1 like
Comment
Raquel Goncalves

Join Date: Mar 2018

Posts: 7
#6

09 Apr 2018, 08:10

yes, you are right.. I think is the first step then..
I will search for this right now..
Thank you so much!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3862
#7

09 Apr 2018, 08:13

Originally posted by Nick Cox View Post

My understanding of zero inflation is that there are uncomfortably more zeros than your tacit model allows, where uncomfortable means that your estimations and predictions are based on false premises and so not trustworthy.

I wasn't aware that you could establish the absence of zero inflation in advance as a matter of how data are produced, or if you prefer collected.

There was a lively discussion on zero-inflated models between Paul Allison and William Greene. Indeed the motivation behind a zero-inflated model appears to be the assumption of two data-generating processes that create the zeros in the data, hence leading to a proportion of zeros that would not be expected under a non-inflated count model. Example 1 of the zero-inflated poisson model in the Stata Manual is a nice starter.

On a more technical aspect, Raquel should consider using (the specialized) nbreg rather than glm.

Edit:
Crossed with Jesse's answer above, who picks on the fishing example I was pointing to.

Best
Daniel

Last edited by daniel klein; 09 Apr 2018, 08:16.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#8

09 Apr 2018, 08:18

As always there can be a small tension between what the data say and what we imagine to be the generating processes based on other ideas. Thus, the distribution of posts per contributor (including people registered but yet to contribute) on Statalist is unlikely to be Poisson. Or perhaps it is!
1 like
Comment
Raquel Goncalves

Join Date: Mar 2018

Posts: 7
#9

09 Apr 2018, 10:38

Thank you so much, Daniel and Nick!
Comment

Announcement

GLM and zeros: raw data or log+1 transformation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment