Should I log my variable?

Ally Riddle

Join Date: Aug 2015

Posts: 9
#1

Should I log my variable?

18 Aug 2015, 06:52

I am analysing the effect of intervention (0-no, 1-intervention) on terrorism, per country/year.

It has been suggested that I keep the terrorism variable as a count variable, however, there is a great deal of skewness and there is also heteroskedacity when I run the model. Thus I took the natural logarithm of the variable: gen llta = log(ta+0.00001). I know that this can be an issue for zeros. This new variable however now has a normal distribution and the R2 significantly increases.

I am unsure as to whether it is suitable to make terrorism a log variable. When using different independent variables of intervention (hu1, hu2, hu3), the results sometimes vary. For instance hu3 increases terrorism when logged, but reduces it significantly when not logged. I'm not sure as to why this is.

I have attached boxplot.docx to show the box plot of terrorist attacks (showing skewness).
I have also attached forum_ta.txt which shows the regression with and without logging, as well as the hettest.
Attached Files

boxplot.docx (14.3 KB, 1 view)

forum_ta.txt (3.6 KB, 1 view)
Tags: None
Dick Campbell

Join Date: Apr 2014

Posts: 279
#2

18 Aug 2015, 07:33

You may want to read this blog entry by Bill Gould. http://blog.stata.com/2011/08/22/use...tell-a-friend/

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#3

18 Aug 2015, 07:39

Please don't post MS Word attachments. This is an explicit request in the FAQ, which you were asked to read before posting. Many members here don't even use MS Word. In any case graphs can and should be posted directly, e.g. as .png attachments. This also is explained in the FAQ.

Similarly asking people to open an attachment is an unnecessary expectation: the FAQ Advice explains how to post results as CODE.

My question is why it seems important to you that terrorism (not here explained, but evidently a count variable) be normally distributed. Marginal normality of the response is not even an assumption of linear regression, strict sense. Conversely, it would make no sense to use linear regression with such a variable as response, as linear regression will not respect the non-negative nature of the response. That will bite very hard as you evidently have zeros in your data. These are standard points, but see

Blog . . . . . . . . . . . . Use poisson rather than regress; tell a friend
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W. Gould
9/11 http://blog.stata.com/2011/08/22/
use-poisson-rather-than-regress-tell-a-friend/

or any book by Jeff Wooldridge or any text on categorical data modelling.

The transformation you have used is utterly arbitrary. I very much doubt that it has produced a normal distribution, as it can do no more than map a set of spikes to another set of spikes. Let's focus on what happens at the lower end, using Mata as a sandbox:

Code:

. mata: ------------------------------------------------- mata (type end to exit) ----- : y = (0,1,2,3)' : y, ln(y :+ 0.00001) 1 2 +-------------------------------+ 1 | 0 -11.51292546 | 2 | 1 9.99995e-06 | 3 | 2 .6931521805 | 4 | 3 1.098615622 | +-------------------------------+

All zeros are mapped to -11.5 or so and so all zeros become massive outliers!

Also on your new scale there is a bigger difference between ln(0 + 0.00001) and ln(1 + 0.00001) than between ln(1 + 0.00001) and ln(50000 + 0.00001).

The key point is that adding a small constant before logging is not at all a nudge or conservative change; it is a massive change, and might fairly be called a malformation, not a transformation. This follows from the curvature of ln y as a function of y.

Now such a transformation may make some substantive sense; absence of terrorism incidents is an important qualitative fact; but the transformation does not march with either linear regression, or an aim of marginal normality. Nevertheless, absent expertise in political science on my part, it is hard to see a case for 0.00001.

There is a positive and much simpler way forward already implied, namely Poisson regression.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

18 Aug 2015, 07:58

Similar comments in http://www.statalist.org/forums/foru...vel-regression
Comment
Ally Riddle

Join Date: Aug 2015

Posts: 9
#5

20 Aug 2015, 04:33

I ran the data with the Poisson Model (with and without robust) and when I run estat gof it suggests that Poisson is not a good choice as I have over-dispersion. I have tried with nbreg and glm negative binomial instead. Since I have issues of heteroskedacity and autocollinearity, is the glm a better fitting model? Is there a way of knowing whether nbreg or glm negative binomial is better?
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#6

20 Aug 2015, 14:44

Dear Ally,

If I understand correctly what you are doing, overdispersion is not a big problem. You can possibly just use the basic Poisson regression with robust standard errors. You can also try the NB regression (also with robust standard errors) but the results are likely to change little.

Joao
Comment

Announcement

Should I log my variable?

Comment

Comment

Comment

Comment

Comment