Confusions when dealing with skewed data

David Lu

Join Date: May 2016

Posts: 105
#1

Confusions when dealing with skewed data

06 May 2016, 06:20

Hi all,

I have cross-sectional dataset which contains the data of firms' annual sales. I'm interested in a regression model to test the effect of R&D spending on a firm's sales.. As is usual for income data, it is positively skewed.So, I want to do the log transformation of these skewed data before regression.

I read the post(http://www.stata.com/statalist/archi.../msg00553.html) in which it suggests that not to do the transformation to solve the skewness problem.Instead,glm may be a better choice.Then I checked out the manual of stata about glm.But I am not sure which family and link function fits my data best.Because my data is annual sales,it may not be a count data, so I think if it still proper to use poisson or nbreg. Also, it's not a dummy, a ratio or rate, so logit,probit won't be suitable. Finally, I think gamma or inverse guassian might be suitable.But I am still not sure if I am correct to select the regression.

Is there any guideline I can follow to find a regression command based on the distribution of my data when it's a skewed one? Also, since it no longer be a simple OLS, how can I use stata to graph the results of GLM regression ? I have read some example provided in Stata but most of them are data like count,ratio,rate and few is about continuous data like annual sale. Therefore, it would be very helpful to provide some examples to improve my understanding in dealing with skewed data.

Thank you for your attention and patience to this matter.

Best,
David

Last edited by David Lu; 06 May 2016, 06:36.
Tags: continuous data, data, regression, skew, Suggestion
Guillaume Geri

Join Date: Sep 2014

Posts: 55
#2

06 May 2016, 07:15

Hi David
There are some tests, which could help you to decide:
- for the distribution: modified Park's test
- for the link: Pergibon's test

There is a glmdiagnostic package that has been built by U of Penn team (http://www.uphs.upenn.edu/dgimhsr/eeinct_multiv.htm) which could help you to implement those tests.
Hope this helps
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35762
#3

06 May 2016, 07:21

See http://blog.stata.com/2011/08/22/use...tell-a-friend/ for a good start in this territory.
Comment
David Lu

Join Date: May 2016

Posts: 105
#4

06 May 2016, 08:41

Originally posted by Nick Cox View Post

See http://blog.stata.com/2011/08/22/use...tell-a-friend/ for a good start in this territory.

Hi Nick,

Thx for the excellent post. It helps me a lot. So, based on the post along with its reference (i quote them below), does it mean that poisson may be a safe and convenient choice when we are not sure which family we should choose (say, poisson,negative binomial regression,gamma,inverse gaussian,ect.) to estimate the model with non-count variable like income,sales?

"...At the recent Stata Conference in Chicago, I asked a group of knowledgeable researchers a loaded question, to which the right answer was Poisson regression with option vce(robust), but they mostly got it wrong. I said to them, “I have a process for which it is perfectly reasonable to assume that the mean of yj is given by exp(b0 + Xjb), but I have no reason to believe that E(yj) = Var(yj), which is to say, no reason to suspect that the process is Poisson. How would you suggest I estimate the model?” Certainly not using Poisson, they replied. Social scientists suggested I use log regression. Biostatisticians and health researchers suggested I use negative binomial regression even when I objected that the process was not the gamma mixture of Poissons that negative binomial regression assumes. “What else can you do?” they said and shrugged their collective shoulders. And of course, they just assumed over dispersion..."

"...Note: If you decide on a log link, you may want to call your model \GLM with a log link," rather than a \Poisson" QMLE|some older reviewers believe Poisson regression is only for counts...."

Thx in advance,
David
Comment
David Lu

Join Date: May 2016

Posts: 105
#5

06 May 2016, 08:59

Originally posted by Guillaume Geri View Post

Hi David
There are some tests, which could help you to decide:
- for the distribution: modified Park's test
- for the link: Pergibon's test

There is a glmdiagnostic package that has been built by U of Penn team (http://www.uphs.upenn.edu/dgimhsr/eeinct_multiv.htm) which could help you to implement those tests.
Hope this helps

Hi Guillaume,

Thx for the hints in glmdiagnostic. However, the package it provided seems a bit complicated for me to follow. Most importantly, the context of the example is far from mine. My data is continuous and not count variable, not ratio which is far from the context of QALYs. More specifically, I don't understand what it means below:

"glmdiagnostic.do: Contains the program glmdiag. "Doing" glmdiagnostic does not run any diagnostics. Instead, it loads glmdiag so that it can be called by STATA. glmdiag performs the modified Park test (for the GLM family) and the Pearson correlation test, the Pregibon link test, and the modified Hosmer and Lemeshow test (for the GLM link)"

What does it mean by " it loads glmdiag so that it can be called by STATA."

I am very fresh in this field. So, is there any other more elementary example for a beginner to follow?

Thanks in advance,
David

Last edited by David Lu; 06 May 2016, 09:05.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35762
#6

06 May 2016, 08:59

Nothing is safe, but there's plenty of evidence that Poisson works well across a range of situations in which (mean) outcomes are positive.
Comment
Guillaume Geri

Join Date: Sep 2014

Posts: 55
#7

06 May 2016, 09:07

Originally posted by David Lu View Post

Hi Guillaume,

Thx for the hints in glmdiagnostic. However, the package it provided seems a bit complicated for me to follow. Most importantly, the context of the example is far from mine. My data is continuous and not count variable, not ratio, not QLAYs. Is there any other more simple example for a beginner to follow?

Thanks in advance,
David

Hi David
I am not an very experienced Stata user but I used this command very easily after running glm model. Moreover, I used it after using glm model with Poisson distribution and log link regarding the right-skewed distribution of the cost variable I had.

The easiest way to use it is
- 1) to run the glmdiagnostic do file once before running your glm model
- 2) then just type glmdiagnostic and it will provide you the results of the tests, which could help you in you choices.

By the way, the results of the tests are of course not the only answer to your difficult question but could help to justify your approach.
Let me know if I can help you in any way
Comment
David Lu

Join Date: May 2016

Posts: 105
#8

06 May 2016, 09:17

Originally posted by Guillaume Geri View Post

Hi David
I am not an very experienced Stata user but I used this command very easily after running glm model. Moreover, I used it after using glm model with Poisson distribution and log link regarding the right-skewed distribution of the cost variable I had.

The easiest way to use it is
- 1) to run the glmdiagnostic do file once before running your glm model
- 2) then just type glmdiagnostic and it will provide you the results of the tests, which could help you in you choices.

By the way, the results of the tests are of course not the only answer to your difficult question but could help to justify your approach.
Let me know if I can help you in any way

Hi Guillaume,

I ran the glmdiagnostic do file once before running my glm model and then type glmdiagnostic.But stata report error:

"
. glmdiagnostic
unrecognized command: glmdiagnostic
r(199);
"

Is there something wrong or missing, and do they also provide an ado.file?

Best,
David
Comment
David Lu

Join Date: May 2016

Posts: 105
#9

06 May 2016, 09:20

Originally posted by Nick Cox View Post

Nothing is safe, but there's plenty of evidence that Poisson works well across a range of situations in which (mean) outcomes are positive.

Hi Nick,

Thank you for the introduction. That encourages me to explore more on Poisson and also help me understand why increasing number of scholars begin to use poisson instead of log transformation.

All the best with your research,
David
Comment
Guillaume Geri

Join Date: Sep 2014

Posts: 55
#10

06 May 2016, 09:25

Originally posted by David Lu View Post

Hi Guillaume,

I ran the glmdiagnostic do file once before running my glm model and then type glmdiagnostic.But stata report error:

"
. glmdiagnostic
unrecognized command: glmdiagnostic
r(199);
"

Is there something wrong or missing, and do they also provide an ado.file?

Best,
David

Hi David

please find enclosed the file I've stored in my ~/Applications/Stata/ado/personal (I work on MacOSX), which I recalled glmdiag.ado

After running your glm model, type glmdiag and it should work.

Attached Files

glmdiag.ado (7.4 KB, 1 view)
Comment
David Lu

Join Date: May 2016

Posts: 105
#11

06 May 2016, 09:38

Originally posted by Guillaume Geri View Post

Hi David

please find enclosed the file I've stored in my ~/Applications/Stata/ado/personal (I work on MacOSX), which I recalled glmdiag.ado

After running your glm model, type glmdiag and it should work.

Hi Guillaume,

Thx very much, it works now. For your reference, I pasted the result below, could you tell me how to interpret them? Any helpful link on this interpretation would be great.

Thx,
David

glmdiag

FITTED MODEL: Link = Log ; Family = Poisson

Results, Modified Park Test (for Family)

Coefficient: 1.07331

Family, Chi2, and p-value in descending order of likelihood

Family Chi2 P-value

Poisson: 0.5749 0.4483
Gamma: 91.8688 0.0000
Gaussian NLLS: 123.2393 0.0000
Inverse Gaussian or Wald: 397.1210 0.0000

Results of tests of GLM Log link

Pearson Correlation Test: 0.0000
Pregibon Link Test: 0.0025
Modified Hosmer and Lemeshow: 0.0059
Comment
Guillaume Geri

Join Date: Sep 2014

Posts: 55
#12

06 May 2016, 09:59

Hi David
to correctly interpretate these tests, I can only suggest you to carefully read the very well-done tutorials on GLM diagnostics on the website we discussed previously. It seems that the Poisson distribution is a good choice compared to the others as well as the log link. But, the interpretation of such tests require a more global view of your data as well.

Happy to help
Comment
Nikos Korompos

Join Date: Jan 2017

Posts: 66
#13

20 Jan 2017, 08:15

Hi,

I tried to use this ado file, however I am not very familiar with this concept. I used -mkdir- to create a personal folder for the ado files. Could you please let me know what is the process after this? I copied the ado file to the new folder (manually), I then ran the GLM and the -glmdiag- command. But stata reported this:
glmdiag
==0 invalid name
r(198);

What do you suggest?

Many thanks.

Nikos
Comment
Guillaume Geri

Join Date: Sep 2014

Posts: 55
#14

22 Jan 2017, 13:04

I used -mkdir- to create a personal folder for the ado files. Could you please let me know what is the process after this?

Hi Nikos
To my opinion, the easiest way is
- 1) run the ado-file glmdiag
- 2) run your glm model
- and 3) run the glmdiag command.
Let me know if it's helpful
Comment
Nikos Korompos

Join Date: Jan 2017

Posts: 66
#15

13 Jul 2017, 09:39

It is not working. Could you please tell me the steps for installing this ado file?

Mnay thanks.

Best regards,

Nikos
Comment

Announcement

Confusions when dealing with skewed data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment