Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Confusions when dealing with skewed data

    Hi all,

    I have cross-sectional dataset which contains the data of firms' annual sales. I'm interested in a regression model to test the effect of R&D spending on a firm's sales.. As is usual for income data, it is positively skewed.So, I want to do the log transformation of these skewed data before regression.

    I read the post(http://www.stata.com/statalist/archi.../msg00553.html) in which it suggests that not to do the transformation to solve the skewness problem.Instead,glm may be a better choice.Then I checked out the manual of stata about glm.But I am not sure which family and link function fits my data best.Because my data is annual sales,it may not be a count data, so I think if it still proper to use poisson or nbreg. Also, it's not a dummy, a ratio or rate, so logit,probit won't be suitable. Finally, I think gamma or inverse guassian might be suitable.But I am still not sure if I am correct to select the regression.

    Is there any guideline I can follow to find a regression command based on the distribution of my data when it's a skewed one? Also, since it no longer be a simple OLS, how can I use stata to graph the results of GLM regression ? I have read some example provided in Stata but most of them are data like count,ratio,rate and few is about continuous data like annual sale. Therefore, it would be very helpful to provide some examples to improve my understanding in dealing with skewed data.

    Thank you for your attention and patience to this matter.

    Best,
    David
    Last edited by David Lu; 06 May 2016, 06:36.

  • #2
    Hi David
    There are some tests, which could help you to decide:
    - for the distribution: modified Park's test
    - for the link: Pergibon's test

    There is a glmdiagnostic package that has been built by U of Penn team (http://www.uphs.upenn.edu/dgimhsr/eeinct_multiv.htm) which could help you to implement those tests.
    Hope this helps

    Comment


    • #3
      See http://blog.stata.com/2011/08/22/use...tell-a-friend/ for a good start in this territory.

      Comment


      • #4
        Originally posted by Nick Cox View Post
        See http://blog.stata.com/2011/08/22/use...tell-a-friend/ for a good start in this territory.
        Hi Nick,

        Thx for the excellent post. It helps me a lot. So, based on the post along with its reference (i quote them below), does it mean that poisson may be a safe and convenient choice when we are not sure which family we should choose (say, poisson,negative binomial regression,gamma,inverse gaussian,ect.) to estimate the model with non-count variable like income,sales?

        "...At the recent Stata Conference in Chicago, I asked a group of knowledgeable researchers a loaded question, to which the right answer was Poisson regression with option vce(robust), but they mostly got it wrong. I said to them, “I have a process for which it is perfectly reasonable to assume that the mean of yj is given by exp(b0 + Xjb), but I have no reason to believe that E(yj) = Var(yj), which is to say, no reason to suspect that the process is Poisson. How would you suggest I estimate the model?” Certainly not using Poisson, they replied. Social scientists suggested I use log regression. Biostatisticians and health researchers suggested I use negative binomial regression even when I objected that the process was not the gamma mixture of Poissons that negative binomial regression assumes. “What else can you do?” they said and shrugged their collective shoulders. And of course, they just assumed over dispersion..."

        "...Note: If you decide on a log link, you may want to call your model \GLM with a log link," rather than a \Poisson" QMLE|some older reviewers believe Poisson regression is only for counts...."


        Thx in advance,
        David

        Comment


        • #5
          Originally posted by Guillaume Geri View Post
          Hi David
          There are some tests, which could help you to decide:
          - for the distribution: modified Park's test
          - for the link: Pergibon's test

          There is a glmdiagnostic package that has been built by U of Penn team (http://www.uphs.upenn.edu/dgimhsr/eeinct_multiv.htm) which could help you to implement those tests.
          Hope this helps
          Hi Guillaume,

          Thx for the hints in glmdiagnostic. However, the package it provided seems a bit complicated for me to follow. Most importantly, the context of the example is far from mine. My data is continuous and not count variable, not ratio which is far from the context of QALYs. More specifically, I don't understand what it means below:

          "glmdiagnostic.do: Contains the program glmdiag. "Doing" glmdiagnostic does not run any diagnostics. Instead, it loads glmdiag so that it can be called by STATA. glmdiag performs the modified Park test (for the GLM family) and the Pearson correlation test, the Pregibon link test, and the modified Hosmer and Lemeshow test (for the GLM link)"

          What does it mean by " it loads glmdiag so that it can be called by STATA."

          I am very fresh in this field. So, is there any other more elementary example for a beginner to follow?

          Thanks in advance,
          David
          Last edited by David Lu; 06 May 2016, 09:05.

          Comment


          • #6
            Nothing is safe, but there's plenty of evidence that Poisson works well across a range of situations in which (mean) outcomes are positive.

            Comment


            • #7
              Originally posted by David Lu View Post

              Hi Guillaume,

              Thx for the hints in glmdiagnostic. However, the package it provided seems a bit complicated for me to follow. Most importantly, the context of the example is far from mine. My data is continuous and not count variable, not ratio, not QLAYs. Is there any other more simple example for a beginner to follow?

              Thanks in advance,
              David
              Hi David
              I am not an very experienced Stata user but I used this command very easily after running glm model. Moreover, I used it after using glm model with Poisson distribution and log link regarding the right-skewed distribution of the cost variable I had.

              The easiest way to use it is
              - 1) to run the glmdiagnostic do file once before running your glm model
              - 2) then just type glmdiagnostic and it will provide you the results of the tests, which could help you in you choices.

              By the way, the results of the tests are of course not the only answer to your difficult question but could help to justify your approach.
              Let me know if I can help you in any way

              Comment


              • #8
                Originally posted by Guillaume Geri View Post

                Hi David
                I am not an very experienced Stata user but I used this command very easily after running glm model. Moreover, I used it after using glm model with Poisson distribution and log link regarding the right-skewed distribution of the cost variable I had.

                The easiest way to use it is
                - 1) to run the glmdiagnostic do file once before running your glm model
                - 2) then just type glmdiagnostic and it will provide you the results of the tests, which could help you in you choices.

                By the way, the results of the tests are of course not the only answer to your difficult question but could help to justify your approach.
                Let me know if I can help you in any way
                Hi Guillaume,

                I ran the glmdiagnostic do file once before running my glm model and then type glmdiagnostic.But stata report error:

                "
                . glmdiagnostic
                unrecognized command: glmdiagnostic
                r(199);
                "

                Is there something wrong or missing, and do they also provide an ado.file?

                Best,
                David

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  Nothing is safe, but there's plenty of evidence that Poisson works well across a range of situations in which (mean) outcomes are positive.
                  Hi Nick,

                  Thank you for the introduction. That encourages me to explore more on Poisson and also help me understand why increasing number of scholars begin to use poisson instead of log transformation.

                  All the best with your research,
                  David

                  Comment


                  • #10
                    Originally posted by David Lu View Post

                    Hi Guillaume,

                    I ran the glmdiagnostic do file once before running my glm model and then type glmdiagnostic.But stata report error:

                    "
                    . glmdiagnostic
                    unrecognized command: glmdiagnostic
                    r(199);
                    "

                    Is there something wrong or missing, and do they also provide an ado.file?

                    Best,
                    David
                    Hi David

                    please find enclosed the file I've stored in my ~/Applications/Stata/ado/personal (I work on MacOSX), which I recalled glmdiag.ado

                    After running your glm model, type glmdiag and it should work.
                    Attached Files

                    Comment


                    • #11
                      Originally posted by Guillaume Geri View Post

                      Hi David

                      please find enclosed the file I've stored in my ~/Applications/Stata/ado/personal (I work on MacOSX), which I recalled glmdiag.ado

                      After running your glm model, type glmdiag and it should work.
                      Hi Guillaume,

                      Thx very much, it works now. For your reference, I pasted the result below, could you tell me how to interpret them? Any helpful link on this interpretation would be great.

                      Thx,
                      David

                      glmdiag


                      FITTED MODEL: Link = Log ; Family = Poisson

                      Results, Modified Park Test (for Family)

                      Coefficient: 1.07331

                      Family, Chi2, and p-value in descending order of likelihood

                      Family Chi2 P-value

                      Poisson: 0.5749 0.4483
                      Gamma: 91.8688 0.0000
                      Gaussian NLLS: 123.2393 0.0000
                      Inverse Gaussian or Wald: 397.1210 0.0000

                      Results of tests of GLM Log link

                      Pearson Correlation Test: 0.0000
                      Pregibon Link Test: 0.0025
                      Modified Hosmer and Lemeshow: 0.0059

                      Comment


                      • #12
                        Hi David
                        to correctly interpretate these tests, I can only suggest you to carefully read the very well-done tutorials on GLM diagnostics on the website we discussed previously. It seems that the Poisson distribution is a good choice compared to the others as well as the log link. But, the interpretation of such tests require a more global view of your data as well.

                        Happy to help

                        Comment


                        • #13
                          Hi,

                          I tried to use this ado file, however I am not very familiar with this concept. I used -mkdir- to create a personal folder for the ado files. Could you please let me know what is the process after this? I copied the ado file to the new folder (manually), I then ran the GLM and the -glmdiag- command. But stata reported this:
                          glmdiag
                          ==0 invalid name
                          r(198);

                          What do you suggest?

                          Many thanks.

                          Nikos

                          Comment


                          • #14
                            I used -mkdir- to create a personal folder for the ado files. Could you please let me know what is the process after this?
                            Hi Nikos
                            To my opinion, the easiest way is
                            - 1) run the ado-file glmdiag
                            - 2) run your glm model
                            - and 3) run the glmdiag command.
                            Let me know if it's helpful

                            Comment


                            • #15
                              It is not working. Could you please tell me the steps for installing this ado file?

                              Mnay thanks.

                              Best regards,

                              Nikos

                              Comment

                              Working...
                              X