Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Randomized quantile residuals (Dunn-Smyth residuals) for GLM diagnostics

    I am a student undertaking my Master's in Biostatistics and am working through GLM diagnostics and want to compute randomised quantile residuals (Dunn-Smyth residuals) after fitting GLMs in Stata.

    My understanding is that randomised quantile residuals are considered the gold standard diagnostic tool for GLMs - they are approximately standard normal if the model is correct, regardless of the outcome distribution, which makes them much more useful than deviance or Pearson residuals for non-normal outcomes.

    In R, these are straightforwardly available via qresid() from the statmod package. However, I prefer to work in Stata and am not aware of an official or SSC-validated command that computes these residuals after glm.

    I found a GitHub package (https://github.com/psotob91/qresid) by Percy Soto-Becerra, which appears to implement randomised quantile residuals for normal linear regression, Gaussian GLMs, binomial/Bernoulli models (via both glm and logit/logistic commands), Poisson models (via both glm and poisson commands), and negative binomial models (via both glm and nbreg commands). Since I'm not expert enough to know if code is correct, and the package is pre-release and has not been submitted to SSC, can anyone point me to a validated Stata command for computing randomised quantile residuals after glm?

  • #2
    I suspect you have searched tireless for such programs, and have found but 1 (coded 4 years ago).

    It looks like the qresid includes a Monte Carlo simulation.

    You could use the MC or some other data to estimate in R and Stata and compare the results.

    The ado file does not look that complicated, so you could probably check it against the theory.

    Comment


    • #3
      Thanks for the suggestion, George. Given my limited knowledge and time constraints, I'm wondering if I'd be better off not being so precious about my preference for Stata and learn to code in R. Thanks again.

      Comment


      • #4
        I've given that a try; did not work out.

        I've had some success having Claude rewrite R code to run in Stata. It takes a little tweaking sometimes, but works remarkably well.

        Or, you could drop the ado file into the Claude/ChatGPT and get an explanation and check for comparability across the Stata and R versions.

        Comment


        • #5
          Thanks again, George. I'll give your suggestions a try.

          Comment


          • #6
            There were a few bugs in the Stata version. Here's a fix. No guarantee but matches R's qresid. Models must be estimated by GLM (Gaussian, Bernoulli/Binomial, Poisson, or Negative Binomial supported).

            Here's some MC to see it work.

            Code:
             qresid.ado

            Comment


            • #7
              Code:
              *-------------------------------------------------------------------------------
              **# Bernoulli example simulation:
              clear all
              set seed 123
              set obs 100
              gen y = rbinomial(1, 0.3)
              gen x = _n
              
              **## command: glm
              glm y x, family(binomial) link(logit)
              predict dev, dev
              
              qresid res1
              qresid res2, nqres(1)
              qresid res3, standardized nqres(1)
              
              qnorm dev, name(qq1, replace) nodraw
              qnorm res11, name(qq2, replace) nodraw
              qnorm res21, name(qq3, replace) nodraw
              qnorm res3_std1, name(qq4, replace) nodraw
              graph combine qq1 qq2 qq3 qq4
              
              *-------------------------------------------------------------------------------
              **# Binomial example simulation:
              clear all
              set seed 123
              set obs 30
              *gen n = runiformint(100, 200)
              gen n = 20
              gen y = rbinomial(n, 0.05)
              gen x = _n
              
              **## command: glm
              glm y x, family(binomial n) link(logit)
              predict dev, dev
              
              qresid res1
              qresid res2, nqres(1)
              qresid res3, standardized nqres(1)
              
              qnorm dev, name(qq1, replace) nodraw
              qnorm res11, name(qq2, replace) nodraw
              qnorm res21, name(qq3, replace) nodraw
              qnorm res3_std1, name(qq4, replace) nodraw
              graph combine qq1 qq2 qq3 qq4
              
              *-------------------------------------------------------------------------------
              **# Poisson example simulation:
              **## Simulation 
              clear all 
              set seed 123
              set obs 50
              gen y = rpoisson(1)
              gen x = _n
              
              **## command: glm
              glm y x, family(poisson)
              predict dev, dev
              
              qresid res1
              qresid res2, nqres(1)
              qresid res3, standardized nqres(1)
              
              qnorm dev, name(qq1, replace) nodraw
              qnorm res11, name(qq2, replace) nodraw
              qnorm res21, name(qq3, replace) nodraw
              qnorm res3_std1, name(qq4, replace) nodraw
              graph combine qq1 qq2 qq3 qq4
              
              *------------------------------------------------------------------------------
              **# Negative Binomial example simulation:
              **## Simulation 
              clear all 
              set seed 123
              set obs 50
              gen y = rnbinomial(2,0.5)
              gen x = _n
              
              **## command: glm
              glm y x, family(nbinomial ml)
              predict dev, dev
              
              qresid res1
              qresid res2, nqres(1)
              qresid res3, standardized nqres(1)
              
              qnorm dev, name(qq1, replace) nodraw
              qnorm res11, name(qq2, replace) nodraw
              qnorm res21, name(qq3, replace) nodraw
              qnorm res3_std1, name(qq4, replace) nodraw
              graph combine qq1 qq2 qq3 qq4

              Comment


              • #8
                Thank you very much for your interest in the package, and especially to Lili for raising the question that motivated me to revisit and resume the project after this discussion.

                Due to personal circumstances — and also because during the last few years I have been working mainly with R — development of the package remained unfinished for some time. However, this thread motivated me to prioritize the project again.

                I am currently updating and restructuring the codebase to prepare a first official public release before the end of this month. The initial version will focus on the original independent-response regression models described in the Dunn & Smyth framework, including continuous and count GLMs.

                After that, I am planning a second development stage extending support to more complex count-data settings such as zero-inflated, hurdle, and related mixture models. A later major version is planned for multilevel and correlated-data models.

                The repository will remain fully open-source, and contributions, suggestions, validation efforts, and collaborations are very welcome.

                Thanks again for the interest and discussion.

                Comment


                • #9
                  Thanks again for the interest and for the earlier discussion around randomized quantile residuals.

                  I have now released the first public version of qresid, and the package has been submitted to SSC for review.

                  GitHub repository:
                  https://github.com/psotob91/qresid

                  The current release focuses on supported regression models for non-correlated data. You will find interesting implementations in count data models. Additional extensions for correlated-data modls are planned for future versions.

                  Feedback and suggestions are very welcome.

                  Comment


                  • #10
                    I was interested to read about Dunn-Smyth residuals, as I hadn't heard of them before this thread. Perhaps this is a difference across fields. I do wonder how they're applied. For example, if Y is a binary response I know it's distribution is Bernoulli. Therefore, any diagnostic is necessarily a functional form test for the response probability, P(Y = 1|X). I think using the D-S residuals must be inferior to other direct methods, such as a RESET-type test that puts in higher powers of X_i*b^. Or, just include squares and cross products directly.

                    In other cases, such as when Y is a count variable and I use, say, Poisson regression, using D-S gives an omnibus test that can reject for many reasons. But different misspecifications have different levels of importance. For estimating mean effects, Poisson regression is very well suited because the estimator is completely robust for the mean parameters regardless of the actual distribution. If I conclude, using D-S, that there's some sort of model misspecification, what should I do? It could be that the standardized fourth moment in my data doesn't match up with the Poisson distribution, something I wouldn't care much about. If the conditional mean is misspecified then using D-S to move to, say, a negative binomial distribution is counterproductive because it could be that the Poisson distribution is correct but the mean is incorrect.

                    When applied to roughly continuous outcomes after OLS estimation, again, the D-S residuals are sensitive to a misspecified mean, heteroskedasticity, and nonnormality. But, again, these "misspecifications" have different levels of importance. OLS works very well if the mean is correctly specified, and fully robust inference is available provided N is not too small. Why should we reject the OLS estimation if, say, the distribution is not normal? One reason would be that if we need a full distribution to make statements about probabilities, and I wonder if that's the main motivation for using D-S residuals.

                    Comment


                    • #11
                      Closing the loop on this thread and wanting to very much thank both George and Percy.

                      George - thank you for the practical suggestions and for taking the time to identify and fix the bugs in the original code. That was a generous contribution.

                      Percy - I'm delighted you found the time to revisit the qresid package and submit it to SSC.

                      For everyone's benefit, it is now available to install directly in Stata with: ssc install qresid

                      The current release supports randomised quantile residuals for continuous and count GLMs including Gaussian, Bernoulli/Binomial, Poisson, and Negative Binomial models.

                      Thanks again to everyone who contributed to this thread - the Statalist community in action is a beautiful thing!

                      Lili

                      Comment

                      Working...
                      X