Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Which family and link for GLM?

    Hello!

    I am trying to find the right model that matches the distribution of my data. I use cross-sectional data and I want to explain life satisfaction ("0=not satisfied at all - 10=completely satisfied) by social relationships (marital status, social network size). My dependent variable is negatively skewed (-1.16). After running my OLS regression (including control variables), I find that my residuals are also not normally distributed. The residual-vs.fitted plot indicates heteroskedasticity. I also checked both the Breusch-Pagan / Cook-Weisberg test and the White's test for heteroskedasticity which are both highly significant (Prob > chi2 = 0.0000). I also tried to transform my dependent variable but I don't see an improvement after doing a log-transformation or Box-Cox transformation.

    Rather than forcing the data to fit the model, I am now trying to find the right family and link for a generalized linear model. I think that family(gamma) and link(log) could fit my data. But mostly I have no clue (yet) how to find out. Can you give me a hint?

    Thank you very much!

    Click image for larger version

Name:	tab1.png
Views:	1
Size:	9.3 KB
ID:	1577678

    Click image for larger version

Name:	tab2.png
Views:	1
Size:	11.8 KB
ID:	1577679


    Click image for larger version

Name:	Graph1.png
Views:	1
Size:	39.2 KB
ID:	1577676

    Click image for larger version

Name:	Graph3.png
Views:	1
Size:	52.7 KB
ID:	1577677

    Last edited by Maja Schmit; 17 Oct 2020, 15:55.

  • #2
    Maja Schmit I start by plotting the sample distribution by entering 'hist zufri'. If you share your data using dataex, you might get a more helpful response. Here is a walk through I find useful for answering your question: https://statisticsbyjim.com/hypothes...ribution-data/

    Comment


    • #3
      Thank you for the helpful website! The data I am using is confidential, so unfortunately I can't share them via dataex.

      Click image for larger version

Name:	Graph5.png
Views:	1
Size:	23.6 KB
ID:	1577688

      Comment


      • #4
        It looks like there is not much variation across the scale -- most people are satisfied to some extent. I would consider dichotomizing around the median but someone else might have a better idea

        Comment


        • #5
          I would use an ordered logit model. If you want to stay in the GLM family, use a binomial with an upper bound of 10 and a logit link.

          Code:
          glm y x1 ... xk, fam(bin 10) link(logit) vce(robust)

          Comment


          • #6
            As a footnote to #5 from Jeff Wooldridge, I note that a logarithmic transformation seems unlikely to help, as you have zero values -- and negative skewness. Skewness of the response is not the primary issue, however: it's getting a functional form that matches the behaviour of the data.

            Comment


            • #7
              Many thanks for your useful suggestions and further clarifications! As suggested I used glm with fam(bin10) and link(logit). As can be seen from the graph of the residuals vs. fitted values, my model fit has improved even though it's still not very good (as far as I can tell). Nevertheless, you already helped me a lot. Thanks!

              Code:
              predict eta, eta
              predict pearson, pearson
              label var pearson "Pearson residuals"
              label var eta "Linear predictor"
              twoway scatter pearson eta, yline(0)
              Click image for larger version

Name:	Graphbin.png
Views:	1
Size:	48.6 KB
ID:	1577749

              Comment


              • #8
                It may not be possible to get a very good fit. What question are you asking? Is it just to predict or do you have a causal question in mind.

                Many influential studies have been published with low R-squareds. Under randomized assignment you could get unbiased estimation of causal effects with a very small R-squared. Have you tried putting in squares and interactions of your covariates? We’re a bit in the dark.

                Comment


                • #9
                  Again a sidenote: elsewhere I have seen people puzzled by the pattern of stripes in residual plots like these.

                  In a plain or vanilla regression, residuals = observed MINUS predicted will lie on distinct lines for each distinct value of the observed, so that here observed is one of 0, 1, ..., 9, 10 and so the residuals in #1 for observed response 7 (say) lie on a line residual = 7 MINUS predicted (and so with slope -1).

                  In #7 the pattern is curved given the link function and the flavour of residuals but once you know what to look for you can identify 11 stripes (given also confirmation in the histogram that all possible response values are indeed observed in the sample).

                  Comment

                  Working...
                  X