Which family and link for GLM?

Maja Schmit

Join Date: Oct 2020

Posts: 5
#1

Which family and link for GLM?

17 Oct 2020, 15:49

Hello!

I am trying to find the right model that matches the distribution of my data. I use cross-sectional data and I want to explain life satisfaction ("0=not satisfied at all - 10=completely satisfied) by social relationships (marital status, social network size). My dependent variable is negatively skewed (-1.16). After running my OLS regression (including control variables), I find that my residuals are also not normally distributed. The residual-vs.fitted plot indicates heteroskedasticity. I also checked both the Breusch-Pagan / Cook-Weisberg test and the White's test for heteroskedasticity which are both highly significant (Prob > chi2 = 0.0000). I also tried to transform my dependent variable but I don't see an improvement after doing a log-transformation or Box-Cox transformation.

Rather than forcing the data to fit the model, I am now trying to find the right family and link for a generalized linear model. I think that family(gamma) and link(log) could fit my data. But mostly I have no clue (yet) how to find out. Can you give me a hint?

Thank you very much!

Last edited by Maja Schmit; 17 Oct 2020, 15:55.
Tags: None
Tom Scott

Join Date: Apr 2019

Posts: 266
#2

17 Oct 2020, 16:03

Maja Schmit I start by plotting the sample distribution by entering 'hist zufri'. If you share your data using dataex, you might get a more helpful response. Here is a walk through I find useful for answering your question: https://statisticsbyjim.com/hypothes...ribution-data/
Comment
Maja Schmit

Join Date: Oct 2020

Posts: 5
#3

17 Oct 2020, 17:35

Thank you for the helpful website! The data I am using is confidential, so unfortunately I can't share them via dataex.
Comment
Tom Scott

Join Date: Apr 2019

Posts: 266
#4

17 Oct 2020, 18:37

It looks like there is not much variation across the scale -- most people are satisfied to some extent. I would consider dichotomizing around the median but someone else might have a better idea
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2156
#5

17 Oct 2020, 19:07

I would use an ordered logit model. If you want to stay in the GLM family, use a binomial with an upper bound of 10 and a logit link.

Code:

glm y x1 ... xk, fam(bin 10) link(logit) vce(robust)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35681
#6

17 Oct 2020, 19:11

As a footnote to #5 from Jeff Wooldridge, I note that a logarithmic transformation seems unlikely to help, as you have zero values -- and negative skewness. Skewness of the response is not the primary issue, however: it's getting a functional form that matches the behaviour of the data.
Comment
Maja Schmit

Join Date: Oct 2020

Posts: 5
#7

18 Oct 2020, 07:12

Many thanks for your useful suggestions and further clarifications! As suggested I used glm with fam(bin10) and link(logit). As can be seen from the graph of the residuals vs. fitted values, my model fit has improved even though it's still not very good (as far as I can tell). Nevertheless, you already helped me a lot. Thanks!

Code:

predict eta, eta predict pearson, pearson label var pearson "Pearson residuals" label var eta "Linear predictor" twoway scatter pearson eta, yline(0)
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2156
#8

18 Oct 2020, 07:31

It may not be possible to get a very good fit. What question are you asking? Is it just to predict or do you have a causal question in mind.

Many influential studies have been published with low R-squareds. Under randomized assignment you could get unbiased estimation of causal effects with a very small R-squared. Have you tried putting in squares and interactions of your covariates? We’re a bit in the dark.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35681
#9

18 Oct 2020, 08:16

Again a sidenote: elsewhere I have seen people puzzled by the pattern of stripes in residual plots like these.

In a plain or vanilla regression, residuals = observed MINUS predicted will lie on distinct lines for each distinct value of the observed, so that here observed is one of 0, 1, ..., 9, 10 and so the residuals in #1 for observed response 7 (say) lie on a line residual = 7 MINUS predicted (and so with slope -1).

In #7 the pattern is curved given the link function and the flavour of residuals but once you know what to look for you can identify 11 stripes (given also confirmation in the histogram that all possible response values are indeed observed in the sample).
1 like
Comment

Announcement

Which family and link for GLM?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment