Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variables – transformation and interpretation

    I'm looking into satisfaction with GP practices in England and I’m using NHS payments to practices as one of the independent variables in the regression model. I have two questions regarding transformation and interpretation of results.

    Question 1:
    The payments in the dataset are a) positively skewed (sktest 2.81) and b) have a range of -23,000 to over 1,000,000, including some who received £0.

    From what I’ve read before, it’s standard to transform variables such as income before running a regression if they are skewed, but I also read that I cannot log this variable because it contains negative and zero values. What kind of transformation should I undertake to correct for skewness, a quadric or cubic transformation maybe?

    Question 2:
    The dependent variable is the practice satisfaction score on a scale of 1-5. The variable is negatively skewed as it seems clustering of satisfaction scores at the higher end is common for these kinds of surveys. In order to make the distribution more normal I first reflected and then logged the variable.

    The independent variable for payments is expressed in £ and pence (e.g. £25,022.60). If I don’t transform it, I would just like to check what my interpretation of the results will be – a one unit change in the IV is associated with a 100 times the coefficient percent change in the DV. Am I correct in saying that a one unit here is £0.01?

  • #2
    Perhaps you may think about dealing with these variables, instead of transforming them into a forceful "normality" that would eventually entail to a difficulty in the interpretation. For example, quantiles, standardized values, ranking, categorizing etc. Also, depending on your study design, you may think about using "robust" options or generalized models to account for a skewed pattern of distribution.
    Best regards,

    Marcos

    Comment


    • #3
      Spoiler alert: Opinionated commentary follows.

      Q2: A variable on a 5-point graded scale can't be normally distributed and (although it's contentious in detail) I would argue that **no** transformation makes sense here (apart from trival linear recoding, which won't affect skewness). This is the age-old debate about what different measurement scales allow and although treating ordinal (here graded) as if it were metric is sometimes done (as when we average academic grades) the idea (or ideal) that such a scale could be transformed to normal is not standard. (Some would have harsher words.) I would use ordinal logit as model of first choice. I think you would have a a hard time publishing that analysis in a reputable journal. In any event, marginal normality is unlikely to be key; it's not even an assumption behind linear regression.

      Q1: Payments to practices is an **independent** variable (i.e. a covariate or predictor). A quadratic transformation would make no sense as it treats negative and positive values alike and a cubic transformation would in practice make the skewness much, much worse. A transformation that could make more sense is cube root. Your reason for transformation is not so much lack of normality (as above) as probable nonlinearity and/or sensitivity to outliers.

      If payments were a dependent variable (outcome or response), then a generalised linear model with logarithmic link postulates only that mean responses are positive and can thus indulge a small fraction of zero or negative values. If so, don't transform at all; just use an appropriate link function. (I think this is what Marcos means by "generalized models".)
      Last edited by Nick Cox; 14 Aug 2015, 10:12.

      Comment


      • #4
        Indeed, Nick. "Words failed me", but #3 is exactly what I wish to have said.
        Best regards,

        Marcos

        Comment


        • #5
          Thank you very much for your suggestions. As I'm quite a novice in this (I'm working on my dissertation, so not intending this for publication, but nevertheless) I found them very helpful.

          As for Q2, I'm afraid my explanation of the variable was incorrect. What I should have said is that each practice was evaluated on a 5-point graded scale, but as the unit of analysis is a practice, a mean score for each practice was calculated. This means there are 7,400 unique observations – a practice would for instance have a satisfaction score of 4.3456. In this case I can’t use ordinal logit, and I guess only a linear regression would do, but that presents me with the problem outlined above. Or maybe I should revaluate my model in the first place.
          Last edited by Luka Crnjakovic; 15 Aug 2015, 10:54.

          Comment

          Working...
          X