Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Standardizing with Z scores

    Dear forum,
    I'm just finishing a project and would appreciate your advice. I want to standardize my regression coefficients. Following advice, I'm not doing anything to binary variables, but am turning continuous variables into Z-scores with the code

    egen z_variable = std(variable)

    Is this a reasonable approach? I think I read somewhere that, when standardizing, one should divide continuous explanatory variables by two times their standard deviation? However, I gather this is different from creating Z-scores?

    I'd be immensely grateful for any advice,
    Tom

  • #2
    The correct way for z-standardization is as follows:

    Code:
    sum var
    gen var_z = (var - r(mean)) / r(sd)
    Or report beta weights:
    Code:
    reg depvar var1 var2, beta
    Best wishes

    Stata 18.0 MP | ORCID | Google Scholar

    Comment


    • #3
      Z-scores, by definition, are the deviation from the mean divided by the standard deviation. If you divide by two standard deviations, you are doing something else.

      Now, I would say as between the Z-score and the score you get by dividing by two standard deviations, the former is more commonly used and will be recognized by more people. Other than that, they are equally good--or, rather, I really would say they are equally bad.

      The most important question is not how to standardize by why standardize in the first place? The purpose of data analysis is to provide insight and understanding about data. If you are working with variables whose units of measurement are commonly understand among your target audience, then you should leave them in their natural units. Telling somebody about the effect of, say, height measured in inches or centimeters is clear and helpful. Telling somebody about the effect of standardized height is just obfuscatory. Probably they don't know what the standard deviation of height in your data set is, so it's meaningless. But even if you provide that standard deviation, you are then forcing them to do mental arithmetic to make any sense out of the results.

      The one situation where standardizing variables (whether by z-scores, or rescaling to 0-100, or dividing by two standard deviations, or whatever) can be helpful is when the variables are measured in arbitrary or not very meaningful units such as the number of items endorsed on some scale developed by the researchers. In that situation, the audience will not have any intuition about the meaning of a unit difference in the variable given in its natural units. While I could argue that they really don't have any better intuition that when it's put on a standard scale, in fact, z-scores do at least provide some information about what responses are typical or atypical, and what differences are small or large in population terms (though not necessarily in individual terms). And not every scale in arbitrary units benefits from being standardized. Scales that are widely familiar to your audience, as, for example, the PHQ-9 or Beck Depression Inventory would be to psychologists, even though reported as arbitrary scores, "everybody" knows what those scores correspond to in psychological terms, so standardizing or scaling them in some other way would be another example of obfuscation.

      If your variables are of the type that really would be more understandable when standardized, I would recommend using the z-score as it is, at least for most audiences, the most familiar and least likely to require explanation or provoke puzzlement.

      Comment


      • #4
        Thanks for such a comprehensive answer Clyde - it's really helpful and very kind of you, Tom

        Comment


        • #5
          As far as I know Gelman (2008) [reference see below] was the first to recommend standardizing regression inputs (not only predictors) by dividing (quasi)continuous predictors (with arbitrary units, as mentioned by Clyde Schechter) by two standard deviations: This would allow to better compare their coefficients (the size of their effects) to untransformed binary (dichotomous) predictors in regression models (unstandardized coefficients) because dichotomous variables have approximately a standard deviation of 0.5 (if their mean does not deviate too much from 0.5). You can achieve this by generating z-scores and subsequently dividing them by 2. To me this can be useful if you know what you are doing (and if you are able to convey the meaning of the coefficients of the transformed variables to the reader). In 2009 Gelman did update his advice (see here).

          Reference: Gelman, A. (2008). Scaling regression inputs by dividing by two standard deviations. Statistics in Medicine, 27(15), 2865–2873. https://doi.org/10.1002/sim.3107

          Comment


          • #6
            Thanks Dirk, that's very useful

            Comment

            Working...
            X