Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why constant is not zero when standardizing both Y and X in regression? And other inconsistencies.

    I have a multiple regression and before running it I standardize my dependent variable Y and my predictors X1, X2, X3.
    I use the commands:

    Code:
    ​egen zY = std(Y)
    egen zX1 = std(X1)
    egen zX2 = std(X2)
    egen zX3 = std(X3)
    Now if I run the multiple regression 3 questions arise:

    First, why if I run the command below (reg on standardized dependent and predictors) I get a constant that is different from zero, 0.8 in my case?

    Code:
    reg zY zX1 zX2 zX3
    Second, why if run the command below I get betas that are different from the coefficients in the regression from the command above?

    Code:
    reg Y X1 X2 X3, beta
    Third, why if I run command A instead of commands B I get slightly different coefficients, comparing the first regression model with bStdX from listcoef?

    Code:
    Command A
    reg Y zX1 zX2 zX3
    
    Command B
    reg Y X1 X2 X3
    listcoef
    I'm using Stata 13.
    Thanks a lot!

  • #2
    I guess you have missing values for some of your variables, so you standardize X1 for everyone that is observed in X1, but not all of these are actually used in the regression. Instead you need to standardize in the sample that will be used in the model.

    Code:
    // open example data
    sysuse auto, clear
    
    gen byte touse = !missing(mpg, price, rep78)
    egen double zmpg   = std(mpg)   if touse
    egen double zprice = std(price) if touse
    egen double zrep78 = std(rep78) if touse
    reg zprice zmpg zrep78
    reg price mpg rep78, beta
    (For more on examples I sent to the Statalist see: http://www.maartenbuis.nl/example_faq )
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      All of the reported issues would have benefited from a reproducible example or at least the output you got.

      I guess all arise because of differences in precision and/or differences in samples being used. egen, by default, uses all observations for a given variable, while regress will, by default, be based on the sub-sample with no missing values on any of the included variables in the model. Try

      Code:
      // mark the sample
      quitely regress Y X1 X2 X3
      generate byte mysample = e(sample)
      
      // standardize variables with double precision
      foreach x of X1 X2 X3 {
          quietly summarize `x' if mysample
          generate double z`x' = (`x' - r(mean))/r(sd)
      }
      
      // run the regression models
      regress zY zX1 zX2 zX3
      regress Y X1 X2 X3 , beta
      regress zY X1 X2 X3
      listcoef
      Also note that listcoef is user-written, and you are asked to explain where it comes from.

      Best
      Daniel

      Comment


      • #4
        Great Thanks! Removing missing data solved all the issues.

        Comment


        • #5
          By the way, do you think is better reporting bStdX or bStdXY?

          Comment


          • #6
            Originally posted by Andrea Arancio View Post
            By the way, do you think is better reporting bStdX or bStdXY?

            Better in what sense?

            Best
            Daniel

            Comment


            • #7
              With the aim to see the relative imprortance of predictors.
              I know this is not 100% correct though.

              Comment


              • #8
                While I am not crazy about standardized variables in general, x standardization alone is usually sufficient for assessing the relative importance of predictors.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  I would state that neither is better in that sense, as relative importance cannot be judged from statistical analysis in a meaningful way without substantial context.

                  First, I think that people have limited understanding of what a standard deviation means. Perhaps this is why so many want to report them. Take me as an example. I cannot tell you what a one standard deviation in age means substantially. Only after you tell me that in your data a one standard deviation of age corresponds to 10 years, I can make sense of that. The point is, I would try reporting scales that have an interpretation as natural/intuitive and easy as possible. A standard deviation does not qualify, in my view.

                  Secondly, suppose you are a political decision maker and want to reduce crime rates. Suppose a political researcher tells you that per prison built [substitute "one SD of prisons built" here, to make interpretation even more complicated], you reduce crime rates by factor of 5 [again, substitute "by 5 SDs", to make interpretation even more complicated]. The researcher then tells you that having a cop walking up and down the streets every night [substistute "a one SD of cops", ... I think you get the point], reduces crime rates by a factor of 2.5. What do you make of this as a politician? Should you built a new prison or should you have cops walk up and down the street? Aside from the fact that the causal mechanism underlying the former association is questionable, the answer might very well depend on how much the cost for each intervention would be. If the cost for one prison is more than 100 times than those for a cop on the streets, then you might want to invest your money in cops, despite the smaller (standardized) coefficient reported by the researcher.

                  Best
                  Daniel
                  Last edited by daniel klein; 11 Nov 2015, 07:41. Reason: typos

                  Comment


                  • #10
                    Here is my own handout on the evils of standardization. Just skip to the last page if you don't want to wade through the examples and math.

                    http://www3.nd.edu/~rwilliam/stats2/l71.pdf

                    You can also see this for a brief discussion of standardized coefficients in logistic regression.
                    -------------------------------------------
                    Richard Williams, Notre Dame Dept of Sociology
                    StataNow Version: 19.5 MP (2 processor)

                    EMAIL: [email protected]
                    WWW: https://www3.nd.edu/~rwilliam

                    Comment

                    Working...
                    X