Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Does centering variables make sense?

    I've seen papers published in other fields apply centering to focal variables first (before construct the interaction terms). However, after reading the following post online
    Julien
    Does it really make sense to use that technique in an econometric context ?

    To me the square of mean-centered variables has another interpretation than the square of the original variable. Imagine your X is number of year of education and you look for a square effect on income: the higher X the higher the marginal impact on income say. So you want to link the square value of X to income. If X goes from 2 to 4, the impact on income is supposed to be smaller than when X goes from 6 to 8 eg. When capturing it with a square value, we account for this non linearity by giving more weight to higher values. A move of X from 2 to 4 becomes a move from 4 to 16 (+12) while a move from 6 to 8 becomes a move from 36 to 64 (+28). If we center, a move of X from 2 to 4 becomes a move from -15.21 to -3.61 (+11.60) while a move from 6 to 8 becomes a move from 0.01 to 4.41 (+4.4). So moves with higher values of education become smaller, so that they have less weigh in effect if my reasoning is good. It seems to me that we capture other things when centering.
    I wonder what your recommendation is. Is there any command in Stata that can automatically conduct mean-centering without manually doing it? Thanks

  • #2





    I think you should give a source for that quotation.

    I see why some people do this. I don't think it makes it especially easier to compare different studies as means will differ,

    An example of related kind is that with calendar years, using year squared doesn't usually help but it would be nice if people could agree on using some standard base such as 2000. That's the rub, both ways.

    Code:
    . ssc desc center
    
    -------------------------------------------------------------------------------
    package center from http://fmwww.bc.edu/repec/bocode/c
    -------------------------------------------------------------------------------
    
    TITLE
          'CENTER': module to center (or standardize) variables
    
    DESCRIPTION/AUTHOR(S)
          
           center centers variables to have zero sample mean  (and,
          optionally, unit sample variance). center is byable and may also
          be used for quasi-demeaning.
          
          KW: centering
          KW: demeaning
          KW: z-score
          
          Requires: Stata version 7.0
          
          Distribution-Date: 20170413
          
          Author: Ben Jann, University of Bern
          Support: email [email protected]
          
    
    INSTALLATION FILES                               (type net install center)
          center.ado
          center.hlp
    
    ANCILLARY FILES                                  (type net get center)
          center.zip
    -------------------------------------------------------------------------------
    (type ssc install center to install)

    Comment


    • #3
      Usually, when you include a squared term in a model, you want to include the lower order term, too. If you do, centering (i.e., shifting the mean value) does not matter at all. Run the following

      Code:
      sysuse auto , clear
      center mpg
      regress price c.mpg##c.mpg
      summarize mpg
      margins , at(mpg = (`=r(min)'(1)`= r(max)'))
      marginsplot , name(simple , replace) nodraw
      regress price c.c_mpg##c.c_mpg
      summarize c_mpg
      margins , at(c_mpg = (`=r(min)'(1)`= r(max)'))
      marginsplot , name(squared , replace) nodraw
      graph combine simple squared
      Best
      Daniel
      Last edited by daniel klein; 22 Oct 2018, 12:29.

      Comment


      • #4
        I agree with daniel klein, but I would make the point even more strongly than just "usually." I would say that you always want to include the linear term along with the quadratic unless there is a compelling scientific reason not to. For example, if you were fitting a curve of displacement of a particle under constant acceleration over time with initial velocity 0, that is known to be a pure quadratic by the laws of physics and a little calculus. But in real life, one seldom encounters that, and absent such real-world constraints as to guarantee that there is no linear term involved, a model that has a quadratic term but no linear term is just mis-specified and should be discarded.

        Comment


        • #5
          As this handout notes, centering can be an aid to interpretation. It is usually not essential though.

          https://www3.nd.edu/~rwilliam/stats2/l53.pdf

          As Nick notes, centering can make comparisons harder because means differ. But, you don't have to center about the mean. For example, in the US, you might subtract 12 from years of education so that a score of 0 = high school graduate. Or, as Nick also suggests, subtract 2000 from year so that year squared is not some super-high number. Besides being harder to interpret, super-high numbers can create computational problems.
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          Stata Version: 17.0 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment


          • #6
            Another reason to center predictors prior to constructing polynomial terms of them is to reduce collinearity.
            Code:
            sysuse auto
            regress turn c.displacement##c.displacement
            estat vif
            summarize displacement, meanonly
            generate double c_displacement = displacement - r(mean)
            regress turn c.c_displacement##c.c_displacement
            estat vif
            It won't matter much with least squares linear regression, because most modern software uses algorithms that are robust. But my understanding is that estimation commands that use iterative algorithms—e.g., generalized linear models and nonlinear least squares regression—are more sensitive to collinearity in the predictors.

            So, in addition to improving numerical instability by avoiding extremely large numbers that Richard mentions, improvement in numerical stability by reducing collinearity for those estimation commands that are potentially sensitive to it is another consideration.

            Comment


            • #7
              As everyone here has noted, there really is no reason to mean-center your variables. If you need something to cite, here are three papers that argue that mean-centering doesn't help.

              Echambadi, R., J.D. Hess. 2007. Mean-centering does not alleviate collinearity problems in moderated multiple regression models. Marketing Science 26(3) 438-445. https://pubsonline.informs.org/doi/a...mksc.1060.0263

              Specifically, we demonstrate that (1) in contrast to Aiken and West’s (1991) suggestion, mean-centering does not improve the accuracy of numerical computation of statistical parameters, (2) it does not change the sampling accuracy of main effects, simple effects, and/or interaction effects (point estimates and standard errors are identical with or without mean-centering), and (3) it does not change overall measures of fit such as R2 and adjusted-R2. It does not hurt, but it does not help...”
              --Quote from p. 439


              Kromrey, J.D., L. Foster-Johnson. 1998. Mean centering in moderated multiple regression: Much ado about nothing. Educational and Psychological Measurement 58(1) 42. http://journals.sagepub.com/doi/10.1177/0013164498058001005

              Abstract
              Centering variables prior to the analysis of moderated multiple regression equations has been advocated for reasons both statistical (reduction of multicollinearity) and substantive (improved interpretation of the resulting regression equations). This article provides a comparison of centered and raw score analyses in least squares regression. The two methods are demonstrated to be equivalent, yielding identical hypothesis tests associated with the moderation effect and regression equations that are functionally equivalent.

              Kam, C.D., R.J. Franzese. 2007. Modeling and interpreting interactive hypotheses in regression analysis. University of Michigan Press, Ann Arbor.
              (see Chap. 4)
              Last edited by David Benson; 23 Oct 2018, 17:03.

              Comment


              • #8
                I was just pointed to this page via something I wrote on Twitter There are absolutely good reasons to center before creating squares and interactions. Of course one can use margins to obtain the average partial effects and do the same thing. But the original equation looks ugly. The APEs are directly gotten by mean centering, too, and the standard errors after mean centering are often well below those when not. It's easy to understand why: if we write y = b0 + b1*x1 + b2*x2 + b3*x1*x2 then b1 is the partial effect at x2 = 0, and this may be a crazy value for x2. If we center x1 and x2 about their means we ensure the new coefficients on x1 and x2, say a1 and a2, are the average partial effects. If x1 is binary then x1 = 0 makes sense, so we often don't center x1. Centering x2 forces a1 to be the average treatment effect -- something I think we can agree is interesting. And as I point out in my tweet, for 2SLS and LASSO there are good reasons to center.

                Comment


                • #9
                  Originally posted by Clyde Schechter View Post
                  I agree with daniel klein, but I would make the point even more strongly than just "usually." I would say that you always want to include the linear term along with the quadratic unless there is a compelling scientific reason not to. For example, if you were fitting a curve of displacement of a particle under constant acceleration over time with initial velocity 0, that is known to be a pure quadratic by the laws of physics and a little calculus. But in real life, one seldom encounters that, and absent such real-world constraints as to guarantee that there is no linear term involved, a model that has a quadratic term but no linear term is just mis-specified and should be discarded.
                  How about the situation where we are not interested in interpreting coefficient estimates, but just want to control for a variable ranging from 0 to 10 and moreover a value of "0" makes sense? I'm thinking of the command "fp", where centering applies to already generated polynomial variables (variables used to build polynomial powers need to be positive, or otherwise one must specify an option to transform non-positive values to zeros in polynomial transformations), and you may end up with a model including a quadratic or cubic (or even quartic, if you specify it in "powers") term without the linear one. I understand that, for the variable "calendar year", distance from year 0 affects results if we omit the linear term, that is disturbing because we would get different results by counting years differently. But what if, instead, we are talking about a situation where "0" has a precise meaning? In my case, for example, instead of using the logarithm of the population (varying between 0 and 6.98) as an offset for a negative binomial regression, since it is measured loosely (and including it in the regression as an offset, or linearly with a coefficient freely estimated, leads to an overwhelming rejection of the link test), I believe that its effect may be different from the linear one, and that it may also affect overdispersion. Thus, "fp" may allow me to control for it, without making any assumption on the functional form (at least, being as non-parametric in spirit as reasonably possible).

                  This is an example of my command:

                  Code:
                  fp<lnpoptot>, powers(0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4) zero replace: nbreg var0 var1 var2 var3 var4 var5 var6 var7 var8 var9 <lnpoptot>, vce (cluster census_block)]
                  The algorithm chose precisely the quadratic transformation.
                  Is such approach wrong?
                  Last edited by Federico Tedeschi; 28 Oct 2022, 05:56.

                  Comment

                  Working...
                  X