Does centering variables make sense?

Cooper Felix

Join Date: Sep 2015

Posts: 84
#1

Does centering variables make sense?

22 Oct 2018, 11:59

I've seen papers published in other fields apply centering to focal variables first (before construct the interaction terms). However, after reading the following post online

Julien
Does it really make sense to use that technique in an econometric context ?

To me the square of mean-centered variables has another interpretation than the square of the original variable. Imagine your X is number of year of education and you look for a square effect on income: the higher X the higher the marginal impact on income say. So you want to link the square value of X to income. If X goes from 2 to 4, the impact on income is supposed to be smaller than when X goes from 6 to 8 eg. When capturing it with a square value, we account for this non linearity by giving more weight to higher values. A move of X from 2 to 4 becomes a move from 4 to 16 (+12) while a move from 6 to 8 becomes a move from 36 to 64 (+28). If we center, a move of X from 2 to 4 becomes a move from -15.21 to -3.61 (+11.60) while a move from 6 to 8 becomes a move from 0.01 to 4.41 (+4.4). So moves with higher values of education become smaller, so that they have less weigh in effect if my reasoning is good. It seems to me that we capture other things when centering.

I wonder what your recommendation is. Is there any command in Stata that can automatically conduct mean-centering without manually doing it? Thanks
Tags: None

1 like

Nick Cox

Join Date: Mar 2014
Posts: 35676

22 Oct 2018, 12:15

I think you should give a source for that quotation.

I see why some people do this. I don't think it makes it especially easier to compare different studies as means will differ,

An example of related kind is that with calendar years, using year squared doesn't usually help but it would be nice if people could agree on using some standard base such as 2000. That's the rub, both ways.

Code:

. ssc desc center

-------------------------------------------------------------------------------
package center from http://fmwww.bc.edu/repec/bocode/c
-------------------------------------------------------------------------------

TITLE
      'CENTER': module to center (or standardize) variables

DESCRIPTION/AUTHOR(S)
      
       center centers variables to have zero sample mean  (and,
      optionally, unit sample variance). center is byable and may also
      be used for quasi-demeaning.
      
      KW: centering
      KW: demeaning
      KW: z-score
      
      Requires: Stata version 7.0
      
      Distribution-Date: 20170413
      
      Author: Ben Jann, University of Bern
      Support: email [email protected]
      

INSTALLATION FILES                               (type net install center)
      center.ado
      center.hlp

ANCILLARY FILES                                  (type net get center)
      center.zip
-------------------------------------------------------------------------------
(type ssc install center to install)

Comment

daniel klein

Join Date: Mar 2014

Posts: 3846
#3

22 Oct 2018, 12:26

Usually, when you include a squared term in a model, you want to include the lower order term, too. If you do, centering (i.e., shifting the mean value) does not matter at all. Run the following

Code:

sysuse auto , clear center mpg regress price c.mpg##c.mpg summarize mpg margins , at(mpg = (`=r(min)'(1)`= r(max)')) marginsplot , name(simple , replace) nodraw regress price c.c_mpg##c.c_mpg summarize c_mpg margins , at(c_mpg = (`=r(min)'(1)`= r(max)')) marginsplot , name(squared , replace) nodraw graph combine simple squared

Best
Daniel

Last edited by daniel klein; 22 Oct 2018, 12:29.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30087
#4

22 Oct 2018, 14:40

I agree with daniel klein, but I would make the point even more strongly than just "usually." I would say that you always want to include the linear term along with the quadratic unless there is a compelling scientific reason not to. For example, if you were fitting a curve of displacement of a particle under constant acceleration over time with initial velocity 0, that is known to be a pure quadratic by the laws of physics and a little calculus. But in real life, one seldom encounters that, and absent such real-world constraints as to guarantee that there is no linear term involved, a model that has a quadratic term but no linear term is just mis-specified and should be discarded.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4984
#5

22 Oct 2018, 17:38

As this handout notes, centering can be an aid to interpretation. It is usually not essential though.

https://www3.nd.edu/~rwilliam/stats2/l53.pdf

As Nick notes, centering can make comparisons harder because means differ. But, you don't have to center about the mean. For example, in the US, you might subtract 12 from years of education so that a score of 0 = high school graduate. Or, as Nick also suggests, subtract 2000 from year so that year squared is not some super-high number. Besides being harder to interpret, super-high numbers can create computational problems.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4406
#6

22 Oct 2018, 18:46

Another reason to center predictors prior to constructing polynomial terms of them is to reduce collinearity.

Code:

sysuse auto regress turn c.displacement##c.displacement estat vif summarize displacement, meanonly generate double c_displacement = displacement - r(mean) regress turn c.c_displacement##c.c_displacement estat vif

It won't matter much with least squares linear regression, because most modern software uses algorithms that are robust. But my understanding is that estimation commands that use iterative algorithms—e.g., generalized linear models and nonlinear least squares regression—are more sensitive to collinearity in the predictors.

So, in addition to improving numerical instability by avoiding extremely large numbers that Richard mentions, improvement in numerical stability by reducing collinearity for those estimation commands that are potentially sensitive to it is another consideration.
1 like
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#7

23 Oct 2018, 17:00

As everyone here has noted, there really is no reason to mean-center your variables. If you need something to cite, here are three papers that argue that mean-centering doesn't help.

Echambadi, R., J.D. Hess. 2007. Mean-centering does not alleviate collinearity problems in moderated multiple regression models. Marketing Science 26(3) 438-445. https://pubsonline.informs.org/doi/a...mksc.1060.0263

Specifically, we demonstrate that (1) in contrast to Aiken and West’s (1991) suggestion, mean-centering does not improve the accuracy of numerical computation of statistical parameters, (2) it does not change the sampling accuracy of main effects, simple effects, and/or interaction effects (point estimates and standard errors are identical with or without mean-centering), and (3) it does not change overall measures of fit such as R2 and adjusted-R2. It does not hurt, but it does not help...”
--Quote from p. 439

Kromrey, J.D., L. Foster-Johnson. 1998. Mean centering in moderated multiple regression: Much ado about nothing. Educational and Psychological Measurement 58(1) 42. http://journals.sagepub.com/doi/10.1177/0013164498058001005

Abstract
Centering variables prior to the analysis of moderated multiple regression equations has been advocated for reasons both statistical (reduction of multicollinearity) and substantive (improved interpretation of the resulting regression equations). This article provides a comparison of centered and raw score analyses in least squares regression. The two methods are demonstrated to be equivalent, yielding identical hypothesis tests associated with the moderation effect and regression equations that are functionally equivalent.

Kam, C.D., R.J. Franzese. 2007. Modeling and interpreting interactive hypotheses in regression analysis. University of Michigan Press, Ann Arbor. (see Chap. 4)

Last edited by David Benson; 23 Oct 2018, 17:03.
1 like
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2156
#8

24 Apr 2021, 09:50

I was just pointed to this page via something I wrote on Twitter There are absolutely good reasons to center before creating squares and interactions. Of course one can use margins to obtain the average partial effects and do the same thing. But the original equation looks ugly. The APEs are directly gotten by mean centering, too, and the standard errors after mean centering are often well below those when not. It's easy to understand why: if we write y = b0 + b1*x1 + b2*x2 + b3*x1*x2 then b1 is the partial effect at x2 = 0, and this may be a crazy value for x2. If we center x1 and x2 about their means we ensure the new coefficients on x1 and x2, say a1 and a2, are the average partial effects. If x1 is binary then x1 = 0 makes sense, so we often don't center x1. Centering x2 forces a1 to be the average treatment effect -- something I think we can agree is interesting. And as I point out in my tweet, for 2SLS and LASSO there are good reasons to center.
Comment
Federico Tedeschi

Join Date: Mar 2015

Posts: 137
#9

28 Oct 2022, 05:45

Originally posted by Clyde Schechter View Post

I agree with daniel klein, but I would make the point even more strongly than just "usually." I would say that you always want to include the linear term along with the quadratic unless there is a compelling scientific reason not to. For example, if you were fitting a curve of displacement of a particle under constant acceleration over time with initial velocity 0, that is known to be a pure quadratic by the laws of physics and a little calculus. But in real life, one seldom encounters that, and absent such real-world constraints as to guarantee that there is no linear term involved, a model that has a quadratic term but no linear term is just mis-specified and should be discarded.

How about the situation where we are not interested in interpreting coefficient estimates, but just want to control for a variable ranging from 0 to 10 and moreover a value of "0" makes sense? I'm thinking of the command "fp", where centering applies to already generated polynomial variables (variables used to build polynomial powers need to be positive, or otherwise one must specify an option to transform non-positive values to zeros in polynomial transformations), and you may end up with a model including a quadratic or cubic (or even quartic, if you specify it in "powers") term without the linear one. I understand that, for the variable "calendar year", distance from year 0 affects results if we omit the linear term, that is disturbing because we would get different results by counting years differently. But what if, instead, we are talking about a situation where "0" has a precise meaning? In my case, for example, instead of using the logarithm of the population (varying between 0 and 6.98) as an offset for a negative binomial regression, since it is measured loosely (and including it in the regression as an offset, or linearly with a coefficient freely estimated, leads to an overwhelming rejection of the link test), I believe that its effect may be different from the linear one, and that it may also affect overdispersion. Thus, "fp" may allow me to control for it, without making any assumption on the functional form (at least, being as non-parametric in spirit as reasonably possible).

This is an example of my command:

Code:

fp<lnpoptot>, powers(0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4) zero replace: nbreg var0 var1 var2 var3 var4 var5 var6 var7 var8 var9 <lnpoptot>, vce (cluster census_block)]

The algorithm chose precisely the quadratic transformation.
Is such approach wrong?

Last edited by Federico Tedeschi; 28 Oct 2022, 05:56.
Comment

Announcement

Does centering variables make sense?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment