understanding a regression with quadratic terms

Julian Casaro

Join Date: Aug 2016

Posts: 9
#1

understanding a regression with quadratic terms

18 Aug 2016, 14:38

Hello all,

I resort to you to ask you about something that is worrying me about certain results I am obtaining. I am running a panel regression with random effects estimator and including a quadratic term in the regression. the model is basically the following:

y_it = α_i + βX_it + β2X²_it + β3Z_it + ε_it

My first question is if it is recommendable to center the X variable and later calculate the its quadratic over such value. In such case, how should I interpret the resulting coefficients and how could I find the value of X that sets the turning point in the non-linear relationship (would that be solved by taking a straightforward derivative?)

Secondly, when including the quadratic term into the regression, both the linear and quadratic terms enter significanty and show the existence of a concave relationship between the variables X and Y (β2<0). After running derivatives and identifying the value of X that stands as the possible turning point I went on to try to drop out the values above that threshold, which in this model means droping data on 5 countries from the sample. As a result I got a dataset with the remaining 15 countries where the vast majority of the values of X land below the threshold mentioned. After running a new regression for this new set, results on the quadratic coefficient obviously become insignificant, and only the linear relationship stays significant. My question is about the interpretation of results. Could it be the case that those 5 countries are biasing my results when running the regression for the whole set of countries, and wrongly suggesting an inverted-u relationship when in fact they simply follow an opposite trend to the rest of the countries? Countries seem not to be behaving in the same way, and I wonder if a random effects model is a good estimator in this cases.

I want to apologise in advance if my low knowledge of statistics made my questions unclear. Any suggestions will be accepted.

Thank you
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#2

18 Aug 2016, 15:09

The fact that you are including a quadratic term has little influence on your decision about centering the X variables. Sometimes, when you are especially interested in the quadratic nature of the relationship, and would like to report a result like "the expected value of the outcome Y is proportional to the square of the difference between X and #", where # is the value of X at the vertex of the parabola, then it is especially convenient to center X around #, so that the linear term drops out of the model. This might or might not apply to your situation.

Another aspect of a quadratic model is that in most circumstances, X and X^2 will be correlated, sometimes quite highly so. Occasionally this correlation will be so high that you get extremely wide standard errors for both the linear and quadratic coefficients. When that happens, it is better to center the variable around some measure of central tendency: the centered variables will generally exhibit low correlations. But it doesn't sound like that is a problem here.

Sometimes, if you are dealing with very large values of X, the much larger values X^2 create a scaling problem that makes it difficult for the model estimators to converge. In that case the best solution is rescaling, although centering often accomplishes the same thing. In any case, that clearly didn't happen to you.

There is no need to resort to differential calculus to find the vertex of a parabola. A little algebra (completing the square) directly gives you the result that the vertex lies at X = -β₁/(2β₂). You can use -nlcom- to calculate this in Stata.

The remainder of your question is not really a statistical question, it is a question about the scientific content of your work. Since you haven't even hinted at what domain of knowledge you are working in (except that it is one that studies countries, which could be almost anything) nobody will be able to advise you concretely. When the data fit a quadratic model well, it is possible that in reality the entities on either side of the axis of symmetry are distinct populations that ought to be analyzed and thought about separately. But it is also possible that an inverted U shaped relationship is real. (It is rare that a quadratic relationship turns out to be a useful law of science, but they are often excellent approximations to U- or inverse U-shaped relationships when the range of the data is reasonably restricted.) It is also possible that the entities on one side of the axis of symmetry represent data errors. But there is no way to statistically distinguish these possibilities: for that you must look to the scientific theory underlying your data set.

In any case, if an inverse-U shaped relationship makes sense scientifically and fits the data reasonably well, that would not be relevant to deciding whether a random effects model is appropriate or not. The appropriateness of a random effects model is an altogether separate issue.
1 like
Comment

Julian Casaro

Join Date: Aug 2016
Posts: 9

18 Aug 2016, 16:24

Thank you very much for your answer.

Regarding what you mention about the correlation between x and x^2 it indeed stands as an issue here. Although I believe that the std errors are not wide.

Code:

. xtreg GiniCEPALNat PrivCred PrivCred2 AvgSchooling InflationCPI Trade GDPGrowth , re

Random-effects GLS regression                   Number of obs      =        94
Group variable: CountryNum                      Number of groups   =        14

R-sq:  within  = 0.5322                         Obs per group: min =         5
       between = 0.2489                                        avg =       6.7
       overall = 0.3598                                        max =         8

                                                Wald chi2(6)       =     75.12
corr(u_i, X)   = 0 (assumed)                    Prob > chi2        =    0.0000

------------------------------------------------------------------------------
GiniCEPALNat |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    PrivCred |   .2178837   .0577958     3.77   0.000     .1046061    .3311613
   PrivCred2 |  -.1574209   .0513551    -3.07   0.002    -.2580749   -.0567668
AvgSchooling |  -.0232057   .0034132    -6.80   0.000    -.0298956   -.0165159
Wages |  -.0011811   .0013687    -0.86   0.388    -.0038637    .0015015
       Minim |  -.0370494   .0196608    -1.88   0.060    -.0755839    .0014852
   GDP |  -.0056184   .1204881    -0.05   0.963    -.2417708     .230534
       _cons |   .6930317   .0275917    25.12   0.000     .6389528    .7471105
-------------+----------------------------------------------------------------
     sigma_u |  .02485974
     sigma_e |  .02051442
         rho |  .59489607   (fraction of variance due to u_i)
------------------------------------------------------------------------------

After trying to center the variable at its mean, correlation still remains high (0.7), and only the values of the coefficient for the linear term (PrivCred) differ from the ones showed above when running the regression with the centered variables. Is there any other way to solve this high correlation problem?

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#4

18 Aug 2016, 17:00

You have a high correlation, but you don't have a high correlation problem. Your standard errors for PrivCred and PrivCred2 are both quite reasonable. That is all that matters. The correlation between these variables has no adverse impact on the rest of the results. So just forget about it. Evaluating multicollinearity probably wastes more people's time than any other statistical bugbear.

By the way, if at some point you want to calculate or graph predicted values of GiniCEPALNat for various values of your predictors, or calculate and graph marginal effects of any of your predictors*, it will be much easier for you to do that using -margins- then by hand. But in order to use the -margins- command, you have to do your regression using factor variable notation (-help fvvarlist-) This is particularly critical with regard to the PrivCred variable: calculating the marginal effects from a quadratic relationship by hand is a lot of work and is error prone. -margins- will get it right for you effortlessly if you set up the regression properly.

Code:

xtreg GiniCEPALNat c.PrivCred##c.PrivCred AvgSchooling InflationCPI Trade GDPGrowth , re

This will automatically generate the quadratic term and include it in the model (you don't need, and, in fact, should not have, a separate PrivCred2 variable). And when you run -margins-, it will account for the fact that the quadratic term also changes when the linear one does.

*And I strongly recommend that you do some graphs of the Gini vs PrivCred relationship and the marginal effects of PrivCred on Gini at various levels of PrivCred. Most of your audience, unless they are statistical professionals, will find the results of a quadratic model difficult to understand in the abstract. A picture is worth thousands of words here.
Comment
Julian Casaro

Join Date: Aug 2016

Posts: 9
#5

19 Aug 2016, 12:29

Clyde, your comments and tips prove very helpful.

Further to what you suggested, would you recommend any source from where I could learn how to run this graphs of the marginal effects??
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#6

19 Aug 2016, 13:44

So, I don't know what the interesting range of values of PrivCred are. Just for the sake of illustration, let's say they are 10, 20, 30, 40, 50. Then, after you run your -xtreg- command with factor variable notation, you can do this:

Code:

// PREDICTED VALUES OF GiniCEPALNat // AT INTERESTING VALUES OF PrivCred // ADJUSTED FOR OTHER MODEL VARIABLES margins, at(PrivCred = (10 20 30 40 50)) marginsplot // MARGINAL EFFECTS ON GiniCEPALNat // OF PrivCred AT INTERESTING VALUES // ADJUSTED FOR OTHER MODEL VARIABLES margins, dydx(PrivCred) at(PrivCred = (10 20 30 40 50)) marginsplot

You should read the manual section on -margins- and -marginsplot-. The former, in particular, is lengthy and a bit of a heavy read. An easier introduction to the -margins- that you might want to read first, and that will get you started for basic uses, is http://www.stata-journal.com/article...article=st0260, though the manual has a greater level of detail that you will eventually want to learn.

The -marginsplot- command accepts most of the options available in -graph twoway-, so you can customize the appearance of your graphs however you like. Also remember that if you run these commands one after another without interruption as they stand, the second graph will overwrite the first. So you might want to save each graph, or -name()- it so that it gets its own tab in the graph window and you can look at both of them.
2 likes
Comment
Julian Casaro

Join Date: Aug 2016

Posts: 9
#7

20 Aug 2016, 02:27

Great, thank you! I started doing some graphs and new questions arose.

I have a panel data with data of 15 countires over 8 time periods and I am trying to figure out whether credit issued to the private sector may be influencial in determining income inequality. To test a possible inverted-u pattern, I add a quadratic term in the regression and results seem to suggest a possible concave relationship. However, when I look at the graph where both variables are plotted some questions arise:

- the vertex seems to be found for values of private credit over 68%. However, it can be seen in the graph that very few observations lie above that threshold. How is it possible then that the regression gives me such results? Is there anything that may be biasing the results?

- A high concentration of observations is seen for values of Private Credit below 40%. If I only keep the countries which report such values, the inverted-u pattern loses strengh and a linear positive relation arises. Could it be the case that by having some countries with a positive relation between the variables, and others with a negative one, when running the regression with squared terms of Private Credit this is understood as an inverted-u (when in fact there may be two group of countries acting differently?)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#8

20 Aug 2016, 09:11

Could it be the case that by having some countries with a positive relation between the variables, and others with a negative one, when running the regression with squared terms of Private Credit this is understood as an inverted-u (when in fact there may be two group of countries acting differently?)

Yes, it could.

Any time you see any data that appear to follow a quadratic relationship, if you restrict your attention to the cases on one side of the axis of symmetry, you will find a monotone relationship, and when restricting attention to the cases on the other side you will find a relationship in the opposite direction. It is not possible to mathematically distinguish the operation of two separate relationships in separate populations from a single U or inverse-U relationship. So the question is not answerable in statistical terms.

To resolve this issue, if it can be resolved at all, requires turning to the underlying theory. This is not my discipline, so I have no idea what theory predicts about the relationship between private credit and Gini. For advice on that, you will need to turn to a colleague in your field.

If there is no theory about this, then your question becomes, actually, metaphysical. From that perspective, you might think about it this way. An inverse U-shaped relationship approximately described by a quadratic would be a more parsimonious model than a model with two separate populations, one having a positive relationship and the other a negative one. So by Occam's Razor, you would prefer the quadratic. A key here is that the two populations are distinguished only by the value of the Private Credit variable. When that is high we see a negative relationship to Gini, and when it is low we see a positive one. Is it plausible that this is just coincidence? All else equal, it is not plausible, so on metaphyiscal grounds the quadratic model would be better.

But if there is some other variable, or set of variables, (perhaps measured in your data, or perhaps not yet measured, or perhaps not even thought about yet) that distinguishes the population with the increasing relationship from the population with the decreasing relationship, and which also accounts for one having high private credit and the other low, then you would be in a different situation. In that case, explaining the two phenomena (high vs low private credit and decreasing vs increasing relationship to Gini) and their co-occurrence with a single variable would have greater "explanatory power" and that would be preferred as the most parsimonious model. Is there such a variable or set of variables? What might it be?

That's how I would think about this if there isn't something to answer the question within your discipline's theory.
1 like
Comment

Eldar Saleh

Join Date: Oct 2022
Posts: 6

27 Nov 2022, 02:58

Following Julian's great question, I wanted to emphasize the first part of the post that concerns whether to center a variable before calculating its quadratic form or if I should calculate the center of a quadratic term and use it as the centered quadratic. In our research, the nonlinear relationship is important to be dealt with, so I am curious to know whether to use the quadratic centered or the centered quadratic term in our model!
When there is a nonlinear relationship between X and Y, centering and then calculating the quadratic, swipes the nonlinear effect. One of the co-authors referred me to papers like Dalal & Zickar (2011) that state substantive variables need to be centered and then we should calculate the quadratic term (Y=b0+b1(X−X¯)+b2(X−X¯)^2+e) {here X¯ means centered X}. I am hesitant that this approach is making sense. I suppose that we should treat the main quadratic term as a separate variable and then calculate its centered form and include it in the equation (Y=b0+b1(X−X¯)+b2(X^2−X¯^2)+e). Which approach is correct?

By the way here is a table of simulated data that shows how these two approaches differ:

Y: Dependent
X: Independent

X2c: Centered form of the quadratic original X
Xc2: Quadratic form of the centered X

Y	X	X2c	Xc2
1.80	1.00	-14.67	7.11
1.50	2.00	-11.67	2.78
4.50	3.00	-6.67	0.44
4.70	4.00	0.33	0.11
6.00	5.00	9.33	1.78
8.30	6.00	20.33	5.44
5.60	7.00	33.33	11.11
5.00	8.00	48.33	18.78
2.50	9.00	65.33	28.44

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#10

28 Nov 2022, 07:52

In theory, you can do this either way: the models are equivalent and the results from either can be transformed to give the results from the other. They are just different ways of parameterizing the same model. But in practice, you will be much better off centering X and then using the square of centered X. The results will be more easily interpreted, and calculating predicted values and marginal effects will be an order of magnitude easier. If you do it the other way you will become embroiled in messy coding for a lot of unnecessary algebra.

The basic difficulty posed by first centering X^2 and then using X^2 minus that value is that the linear and quadratic terms are now centered in different places. Consequently, any calculations using these two terms will require that you "translate" any value of X-X_ (I don't know how you get the _ above rather than below the line) into a corresponding value of (X^2 - X^2_). For example, -margins- will not understand that this transformation is needed, so you would be unable to use it to calculate predicted margins or marginal effects. Instead you would have to write lengthy code to do those things. If you stick with just centering X and then using the square of centered X in your model, if you do it using factor-variable notation (-help fvvarlist- if you are not familiar with this) then it is simple: -margins- will know that one term is the square of the other and will handle it all correctly.

Last edited by Clyde Schechter; 28 Nov 2022, 07:56.
Comment

Announcement