Testing whether to include a squared term

Rose Simmons started a topic Testing whether to include a squared term

05 Mar 2017, 14:28
Testing whether to include a squared term
Hi,

I am using a panel dataset.
vote is my dependent variable: 1 if the respondent voted in an annual leadership election, and 0 otherwise (so I am using nonlinear methods).
My independent variables include marital status, gender, age etc.

I then run my regression with only age and age^2 as control variables:

Code:

xtprobit vote c.age c.age#c.age, re vce(robust)

I then conduct the test to see whether age^2 should be included, because I suspect there may be a U-shaped or inverse U-shaped relationship with voting (e.g. very young and very old people may be more or less likely to vote than middle-aged people, in a non-linear relationship).

Code:

test age c.age#c.age ( 1) [vote]age= 0 ( 2) [vote]c.age#c.age= 0 chi2( 2) = 4.34 Prob > chi2 = 0.1141

With this result, does this suggest that including age^2 is insignificant, and that perhaps I should only include age?

I believe this is the appropriate test to see the significant of the squared term, although please could you advise me if I'm mistaken?

Thank you
Last edited by Rose Simmons; 05 Mar 2017, 15:15.
Tags: None
Jovana Ju replied

27 Feb 2025, 09:22
Yes, you are right. Thank you so much.
Leave a comment:
Clyde Schechter replied

27 Feb 2025, 08:29
No, I did ask for the coefficients of X1 in the linear (i.e. non-quadratic) models, which, as I understand it, is what you provided in #53.
Leave a comment:
Jovana Ju replied

27 Feb 2025, 01:47
Clyde, I am so sorry. I think I didn’t give the right answer because you asked me about the coefficients in front of the linear term, but I provided information for the linear model instead. So, the coefficients in front of the linear term are -0.0054 (confidence interval: -0.0095492 to -0.0013747) when the mentioned observation is included, and -0.0046 (confidence interval: -0.0151464 to 0.005874) when it is excluded.
Leave a comment:
Jovana Ju replied

26 Feb 2025, 18:09
Thank you so much for your time patience and help!
Leave a comment:
Clyde Schechter replied

26 Feb 2025, 17:59
That's a pretty large shift in the coefficient based on the inclusion of that one observation. So, in this case, I think you have to keep the quadratic model. But I do think that when you report your results you need to make explicit the caveat that the results are heavily influenced by that one outlier observation.
Leave a comment:
Jovana Ju replied

26 Feb 2025, 16:29
Thank you, Clyde. In the first case (with this observation), the coefficient is -.0010967 (95% confidence interval -.0022103 to .000017), and in the second case, it is -.0039147 (95% confidence interval -.0066853 to -.0011442).
Leave a comment:
Clyde Schechter replied

26 Feb 2025, 15:51
Let's not focus on significance so much--it's a poor guide to model selection. How different are the coefficients in the linear (no quadratic term) models with and without that one observation. Actually, what are the 95% confidence intervals around those coefficients as well?
Leave a comment:
Jovana Ju replied

26 Feb 2025, 11:47
Clyde, when the model is applied to observations to the right of the parabola's vertex, its shape changes, and the linear model becomes significant. Initially, I didn’t consider this a problem since this is the downward-sloping part of the parabola, so I expected the negative sign... Am I wrong? Regarding this observation, it is certain that it is not a data error, but it could be said that it is an extreme value.
Leave a comment:
Clyde Schechter replied

26 Feb 2025, 10:41
So, looking again at the graph of the quadratic, you can see that it actually gets pretty steep in the right-side of the plotted range. You state that your values of X are grouped in a low and narrow range, and if the linear version of the model comes out with a coefficient that is close to zero, that range must be fairly close to the apex of the parabola. And in a situation like that, you can definitely end up with a situation where the estimate of the linear slope is close to zero because the upward slope on the right and the downward slope on the left cancel each other out. So in that situation, I would retain the quadratic model.

But let me throw in one other caution here. You mentioned that there is only one observation in the data set that has an X value to the left of the parabola vertex. In order for the upslope downslope cancellation I referred to in the previous paragraph to occur, this makes me wonder if that one observation is a point with high leverage in the regression, since the rest of the data is on the downsloping side of the curve. What happens to the quadratic model if you exclude that one observation? If the model changes substantially, then your results hinge critically on a single data point. In that case, you should double and triple-check that that data point isn't just a data error. And if you can confirm that it is, in fact, a valid data point, in reporting your results you should mention that there is this one high-leverage observation with a strong influence on the findings.
Leave a comment:
Jovana Ju replied

26 Feb 2025, 09:19
Clyde, thank you very much for your suggestion. The values of X are grouped within a low and narrow range. I have already raised the issue of data variability elsewhere. If we assume that the range is not wide enough to model quadratic effects, a linear approximation naturally emerges as an alternative, but it is not statistically significant—that is, all tests indicate that the model with quadratic effects is better.

Regarding your last sentence, I was closest to describing the relationship as nonlinear but without emphasizing the inverted U-shape for the reasons mentioned.
Leave a comment:
Clyde Schechter replied

26 Feb 2025, 08:42
It depends.

If you run -graph twoway function y= -.0085499 -.0054619*x -.0001648*x*x, range(-20 20)- you will see what the quadratic function of X1 in your model looks like over a range from a bit to the left of the vertex to far to the right of it. You can see that it does not really represent an inverted-U shape in any meaningful sense of the term. Rather the quadratic term is picking up curvilinearity in the relationship. Notice that, as with any quadratic, if you focus in on a narrow range of X1 values, the relationship is very close to linear. So the real question is whether the range of X values in your data is wide enough that the non-linearity is large enough to be worth modeling. So I would do something like plot a histogram of the X1 values in you data and then see whether they are concentrated in a narrow enough area that a linear approximation to the model is satisfactory, or if it is broad enough that you really need to express the non-linearity in your model.

But I will say that if you do end up retaining the quadratic representation, I would not refer to it as showing an inverse-U relationship: clearly the left half of that supposed inverse-U is effectively missing.
1 like
Leave a comment:

Jovana Ju replied

26 Feb 2025, 08:31

Hello everyone! I previously wrote regarding the issue of whether to include quadratic effects in my model, but I am still facing the same problem and I kindly ask for your help based on the output I have from Stata. I am working with panel data, where T is 6 and N is 98.

Code:

xtreg Y X1 X1sq X2 X2sq X3 X4 X5 X6 X7, fe robust

Fixed-effects (within) regression               Number of obs     =        582
Group variable: id                              Number of groups  =         98

R-squared:                                      Obs per group:
     Within  = 0.1182                                         min =          1
     Between = 0.1050                                         avg =        5.9
     Overall = 0.0921                                         max =          6

                                                F(9, 97)          =  913273.12
corr(u_i, Xb) = 0.0783                          Prob > F          =     0.0000

                                    (Std. err. adjusted for 98 clusters in id)
------------------------------------------------------------------------------
             |               Robust
           Y | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          X1 |  -.0054619   .0020594    -2.65   0.009    -.0095492   -.0013747
        X1sq |  -.0001648   .0000562    -2.93   0.004    -.0002765   -.0000532
          X2 |  -.0056495   .0002735   -20.66   0.000    -.0061923   -.0051068
        X2sq |  -.0000362   1.56e-06   -23.20   0.000    -.0000393   -.0000331
          X3 |   .0027445   .0185407     0.15   0.883    -.0340537    .0395427
          X4 |   .1119635   .0398111     2.81   0.006     .0329495    .1909775
          X5 |   .0025336   .0021159     1.20   0.234    -.0016659    .0067332
          X6 |  -.0825474   .0398469    -2.07   0.041    -.1616326   -.0034623
          X7 |   .0006929   .0004353     1.59   0.115    -.0001712    .0015569
       _cons |  -.0085499   .4149904    -0.02   0.984     -.832191    .8150912
-------------+----------------------------------------------------------------
     sigma_u |  .06728514
     sigma_e |  .06179892
         rho |  .54242449   (fraction of variance due to u_i)
------------------------------------------------------------------------------

The estimated model shows that there is an inverted U-shaped relationship between X1 and Y, and that the maximum Y is reached at X1 = -16.56. I used utest in Stata. What concerns me is that only one value of X1 from my sample lies to the left of this extreme point. Should I then keep the quadratic function?

Announcement

Testing whether to include a squared term

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: