Testing whether to include a squared term

Rose Simmons

Join Date: Feb 2017

Posts: 114
#1

Testing whether to include a squared term

05 Mar 2017, 14:28

Hi,

I am using a panel dataset.
vote is my dependent variable: 1 if the respondent voted in an annual leadership election, and 0 otherwise (so I am using nonlinear methods).
My independent variables include marital status, gender, age etc.

I then run my regression with only age and age^2 as control variables:

Code:

xtprobit vote c.age c.age#c.age, re vce(robust)

I then conduct the test to see whether age^2 should be included, because I suspect there may be a U-shaped or inverse U-shaped relationship with voting (e.g. very young and very old people may be more or less likely to vote than middle-aged people, in a non-linear relationship).

Code:

test age c.age#c.age ( 1) [vote]age= 0 ( 2) [vote]c.age#c.age= 0 chi2( 2) = 4.34 Prob > chi2 = 0.1141

With this result, does this suggest that including age^2 is insignificant, and that perhaps I should only include age?

I believe this is the appropriate test to see the significant of the squared term, although please could you advise me if I'm mistaken?

Thank you

Last edited by Rose Simmons; 05 Mar 2017, 15:15.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30096
#2

05 Mar 2017, 15:26

Well, the test you did is actually a test of the joint significance of logage and logage squared. The result, were you to rely on it, suggests that both terms together fail to make a "statistically significant" contribution to the model. It does not provide a test of the inclusion of the quadratic term per se. If you believe in this kind of testing, you would drop both logage and logage squared from your model. That would leave you with a model that has no representation of age at all--which strikes me as not credible at all.

But, in any case, I am never one to endorse selecting model variables on the basis of significance tests. I think that's just about the worst way to do it. You never explained why you log-transformed age in the first place. I'll assume you had a good reason, take it as a given, and won't discuss that. Your reason for trying a quadratic model is that you thought there might be a U- or inverted U-shaped relationship. Determining whether that is true rely should not be done based on tests of statistical significance. The way to do that is to look at the coefficients of logage and logage#logage. In particular, the axis of symmetry of the parabola will be at logage = -_b[logage]/(2*_b[c.logage#c.logage]) So you should calculate that value, using -nlcom-. If that value falls squarely within the range of logage that occurs in your data, then you do indeed have a relationship that is U- or inverted U- shaped (depending on the sign of the quadratic coefficient). This is true whether either of these coefficients, is "significant" or whether there is joint "significance." And in this circumstance you should keep the quadratic term in the model.

If, however, the axis of symmetry lies clearly outside the range of logage in your data, then within the range of your "parabola" is a curvilnear function, but it does not reach a peak or nadir and turn around. So you have a curve with a declining marginal effect, or a curve with an increasing marginal effect--but no vertex within the data. You might want to explore graphically* to see if the departure from linearity is large enough to matter from a real-world substantive perspective. If the effect of the quadratic term is to just add something like rounding error to the predicted probabilities, you'd probably just want to drop it.

The hard case is where the axis of symmetry lies near the boundary of one side of the range of your data. In that case you "sort of" have a vertex, but given measurement error and other issues, it may not be of practical importance. Deciding what to do there requires a lot of judgment. But again, the judgment should be based on how much the quadratic term contributes to the predicted probabilities, not some kind of significance test. And again, a graph will be quite informative.

*To see what's going on here I recommend selecting a bunch of values of logage that span the range in your data. For the sake of illustration, I'll say that the ages in your data range from 20 through 80, so log(age) runs from about 3 to 4.4. So I'd do this:

Code:

margins, at(logage = (3(0.2)4.4)) marginsplot
2 likes
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#3

05 Mar 2017, 16:02

Dear Clyde,

Thank you very much for your response. You have helped me to understand why the test command that I used was not appropriate.

I initially took logs of age, however, I then plotted logage and realised that it is actually better as just age. The age variable was already normally distributed, and in fact taking logs skewed it. Apologies for this, I have edited my original post to now read as "age" and not "logage".

My age variable ranges from 24 to 91.

Plotting vote against age gives me the following:

Image 1 (linear line fitted):

Image 2 (quadratic line fitted):

I have run my regression in Stata:

Code:

xtprobit vote c.age c.age#c.age, re vce(robust)

I then calculated the nlcom value. Sorry, what did you mean by:

If that value falls squarely within the range of logage that occurs in your data, then you do indeed have a relationship that is U- or inverted U- shaped

The nlcom value is 42.7. This is within the range of age values, so does this mean the that I potentially have a U- or inverted U- shaped relationship?

Code:

. nlcom -_b[age]/(2*_b[c.age#c.age]) _nl_1: -_b[age]/(2*_b[c.age#c.age]) ------------------------------------------------------------------------------ saving | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _nl_1 | 42.66536 16.5475 2.58 0.010 10.23285 75.09786 ------------------------------------------------------------------------------

As per your recommendation, I tried:

Code:

margins, at(age = (24(2)91)) marginsplot

The result was the attached picture, which I believe appears to be quadratic - would you agree?

Image 3 (marginsplot):

Thank you

Last edited by Rose Simmons; 05 Mar 2017, 16:07.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30096
#4

05 Mar 2017, 16:18

Yes, all of the above looks like a quadratic model is a good choice.
1 like
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#5

06 Mar 2017, 08:38

Hi Clyde,

Thank you for you help yesterday.

Today, I installed the latest Stata update (I am now using Stata/SE 14.2).
When I run the same code that I did yesterday, I am now getting error messages:

Code:

. nlcom -_b[age]/(2*_b[c.age#c.age]) [age] not found r(111); . margins, at(age = (24(2)91)) variable 'age' not found in list of covariates r(322); . . marginsplot previous command was not margins r(301);

Page 14 of the link in the Stata Manual alludes to this problem: http://www.stata.com/manuals13/rmargins.pdf

However, I am unsure of how to resolve this, because I thought the specification in the second line of code would have helped to avoid the error.

Thank you
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30096
#6

06 Mar 2017, 11:50

Did you run your -xtprobit vote c.age c.age#c.age, re vce(robust)- command before running these other commands? -nlcom- and -margins- are postestimation commands and can only be run after a regression command has run and posted its estimates in e(). And -marginsplot- runs only immediately after a successful run of -margins-. Did you perhaps use logage instead of age in the -xtprobit- command? If so, you need to rerun it with -age- (or change these other commands to refer to logage instead of age.) Or perhaps you ran some command between the -xtprobit- command and these that has overwritten the -xtprobit- results in e()?
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#7

06 Mar 2017, 12:00

Clyde Schechter - I accidentally ran my entire regression instead of just -xtprobit vote c.age c.age#c.age, re vce(robust)- and this caused the error. Thanks very much for helping me to identify this, and apologies for this basic error.

I also wanted to ask you, in Image 3 (marginsplot) which I posted in #3, how can I interpret the lengths of the lines/ranges - for example, for the youngest and oldest respondents, the lines appear considerably longer than the middle-aged respondents. Does this have a meaning?

Also, the nlcom value 42.7 is within the range of age values [24,91], but it is not "squarely" in the middle of this range. Is it still correct to interpret this as a U- or inverted U- shaped relationship?

Many thanks

Last edited by Rose Simmons; 06 Mar 2017, 12:03.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30096
#8

06 Mar 2017, 12:15

I also wanted to ask you, in Image 3 (marginsplot) which I posted in #3, how can I interpret the lengths of the lines/ranges - for example, for the youngest and oldest respondents, the lines appear considerably longer than the middle-aged respondents. Does this have a meaning?

Yes. The probability of voting is estimated by this model with less certainty for the youngest and oldest respondents.

Also, the nlcom value 42.7 is within the range of age values [24,91], but it is not "squarely" in the middle of this range. Is it still correct to interpret this as a U- or inverted U- shaped relationship?

Maybe we just use the term "squarely in the middle" differently. To me, 42.7 is squarely in the middle of a 20-80 range.
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#9

06 Mar 2017, 12:36

Yes. The probability of voting is estimated by this model with less certainty for the youngest and oldest respondents.

Thank you for this interpretation, I understand this better now as it is related to the confidence intervals.

Maybe we just use the term "squarely in the middle" differently. To me, 42.7 is squarely in the middle of a 20-80 range.

I interpreted "squarely in the middle" to mean exactly in the centre - do you interpret it as roughly/approximately in the centre?
Comment
Simon Schillebeeckx

Join Date: Jun 2017

Posts: 26
#10

10 Apr 2018, 04:56

In addition to everything Clyde said, there is an easy test that your can download which has been developed by Lind & Mehlum (2010) "With or Without U".
You can find it in Stata by simply using

Code:

findit utest

It does pretty much exactly what Clyde suggested (i.e. determine whether there is an inflection point and it gives some additional information about "how sure" you can be there is indeed a U-shape.
It is notable however this test does not permit you to use the commonly used interactions in stata.
So your code would be (after installation)

Code:

gen age2 = age*age xtprobit vote c.age c.age#c.age, re vce(robust) utest age age2, prefix(vote)

Of course, if you want to use margins afterward you have to go back to other notation for interactions, else Stata will not "know" that age and age2 are based on the same variable.
Comment
Aye Aye Khaine

Join Date: Jan 2019

Posts: 41
#11

03 Apr 2019, 16:23

Hello all,

This is my regression:

regress zlen i.hlth_hyg_nut2 c.agemo c.agemo#c.agemo i.caregiver_edu3 i.female i.improveddrinksource ///
i.fourincomecat i.maternitycash2 i.otherintervention2 c.totalhhwithoutchild2 i.ecologicalzone ///
if zlen_flag==0 & agemo>5.99

I tried the following and got an error message. Can you please help what went wrong ?

nlcom -_b[agemo]/(2*_b[c.agemo#c.agemo])//

_nl_1: -_b[agemo]/(2*_b[c.agemo#c.agemo])//

. nlcom -_b[agemo]/(2*_b[c.agemo#c.agemo])//

invalid syntax
r(198);
Comment
Aye Aye Khaine

Join Date: Jan 2019

Posts: 41
#12

03 Apr 2019, 16:24

Hello,

How can I do a test if my dependent variable is quantitative integer continuous variable? not categorical.

I got an error message.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30096
#13

03 Apr 2019, 16:38

Why do you have // at the end of the command? The purpose of // is to set off the rest of the line as a comment--but you don't have any comment there. So get rid of that. If you really do plan to put a comment there, then fine, but in that case you must put a blank space between the ) and the //.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30096
#14

03 Apr 2019, 16:41

Re #12: This question is unrelated to the content of this thread. Please repost as a New Topic.

It is important to keep threads on topic: people sometimes search for specific topics or keywords in the title. By including off-topic posts, you waste the time of people who come looking for that topic. And, if somebody in the future has a question similar to yours and tries to search for it, they won't find it.
Comment
Aye Aye Khaine

Join Date: Jan 2019

Posts: 41
#15

03 Apr 2019, 16:41

Hi Clyde,
Thank you. I keep making such basic mistake.
Thanks for catching that!
Comment

Announcement

Testing whether to include a squared term

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment