Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Testing whether to include a squared term

    Hi,

    I am using a panel dataset.
    vote is my dependent variable: 1 if the respondent voted in an annual leadership election, and 0 otherwise (so I am using nonlinear methods).
    My independent variables include marital status, gender, age etc.

    I then run my regression with only age and age^2 as control variables:

    Code:
    xtprobit vote c.age c.age#c.age, re vce(robust)
    I then conduct the test to see whether age^2 should be included, because I suspect there may be a U-shaped or inverse U-shaped relationship with voting (e.g. very young and very old people may be more or less likely to vote than middle-aged people, in a non-linear relationship).

    Code:
    test age c.age#c.age
    
     ( 1)  [vote]age= 0
     ( 2)  [vote]c.age#c.age= 0
    
               chi2(  2) =    4.34
             Prob > chi2 =    0.1141
    With this result, does this suggest that including age^2 is insignificant, and that perhaps I should only include age?

    I believe this is the appropriate test to see the significant of the squared term, although please could you advise me if I'm mistaken?

    Thank you
    Last edited by Rose Simmons; 05 Mar 2017, 15:15.

  • #2
    Well, the test you did is actually a test of the joint significance of logage and logage squared. The result, were you to rely on it, suggests that both terms together fail to make a "statistically significant" contribution to the model. It does not provide a test of the inclusion of the quadratic term per se. If you believe in this kind of testing, you would drop both logage and logage squared from your model. That would leave you with a model that has no representation of age at all--which strikes me as not credible at all.

    But, in any case, I am never one to endorse selecting model variables on the basis of significance tests. I think that's just about the worst way to do it. You never explained why you log-transformed age in the first place. I'll assume you had a good reason, take it as a given, and won't discuss that. Your reason for trying a quadratic model is that you thought there might be a U- or inverted U-shaped relationship. Determining whether that is true rely should not be done based on tests of statistical significance. The way to do that is to look at the coefficients of logage and logage#logage. In particular, the axis of symmetry of the parabola will be at logage = -_b[logage]/(2*_b[c.logage#c.logage]) So you should calculate that value, using -nlcom-. If that value falls squarely within the range of logage that occurs in your data, then you do indeed have a relationship that is U- or inverted U- shaped (depending on the sign of the quadratic coefficient). This is true whether either of these coefficients, is "significant" or whether there is joint "significance." And in this circumstance you should keep the quadratic term in the model.

    If, however, the axis of symmetry lies clearly outside the range of logage in your data, then within the range of your "parabola" is a curvilnear function, but it does not reach a peak or nadir and turn around. So you have a curve with a declining marginal effect, or a curve with an increasing marginal effect--but no vertex within the data. You might want to explore graphically* to see if the departure from linearity is large enough to matter from a real-world substantive perspective. If the effect of the quadratic term is to just add something like rounding error to the predicted probabilities, you'd probably just want to drop it.

    The hard case is where the axis of symmetry lies near the boundary of one side of the range of your data. In that case you "sort of" have a vertex, but given measurement error and other issues, it may not be of practical importance. Deciding what to do there requires a lot of judgment. But again, the judgment should be based on how much the quadratic term contributes to the predicted probabilities, not some kind of significance test. And again, a graph will be quite informative.

    *To see what's going on here I recommend selecting a bunch of values of logage that span the range in your data. For the sake of illustration, I'll say that the ages in your data range from 20 through 80, so log(age) runs from about 3 to 4.4. So I'd do this:
    Code:
    margins, at(logage = (3(0.2)4.4))
    marginsplot

    Comment


    • #3
      Dear Clyde,

      Thank you very much for your response. You have helped me to understand why the test command that I used was not appropriate.

      I initially took logs of age, however, I then plotted logage and realised that it is actually better as just age. The age variable was already normally distributed, and in fact taking logs skewed it. Apologies for this, I have edited my original post to now read as "age" and not "logage".

      My age variable ranges from 24 to 91.

      Plotting vote against age gives me the following:

      Image 1 (linear line fitted):
      Click image for larger version

Name:	graph 1.png
Views:	1
Size:	11.6 KB
ID:	1376953


      Image 2 (quadratic line fitted):
      Click image for larger version

Name:	graph 2.png
Views:	1
Size:	11.8 KB
ID:	1376954


      I have run my regression in Stata:
      Code:
      xtprobit vote c.age c.age#c.age, re vce(robust)
      I then calculated the nlcom value. Sorry, what did you mean by:
      If that value falls squarely within the range of logage that occurs in your data, then you do indeed have a relationship that is U- or inverted U- shaped
      The nlcom value is 42.7. This is within the range of age values, so does this mean the that I potentially have a U- or inverted U- shaped relationship?

      Code:
      . nlcom -_b[age]/(2*_b[c.age#c.age])
      
             _nl_1:  -_b[age]/(2*_b[c.age#c.age])
      
      ------------------------------------------------------------------------------
            saving |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
             _nl_1 |   42.66536    16.5475     2.58   0.010     10.23285    75.09786
      ------------------------------------------------------------------------------
      As per your recommendation, I tried:
      Code:
      margins, at(age = (24(2)91))
      marginsplot
      The result was the attached picture, which I believe appears to be quadratic - would you agree?

      Image 3 (marginsplot):
      Click image for larger version

Name:	vote age.png
Views:	1
Size:	19.1 KB
ID:	1376952


      Thank you
      Last edited by Rose Simmons; 05 Mar 2017, 16:07.

      Comment


      • #4
        Yes, all of the above looks like a quadratic model is a good choice.

        Comment


        • #5
          Hi Clyde,

          Thank you for you help yesterday.

          Today, I installed the latest Stata update (I am now using Stata/SE 14.2).
          When I run the same code that I did yesterday, I am now getting error messages:

          Code:
          . nlcom -_b[age]/(2*_b[c.age#c.age])
          
          [age] not found
          r(111);
          
          . margins, at(age = (24(2)91))
          variable 'age' not found in list of covariates
          r(322);
          
          . 
          . marginsplot
          previous command was not margins
          r(301);

          Page 14 of the link in the Stata Manual alludes to this problem: http://www.stata.com/manuals13/rmargins.pdf

          However, I am unsure of how to resolve this, because I thought the specification in the second line of code would have helped to avoid the error.

          Thank you

          Comment


          • #6
            Did you run your -xtprobit vote c.age c.age#c.age, re vce(robust)- command before running these other commands? -nlcom- and -margins- are postestimation commands and can only be run after a regression command has run and posted its estimates in e(). And -marginsplot- runs only immediately after a successful run of -margins-. Did you perhaps use logage instead of age in the -xtprobit- command? If so, you need to rerun it with -age- (or change these other commands to refer to logage instead of age.) Or perhaps you ran some command between the -xtprobit- command and these that has overwritten the -xtprobit- results in e()?

            Comment


            • #7
              Clyde Schechter - I accidentally ran my entire regression instead of just -xtprobit vote c.age c.age#c.age, re vce(robust)- and this caused the error. Thanks very much for helping me to identify this, and apologies for this basic error.

              I also wanted to ask you, in Image 3 (marginsplot) which I posted in #3, how can I interpret the lengths of the lines/ranges - for example, for the youngest and oldest respondents, the lines appear considerably longer than the middle-aged respondents. Does this have a meaning?

              Also, the nlcom value 42.7 is within the range of age values [24,91], but it is not "squarely" in the middle of this range. Is it still correct to interpret this as a U- or inverted U- shaped relationship?

              Many thanks
              Last edited by Rose Simmons; 06 Mar 2017, 12:03.

              Comment


              • #8
                I also wanted to ask you, in Image 3 (marginsplot) which I posted in #3, how can I interpret the lengths of the lines/ranges - for example, for the youngest and oldest respondents, the lines appear considerably longer than the middle-aged respondents. Does this have a meaning?
                Yes. The probability of voting is estimated by this model with less certainty for the youngest and oldest respondents.

                Also, the nlcom value 42.7 is within the range of age values [24,91], but it is not "squarely" in the middle of this range. Is it still correct to interpret this as a U- or inverted U- shaped relationship?
                Maybe we just use the term "squarely in the middle" differently. To me, 42.7 is squarely in the middle of a 20-80 range.

                Comment


                • #9
                  Yes. The probability of voting is estimated by this model with less certainty for the youngest and oldest respondents.
                  Thank you for this interpretation, I understand this better now as it is related to the confidence intervals.

                  Maybe we just use the term "squarely in the middle" differently. To me, 42.7 is squarely in the middle of a 20-80 range.
                  I interpreted "squarely in the middle" to mean exactly in the centre - do you interpret it as roughly/approximately in the centre?

                  Comment


                  • #10
                    In addition to everything Clyde said, there is an easy test that your can download which has been developed by Lind & Mehlum (2010) "With or Without U".
                    You can find it in Stata by simply using
                    Code:
                    findit utest
                    It does pretty much exactly what Clyde suggested (i.e. determine whether there is an inflection point and it gives some additional information about "how sure" you can be there is indeed a U-shape.
                    It is notable however this test does not permit you to use the commonly used interactions in stata.
                    So your code would be (after installation)

                    Code:
                    gen age2 = age*age
                    xtprobit vote c.age c.age#c.age, re vce(robust)
                    utest age age2, prefix(vote)
                    Of course, if you want to use margins afterward you have to go back to other notation for interactions, else Stata will not "know" that age and age2 are based on the same variable.

                    Comment


                    • #11
                      Hello all,

                      This is my regression:

                      regress zlen i.hlth_hyg_nut2 c.agemo c.agemo#c.agemo i.caregiver_edu3 i.female i.improveddrinksource ///
                      i.fourincomecat i.maternitycash2 i.otherintervention2 c.totalhhwithoutchild2 i.ecologicalzone ///
                      if zlen_flag==0 & agemo>5.99

                      I tried the following and got an error message. Can you please help what went wrong ?

                      nlcom -_b[agemo]/(2*_b[c.agemo#c.agemo])//

                      _nl_1: -_b[agemo]/(2*_b[c.agemo#c.agemo])//


                      . nlcom -_b[agemo]/(2*_b[c.agemo#c.agemo])//

                      invalid syntax
                      r(198);

                      Comment


                      • #12
                        Hello,

                        How can I do a test if my dependent variable is quantitative integer continuous variable? not categorical.

                        I got an error message.

                        Comment


                        • #13
                          Why do you have // at the end of the command? The purpose of // is to set off the rest of the line as a comment--but you don't have any comment there. So get rid of that. If you really do plan to put a comment there, then fine, but in that case you must put a blank space between the ) and the //.

                          Comment


                          • #14
                            Re #12: This question is unrelated to the content of this thread. Please repost as a New Topic.

                            It is important to keep threads on topic: people sometimes search for specific topics or keywords in the title. By including off-topic posts, you waste the time of people who come looking for that topic. And, if somebody in the future has a question similar to yours and tries to search for it, they won't find it.

                            Comment


                            • #15
                              Hi Clyde,
                              Thank you. I keep making such basic mistake.
                              Thanks for catching that!

                              Comment

                              Working...
                              X