Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rose Simmons
    started a topic Testing whether to include a squared term

    Testing whether to include a squared term

    Hi,

    I am using a panel dataset.
    vote is my dependent variable: 1 if the respondent voted in an annual leadership election, and 0 otherwise (so I am using nonlinear methods).
    My independent variables include marital status, gender, age etc.

    I then run my regression with only age and age^2 as control variables:

    Code:
    xtprobit vote c.age c.age#c.age, re vce(robust)
    I then conduct the test to see whether age^2 should be included, because I suspect there may be a U-shaped or inverse U-shaped relationship with voting (e.g. very young and very old people may be more or less likely to vote than middle-aged people, in a non-linear relationship).

    Code:
    test age c.age#c.age
    
     ( 1)  [vote]age= 0
     ( 2)  [vote]c.age#c.age= 0
    
               chi2(  2) =    4.34
             Prob > chi2 =    0.1141
    With this result, does this suggest that including age^2 is insignificant, and that perhaps I should only include age?

    I believe this is the appropriate test to see the significant of the squared term, although please could you advise me if I'm mistaken?

    Thank you
    Last edited by Rose Simmons; 05 Mar 2017, 15:15.

  • Jovana Ju
    replied
    Yes, you are right. Thank you so much.

    Leave a comment:


  • Clyde Schechter
    replied
    No, I did ask for the coefficients of X1 in the linear (i.e. non-quadratic) models, which, as I understand it, is what you provided in #53.

    Leave a comment:


  • Jovana Ju
    replied
    Clyde, I am so sorry. I think I didn’t give the right answer because you asked me about the coefficients in front of the linear term, but I provided information for the linear model instead. So, the coefficients in front of the linear term are -0.0054 (confidence interval: -0.0095492 to -0.0013747) when the mentioned observation is included, and -0.0046 (confidence interval: -0.0151464 to 0.005874) when it is excluded.

    Leave a comment:


  • Jovana Ju
    replied
    Thank you so much for your time patience and help!

    Leave a comment:


  • Clyde Schechter
    replied
    That's a pretty large shift in the coefficient based on the inclusion of that one observation. So, in this case, I think you have to keep the quadratic model. But I do think that when you report your results you need to make explicit the caveat that the results are heavily influenced by that one outlier observation.

    Leave a comment:


  • Jovana Ju
    replied
    Thank you, Clyde. In the first case (with this observation), the coefficient is -.0010967 (95% confidence interval -.0022103 to .000017), and in the second case, it is -.0039147 (95% confidence interval -.0066853 to -.0011442).

    Leave a comment:


  • Clyde Schechter
    replied
    Let's not focus on significance so much--it's a poor guide to model selection. How different are the coefficients in the linear (no quadratic term) models with and without that one observation. Actually, what are the 95% confidence intervals around those coefficients as well?

    Leave a comment:


  • Jovana Ju
    replied
    Clyde, when the model is applied to observations to the right of the parabola's vertex, its shape changes, and the linear model becomes significant. Initially, I didn’t consider this a problem since this is the downward-sloping part of the parabola, so I expected the negative sign... Am I wrong? Regarding this observation, it is certain that it is not a data error, but it could be said that it is an extreme value.

    Leave a comment:


  • Clyde Schechter
    replied
    So, looking again at the graph of the quadratic, you can see that it actually gets pretty steep in the right-side of the plotted range. You state that your values of X are grouped in a low and narrow range, and if the linear version of the model comes out with a coefficient that is close to zero, that range must be fairly close to the apex of the parabola. And in a situation like that, you can definitely end up with a situation where the estimate of the linear slope is close to zero because the upward slope on the right and the downward slope on the left cancel each other out. So in that situation, I would retain the quadratic model.

    But let me throw in one other caution here. You mentioned that there is only one observation in the data set that has an X value to the left of the parabola vertex. In order for the upslope downslope cancellation I referred to in the previous paragraph to occur, this makes me wonder if that one observation is a point with high leverage in the regression, since the rest of the data is on the downsloping side of the curve. What happens to the quadratic model if you exclude that one observation? If the model changes substantially, then your results hinge critically on a single data point. In that case, you should double and triple-check that that data point isn't just a data error. And if you can confirm that it is, in fact, a valid data point, in reporting your results you should mention that there is this one high-leverage observation with a strong influence on the findings.

    Leave a comment:


  • Jovana Ju
    replied
    Clyde, thank you very much for your suggestion. The values of X are grouped within a low and narrow range. I have already raised the issue of data variability elsewhere. If we assume that the range is not wide enough to model quadratic effects, a linear approximation naturally emerges as an alternative, but it is not statistically significant—that is, all tests indicate that the model with quadratic effects is better.

    Regarding your last sentence, I was closest to describing the relationship as nonlinear but without emphasizing the inverted U-shape for the reasons mentioned.

    Leave a comment:


  • Clyde Schechter
    replied
    It depends.

    If you run -graph twoway function y= -.0085499 -.0054619*x -.0001648*x*x, range(-20 20)- you will see what the quadratic function of X1 in your model looks like over a range from a bit to the left of the vertex to far to the right of it. You can see that it does not really represent an inverted-U shape in any meaningful sense of the term. Rather the quadratic term is picking up curvilinearity in the relationship. Notice that, as with any quadratic, if you focus in on a narrow range of X1 values, the relationship is very close to linear. So the real question is whether the range of X values in your data is wide enough that the non-linearity is large enough to be worth modeling. So I would do something like plot a histogram of the X1 values in you data and then see whether they are concentrated in a narrow enough area that a linear approximation to the model is satisfactory, or if it is broad enough that you really need to express the non-linearity in your model.

    But I will say that if you do end up retaining the quadratic representation, I would not refer to it as showing an inverse-U relationship: clearly the left half of that supposed inverse-U is effectively missing.

    Leave a comment:


  • Jovana Ju
    replied
    Hello everyone! I previously wrote regarding the issue of whether to include quadratic effects in my model, but I am still facing the same problem and I kindly ask for your help based on the output I have from Stata. I am working with panel data, where T is 6 and N is 98.


    Code:
    xtreg Y X1 X1sq X2 X2sq X3 X4 X5 X6 X7, fe robust
    
    Fixed-effects (within) regression               Number of obs     =        582
    Group variable: id                              Number of groups  =         98
    
    R-squared:                                      Obs per group:
         Within  = 0.1182                                         min =          1
         Between = 0.1050                                         avg =        5.9
         Overall = 0.0921                                         max =          6
    
                                                    F(9, 97)          =  913273.12
    corr(u_i, Xb) = 0.0783                          Prob > F          =     0.0000
    
                                        (Std. err. adjusted for 98 clusters in id)
    ------------------------------------------------------------------------------
                 |               Robust
               Y | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
              X1 |  -.0054619   .0020594    -2.65   0.009    -.0095492   -.0013747
            X1sq |  -.0001648   .0000562    -2.93   0.004    -.0002765   -.0000532
              X2 |  -.0056495   .0002735   -20.66   0.000    -.0061923   -.0051068
            X2sq |  -.0000362   1.56e-06   -23.20   0.000    -.0000393   -.0000331
              X3 |   .0027445   .0185407     0.15   0.883    -.0340537    .0395427
              X4 |   .1119635   .0398111     2.81   0.006     .0329495    .1909775
              X5 |   .0025336   .0021159     1.20   0.234    -.0016659    .0067332
              X6 |  -.0825474   .0398469    -2.07   0.041    -.1616326   -.0034623
              X7 |   .0006929   .0004353     1.59   0.115    -.0001712    .0015569
           _cons |  -.0085499   .4149904    -0.02   0.984     -.832191    .8150912
    -------------+----------------------------------------------------------------
         sigma_u |  .06728514
         sigma_e |  .06179892
             rho |  .54242449   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------
    The estimated model shows that there is an inverted U-shaped relationship between X1 and Y, and that the maximum Y is reached at X1 = -16.56. I used utest in Stata. What concerns me is that only one value of X1 from my sample lies to the left of this extreme point. Should I then keep the quadratic function?

    Leave a comment:


  • Jovana Ju
    replied
    Thank you very much for your comments and suggestions!

    Sincerely,

    Jovana

    Leave a comment:


  • Nick Cox
    replied
    In ecology it's common that abundance is highest where organisms are happiest (evidently a term of art in gardening), corresponding to ideal temperature, moisture. salinity, nutrient supply, whatever), although competition, predation and other effects may be at work too.

    A standard model for this phenomenon is so-called Gaussian logit, a combination of a quadratic in one predictor and logit link. Here is a simple graph to give flavour:

    Code:
    twoway function invlogit(0.01 * (x - 5) - (x - 5)^2), ra(0 10)
    Click image for larger version

Name:	Gaussian_logit.png
Views:	1
Size:	43.0 KB
ID:	1769478



    As Jeff Wooldridge implies, this model doesn't imply that the entire shape is needed. Indeed, for organisms that thrive at some environmental extreme, only one limb of the bell is needed.

    I'd be interested to know how far this model is used outside ecology, in epidemiology, economics, eschatology, campanology, or anywhere else.

    Leave a comment:

Working...
X