Issues interprenting statistical significance of models with variable and models with quadractic form of the same variable

Rui Agostinho

Join Date: Apr 2019
Posts: 24

Issues interprenting statistical significance of models with variable and models with quadractic form of the same variable

25 Feb 2022, 10:41

Hello everyone,

I have a project in which I am trying to understand how the probability of someone becoming an entrepreneur in an industry is related with a series of variables. In order to do so, I am utilizing a fixed effects regression, and have constructed a few models to be able to interpret the results.

One of the variables which I am interested in analyzing is the median age of the industries. I have models that include the median age and a collection of other variables, and models that include the median age and its squared term, and the same collection of other variables.

The code itself is as follows:

Model1:
xtreg change_to_employer age_median log_nemp_median gender numb_firms higher_education i.year high_tech low_tech KIS Other, fe cluster(caem2)

and

Model2:
xtreg change_to_employer c.age_median##c.age_median log_nemp_median gender numb_firms higher_education i.year high_tech low_tech KIS Other , fe cluster(caem2)

Model 1:

Code:

xtreg change_to_empregador age_median   log_nemp_median gender numb_firms_div1000 higher_education vn_per_employee_median i.year high_tech low_tech KIS Other, fe cluster(caem2)

Model 2:

Code:

xtreg change_to_empregador c.age_median##c.age_median   log_nemp_median gender  numb_firms_div1000  higher_education vn_per_employee_median i.year high_tech low_tech KIS Other, fe cluster(caem2)

The issue I am having interpreting is that the coefficient for age_median in Model1 is not significant, but then the coefficients for both age_median and c.age_median#c.age_median are both significant for Model2. As shown in:

Model1:

Code:

Fixed-effects (within) regression               Number of obs     =        889
Group variable: caem2                           Number of groups  =         77

R-sq:                                           Obs per group:
     within  = 0.1832                                         min =          3
     between = 0.0859                                         avg =       11.5
     overall = 0.0853                                         max =         12

                                                F(24,76)          =       5.26
corr(u_i, Xb)  = -0.2660                        Prob > F          =     0.0000

                                             (Std. Err. adjusted for 77 clusters in caem2)
------------------------------------------------------------------------------------------
                         |               Robust
Change_to_empregador_f~e |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------------+----------------------------------------------------------------
              age_median |   -.001125   .0028954    -0.39   0.699    -.0068918    .0046417
         log_nemp_median |  -.0015375   .0168405    -0.09   0.927    -.0350782    .0320033
                  gender |   .0010117   .0016409     0.62   0.539    -.0022565    .0042798
      numb_firms_div1000 |  -.0086784   .0050413    -1.72   0.089     -.018719    .0013621
        higher_education |   .0009512   .0012174     0.78   0.437    -.0014735    .0033758
  vn_per_employee_median |  -.0005548    .000214    -2.59   0.011     -.000981   -.0001286

Model2:

Code:

Fixed-effects (within) regression               Number of obs     =        889
Group variable: caem2                           Number of groups  =         77

R-sq:                                           Obs per group:
     within  = 0.3109                                         min =          3
     between = 0.1165                                         avg =       11.5
     overall = 0.1408                                         max =         12

                                                F(27,76)          =       7.08
corr(u_i, Xb)  = -0.1859                        Prob > F          =     0.0000

                                                                          (Std. Err. adjusted for 77 clusters in caem2)
-----------------------------------------------------------------------------------------------------------------------
                                                      |               Robust
                                 change_to_empregador |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------------------------------------------+----------------------------------------------------------------
                                           age_median |    .136472   .0677635     2.01   0.048     .0015094    .2714347
                                                      |
                            c.age_median#c.age_median |   -.001687   .0008374    -2.01   0.047    -.0033548   -.0000191
                                                      |
                                      log_nemp_median |  -.0891614   .0408428    -2.18   0.032    -.1705069   -.0078159
                                               gender |  -.0019755   .0036136    -0.55   0.586    -.0091726    .0052216
                                   numb_firms_div1000 |  -.0235788   .0107627    -2.19   0.032    -.0450145    -.002143
                                     higher_education |   .0012232   .0029379     0.42   0.678    -.0046282    .0070745
                               vn_per_employee_median |  -.0013087   .0004205    -3.11   0.003    -.0021462   -.0004713
                                                      |
                                                 year |

How is it possible that one variable is not significant by itself, but then becomes significant when regressed together with its quadratic term? Can I then say that the median age of the industries has a significant impact of the probability of transition into entrepreneurship?

Thank you very much,
Rui

Last edited by Rui Agostinho; 25 Feb 2022, 11:30.

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17854
#2

25 Feb 2022, 11:06

Rui:
different models give back different results: no wonder about that.
As you forgot to follow the FAQ (that recommend to share what you typed and what Stata gave you back via CODE delimiters), interested listers can only give general advice.
In your case, as per your description (and with the cautionary tale that words are not numbers) -median_age- has a quadratic relationship with the regressand.

Kind regards,
Carlo
(Stata 19.0)
Comment
Rui Agostinho

Join Date: Apr 2019

Posts: 24
#3

25 Feb 2022, 11:35

Carlo:

Thank you very much for your reply. I have applied the changes to my question that you proposed. Hopefully it will be more clear now.

My issue still remains though: what is the interpretation of the coefficient of age_median in Model 1 not being significant, but then, when looking at Model 2, both age_median and its quadratic term being significant? Is it as simple as age not having a linear relationship with the regressand, but having a quadratic relationship instead?

Thank you,
Rui
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17854
#4

25 Feb 2022, 11:49

Rui:
1) the second regression code gives a higher within R_Sq; hence is better specified than the first one;
2) in regression code #2 other coefficients change in terms of statistical significance, too;
3) in regression code #2 both the linear and squared terms for -median_age- are only barely statistically significant.

Kind regards,
Carlo
(Stata 19.0)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#5

25 Feb 2022, 12:05

First, even for people who take the concept of statistical significance seriously (in fact, especially for them) it is important to bear in mind that the difference between statistically significant and not statistically significant is, itself, not statistically significant, nor even meaningful in any way. You should never draw any conclusions from one thing being statistically significant and another not.

That said, the quadratic relationship you are talking about is precisely the kind of situation where what you describe can and should happen. Run this code:

Code:

clear* set obs 101 set seed 1234 gen x = _n-1 gen y = 2*(_n-50)^2 + rnormal() graph twoway scatter y x || lfit y x regress y x regress y c.x##c.x

Look at the graph before you read the regression outputs. You can see that due to the U-shaped relationship between y and x, the best fitting straight line is more or less horizontal. Correspondingly, in the regression without the quadratic term, the coefficient of x is nearly zero. By contrast, the quadratic regression effectively captures the U-shaped relationship. The coefficient of the quadratic term is a measure of the width (or narrowness) of the U-shape, and the linear term is basically a (scaled) indication of where the vertex of the parabola lies. Since the parabola is fairly steep it has a relatively large quadratic coefficient (relative to the scaling of x² which ranges from -2500 to +2500) .

Added: crossed with #4.
2 likes
Comment

Announcement

Issues interprenting statistical significance of models with variable and models with quadractic form of the same variable

Comment

Comment

Comment

Comment