Multivariate polynomial regression modelling

Duccio Milani

Join Date: Oct 2021

Posts: 23
#1

Multivariate polynomial regression modelling

13 Nov 2021, 10:42

Hello everyone,
I wanted to ask the most appropriate method for finding the best polynomial model for multivariate regression.
So far, I have been working with single covariate regressions and used the following code to generate a table showing the R2, AIC and BIC.

Code:

foreach i of numlist 1/5 { if `i' != 1{ gen x`i' = x^`i' } qui reg y x* estimate store M_'i' qui estat ic matrix M_IC_`i' = r(S) scalar aic_`i' = M_IC_`i'[1, 5] scalar bic_`i' = M_IC_`i'[1, 6] } estout M_*, cells(b(star fmt(3)) se(par fmt(3))) stats(r2 r2_a aic bic df_r, labels(R^2 aR^2 AIC BIC df))

Below is the regression I am working on, including a dummy treatment variable (T) and two interaction variables.
For reference, I am using the heterogeneous local average treatment effect (HLATE) proposed by Becker et al. (2012) in the context of RDD.

Code:

reg AR GDP_dev* ceqi* T TxGDP* Txceqi*

I, therefore, have two questions to ask.
The first is more theoretical and probably very elementary: the degree of the covariates functions have to be all the same, or can I have a regression with x₁² and x₂³?

The second question concerns the use of STATA. Is there a way to test which model is the best, using code similar to the one posted earlier in the message, or should I use another approach altogether?

I apologize if the question is not precise or I used imprecise terms. I hope I have made myself clear.

edit: I can't change the title but most likely I am referring to a multivariable regression modeling, not multivariate.

Last edited by Duccio Milani; 13 Nov 2021, 11:01. Reason: wrong title
Tags: None
Fei Wang

Join Date: Oct 2021

Posts: 726
#2

13 Nov 2021, 11:24

Duccio, if you are attempting to determine the order of polynomial for the forcing variable in a RDD, then the answer depends on the bandwidth. With narrow bandwidth, linear form would usually be sufficient in practice, while quadratic or cubic terms could be added with wider bandwidth. There is no clear formula, and people often check the robustness of results with different orders of polynomials. Some papers suggest avoiding to use very high orders (like more than three) as they may cause estimation biases.

If your question is about polynomial orders in general regressions aiming to identify the effect of x on y (i.e, not for the purpose of fitting and predicting as with time-series data), I would suggest examining graphs that reflect relations between y and x before determining the orders. Again, you may always try different model specifications for robustness check.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

13 Nov 2021, 11:47

Another approach to writing the code is

Code:

sysuse auto, clear local powers c.length forvalues i = 1/3 { display "regress weight `powers'" regress weight `powers' estimates store M_`i' local powers `powers'##c.length }

By using Stata's factor variable notation to include powers, postestimation commands like margins will correctly understand and take account of the implicit relationship between the values of length, length², and length³.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

13 Nov 2021, 15:19

Some papers suggest avoiding to use very high orders (like more than three) as they may cause estimation biases.

I would like to amplify, to a full-throated scream, the above gentle warning from Fei Wang. In fact, I would take it farther. In the absence of compelling, independent evidence that a polynomial model is truly appropriate and a linear model would be misleading, I recommend very strongly against including any higher order terms at all. The estimation bias that can result is pervasive and very serious--often leading to obviously spurious conclusions that are obviously wrong if one just looks at a scatterplot of the data. A particularly clear example is cited in https://statmodeling.stat.columbia.e...learn-from-it/. Gelman has posted some other examples of this, and I think they should be required reading for everybody who uses regression discontinuity designs.

I would additionally discourage you from choosing your model based on AIC or any other such statistic. It will not protect you from this kind of bias, and will just give you a false sense of security.
2 likes
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#5

13 Nov 2021, 19:30

Thanks Clyde Schechter for the source and instructions. I totally agree that scatterplot is necessary before specifying models. I would add a little footnote: For RDD, the bandwidth needs to be narrow when a linear form of forcing variable is mostly sufficient. A wide bandwidth may have already invalidated a RDD which requires analysis in the area sufficiently close to the cut-off point before making disasters with high-order polynomials. That says, any data without enough observations embracing the discontinuity should not be linked to RDD -- solving the issue of polynomial from the very beginning.
1 like
Comment
Duccio Milani

Join Date: Oct 2021

Posts: 23
#6

14 Nov 2021, 15:12

Thank you all for your comments and suggestions.

To Fei and Clyde: I intend to determine the order for the forcing variable. I have already tested that a linear relationship between the forcing and the dependent variable gives satisfactory and credible results and that these are robust to variations in bandwidth. However, in some papers with a similar setting and topic, polynomials of 3, 4 and 5 degrees are used, so I was questioning my results.
So, if I got this correctly, the best strategy is to redo a deeper graphical analysis then report robustness tests by increasing and decreasing the bandwidth and the functional form.

I wonder though since I want to include in my analysis another covariate in addition to the forcing and the treatment dummy, does the method remain the same?

I also tried a local-linear regression using the rdrobust STATA package (Calonico, Cattaneo & Titiunik, 2014), which confirms the results obtained so far with the parametric method. Again I run into difficulties when I want to add covariate but I guess that is material for another post.

Thanks to both of you for the excellent advice and further sources.

To William: thanks for the code, I will try it as soon as I get a chance
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#7

14 Nov 2021, 16:27

I wonder though since I want to include in my analysis another covariate in addition to the forcing and the treatment dummy, does the method remain the same?

I also tried a local-linear regression using the rdrobust STATA package (Calonico, Cattaneo & Titiunik, 2014), which confirms the results obtained so far with the parametric method. Again I run into difficulties when I want to add covariate but I guess that is material for another post.

Yes, the procedure remains the same after adding exogenous covariates, by using -covs()- option of -rdrobut-.
Comment
Duccio Milani

Join Date: Oct 2021

Posts: 23
#8

16 Nov 2021, 09:05

To clarify a doubt that has arisen as a result of this discussion.

Fei Wang when you talk about bandwidth, you refer to the range of observations to consider to the left and to the right of the cut-off, correct? Because in Lee & Lemieux (2010), in the section referring to parametric model, the scholars remark the need for binning observations with different bandwidths, and they do not mention the choice of observation bandwidth in case of a parametric model.
So far, I have conducted my analysis using all observations without binning the observations. Both for transparency of results, especially in graphical representations, and because my observations are not as many as to have a noisy representation.

I am now wondering if I'm not conducting my research incorrectly.
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#9

16 Nov 2021, 09:13

Using all observations could be a start for RDD. But eventually, selecting sufficiently narrow-ranged observations around the cut-off is required for RDD. If sample size is not large enough, there will be tough tradeoff between bias and variance which you are not able to contain both.
Comment

Announcement

Multivariate polynomial regression modelling

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment