Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multivariate polynomial regression modelling

    Hello everyone,
    I wanted to ask the most appropriate method for finding the best polynomial model for multivariate regression.
    So far, I have been working with single covariate regressions and used the following code to generate a table showing the R2, AIC and BIC.

    Code:
    foreach i of numlist 1/5 {
    
        if `i' != 1{
                
            gen x`i' = x^`i'
                
            }
            
        qui reg y x*
        estimate store M_'i'
          
        qui estat ic
        matrix M_IC_`i' = r(S)
        scalar aic_`i'  = M_IC_`i'[1, 5]
        scalar bic_`i'  = M_IC_`i'[1, 6]
            
          }
    
    estout M_*, cells(b(star fmt(3)) se(par fmt(3))) stats(r2 r2_a aic bic df_r, labels(R^2 aR^2 AIC BIC df))
    Below is the regression I am working on, including a dummy treatment variable (T) and two interaction variables.
    For reference, I am using the heterogeneous local average treatment effect (HLATE) proposed by Becker et al. (2012) in the context of RDD.

    Code:
    reg AR GDP_dev* ceqi* T TxGDP* Txceqi*
    I, therefore, have two questions to ask.
    The first is more theoretical and probably very elementary: the degree of the covariates functions have to be all the same, or can I have a regression with x12 and x23?

    The second question concerns the use of STATA. Is there a way to test which model is the best, using code similar to the one posted earlier in the message, or should I use another approach altogether?

    I apologize if the question is not precise or I used imprecise terms. I hope I have made myself clear.

    edit: I can't change the title but most likely I am referring to a multivariable regression modeling, not multivariate.
    Last edited by Duccio Milani; 13 Nov 2021, 11:01. Reason: wrong title

  • #2
    Duccio, if you are attempting to determine the order of polynomial for the forcing variable in a RDD, then the answer depends on the bandwidth. With narrow bandwidth, linear form would usually be sufficient in practice, while quadratic or cubic terms could be added with wider bandwidth. There is no clear formula, and people often check the robustness of results with different orders of polynomials. Some papers suggest avoiding to use very high orders (like more than three) as they may cause estimation biases.

    If your question is about polynomial orders in general regressions aiming to identify the effect of x on y (i.e, not for the purpose of fitting and predicting as with time-series data), I would suggest examining graphs that reflect relations between y and x before determining the orders. Again, you may always try different model specifications for robustness check.

    Comment


    • #3
      Another approach to writing the code is
      Code:
      sysuse auto, clear
      local powers c.length
      forvalues i = 1/3  {
          display "regress weight `powers'"
          regress weight `powers'
         estimates store M_`i'
          local powers `powers'##c.length
      }
      By using Stata's factor variable notation to include powers, postestimation commands like margins will correctly understand and take account of the implicit relationship between the values of length, length2, and length3.

      Comment


      • #4
        Some papers suggest avoiding to use very high orders (like more than three) as they may cause estimation biases.
        I would like to amplify, to a full-throated scream, the above gentle warning from Fei Wang. In fact, I would take it farther. In the absence of compelling, independent evidence that a polynomial model is truly appropriate and a linear model would be misleading, I recommend very strongly against including any higher order terms at all. The estimation bias that can result is pervasive and very serious--often leading to obviously spurious conclusions that are obviously wrong if one just looks at a scatterplot of the data. A particularly clear example is cited in https://statmodeling.stat.columbia.e...learn-from-it/. Gelman has posted some other examples of this, and I think they should be required reading for everybody who uses regression discontinuity designs.

        I would additionally discourage you from choosing your model based on AIC or any other such statistic. It will not protect you from this kind of bias, and will just give you a false sense of security.

        Comment


        • #5
          Thanks Clyde Schechter for the source and instructions. I totally agree that scatterplot is necessary before specifying models. I would add a little footnote: For RDD, the bandwidth needs to be narrow when a linear form of forcing variable is mostly sufficient. A wide bandwidth may have already invalidated a RDD which requires analysis in the area sufficiently close to the cut-off point before making disasters with high-order polynomials. That says, any data without enough observations embracing the discontinuity should not be linked to RDD -- solving the issue of polynomial from the very beginning.

          Comment


          • #6
            Thank you all for your comments and suggestions.

            To Fei and Clyde: I intend to determine the order for the forcing variable. I have already tested that a linear relationship between the forcing and the dependent variable gives satisfactory and credible results and that these are robust to variations in bandwidth. However, in some papers with a similar setting and topic, polynomials of 3, 4 and 5 degrees are used, so I was questioning my results.
            So, if I got this correctly, the best strategy is to redo a deeper graphical analysis then report robustness tests by increasing and decreasing the bandwidth and the functional form.

            I wonder though since I want to include in my analysis another covariate in addition to the forcing and the treatment dummy, does the method remain the same?

            I also tried a local-linear regression using the rdrobust STATA package (Calonico, Cattaneo & Titiunik, 2014), which confirms the results obtained so far with the parametric method. Again I run into difficulties when I want to add covariate but I guess that is material for another post.

            Thanks to both of you for the excellent advice and further sources.

            To William: thanks for the code, I will try it as soon as I get a chance

            Comment


            • #7
              I wonder though since I want to include in my analysis another covariate in addition to the forcing and the treatment dummy, does the method remain the same?

              I also tried a local-linear regression using the rdrobust STATA package (Calonico, Cattaneo & Titiunik, 2014), which confirms the results obtained so far with the parametric method. Again I run into difficulties when I want to add covariate but I guess that is material for another post.
              Yes, the procedure remains the same after adding exogenous covariates, by using -covs()- option of -rdrobut-.

              Comment


              • #8
                To clarify a doubt that has arisen as a result of this discussion.

                Fei Wang when you talk about bandwidth, you refer to the range of observations to consider to the left and to the right of the cut-off, correct? Because in Lee & Lemieux (2010), in the section referring to parametric model, the scholars remark the need for binning observations with different bandwidths, and they do not mention the choice of observation bandwidth in case of a parametric model.
                So far, I have conducted my analysis using all observations without binning the observations. Both for transparency of results, especially in graphical representations, and because my observations are not as many as to have a noisy representation.

                I am now wondering if I'm not conducting my research incorrectly.

                Comment


                • #9
                  Using all observations could be a start for RDD. But eventually, selecting sufficiently narrow-ranged observations around the cut-off is required for RDD. If sample size is not large enough, there will be tough tradeoff between bias and variance which you are not able to contain both.

                  Comment

                  Working...
                  X