Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using dummy variables/interactions in a regression and possible problem of overfitting

    I have three type of dummy variables (hours, weekdays, months). Using those dummy variables as interactions in a form
    Code:
    months#weekdays#hours
    creates around 2000 variables (although p-values for most of them are significant).

    I am worried regarding overfitting and what other approach I could use? If I use
    Code:
    months#weekdays weekdays#hours months#hours
    I get less variables, but also less Adjusted R2 and also RSME.

  • #2
    Boris:
    as per -fvvarlist- you should add an -i.- before each categorical variable and a -c-. before each continuous variable included in the interaction.
    Personally, I find three-way interactions results difficult to disseminate and two-way interactions quite good for most of my research goals (obviously, things might be different for other researchers.
    Besides, it's rare that the conditional main effect of each predictor included in interaction is left unreported.
    Eventually, you should model your regression on the ground of what others did in the past when presented with the same research topic rather than on your data.
    Otherwise, as you stated, you run the risk of ending up with a model that fits your data perfectly, but may hardly be that convincing when applied to other samples.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thank you Carlo for fast response.

      By default Stata assumes -i- so I have not wrote it in this case. If I use two-way interactions, third interactions months#hours have a lot of very high p-values. If I use three-way interactions, I do not get that much high p-values. Interactions in this case are to account for hourly/weekly/monthly seasonalities (for electricity load forecasting).

      Comment


      • #4
        Boris:
        as far as categorical variables are concerned -i.- makes no difference when it comes to a binary variable.
        This remark does not hold when the categorical variable is polytomous, as you can see in the following toy-example:
        Code:
        . sysuse auto.dta
        (1978 Automobile Data)
        
        
        . reg price i.foreign i.rep78
        
              Source |       SS           df       MS      Number of obs   =        69
        -------------+----------------------------------   F(5, 63)        =      0.19
               Model |  8372481.37         5  1674496.27   Prob > F        =    0.9670
            Residual |   568424478        63  9022610.75   R-squared       =    0.0145
        -------------+----------------------------------   Adj R-squared   =   -0.0637
               Total |   576796959        68  8482308.22   Root MSE        =    3003.8
        
        ------------------------------------------------------------------------------
               price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
             foreign |
            Foreign  |    36.7572   1010.484     0.04   0.971    -1982.533    2056.048
                     |
               rep78 |
                  2  |   1403.125   2374.686     0.59   0.557    -3342.306    6148.556
                  3  |   1861.058   2195.967     0.85   0.400    -2527.232    6249.347
                  4  |   1488.621   2295.176     0.65   0.519    -3097.921    6075.164
                  5  |   1318.426   2452.565     0.54   0.593    -3582.634    6219.485
                     |
               _cons |     4564.5   2123.983     2.15   0.035     320.0579    8808.942
        ------------------------------------------------------------------------------
        
        . reg price i.foreign rep78
        
              Source |       SS           df       MS      Number of obs   =        69
        -------------+----------------------------------   F(2, 66)        =      0.02
               Model |  425748.824         2  212874.412   Prob > F        =    0.9759
            Residual |   576371210        66  8732897.12   R-squared       =    0.0007
        -------------+----------------------------------   Adj R-squared   =   -0.0295
               Total |   576796959        68  8482308.22   Root MSE        =    2955.1
        
        ------------------------------------------------------------------------------
               price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
             foreign |
            Foreign  |  -205.6112   959.5456    -0.21   0.831    -2121.406    1710.183
               rep78 |   76.29497   449.2741     0.17   0.866    -820.7098    973.2997
               _cons |   5948.776   1422.631     4.18   0.000     3108.401     8789.15
        ------------------------------------------------------------------------------
        
        .
        In the second regression model, if you do not impose -i.-, Stata treats -rep78- as a continuous variable.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Carlo,
          I understand what you mean.

          Thank you

          Comment


          • #6
            Do you really want hours, days and months as separate things? When you interact all of them, I think you're basically putting in a dummy for every hour in every day in every month which is why you end up with 2000 dummies. If that is what you want, you can go directly to that without all the interactions by using time functions. But, do you really want a separate intercept for every hour in every day in every month?

            Comment


            • #7
              Answer to that would be yes, since my time series has three seasonalities, daily, weekly and monthly which basically for each month can be/is different. To capture those seasonalities I would need either 2000 interaction terms or some "cheat" with two-way interactions.

              Out of sample forecast is a bit more precise (in MAPE terms) for a few decimals for percentage points (which is important). But question is, if I use two-way interactions, why most of the third interaction terms which in this case would be i.month#i.hour have so many very high p-values?

              Comment

              Working...
              X