Using dummy variables/interactions in a regression and possible problem of overfitting

Boris Cupeljic

Join Date: Jul 2016

Posts: 14
#1

Using dummy variables/interactions in a regression and possible problem of overfitting

31 Jul 2016, 03:58

I have three type of dummy variables (hours, weekdays, months). Using those dummy variables as interactions in a form

Code:

months#weekdays#hours

creates around 2000 variables (although p-values for most of them are significant).

I am worried regarding overfitting and what other approach I could use? If I use

Code:

months#weekdays weekdays#hours months#hours

I get less variables, but also less Adjusted R2 and also RSME.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17748
#2

31 Jul 2016, 05:10

Boris:
as per -fvvarlist- you should add an -i.- before each categorical variable and a -c-. before each continuous variable included in the interaction.
Personally, I find three-way interactions results difficult to disseminate and two-way interactions quite good for most of my research goals (obviously, things might be different for other researchers.
Besides, it's rare that the conditional main effect of each predictor included in interaction is left unreported.
Eventually, you should model your regression on the ground of what others did in the past when presented with the same research topic rather than on your data.
Otherwise, as you stated, you run the risk of ending up with a model that fits your data perfectly, but may hardly be that convincing when applied to other samples.

Kind regards,
Carlo
(Stata 19.0)
Comment
Boris Cupeljic

Join Date: Jul 2016

Posts: 14
#3

31 Jul 2016, 05:18

Thank you Carlo for fast response.

By default Stata assumes -i- so I have not wrote it in this case. If I use two-way interactions, third interactions months#hours have a lot of very high p-values. If I use three-way interactions, I do not get that much high p-values. Interactions in this case are to account for hourly/weekly/monthly seasonalities (for electricity load forecasting).
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17748

31 Jul 2016, 07:45

Boris:
as far as categorical variables are concerned -i.- makes no difference when it comes to a binary variable.
This remark does not hold when the categorical variable is polytomous, as you can see in the following toy-example:

Code:

. sysuse auto.dta
(1978 Automobile Data)


. reg price i.foreign i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(5, 63)        =      0.19
       Model |  8372481.37         5  1674496.27   Prob > F        =    0.9670
    Residual |   568424478        63  9022610.75   R-squared       =    0.0145
-------------+----------------------------------   Adj R-squared   =   -0.0637
       Total |   576796959        68  8482308.22   Root MSE        =    3003.8

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |
    Foreign  |    36.7572   1010.484     0.04   0.971    -1982.533    2056.048
             |
       rep78 |
          2  |   1403.125   2374.686     0.59   0.557    -3342.306    6148.556
          3  |   1861.058   2195.967     0.85   0.400    -2527.232    6249.347
          4  |   1488.621   2295.176     0.65   0.519    -3097.921    6075.164
          5  |   1318.426   2452.565     0.54   0.593    -3582.634    6219.485
             |
       _cons |     4564.5   2123.983     2.15   0.035     320.0579    8808.942
------------------------------------------------------------------------------

. reg price i.foreign rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(2, 66)        =      0.02
       Model |  425748.824         2  212874.412   Prob > F        =    0.9759
    Residual |   576371210        66  8732897.12   R-squared       =    0.0007
-------------+----------------------------------   Adj R-squared   =   -0.0295
       Total |   576796959        68  8482308.22   Root MSE        =    2955.1

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |
    Foreign  |  -205.6112   959.5456    -0.21   0.831    -2121.406    1710.183
       rep78 |   76.29497   449.2741     0.17   0.866    -820.7098    973.2997
       _cons |   5948.776   1422.631     4.18   0.000     3108.401     8789.15
------------------------------------------------------------------------------

.

In the second regression model, if you do not impose -i.-, Stata treats -rep78- as a continuous variable.

Kind regards,
Carlo
(Stata 19.0)

Comment

Boris Cupeljic

Join Date: Jul 2016

Posts: 14
#5

31 Jul 2016, 20:32

Carlo,
I understand what you mean.

Thank you
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#6

01 Aug 2016, 13:10

Do you really want hours, days and months as separate things? When you interact all of them, I think you're basically putting in a dummy for every hour in every day in every month which is why you end up with 2000 dummies. If that is what you want, you can go directly to that without all the interactions by using time functions. But, do you really want a separate intercept for every hour in every day in every month?
Comment
Boris Cupeljic

Join Date: Jul 2016

Posts: 14
#7

02 Aug 2016, 15:18

Answer to that would be yes, since my time series has three seasonalities, daily, weekly and monthly which basically for each month can be/is different. To capture those seasonalities I would need either 2000 interaction terms or some "cheat" with two-way interactions.

Out of sample forecast is a bit more precise (in MAPE terms) for a few decimals for percentage points (which is important). But question is, if I use two-way interactions, why most of the third interaction terms which in this case would be i.month#i.hour have so many very high p-values?
Comment

Announcement

Using dummy variables/interactions in a regression and possible problem of overfitting

Comment

Comment

Comment

Comment

Comment

Comment