Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Overfitting

    Dear Statalist users, I have a question with regard to model overfitting.
    I ran nbreg for a set of 30 countries, for which I have observations for 14 years, for two different types of events (30x15x2=840 observations). I use 5 independent variables, plus country dummies and year dummies for fixed effects.
    I am thinking of using two more dummies with triple interactions . My command looks like this.

    nbreg outcome ftb##event##(c.mkt_size_country c.mkt_satur_country c.labor c.skills c.invest) i.country i.year, vce(cluster countryid)

    My first question is whether overfitting is a problem at a first glance, given the number of variables involved (especially after so many interactions).
    My second question is more fundamental about overfitting. I read that the real problem is that using too many predictors gives you correlations that exist in the sample but not in the general population. In my case, my sample represents more than 90% of the population for the given time period. Would it be safe to ignore overfitting?
    Thank you in advance.

  • #2
    The inclusion of plenty of interactions should be put to a test, I mean, checking whether the models improves significantly, or not. To avoid overfitting, maybe sticking to the parsimony criteria would be a nice strategy.
    Best regards,

    Marcos

    Comment


    • #3
      Dear Marcos,
      Thank you for your response. The inclusion of the interaction terms is essential to study two critical questions, i.e how do the main coefficients i.e, mkt_size_country mkt_satur_country labor skills invest change depending on the type of ftb and event.
      When you say "whether the models improves significantly" how would you support this? AIC and BIC criteria?
      Best regards,
      Ioannis

      Comment


      • #4
        When you have a population, then the coefficients are correct for the population. (statistical inference including standard errors with populations is subject to some debate in literatures I follow) Whether the coefficients can be interpreted as supporting a particular theory is a different issue. I'm also not sure I understand your sample size - are you including multiple observations per country-year because you have multiple events? This seems questionable. My bias would be to have one observation per country-year and to include both kinds of events in the same model (to control for omitted variables bias if the events are at all correlated).

        However, three way interactions are often exceedingly hard to interpret, in your case particularly if ftb and/or event take on more than two values each. You'll also be estimating parameters essentially using the number of observations in each ftb by event category which can result in very small numbers of observations relative to variables for some combinations.

        Comment


        • #5
          Dear Phil,
          Many thanks for your response.
          I have two observations (one for each type of event) per country, per year. Since many years have zero events, I believe that the limiting number of observations is the total number of events, which is around 1/3 of the total observations.
          I do include everything in one (the) model as I describe in my original post. This model specification would enable me to get coefficients for my independent variables (interaction terms would be a different story) for for different cases
          ftb=1 and event=1
          ftb=2 and event=1
          ftb=1 and event=2
          ftb=2 and event=2.
          However, as you say the number of observations is relatively small, so I don't know whether I could trust these.

          Thank you again.
          Best regards,
          Ioannis

          Comment

          Working...
          X