Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Aggregating years in time fixed effects



    Dear Statalists,

    Hi, I've just joined the forum so apologies in advance if I accidentally break any of the rules.

    I have a general question about including time-fixed effects in a logistic regression analysis. A problem with my data set is that the sample size is small while time range is long: Specifically, I have around 600 observations for each model, of which 100-200 observations have y=1 value, while the observation period ranges from 1985 to 2016.

    I intended to include annual time dummies at first, but I found out that including +30 time dummies in my model would result in an over-fitting.
    The code I used was :
    Code:
     logit y   x1   x2   x3   . . . x9   i.year


    Next thing I tried was creating a 5-year interval dummies, that is,

    Code:
     gen yr5_dummy=1
        forval i=2/7 {
        replace yr5_dummy=`i' if year>=1985+5*(`i'-1) & year<=1989+5*(`i'-1) & !missing(year)
        }

    and then do the regression with a clustered SE option :
    Code:
     logit y   x1   x2   x3   . . . x9   i.yr5_dummy, vce(cluster yr5_dummy)

    In sum, due to an overfitting problem, I created 5-year interval time dummies instead of annual time dummies, and then specified clustered SE option.
    Would there be any issues if I handle the problem this way?

    Any hints or references to the literature would be appreciated. Thank you.
    Last edited by Hyewon Kim; 29 Aug 2018, 00:43.

  • #2
    Hyewon:
    welcome to this forum.
    Panel data regression works pretty well with large N, small T dataset (that is, the opposite of yours).
    That said, some comments about your query:
    - I'm not clear with your meaning of overifitting (and without seeing what Stata gave you back it is even more difficult to judge if an overiftting issue is actually present in your analysis);
    - your first code (by the way, I would also take a look ad -xtlogit-, although conceived for large N, small T dataset and the conditional fixed effect feature) seemingly forgets you're dealing with a panel dataset, in that you should have clustered your standard errors on panel identifier, as your observations are not independent;
    - I fail to get your gain in grouping -i.year- in five-year blocks, especially if the other variables are expressed in annual terms;
    - in your second code clustering should be on panel identifier, not on years.
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      Dear Carlo,

      Thank you for your reply.

      I think I have misused the word "fixed effect" and caused miscommunication. I apologize for the misunderstanding. I am currently conducting a cross-sectional data analysis and not panel-data analysis.

      To specify more, my research topic is individual choices of university major, and I have a cross-sectional data set on individual major choice and university entrance year. Since each individual entered university in different years, I wanted to control for potential yearly shocks that might have affected their educational choices. (ex. change in educational policy, change in national economic conditions, etc. )

      Below is the result of my first regression (I use Stata 14.1), where I included time dummies for each year of university entrance. y has the value 1 if an individual enrolled in a certain group of major. I found that some observations are dropped due to lack of variance within each year, so I concluded that the model has an over-fitting problem.
      Click image for larger version

Name:	result_with_i.year.jpg
Views:	1
Size:	462.8 KB
ID:	1460012






      Additional evidence from which I thought my model might suffer from overfitting is that if I break the data into annual levels, not many observations in my sample have y=1 value in each year. (I include the two-way table below.) So I assumed that if include the 5-year time dummies instead of annual time dummies in my regression, I would be able to get more stable results.



      . ta year y if sample==1

      y
      year 0 1 Total

      1990 2 2 4
      1991 6 4 10
      1992 8 2 10
      1993 21 4 25
      1994 20 2 22
      1996 21 6 27
      1997 31 2 33
      1998 32 3 35
      1999 22 4 26
      2000 41 5 46
      2001 39 2 41
      2002 33 13 46
      2003 36 7 43
      2004 40 7 47
      2005 40 6 46
      2006 44 7 51
      2007 20 5 25
      2008 30 3 33
      2009 26 3 29
      2010 35 9 44
      2011 34 4 38
      2012 10 1 11

      Total 591 101 692




      Please let me know if my explanation was insufficient or if my understanding is falling short.

      Regards,
      Hyewon
      Last edited by Hyewon Kim; 29 Aug 2018, 02:19.

      Comment


      • #4
        Hyewon:
        thanks for clarifying.
        I do not think that your model suffers from overfitting: most of your predictors are not statistical signicant and the pseudo R2 is quite low.
        Moreover, I would test if -i.year- are jointly significant via
        Code:
        testparm(i.year)
        The perfect prediction is simply a matter of fact related to your sample, that you cannot fix.
        If you create five-year block, you should deploy a strategy to make the other predictors consistent with this new time varaible.
        Kind regards,
        Carlo
        (Stata 18.0 SE)

        Comment

        Working...
        X