Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple Fixed Effects

    Hello,

    I basically have a large unbalanced panel dataset of students (stud_id) across all grades from multiple schools (school_id) for 6 years (year). My dependent variable (DV) is a flag that shows if the student was suspended or not (susp_tag) in every year. For now, lets ignore the fact that the DV is binary. I am still running OLS and I am fine with that. I have many independent variables (IV's) some of which are at the student level and time invariant, some are student-level time varying (past suspension history), some are school level that are all time-varying. I want to try out multiple specifications of fixed effects and I am getting stuck with the various options. I have read through statlist and the stata manual on fixed effects, but I am still getting confused with certain paramterizations. So, I would really appreciate your help with this question.

    Since there are some students who move across schools within a year, I created a new panelid (panelid = group(stud_id school_id) so a student-year within a school is my unit of analysis.

    1. First, I want to use only school fixed effects to control for all time-invariant school-level characteristics and cluster SE's at school level. So I use xtreg DV IV's, fe i(school_id) vce(cluster school_id) nonest or areg DV IV, absorb(school_id) vce (cluster school_id). They both give me same coefficients with minor differences in SE which I believe is just the slight difference in degrees of freedom calculations. Right?

    2. Second, I want to use student fixed effects to predict change within student over time with SE's clustered again at school-level. So I use xtreg DV IV (only time-varying ones included), fe vce (cluster school_id) or areg DV IV, absorb (panelid) vce (cluster school_id). The main variable of interest here is a student-level time-varying IV (hence the identification relies on the sample of students for whom that status changes over time). gain, minor differences in SE's. Is this right?

    3. Lastly, I want to include student fixed effects and school by grade by year fixed effects (to control for variation in school quality across grades and years). So I create a variable school_grade_year = group(school_id grade year).

    If I use xtreg DV IV i.school_grade_year, fe vce(cluster school_id), I run into too many variables issue. Similarly, if I use areg DV IV i.school_grade_year, absorb(panelid), I run into the too many variables issue. If I use xtreg DV IV, fe i(school_grade_year) vce(), that's including only student_grade_year fixed effects right? How do I include both student and the school by grade by year FE?

    Thanks in advance!

    Best,
    Maithreyi



  • #2
    Try -reghdfe- from SSC, it works like areg but allows multiple variables in absorb(), including x#y expressions so you don't have to do egen group

    Comment


    • #3
      Thanks a lot Sergio. Is there any way to fasten the analysis though? I have 7.2 Million observations and the reghdfe code has been running for the last 12 hours. I am running it on a server with decent capacity (not sure of the exact configuration) on Stata 14. Any ideas? Thanks.

      Best,
      Maithreyi

      Comment


      • #4
        Can you show me the line? And how many variables are there? I can do 100m obs in half an hour tops on a 3yr old computer.

        Maybe what's going on is that the fixed effects are in the variable list instead of in absorb()?

        Comment


        • #5
          Hi Sergio,

          Thanks for responding. That's interesting. My code is: reghdfe DV IV (I have 12 IV's) , absorb (panelid school_grade_year) vce (cluster school_id)
          The second variable in the absorb () list is a combination of 3 other variables. The dataset overall has 7.2 Mn observations and 151 variables.

          When I included just my IV of interest and removed the others, the analysis completed in about 5 hours.

          Best,
          Maithreyi

          Comment


          • #6
            From what I understand, your regression only has 13 variables (besides the fixed effects), which shouldn't be much. Also having 151 variables in total shouldn't matter much because reghdfe will drop all unneeded variables after preserving the dataset.

            There are a few tips which you can run, which may help, but I am still a bit puzzled about what can be slowing things down:
            • Add the fast option
            • Add the verbose(3) option to see a log of every step, and see where is the slowdown
            • If you have a lot of memory, use the pool(#) option for # being larger than the default of 5. On the other hand, if you are using virtual memory (see "help memory"), that would slow things down a lot so you may be better off with a smaller pool(#)
            • Instead of "school_grade_year", write down the actual interactions (e.g. school#grade#year, if that's the name of your variables). This will speed up some degrees-of-freedom computations.
            • Maybe change the order of the fixed effects. The general rule is to have the fixed effect with more distinct values first. So instead of doing absorb(panelid school_grade_year), do absorb(school_grade_year panelid).
            • Lastly, using a weaker tolerance, such as tolerance(1e-6) might help, although I wouldn't use it for any final result.
            Let me know if any of these help! (And probably run them with a much smaller sample so you don't waste 5hrs) measuring the speed

            Sergio

            Comment

            Working...
            X