Dear Statalist,
I am working with a zip code level annual crime count data for the period 2000-2008. There was a regulatory change in one state in the middle of the panel data, say 2004, that I suspect has affected crime rates. Thus, it is natural for me to do Diff-in-Diff (DD) or Diff-in-Diff-Diff (DDD). I have three questions to ask and any help will be deeply appreciated.
Because the zip code level crime data is count data with large amount of zeros, I would like to use fixed effects Poisson model with zip code fixed effects. Since unobserved factors in crime rates might be correlated within counties, I want to cluster at county level, instead of zip code level. However, none of the existing Stata commands, like xtpoisson and xtnbreg, allows clustering at level other than zip code.
A user written command, ppml, may be a good choice in this case, based on Silva and Tenreyro (2006). The command has the standard weighting options but lacks fixed effects option such as ", fe". Thus, explicitly including more than 3000 zip code dummies in a regression like eqn (1) should be done.
ppml crime law i.county*time i.year i.zip [aw=pop2000], cluster(county) --- (1)
, where crime is number of crime occurred, i.zip are zip code dummies, law==1 if after the law change and the zip code is in the area affected by the law, and law==0 if pre-change period or unaffected zip codes. Also time =1 for year=2000 through time=9 for year=2008.
This regression take too much time. So I need to remove variation in zip code from each variable in eqn (1) before running regressions to cut down the computing burden.
areg crime, absorb(zip) --- (2)
predict crimeR, res
areg law, absorb(zip) --- (3)
predict lawR, res
,and finally run:
ppml crimeR lawR i.county*time i.year [aw=pop2000], cluster(county) --- (4)
-----------------------------------------------------------------------------------------------------------------------------------------------------------
My Question 1 now is: Should I also remove zip code variation from "i.county*time" and "i.year" in eqn (1)? If so, such as:
xi year --- (5)
areg _Iyear_2000, absorb(zip) --- (6)
all the way through
areg _Iyear_2008, absorb(zip) --- (7)
and similarly for i.county*time? Then, run a regression of completely partialled-out variables?
Question 2:
I have reason to believe that only juveniles are affected by the law change and adults are unaffected. So DDD seems a good strategy. Suppose my crime data now has two observations for each zip: one for juveniles and one for adults. When I run a regression with panel fixed effects like in (8), several variables are automatically omitted because they do not have variation over time.
gen TxPxJ = (treatedZip*post*juveniles)
gen TxP = (treatedZip*post)
gen TxJ = (treatedZip*juveniles)
gen PxJ = (post*juveniles)
, where treatedZip = 1 if a zip code is in the affected area by the 2004 law change and zero otherwise, and juveniles=1 for juveniles and zero for adults; so they have no temporal variation.
xtreg crime TxPxJ TxP TxJ PxJ T P J i.county*time i.year i.zip [aw=pop2000], fe cluster(county) --- (8)
How do I perform a DDD with zip code fixed effects?? Should I not include the zip code fixed effects for DDD regressions and do a pooled OLS regression as in (9)?
reg crime TxPxJ TxP TxJ PxJ T P J i.county*time i.year [aw=pop2000], cluster(county) --- (9)
Question 3:
Is it also valid to use ppml for DDD estimators?
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Thank you!
I am working with a zip code level annual crime count data for the period 2000-2008. There was a regulatory change in one state in the middle of the panel data, say 2004, that I suspect has affected crime rates. Thus, it is natural for me to do Diff-in-Diff (DD) or Diff-in-Diff-Diff (DDD). I have three questions to ask and any help will be deeply appreciated.
Because the zip code level crime data is count data with large amount of zeros, I would like to use fixed effects Poisson model with zip code fixed effects. Since unobserved factors in crime rates might be correlated within counties, I want to cluster at county level, instead of zip code level. However, none of the existing Stata commands, like xtpoisson and xtnbreg, allows clustering at level other than zip code.
A user written command, ppml, may be a good choice in this case, based on Silva and Tenreyro (2006). The command has the standard weighting options but lacks fixed effects option such as ", fe". Thus, explicitly including more than 3000 zip code dummies in a regression like eqn (1) should be done.
ppml crime law i.county*time i.year i.zip [aw=pop2000], cluster(county) --- (1)
, where crime is number of crime occurred, i.zip are zip code dummies, law==1 if after the law change and the zip code is in the area affected by the law, and law==0 if pre-change period or unaffected zip codes. Also time =1 for year=2000 through time=9 for year=2008.
This regression take too much time. So I need to remove variation in zip code from each variable in eqn (1) before running regressions to cut down the computing burden.
areg crime, absorb(zip) --- (2)
predict crimeR, res
areg law, absorb(zip) --- (3)
predict lawR, res
,and finally run:
ppml crimeR lawR i.county*time i.year [aw=pop2000], cluster(county) --- (4)
-----------------------------------------------------------------------------------------------------------------------------------------------------------
My Question 1 now is: Should I also remove zip code variation from "i.county*time" and "i.year" in eqn (1)? If so, such as:
xi year --- (5)
areg _Iyear_2000, absorb(zip) --- (6)
all the way through
areg _Iyear_2008, absorb(zip) --- (7)
and similarly for i.county*time? Then, run a regression of completely partialled-out variables?
Question 2:
I have reason to believe that only juveniles are affected by the law change and adults are unaffected. So DDD seems a good strategy. Suppose my crime data now has two observations for each zip: one for juveniles and one for adults. When I run a regression with panel fixed effects like in (8), several variables are automatically omitted because they do not have variation over time.
gen TxPxJ = (treatedZip*post*juveniles)
gen TxP = (treatedZip*post)
gen TxJ = (treatedZip*juveniles)
gen PxJ = (post*juveniles)
, where treatedZip = 1 if a zip code is in the affected area by the 2004 law change and zero otherwise, and juveniles=1 for juveniles and zero for adults; so they have no temporal variation.
xtreg crime TxPxJ TxP TxJ PxJ T P J i.county*time i.year i.zip [aw=pop2000], fe cluster(county) --- (8)
How do I perform a DDD with zip code fixed effects?? Should I not include the zip code fixed effects for DDD regressions and do a pooled OLS regression as in (9)?
reg crime TxPxJ TxP TxJ PxJ T P J i.county*time i.year [aw=pop2000], cluster(county) --- (9)
Question 3:
Is it also valid to use ppml for DDD estimators?
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Thank you!
Comment