I am examining the effect of adding month fixed effects when estimating yearly wasting prevalence (ie incidence of children being too thin) in datasets where survey timing is not comparable across years. This is commonly done, but I realized that in some settings it helps, and in some settings it doesn't help / even worsens bias! Odd. In any case, this is a rabbit hole but I want to understand what's happening properly, so I decided to look at a bunch of simulated examples.
In my simulation, I create a dataset of 500 obs per year and month for all months from Jan 2013 - Dec 2019, and create variation in average wasting rates by month (following the pattern we see in Senegal, not important but just for the curious), with some year-to-year variation in those monthly patterns as is realistic. I create no "true" variation in year-to-year avg wasting rates. Then in part two, I show (within a preserve - restore) what happens if you keep only a randomly chosen 3-month sample period per year (visualized in a graphic for quick intuition).
My goal is to compare "true" wasting rates over the years (reg wasting i.year in full simulated sample) to "apparent" wasting rates over the years (reg wasting i.year in the non-comparable, 3-month survey sample), to "adjusted" wasting rates over the years (reg wasting i.year i.month in the non-comparable, 3-month survey sample).
Problem: In some simulated survey examples, certain months are dropped from the i.year i.month model -- I presume because they are multicollinear. This seems fine; the first sampled month (usually January) is always omitted as the "base" month to which others are being compared. So conceptually, these omitted months are joining first sampled month / January as the "base" month for which wasting is predicted via _cons + year-specific parameter. I assume, for instance, this is what predict does. But margins are "not estimable" when a month is FE is omitted fr/ the model. Why? Adding force does not help. Am I missing something conceptual?
Two resulting questions, one coding and one conceptual:
(1) With margins out of order, I can uncover year-specific predictions via predict, as shown in the simulation. But what I'd really like is to save the predictions in sim_adj. Is there some way to fix the margins problem / otherwise save these predictions in sim_adj for subsequent graphing?
(2) Conceptually, is there some way to know which 3-month sampling arrangement will lead to multicollinearity and dropped month FEs? If you re-run the preserve-restore code a dozen or so times, you'll notice that perhaps 1 in 4 or 1 in 5 sampling arrangements does NOT require a dropped month FE. But looking at the scatterplot by eye, I can't tell which arrangements do / don't require a dropped month FE (and thus result in un-estimable margins). When months are dropped, they are always in one year only. But not all months in one year only are dropped. And in the arrangements with no dropped month FE, there are still months in one year only. In sum, I can't work out the sampling pattern that leads to multicolliner months. Thoughts?
In my simulation, I create a dataset of 500 obs per year and month for all months from Jan 2013 - Dec 2019, and create variation in average wasting rates by month (following the pattern we see in Senegal, not important but just for the curious), with some year-to-year variation in those monthly patterns as is realistic. I create no "true" variation in year-to-year avg wasting rates. Then in part two, I show (within a preserve - restore) what happens if you keep only a randomly chosen 3-month sample period per year (visualized in a graphic for quick intuition).
My goal is to compare "true" wasting rates over the years (reg wasting i.year in full simulated sample) to "apparent" wasting rates over the years (reg wasting i.year in the non-comparable, 3-month survey sample), to "adjusted" wasting rates over the years (reg wasting i.year i.month in the non-comparable, 3-month survey sample).
Problem: In some simulated survey examples, certain months are dropped from the i.year i.month model -- I presume because they are multicollinear. This seems fine; the first sampled month (usually January) is always omitted as the "base" month to which others are being compared. So conceptually, these omitted months are joining first sampled month / January as the "base" month for which wasting is predicted via _cons + year-specific parameter. I assume, for instance, this is what predict does. But margins are "not estimable" when a month is FE is omitted fr/ the model. Why? Adding force does not help. Am I missing something conceptual?
Two resulting questions, one coding and one conceptual:
(1) With margins out of order, I can uncover year-specific predictions via predict, as shown in the simulation. But what I'd really like is to save the predictions in sim_adj. Is there some way to fix the margins problem / otherwise save these predictions in sim_adj for subsequent graphing?
(2) Conceptually, is there some way to know which 3-month sampling arrangement will lead to multicollinearity and dropped month FEs? If you re-run the preserve-restore code a dozen or so times, you'll notice that perhaps 1 in 4 or 1 in 5 sampling arrangements does NOT require a dropped month FE. But looking at the scatterplot by eye, I can't tell which arrangements do / don't require a dropped month FE (and thus result in un-estimable margins). When months are dropped, they are always in one year only. But not all months in one year only are dropped. And in the arrangements with no dropped month FE, there are still months in one year only. In sum, I can't work out the sampling pattern that leads to multicolliner months. Thoughts?
Code:
******************************************************************************** *** Create full simulated wasting dataset (500 obs per month-year) ******************************************************************************** clear all set seed 1234 set obs 7 gen year= _n+2012 expand 12 bysort year: gen month=_n expand 500 sort year month gen wasted = 0 local i=1 forval y = 2013/2019 { gen x=runiform(0,1) gen e=runiform(-.005,.005) replace wasted = 100 if month==2 & x<=(.0354394+e) & year==`y' replace wasted = 100 if month==3 & x<=(.0322906+e) & year==`y' replace wasted = 100 if month==4 & x<=(.0291154+e) & year==`y' replace wasted = 100 if month==5 & x<=(.0344103+e) & year==`y' replace wasted = 100 if month==6 & x<=(.0417116+e) & year==`y' replace wasted = 100 if month==7 & x<=(.0561161+e) & year==`y' replace wasted = 100 if month==8 & x<=(.0418273+e) & year==`y' replace wasted = 100 if month==9 & x<=(.0463994+e) & year==`y' replace wasted = 100 if month==10 & x<=(.0596474+e) & year==`y' replace wasted = 100 if month==11 & x<=(.0385433+e) & year==`y' replace wasted = 100 if month==12 & x<=(.0344972+e) & year==`y' drop x e } * Similar seasonal wasting patterns in each year two (lpoly wasted month if year==2013) (lpoly wasted month if year==2014) /// (lpoly wasted month if year==2015) (lpoly wasted month if year==2016) /// (lpoly wasted month if year==2017) (lpoly wasted month if year==2018) /// (lpoly wasted month if year==2019) * But no change in average wasting over the years two lpolyci wasted year ******************************************************************************** * "True" wasting rates by year in full sample ******************************************************************************** reg wasted i.year margins year, saving(sim_original, replace) ******************************************************************************** ** Randomly chosen survey periods by year ******************************************************************************** preserve bysort year: gen m1=round(runiform(1,12)) if _n==1 bysort year: egen M1=max(m1) drop m1 gen M2 = M1+1 replace M2=1 if M2==13 gen M3 = M2+1 replace M3=1 if M3==13 replace M3=2 if M3==14 gen SAMP=month==M1 | month==M2 | month==M3 keep if SAMP==1 * Visual of the sampled months per year two scatter year month , ylabel(2013(1)2019) xlabel(1(1)12) msize(large) /// name(adj_setup`i', replace) * How well do the months explain the years reg year i.month local R = e(r2) * "Apparent" wasting rates under 3-month uneven sampling reg wasted i.year margins year, saving(sim_year, replace) * "Adjusted" wasting rates from adding month FE under 3-month uneven sampling reg wasted i.year i.month margins year, saving(sim_adj, replace) force * Notice that the "adjusted" prediction works, so why not margins?? * I assume prediction is using _cons for pooled set of omitted months predict what bysort year: sum what restore
Comment