Difference-in-difference with a time variant treatment variable

Sandra Loayza

Join Date: Aug 2018

Posts: 8
#1

Difference-in-difference with a time variant treatment variable

03 Aug 2018, 08:35

Hello

I am trying to estimate the local impact of living close to a mine using a difference-in difference strategy.

I used pooled household survey data for 6 years and data at the mine level. I combined the two data sources using geographic information a the cluster and mine level.

My model is the following: Y_icdt=β₀ +β₁ Active_t +β₂near_c+β₃ active*near+α_d+γ_t+λX_idt +εivt

Where Y_icdt is the outcome variable of individual i in cluster c, in district d and in survey year t, which is a function of the dummy variable "near"(near=1 if the cluster is located within a close distance of the mine and near=0 if its further away) , the dummy variable "Active" , which takes the value of 1 if the nearest mine to the cluster was active in that survey year and 0 if it wasn't active and the interaction term of these two dummies.

So my control group is those individuals who live further away from the mine and my treatment is the mine being active on a given year. However, mines are active in different years and some mines may be active at first and in a subsequent year be inactive.

I am not sure what is the best approach to conduct this regression. I tried running a simple OLS regression and this came up.

. reg ha5 Active mine_5km near_active
note: Active omitted because of collinearity
note: near_active omitted because of collinearity

Source | SS df MS Number of obs = 23,622
-------------+---------------------------------- F(1, 23620) = 27.79
Model | 155486631 1 155486631 Prob > F = 0.0000
Residual | 1.3218e+11 23,620 5595987.29 R-squared = 0.0012
-------------+---------------------------------- Adj R-squared = 0.0011
Total | 1.3233e+11 23,621 5602332.94 Root MSE = 2365.6

------------------------------------------------------------------------------
ha5 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Active | 0 (omitted)
mine_5km | 531.4403 100.8199 5.27 0.000 333.8268 729.0538
near_active | 0 (omitted)
_cons | 368.5508 15.57857 23.66 0.000 338.0158 399.0858
------------------------------------------------------------------------------

Should I set the data as panel beforehand? And Is it okay to have the variable Active as a dummy or is it better to create a categorical which takes the value of 1 if the nearest mine was active in 2004, 2 if active in 2005 etc?

Please let me know what you think is the best way to estimate the model, considering that the treatment variable takes place at different years for different mines.

Thanks a lot!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30185
#2

03 Aug 2018, 09:27

So, this data is not amenable to a classical difference in differences analysis, because the treatment is intermittent and its onset is not synchronized across units. Instead, you have to use a generalized difference in differences analysis. For a good explanation of the approach, take a look at

https://www.ipr.northwestern.edu/wor.../Day%204.2.pdf

Your actual treatment variable here is neither active nor near but their conjunction. So I would set this up as follows:

Code:

gen under_treatment = (active == 1) & (near == 1) if !missing(active, near) xtset variable_identifying_individual xtreg outcome_variable i.under_treatment i.year perhaps_other_covariates, fe

The coefficient of under_treatment is then your generalized DID estimate of the effect of living near a mine when it is active.

The perhaps_other_covariates part of the model should include other variables that are relevant to the outcome and are not simply unchanging fixed attributes of the individual. In particular, if there are variables describing properties of the mines themselves which are relevant to the outcome, I recommend including those. (However, if an individual is always living near the same mine and if the mine's attributes do not change over time, then these will also be constant within individual and will be omitted due to colinearity.)
Comment
Sandra Loayza

Join Date: Aug 2018

Posts: 8
#3

03 Aug 2018, 11:12

Hi Clyde

Thanks a lot for your help.

I went through the slides and I attempted using your code and I included as other_ covariates a dummy for urbal/rural district. However, for my outcome variables which is Weight in KiIograms, I got the following results. Do you know why 2008 , 2009 year dummies and the treatment variable were omitted? Thanks in advance!

. xtreg hc2 i.under_treatment i.year ur ,fe
note: 0.under_treatment omitted because of collinearity
note: 2008.year omitted because of collinearity
note: 2009.year omitted because of collinearity
note: ur omitted because of collinearity

Fixed-effects (within) regression Number of obs = 9,799
Group variable: newid Number of groups = 9,799

R-sq: Obs per group:
within = . min = 1
between = . avg = 1.0
overall = . max = 1

F(0,0) = 0.00
corr(u_i, Xb) = . Prob > F = .

-----------------------------------------------------------------------------------
hc2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------------+----------------------------------------------------------------
0.under_treatment | 0 (omitted)
|
year |
2008 | 0 (omitted)
2009 | 0 (omitted)
|
ur | 0 (omitted)
_cons | 120.11 . . . . .
------------------+----------------------------------------------------------------
sigma_u | 36.787551
sigma_e | .
rho | . (fraction of variance due to u_i)
-----------------------------------------------------------------------------------
F test that all u_i=0: F(9798, 0) = . Prob > F = .
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30185
#4

03 Aug 2018, 12:02

Your data are not at all as I had imagined them to be, which may be a serious problem with your data, or it may be that I have grossly misunderstood your description in post #1.

The outputs shown in #3 indicate that you have only a single observation per newid in your data set. This appears to contradict your statement that you have six years worth of survey data. But perhaps different people were surveyed in each year and I just assumed the data was longitudinal rather than a series of cross-sections. In any case, with just one observation per id, and with -xtset id-, everything will always be colinear with the fixed effect. So to fix this, I need to see an example of your data. Either you data will require some additional management, or a different analytic approach is needed, or both. Also, the appearance of only 2008 and 2009 years in the output suggests that only 3 distinct years (2008, 2009 and one other) actually appear in your data (or, again, those are the only years that occur in observations that have non-missing values for all model variables.) While an analysis based on just three years is possible, you said you have 6 years worth of data, so I fear something is wrong.

Please use the -dataex- command to show an example of your data. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Even so, something else is amiss. Your under_treatment variable is apparently always zero, or at least is always zero in those observations that have complete data on all model variables. Do you not have any observations where the person lives near a mine and the mine is active? Without that, there is no hope of approaching your research questions. If you do have such people, then something has gone wrong in creating the under_treatment variable, so please show, in addition to the example data, the actual exact code you used to create that variable.
Comment
Sandra Loayza

Join Date: Aug 2018

Posts: 8
#5

03 Aug 2018, 12:54

Thank you for the quick reply Clyde.

You are right, different people were surveyed in each year so its pool cross-sectional data for the years 2004-2009. I used the tab command and noticed that 2007 is the only year where under_treatment observations are all equal to 0, however the rest of the years have variation of 1's and 0's. There are 43,631 observations out of 292,511 that have no geographic information. and these 43,631 observations with missing values belong to the year 2008 .

Please see below a sample of my data. Although the year shows 2003, the organization that collects the data considers it as part of the start of the 2004 survey because the information was collected late December 2003.

My code is the following:

*Create new identifier using household id and number of respondent of each household
egen newid = group(hhid hvidx)

xtset newid year

*Create dummy for clusters within 5 km distance of a mine
gen mine_5km=0
replace mine_5km=1 if distance<=5000
label var mine_5km near

*Create dummy for mine being active at a given year
gen Active=0
replace Active=1 if active_2004==1 & year==2003 | year==2004
replace Active=1 if active_2005==1 & year==2005
replace Active=1 if active_2006==1 & year==2006
replace Active=1 if active_2007==1 & year==2007
replace Active=1 if active_2008==1 & year==2008
replace Active=1 if active_2009==1 & year==2009

gen under_treatment= (Active==1) & (mine_5km==1) if !missing(Active, mine_15km)

xtreg hc2 i.under_treatment i.year, fe

I also checked for households that live near a mine that is also active

Thanks a lot again for your help. Looking forward to your response
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30185
#6

03 Aug 2018, 13:47

Thank you for the additional information. It appears you attempted to post a screen shot of your data. That is not useful in this forum. As is so often the case, the screen shot is not readable, at least on my computer and presumably on those of many others. Even if it were readable, it would not contain metadata that might be important, and even if that were not a problem, there is no way to import data from a screen shot into Stata to work with it. In #4, I explicitly asked you to use the -dataex- command to show your data example, and provided instructions on how to get -dataex- if you don't already have it.

Please post back with a usable data example, by using -dataex-.

Even before seeing that, there is clearly a problem with your code and data. You seem to have two different variables, mine_5km and mine_15km, and you are not being consistent in your choice of one or the other.

Code:

gen under_treatment= (Active==1) & (mine_5km==1) if !missing(Active, mine_15km)

is not going to produce the intended results.

Similarly your cross-tabulation of Active and mime_15km is not going to provide the information needed about under_treatment when the latter is calculated from mine_5km.

With serial cross-sections rather than longitudinal data, it will not be appropriate to -xtset newid-. It will be necessary to select a higher-level unit for -xtset-, probably a variable identifying the particular mine for use as a fixed effect instead.
Comment
Sandra Loayza

Join Date: Aug 2018

Posts: 8
#7

04 Aug 2018, 03:40

Sorry about the screenshots. Please see below an example of my data

. dataex newid year anemia_child Active mine_5km under_treatment in 1/20

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float newid int year float(anemia_child Active mine_5km under_treatment) 1 2005 . 0 0 0 1 2005 . 0 0 0 2 2005 . 0 0 0 2 2005 . 0 0 0 3 2005 . 0 0 0 3 2005 . 0 0 0 4 2005 . 0 0 0 4 2005 . 0 0 0 5 2005 . 0 0 0 5 2005 . 0 0 0 6 2005 . 0 0 0 6 2005 . 0 0 0 7 2005 . 0 0 0 7 2005 . 0 0 0 8 2005 . 0 0 0 8 2005 . 0 0 0 9 2005 . 0 0 0 9 2005 . 0 0 0 10 2005 . 0 0 0 10 2005 . 0 0 0 end format %ty year

------------------ copy up to and including the previous line ------------------

Listed 20 out of 479797 observations

I corrected my code and used mine_5 but still have some similar results with the collinearity problem.

When I combined the household level data and the mine data, some households nearest mine coincides, so a mine appears in my data more than once for the households. I am not sure how could I use it in xtset in this case

Thanks again for your help!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30185
#8

04 Aug 2018, 09:10

Thank you for posting your example. I'm now very confused about what you are doing. This data is completely incompatible with the code you show in #5. Just for starters:

Code:

. xtset newid year repeated time values within panel r(451);

Which also means that you cannot run any -xtreg- commands, because they require that the data be -xtset-.

Your -xtreg- command, assuming that you somehow did get it to run, includes variables that do not appear in your example data. (Presumably they appear elsewhere in your data?)

Next, you show the following output:

Code:

Group variable: newid Number of groups = 9,799 R-sq: Obs per group: within = . min = 1 between = . avg = 1.0 overall = . max = 1

which is a clear statement that there is exactly one observation in the estimation sample per newid. But your example data clearly shows that there are several observations per newid. The only way I can reconcile these is if somehow the pattern of missing values on your regression variables is such that you are left with only one observation per newid that has non-missing values for every variable in the regression command. If that is the case, it is either an extraordinary coincidence, or, more likely, it reflects incorrect data management. In that connection, I also note that your example has a variable anemia_child that has only missing values. That variable is not mentioned in your -xtreg- command, but if your outcome variable hc2 is also like this, then it's a pretty big problem. (The fact that your -xtreg- output says 9,799 observations, but your -dataex- output says you have 479,797 observations also tells me that you have an extreme missing data problem.)

And in the data example you show, it is always the case that both Active = 0 and mine_5km = 0, so you have no observations that are actually "under treatment." I would normally conjecture that your real data are not like that, but the fact that your original regression output contains only 0.under_treatment listed, and even that one is omitted due to colinearity suggests to me that, in fact, this condition prevails in your real data as well. That seems to be inconsistent with the tabulation you showed in #5, but, then again, that used a different variable, mine_15km.

So, I think in order to figure out what's going on, we need to start from scratch. Try posting a new data example that includes all of the variables that will be involved in your regression, and try to make it representative of your real data. Also show the commands that you want to run and the results that they give you. Then we can try again to piece it all together.
Comment

Sandra Loayza

Join Date: Aug 2018
Posts: 8

05 Aug 2018, 12:14

Hi Clyde

I am very sorry for the confusion. I made a silly mistake by appending two of the data sets that contained the same observations and I had duplicates for some individuals in the household so when I created newid by using egen newid = group(hhid hvidx) , the new variable made the data looked as having several observations per newid, which is not the case since I have serial cross sections. I guess this implies that setting the data to panel by using xtset is not appropiate and I should just work with OLS regressions. What do you think?

I also took a look at the missing values issue and went through a document that describes how was the data collected and the size of the samples for each variable.
Some of my outcome variables are health measures such as height, weight and anemia, which are only available for the years 2005,2007,2008 and 2009. However, for other outcome variables such as a Wealth Index and Employment status, the data is available for 2004-2009.

Please find below the code I've used after combining all data sets. I created different measures of distance from the mine as I've seen other authors doing the same in the literature.

*Year
rename hv007 year

*Cluster within 5 km distance of a mine
ren near_dist distance
gen mine_5km=0
replace mine_5km=1 if distance<=5000
label var mine_5km near

*Cluster within 10 km distance of a mine
gen mine_10km=0
replace mine_10km=1 if distance<=10000
*Cluster within 15 km distance of a mine
gen mine_15km=0
replace mine_15km=1 if distance<=15000

*Cluster Withing 20 km distance of a mine
gen mine_20km=0
replace mine_20km=1 if distance<=20000

*Mine Activity
gen Active=0
replace Active=1 if active_2004==1 & year==2003 | year==2004
replace Active=1 if active_2005==1 & year==2005
replace Active=1 if active_2006==1 & year==2006
replace Active=1 if active_2007==1 & year==2007
replace Active=1 if active_2008==1 & year==2008
replace Active=1 if active_2009==1 & year==2009

*generate controls
gen ur=hv025
gen education_level=hv106
gen schooling=hv108
gen region=hv024
gen marital_status=hv115
gen gender=hv104
gen age=hv105
gen member_perhousehold=hv009
gen month=hv006
gen piped_water=0 if hv201!=.
replace piped_water=1 if hv201<=13

capture drop woman_worked
gen woman_worked=0 if v731!=.
replace woman_worked=1 if v731>1 & v731!=.
label var woman_worked "woman worked in the last 12 months"

*anemia child
capture drop anemia_child
gen anemia_child=0 if hc57!=. & hc57!=9
replace anemia_child=1 if hc57>=1 & hc57<=2
label var anemia_child "Child has moderate to severe anemia"

*calcualted BMI for children

gen bmi_child_est=(hc2/10)/(hc3/1000)^2

*insurance
gen insurance=0
replace insurance=1 if sh11z==0 | sh09a==1

*Treatment
gen under_treatment_5km= (Active==1) & (mine_5km==1) if !missing(Active, mine_5km)
gen under_treatment_10km= (Active==1) & (mine_10km==1) if !missing(Active, mine_10km)
gen under_treatment_15km= (Active==1) & (mine_15km==1) if !missing(Active, mine_15km)
gen under_treatment_20km= (Active==1) & (mine_20km==1) if !missing(Active, mine_20km)

*Output
reg anemia_child i.under_treatment_5km i.year
reg hv270 i.under_treatment_5km i.year

save dhs.dta, replace

Here is an example of my data, which includes all the variables that will be involved in my regression except for region education_level ur hv024 schooling member_perhousehold and month, which I tried including but the following message came up: input statement exceeds linesize limit. Try specifying fewer variables r(1000);

. dataex anemia_child ha2 hc2 ha3 hc3 hc73 bmi_child_est hv270 woman_worked under_treatment_5km under_treatment_10km under_treatment_
> 15km under_treatment_20km year insurance piped_water marital_status age gender

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float anemia_child int(ha2 hc2 ha3 hc3 hc73) float bmi_child_est byte hv270 float(woman_worked under_treatment_5km under_treatment_10km under_treatment_15km under_treatment_20km) int year float(insurance piped_water marital_status age gender)
. . . . . . . 1 1 0 0 0 0 2003 0 . . 32 2
. . . . . . . 3 . 0 0 0 0 2003 1 . . 20 1
. . . . . . . 4 1 0 0 0 0 2003 1 . . 25 2
. . . . . . . 3 . 0 0 0 0 2003 0 . .  5 2
. . . . . . . 5 . 0 0 1 1 2003 0 . . 29 1
. . . . . . . 3 . 0 0 0 0 2003 0 . . 45 1
. . . . . . . 4 1 0 0 0 1 2003 0 . . 48 2
. . . . . . . 4 . 0 0 0 1 2003 1 . .  8 2
. . . . . . . 5 1 0 0 0 0 2003 1 . . 44 2
. . . . . . . 4 1 0 0 0 1 2003 0 . . 22 2
. . . . . . . 3 . 1 1 1 1 2003 0 . . 59 2
. . . . . . . 5 0 0 0 0 0 2003 1 . . 16 2
. . . . . . . 4 . 0 0 0 0 2003 1 . . 16 2
. . . . . . . 4 0 0 0 0 0 2003 0 . . 16 2
. . . . . . . 5 . 0 0 0 0 2003 1 . . 55 1
. . . . . . . 4 . 0 0 0 0 2003 0 . . 21 1
. . . . . . . 5 0 0 0 0 0 2003 1 . . 37 2
. . . . . . . 4 1 0 0 0 0 2003 0 . . 43 2
. . . . . . . 4 1 0 0 0 1 2003 1 . . 17 2
. . . . . . . 2 . 0 0 0 1 2003 0 . . 82 2
. . . . . . . 4 . 0 0 0 0 2003 1 . . 90 2
. . . . . . . 5 . 0 0 0 1 2003 1 . . 32 1
. . . . . . . 2 . 0 0 0 1 2003 0 . . 13 2
. . . . . . . 3 . 1 1 1 1 2003 1 . .  9 2
. . . . . . . 3 . 0 0 0 0 2003 0 . .  3 1
. . . . . . . 4 . 0 0 0 1 2003 0 . .  9 1
. . . . . . . 3 0 0 0 0 1 2003 0 . . 40 2
. . . . . . . 3 . 0 0 1 1 2003 0 . . 30 1
. . . . . . . 3 . 0 0 0 0 2003 0 . . 36 1
. . . . . . . 4 . 0 0 0 0 2003 1 . . 76 2
. . . . . . . 5 1 0 0 0 0 2003 0 . . 38 2
. . . . . . . 5 1 0 0 0 0 2003 0 . . 33 2
. . . . . . . 4 . 0 0 0 0 2003 1 . .  1 1
. . . . . . . 3 . 1 1 1 1 2003 1 . . 52 1
. . . . . . . 2 1 0 0 0 0 2003 1 . . 33 2
. . . . . . . 2 . 0 0 0 0 2003 0 . . 16 1
. . . . . . . 4 1 0 0 0 0 2003 0 . . 19 2
. . . . . . . 3 . 1 1 1 1 2003 1 . .  6 2
. . . . . . . 2 . 0 0 0 0 2003 0 . . 23 1
. . . . . . . 3 1 0 0 1 1 2003 1 . . 32 2
. . . . . . . 2 . 0 0 0 0 2003 0 . .  4 2
. . . . . . . 5 0 0 0 0 0 2003 1 . . 15 2
. . . . . . . 3 1 0 0 0 0 2003 0 . . 25 2
. . . . . . . 2 . 0 0 0 0 2003 0 . . 78 2
. . . . . . . 3 . 0 0 0 0 2003 0 . . 73 1
. . . . . . . 3 . 0 0 1 1 2003 1 . .  0 1
. . . . . . . 3 . 0 0 0 1 2003 0 . . 80 2
. . . . . . . 4 . 0 0 0 0 2003 1 . . 59 1
. . . . . . . 3 1 1 1 1 1 2003 1 . . 48 2
. . . . . . . 1 . 0 0 0 0 2003 0 . . 29 1
. . . . . . . 3 . 0 0 0 0 2003 0 . .  3 2
. . . . . . . 3 1 0 0 0 1 2003 0 . . 45 2
. . . . . . . 4 . 0 0 0 0 2003 0 . . 48 1
. . . . . . . 3 . 0 0 0 0 2003 0 . . 16 1
. . . . . . . 4 . 0 0 0 0 2003 0 . . 25 1
. . . . . . . 3 . 0 0 0 1 2003 0 . . 42 1
. . . . . . . 2 0 0 0 0 1 2003 0 . . 19 2
. . . . . . . 4 . 0 0 0 1 2003 1 . . 58 1
. . . . . . . 2 . 0 0 0 0 2003 0 . . 20 1
. . . . . . . 2 . 0 0 0 0 2003 0 . .  4 1
. . . . . . . 2 . 0 0 0 0 2003 1 . . 10 1
. . . . . . . 2 . 0 0 0 0 2003 0 . . 74 1
. . . . . . . 2 . 0 0 0 0 2003 0 . . 17 1
. . . . . . . 2 . 0 0 0 1 2003 0 . . 54 2
. . . . . . . 4 . 0 0 0 0 2003 1 . . 35 1
. . . . . . . 2 . 0 0 0 1 2003 1 . . 62 1
. . . . . . . 3 . 0 0 0 0 2003 1 . . 12 1
. . . . . . . 3 0 0 0 0 0 2003 1 . . 17 2
. . . . . . . 4 . 0 0 0 0 2003 1 . . 60 1
. . . . . . . 3 . 0 0 1 1 2003 1 . .  1 1
. . . . . . . 3 1 1 1 1 1 2003 0 . . 43 2
. . . . . . . 4 1 0 0 0 0 2003 0 . . 45 2
. . . . . . . 2 1 0 0 0 0 2003 0 . . 26 2
. . . . . . . 3 . 1 1 1 1 2003 0 . . 22 1
. . . . . . . 5 1 0 0 0 0 2003 0 . . 33 2
. . . . . . . 4 . 0 0 0 0 2003 1 . . 42 1
. . . . . . . 4 . 0 0 0 0 2003 0 . . 47 1
. . . . . . . 2 0 0 0 0 0 2003 0 . . 23 2
. . . . . . . 4 1 0 0 0 0 2003 0 . . 42 2
. . . . . . . 3 . 0 0 0 0 2003 0 . . 22 1
. . . . . . . 3 . 1 1 1 1 2003 0 . . 53 1
. . . . . . . 5 . 0 0 0 0 2003 0 . . 52 2
. . . . . . . 5 . 0 0 0 0 2003 0 . . 29 2
. . . . . . . 4 . 0 0 0 0 2003 0 . . 81 2
. . . . . . . 2 0 0 0 0 0 2003 0 . . 26 2
. . . . . . . 2 . 0 0 0 1 2003 0 . . 78 2
. . . . . . . 3 0 0 0 0 1 2003 0 . . 16 2
. . . . . . . 5 0 0 0 0 0 2003 1 . . 36 2
. . . . . . . 1 . 0 0 0 0 2003 0 . . 22 1
. . . . . . . 3 . 0 0 0 1 2003 0 . .  5 2
. . . . . . . 2 . 0 0 0 0 2003 1 . .  3 1
. . . . . . . 4 . 1 1 1 1 2003 1 . . 59 1
. . . . . . . 4 . 0 0 0 0 2003 1 . . 12 2
. . . . . . . 3 . 0 0 0 1 2003 0 . .  2 2
. . . . . . . 5 . 0 0 1 1 2003 0 . . 22 2
. . . . . . . 4 . 0 0 0 0 2003 0 . . 45 1
. . . . . . . 4 1 0 0 0 0 2003 1 . . 23 2
. . . . . . . 4 1 0 0 0 0 2003 0 . . 48 2
. . . . . . . 2 . 0 0 0 0 2003 0 . . 41 1
. . . . . . . 2 . 0 0 0 0 2003 0 . . 30 1
end
label values hc73 HC73
label values hv270 HV270
label def HV270 1 "Poorest", modify
label def HV270 2 "Poorer", modify
label def HV270 3 "Middle", modify
label def HV270 4 "Richer", modify
label def HV270 5 "Richest", modify

------------------ copy up to and including the previous line ------------------

Listed 100 out of 292511 observations
Use the count() option to list more

I appreciate a lot your help. If I conduct OLS regressions, should I also include a dummy for Active?

Please let me know what you think. Thanks

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30185

#10

06 Aug 2018, 10:04

Well, this is much clearer, thank you.

And as you clearly have repeated cross sections rather than panel data, you should not -xtset- your newid.

I think what you need to add here is not an indicator ("dummy") for Active, but a variable that designates the mine. Perhaps this is the variable you call region. Does each region correspond to a single mine? If so, i would add i.region to your -regress- commands. If not, you should find such a variable in your data set.

There is also a question in my mind whether you really want to use -regress- with dichotomous outcomes like anemia. While it is perfectly legal to do so, linear probability models can be problematic when the actual probabilities are very close to 0 or 1. So often people prefer to use logistic or probit regressions for these. On the other hand, the linear probability model you get from -regress- is in some ways simpler to work with.

Finally, although your data setup code looks correct, let me point out that it can be greatly shortened and simplified:

Code:

*Cluster within distance of a mine
ren near_dist distance

forvalues d = 5(5)/20 {
    gen mine_`d'km = (distance <= `d'*1000
}


*Mine Activity
forvalues y = 2004/2009 {
    gen Active = (active_`y' == 1 & year == `y')
}
replace Active = 1 if active_2004 == 1 & year == 2003



*generate controls
//    I WON'T WRITE THEM ALL OUT, BUT THESE CAN BE DONE WITH -rename-
//    SO YOU DON'T GENERATE UNNEEDED EXTRA VARIABLES AND WASTE MEMORY
gen ur=hv025
gen education_level=hv106
gen schooling=hv108
gen region=hv024
gen marital_status=hv115
gen gender=hv104
gen age=hv105
gen member_perhousehold=hv009
gen month=hv006
gen piped_water=0 if hv201!=.
replace piped_water=1 if hv201<=13

capture drop woman_worked
gen woman_worked = (v731 > 1) if !missing(v731)
label var woman_worked "woman worked in the last 12 months"


*anemia child
capture drop anemia_child
gen anemia_child = inrange(hc57, 1, 2) if !missing(hc57) & hc57 != 9
label var anemia_child "Child has moderate to severe anemia"


*calcualted BMI for children
gen bmi_child_est=(hc2/10)/(hc3/1000)^2


*insurance
gen insurance = (sh11z==0 | sh09a==1)

*Treatment
forvalues d = 5(5)20 {
    gen under_treatment_`d'km = (Active == 1) & (mine_`d'km == 1) ///
        if !missing(Active, mine_`d'km)
}

Comment

Rochelle Zhang

Join Date: Sep 2025

Posts: 0
#11

06 Aug 2018, 11:00

Dear Clyde,

Thanks you for helping out with people asking question on time varying Did analyses.

I have a general question for time varying Did setup.

In a standard Did, we have one event date, say 1999, it is easy to define post=1 if year>1999, we identify control firms by propensity score matching, for example, we know some firms have a regulation changes (treated), we need to find firms that best match the treated firms, we need to construct the control based on several matching Covariates. (my case A)

In Sandra's request or similar data like hers, I suppose you did not do propensity score matching because she knows which ones are control vs. treated.

is it correct for me to say
1. you suggest coding the varies dummies capturing the difference between treated vs. controls?

2. if it is a case as I described in (my case A), then propensity score matching is needed , with time varying events, it is harder to code to me. if you have seen examples that help with this, please let me know.

Thanks,

rochelle
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30185
#12

06 Aug 2018, 11:27

In a standard Did, we have one event date, say 1999, it is easy to define post=1 if year>1999, we identify control firms by propensity score matching, for example, we know some firms have a regulation changes (treated), we need to find firms that best match the treated firms, we need to construct the control based on several matching Covariates. (my case A)

This is one situation. But there can be other situations: sometimes the regulation changes are the result of a state imposing it across the board on all firms within its jurisdiction; in that case the control group can be firms in other jurisdictions, preferably jurisdictions in which conditions are otherwise similar. Sometimes a state will impose a regulation on certain types of firms, and then you can sometimes choose the control group to just be firms that were not of that type: of course this means that firm type is confounded with the intervention, which puts a very high importance on parallel trends before the intervention, and even then weakens causal inference. There are many possibilities.

1. you suggest coding the varies dummies capturing the difference between treated vs. controls?

In the simple, classical DID analysis, yes, you need one indicator (dummy) variable that captures treated vs control, another that captures pre- intervention vs post-intervention, and then their interaction. In the generalized DID analysis, these variables may necessarily be replaced by fixed effects for firm and time, respectively.

In Sandra's request or similar data like hers, I suppose you did not do propensity score matching because she knows which ones are control vs. treated.

Her case is a bit more complicated than this because the "treatment," which is living near an active mine is actually intermittent. There is no single start date. Different mines are active at different times, and a mine can go from inactive to active to inactive again and then reactivate again later, etc. In any case, there is no real "treatment group" or "control group" in her design: each person in her study lives within some distance of a mine, and those mines are active at some times and inactive at other times. So each person is both "treatment" and "control" depending on when you are referring to. This is a very complicated design and it really stretches even generalized DID to its limits, especially because she does not have longitudinal (panel) data on the same people over time.

2. if it is a case as I described in (my case A), then propensity score matching is needed , with time varying events, it is harder to code to me. if you have seen examples that help with this, please let me know.

Having matched-controls introduces yet another level of complexity, because now we have both repeated observations per firm over time and we have nesting of firms within matched pairs. This means that you really have a 3-level model to analyze. That's a problem in finance where there is something of an aversion to random-effects analysis, but there is no way to do a 3-level fixed effects analysis. So sometimes, particularly in a DID design where the effect of interest is a within-firm effect, we make compromises like -xtset-ing on the firm and then -vce(cluster matched_pair)-. Or sometimes we go with a three-level random-effects design and do what we can to minimize the potential for inconsistent estimates. The whole subject is complicated and I think is best handled on a case-by-case basis.

And of course, before you even get to that there is the issue of calculating propensity scores and creating matched pairs (something that has not arisen much in my recent work, so I am not really up to date on the latest Stata commands for this.)
Comment
Rochelle Zhang

Join Date: Sep 2025

Posts: 0
#13

06 Aug 2018, 20:18

Thank you Clyde very much !

I consider the construction of controls using Propensity score match challenging when events are time varying.
Comment
Rochelle Zhang

Join Date: Sep 2025

Posts: 0
#14

06 Aug 2018, 23:44

Dear Clyde,
I read the pdf you posted in #2 https://www.ipr.northwestern.edu/wor.../Day%204.2.pdf

suppose I do not need to use propensity score matching to construct control groups (#11) , I only need to code DID for time varying events , not as complex as Sandra. Reading the linked pdf above seems to suggest using two fixed effect (page 31)

say we have firms in new york vs. new jersey, the former had legislation changes in multiple years. firms new jersey is control

how would the did reg look ? would you do

xtreg y post post#treat i.year i.firmid ?
Comment
Sandra Loayza

Join Date: Aug 2018

Posts: 8
#15

07 Aug 2018, 05:56

Thanks a lot for your help and time Clyde!

Each region doesn't correspond to a single mine necessarily but to a specific cluster. Should I add i.cluster instead then?

I will take a look at the logistic and probit regressions for the dichotomous outcomes, thanks for the advice!
Comment

Announcement