Hello
I am trying to estimate the local impact of living close to a mine using a difference-in difference strategy.
I used pooled household survey data for 6 years and data at the mine level. I combined the two data sources using geographic information a the cluster and mine level.
My model is the following: Yicdt =β0 +β1 Activet +β2 nearc+β3 active*near+αd+γt +λXidt +εivt
Where Yicdt is the outcome variable of individual i in cluster c, in district d and in survey year t, which is a function of the dummy variable "near"(near=1 if the cluster is located within a close distance of the mine and near=0 if its further away) , the dummy variable "Active" , which takes the value of 1 if the nearest mine to the cluster was active in that survey year and 0 if it wasn't active and the interaction term of these two dummies.
So my control group is those individuals who live further away from the mine and my treatment is the mine being active on a given year. However, mines are active in different years and some mines may be active at first and in a subsequent year be inactive.
I am not sure what is the best approach to conduct this regression. I tried running a simple OLS regression and this came up.
. reg ha5 Active mine_5km near_active
note: Active omitted because of collinearity
note: near_active omitted because of collinearity
Source | SS df MS Number of obs = 23,622
-------------+---------------------------------- F(1, 23620) = 27.79
Model | 155486631 1 155486631 Prob > F = 0.0000
Residual | 1.3218e+11 23,620 5595987.29 R-squared = 0.0012
-------------+---------------------------------- Adj R-squared = 0.0011
Total | 1.3233e+11 23,621 5602332.94 Root MSE = 2365.6
------------------------------------------------------------------------------
ha5 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Active | 0 (omitted)
mine_5km | 531.4403 100.8199 5.27 0.000 333.8268 729.0538
near_active | 0 (omitted)
_cons | 368.5508 15.57857 23.66 0.000 338.0158 399.0858
------------------------------------------------------------------------------
Should I set the data as panel beforehand? And Is it okay to have the variable Active as a dummy or is it better to create a categorical which takes the value of 1 if the nearest mine was active in 2004, 2 if active in 2005 etc?
Please let me know what you think is the best way to estimate the model, considering that the treatment variable takes place at different years for different mines.
Thanks a lot!
I am trying to estimate the local impact of living close to a mine using a difference-in difference strategy.
I used pooled household survey data for 6 years and data at the mine level. I combined the two data sources using geographic information a the cluster and mine level.
My model is the following: Yicdt =β0 +β1 Activet +β2 nearc+β3 active*near+αd+γt +λXidt +εivt
Where Yicdt is the outcome variable of individual i in cluster c, in district d and in survey year t, which is a function of the dummy variable "near"(near=1 if the cluster is located within a close distance of the mine and near=0 if its further away) , the dummy variable "Active" , which takes the value of 1 if the nearest mine to the cluster was active in that survey year and 0 if it wasn't active and the interaction term of these two dummies.
So my control group is those individuals who live further away from the mine and my treatment is the mine being active on a given year. However, mines are active in different years and some mines may be active at first and in a subsequent year be inactive.
I am not sure what is the best approach to conduct this regression. I tried running a simple OLS regression and this came up.
. reg ha5 Active mine_5km near_active
note: Active omitted because of collinearity
note: near_active omitted because of collinearity
Source | SS df MS Number of obs = 23,622
-------------+---------------------------------- F(1, 23620) = 27.79
Model | 155486631 1 155486631 Prob > F = 0.0000
Residual | 1.3218e+11 23,620 5595987.29 R-squared = 0.0012
-------------+---------------------------------- Adj R-squared = 0.0011
Total | 1.3233e+11 23,621 5602332.94 Root MSE = 2365.6
------------------------------------------------------------------------------
ha5 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Active | 0 (omitted)
mine_5km | 531.4403 100.8199 5.27 0.000 333.8268 729.0538
near_active | 0 (omitted)
_cons | 368.5508 15.57857 23.66 0.000 338.0158 399.0858
------------------------------------------------------------------------------
Should I set the data as panel beforehand? And Is it okay to have the variable Active as a dummy or is it better to create a categorical which takes the value of 1 if the nearest mine was active in 2004, 2 if active in 2005 etc?
Please let me know what you think is the best way to estimate the model, considering that the treatment variable takes place at different years for different mines.
Thanks a lot!
Comment