Hi all,
I have an econometric question.
Some of you already know the American Community Survey data. But for those of you who don't know, it's a household survey that doesn't identify a household's county due to privacy issues. The smallest geographical unit it identifies is the Public Use Microdata Area (PUMA), which is either a county or a group of counties based on the population.
For my research, I am trying to run regression in which my outcome variable is household level (PUMA identified), and my treatment variable is county level.
Since it's cumbersome and noisy to aggregate my treatment variable to the PUMA level, following Autor and Dorn (American Economic Review, 2013), I have decided to link the outcome variable and treatment variable using the commuting zone (CZ) geographic level. CZs are usually larger than PUMAs. Some CZs consist of a single county, whereas most consist of multiple counties. I can easily aggregate the outcome variable from county to CZ. I can also identify the CZ of the household in the ACS data.
The issue is that some PUMAs get split into multiple CZs. For such cases, Autor and Dorn (2013) suggests using the probability that a household (PUMA identified) lies in a particular CZ using the probabilistic weights based on the population share of those PUMAs in a given CZ. David Dorn has the PUMA-county crosswalk files here (section E): https://www.ddorn.net/data.htm.
Particularly, he suggests using the "joinby" command on Stata on the ACS dataset using PUMA to match the households with multiple CZs, each CZ with a particular weight. So, a singular household-year combination will get multiple observations, each with a unique CZ. As I stated, each row has a CZ weight based on the population distribution.
It's easy to weigh my treatment variable based on these weights. But the issue I am having is using CZ fixed effects. Since a singular household can lie in any one of the multiple CZs, how do I deal with this situation?
I thought of doing one of the three:
1. Creating a singular row for a household-year combination by collapsing the treatment variable using the given CZ weights. Then, use the CZ fixed effects for that observation using the CZ with the largest weight.
2. Creating a singular row for a household-year combination by collapsing the treatment variable using the given CZ weights. Then, use the multiple CZ fixed effects for that observation. For instance, if a household's PUMA is in two CZs, then the CZ fixed effects will turn the dummy variables for both of those CZs to 1.
3. Let the household-year combination have multiple rows of observations and use different CZs for fixed effects per observation. But this will create a lot of noise.
I know each has its limitations, and I'm not sure if there's a better way to deal with this.
I know it's long, and I may not have explained it well. But I'd appreciate your input on this, and I'm willing to clarify more if needed. Thank you!
I have an econometric question.
Some of you already know the American Community Survey data. But for those of you who don't know, it's a household survey that doesn't identify a household's county due to privacy issues. The smallest geographical unit it identifies is the Public Use Microdata Area (PUMA), which is either a county or a group of counties based on the population.
For my research, I am trying to run regression in which my outcome variable is household level (PUMA identified), and my treatment variable is county level.
Since it's cumbersome and noisy to aggregate my treatment variable to the PUMA level, following Autor and Dorn (American Economic Review, 2013), I have decided to link the outcome variable and treatment variable using the commuting zone (CZ) geographic level. CZs are usually larger than PUMAs. Some CZs consist of a single county, whereas most consist of multiple counties. I can easily aggregate the outcome variable from county to CZ. I can also identify the CZ of the household in the ACS data.
The issue is that some PUMAs get split into multiple CZs. For such cases, Autor and Dorn (2013) suggests using the probability that a household (PUMA identified) lies in a particular CZ using the probabilistic weights based on the population share of those PUMAs in a given CZ. David Dorn has the PUMA-county crosswalk files here (section E): https://www.ddorn.net/data.htm.
Particularly, he suggests using the "joinby" command on Stata on the ACS dataset using PUMA to match the households with multiple CZs, each CZ with a particular weight. So, a singular household-year combination will get multiple observations, each with a unique CZ. As I stated, each row has a CZ weight based on the population distribution.
It's easy to weigh my treatment variable based on these weights. But the issue I am having is using CZ fixed effects. Since a singular household can lie in any one of the multiple CZs, how do I deal with this situation?
I thought of doing one of the three:
1. Creating a singular row for a household-year combination by collapsing the treatment variable using the given CZ weights. Then, use the CZ fixed effects for that observation using the CZ with the largest weight.
2. Creating a singular row for a household-year combination by collapsing the treatment variable using the given CZ weights. Then, use the multiple CZ fixed effects for that observation. For instance, if a household's PUMA is in two CZs, then the CZ fixed effects will turn the dummy variables for both of those CZs to 1.
3. Let the household-year combination have multiple rows of observations and use different CZs for fixed effects per observation. But this will create a lot of noise.
I know each has its limitations, and I'm not sure if there's a better way to deal with this.
I know it's long, and I may not have explained it well. But I'd appreciate your input on this, and I'm willing to clarify more if needed. Thank you!
Comment