Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Understanding Abadie, Athey, Imbens, and Wooldridge (2017) using a long-difference example

    There are 100 counties j. There are many people in each county. People do not move across counties from Jan 1 to Dec 31. No time subscript is needed in this example. It's a long difference.
    Y_{i} = 1 if person i got cancer by Dec 31, and 0 otherwise.

    X_{j(i)} = Amount of pollutant that spilled into county j (in which person i lives) from Jan 1 to Dec 31
    Before reading Abadie et al. (2017), I have been thinking I need to cluster at state because there are state-level health-related policies.

    But Abadie et al. (2017) say
    "The researcher should assess whether the sampling process is clustered or not, and whether the assignment mechanism is clustered. If the answer to both is no, one should not adjust the standard errors for clustering, irrespective of whether such an adjustment would change the standard errors."
    In this example, in what situation would "sampling process" and "assignment mechanism" be considered to be clustered?

    Is Abadie et al. (2017) basically saying that clustering at state is too conservative approach?

    So in this example, Abadie et al. (2017) recommends clustering at county?

  • #2
    Another similar example will be this.

    Consider all NAICS 6-digit industries j. There are many workers in each industry. No time subscript is needed in this example. It's a long difference.

    Y_{i} = 1 if worker i got cancer by Dec 31, and 0 otherwise.

    X_{j(i)} = Increase in lead contamination in NAICS 6-digit industry j from Jan 1 to Dec 31, where j is the industry that worker i works in.

    In this example, in what situation would "sampling process" and "assignment mechanism" be considered to be clustered?

    In this case, is Abadie, Athey, Imbens, and Wooldridge (2017) saying that clustering at NAICS 5-digit industry is a too conservative approach?

    So is Abadie et al. (2017) saying that I need to cluster at NAICS 6-digit?

    Comment


    • #3
      James: It is unlike that your data were obtained by cluster sampling. It is more likely the data are from a random sample (or maybe a stratified sample). Suppose that it's a random sample. Then our suggestion is to use the potential outcomes nature of the outcomes, and to think of randomness in the assignment. In your example, it seems clear that the assignment is at the county level. Therefore, you are correct that you should cluster at the county level. That there might still be state-level differences is a red herring. All that matters is the level of assignment.

      If the data were clustered at a level county or lower (say, city or zip code), then you would still cluster at the county level. If the data were cluster sampled by state -- very unlikely, since all states likely appear in the sample -- then you'd cluster at the state level.

      Note that clustering at the county level gives you conservative inference if you observe the entire population, using the potential outcomes framework. But it sounds like that is not the case for you.

      I hope this helps.

      JW

      Comment


      • #4
        Originally posted by Jeff Wooldridge View Post
        James: It is unlike that your data were obtained by cluster sampling. It is more likely the data are from a random sample (or maybe a stratified sample). Suppose that it's a random sample. Then our suggestion is to use the potential outcomes nature of the outcomes, and to think of randomness in the assignment. In your example, it seems clear that the assignment is at the county level. Therefore, you are correct that you should cluster at the county level. That there might still be state-level differences is a red herring. All that matters is the level of assignment.

        If the data were clustered at a level county or lower (say, city or zip code), then you would still cluster at the county level. If the data were cluster sampled by state -- very unlikely, since all states likely appear in the sample -- then you'd cluster at the state level.

        Note that clustering at the county level gives you conservative inference if you observe the entire population, using the potential outcomes framework. But it sounds like that is not the case for you.

        I hope this helps.

        JW

        Thank you very much! Yes, right the data were not cluster-sampled by state, also not by NAICS 5-digit in the industry-level example.

        As a corollary, in my industry example, I can see I need to cluster at NAICS 6-digit, not 5-digit.

        Also as a corollary, if Y is also county-level (and X is still county-level), as below, then I think I shouldn’t cluster at any level.
        Y_{j} = Fraction of people having cancer by Dec 31 in county j
        X_{j} = Amount of pollutant that spilled into county j from Jan 1 to Dec 31

        Also you wrote “clustering at the county level gives you conservative inference if you observe the entire population”.

        Are you referring to a situation where, for example, the treatment is at individual-level, rather than county-level, such as “amount of pollutant that individual person i has inhaled during the year”? In that case, I can understand that even county-level clustering is too conservative. I should not cluster s.e. at any level.

        Or, are you referring to a situation like this?
        Y_{i} = Same as original example. 1 if person i got cancer by Dec 31.
        X_{j(i)} = Same as original example. Amount of pollutant that spilled into county j (in which person i lives) from Jan 1 to Dec 31
        Sample of i = Entire 300 million US adults

        Comment


        • #5
          Correct, James. If Y and X are both county level, no need to cluster. And in your example, the entire population would mean all 300 million. If Y and X are at the county level, it means you see all counties. In that case, the usual heteroskedasticity-robust inference is conservative. (That was in the earlier AAIW paper on sampling versus design uncertainty.)

          Comment

          Working...
          X