Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with matching individual-level data with group-level data

    I have two datasets in the US:
    • Dataset A has 10 million observations of mortgage borrowers with individual features (income, FICO, loan amount, interest rate, etc).
    • Dataset B has county-level environmental information (air pollution, water pollution, etc.).
    I matched A and B on the county variable. Now, I want to run a regression of air pollution (county-level) with the mortgage interest rate (individual-level). This leads to a very big assumption: everyone in a county is exposed to the same level of pollution! From what I understand, if the individuals' variation within a county is large, my estimate will be biased. But other than, I don't know what else I should address/be aware.

    I'd love to hear some insights on this, or some recommended reading materials.
    Last edited by Kien Hoang; 30 Jul 2023, 13:45.

  • #2
    If your goal is to predict/understand/explain a person's mortgage rate as a function of that person's exposure to air pollution, you cannot do that with county-level air pollution data. There are several aspects to this. One is that the county-level data will have less variation than the individual-level data, so all else being the same, the regression coefficient of a county-level air pollution variable will be larger in magnitude than the individual-level variable would have. But there are other, perhaps deeper problems with this approach. County-level air pollution is not necessarily a good indicator of an individual's personal exposure to anything. People often work or study in different counties from where they live, or have other reasons to spend substantial amounts of time out of their residential county.

    All of that said, it might be perfectly reasonable to study the relationship of county-level air pollution measures to personal mortgage rates, so long as you do not delude yourself into thinking that the results would tell you anything useful about the relationship of individual level air pollution measures to personal mortgage rates. Higher air pollution levels in a county may very well lead to lower residential property values. That in turn may mean that people with lower incomes are attracted to buy homes in such counties. As lower income people may be at greater risk of defaulting on a mortgage, they may be charged higher mortgage interest rates. So there is a reasonably plausible argument that the county level air pollution measure, even though it is not a reasonable proxy for individual air pollution exposure, might affect an individual's mortgage rate. But just understand that this is completely separate from the effects of individual-level air pollution.

    Comment


    • #3
      Ideally, I think, you'd aggregate to the level of the most aggregated data. Any coefficient based on data below that is prone to be biased. But, that may not work well in your case.

      Comment


      • #4
        Hi all,

        Thank you so much for your response.

        Regarding Clyde's comment, I believe what you mentioned in the latter paragraph is the confounding effect of individuals' income on their mortgage rate (outcome) and their exposure to pollution (treatment). The answer is that I have information on that variable, along with several other confounders, which all I have controlled for.

        As you said, the big problem is that the county-level data will have less variation than the individual-level data. But I'd want to ask more about your opinion on this since you said using county-level pollution cannot explain a person's mortgage rate. From what I understand, what missing here is the county's within-variation in pollution. However, I think that the causal effect of pollution can still be estimated from the variation between counties, given that the number of counties is large enough.

        Comment


        • #5
          However, I think that the causal effect of pollution can still be estimated from the variation between counties, given that the number of counties is large enough.
          I disagree. The number of counties has no bearing on the matter. It's a measurement issue. The problem is that the county measure is not a valid proxy for the individual-level measure because many people spend much of their time in counties other than their county of residence. In fact, because of this, the more between-county variation in pollution there is, the worse the county measure is as a proxy for the individual measure. The county measure of pollution simply does not serve as a valid measure of individual pollution exposure.
          Last edited by Clyde Schechter; 31 Jul 2023, 15:24.

          Comment


          • #6
            might be able to use a similar approach.

            HTML Code:
            https://heep.hks.harvard.edu/files/heep/files/dp69_sullivan.pdf

            Comment


            • #7
              Actually, what I'm concerned about is the propery's exposure to pollution, rather than individual, as the outcome variable is mortgage rate. Sorry for not specifying this.

              But I understand your argument, and I agree that a more spatially precise data would be much more desirable. However, in case I cannot obtain a lower geographical level of pollution, do you recommend any anything to get a more precise causal estimate?

              Comment


              • #8
                Thank you George for your material!

                Comment


                • #9
                  This is the ticket.

                  HTML Code:
                  https://www.census.gov/content/dam/Census/library/working-papers/2017/adrm/carra-wp-2017-04.pdf

                  Comment


                  • #10
                    If your interest is in the property's mortgage rate, not the mortgage holder's, then the county pollution measure isn't so bad. It's just a matter of being rather coarse-grained in this context. The modeling method that George Ford recommended in #6 would be a state of the art approach if you are able to implement it.

                    Comment


                    • #11
                      I've used satellite data through Arcgis before. You can get fine location data from a surface. I had someone who's an expert do it, however, so can't help you with that (other than to give you a contact). It wasn't terribly expensive (a couple hundred $ if I recall).

                      Comment

                      Working...
                      X