Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to fix years and counties being omitted because of collinearity

    Hi everyone,

    For a school project I'm writing a paper about the correlation between public transit usage and accident rates. I'm analyzing New Jersey. I have the number of total crashes, crashes that caused injuries, and crashes that resulted in fatalities for every county and the state as a whole from 2005 to 2017. This means I only have 22 observations per year--I know that's a problem, but not sure how much of it is a problem. My professor is okay with the lack of internal validity and values the experience of having created the project more than reaching statistically valid conclusions.

    I've got a bunch of variables for controls, including economic conditions state-wide (not by county), county size, median age, median income, and minority populations:
    njrgdp_ njur_ countysizesqmi_ asianAlone_ blackAlone_ hispanic_ medianAge_ medianIncome_ otherAlone_ population_ totalPopForRaceCalculations_ twoOrMoreRaces_ whiteAlone_

    I used xtset to make panel data with my counties and years and ran the following regression:

    xtreg l_tcrash l_pt l_cs l_pop l_white njur_ njrgdp_, fe
    Just to be sure: this is taking my dependent variable (log of % of total crashes per worker in the county) and my independent variables (log of % of workers that take public transit, log of county size, log of population of county, log of % white of the county, the NJ unemployment rate and real GDP), and telling it to do that with fixed effects per county and year?

    When I do that, I get the table that I've attached below.

    Honestly I'm totally lost as to how to proceed.
    1) I have no idea how to interpret a good amount of this, especially the coefficients on the logs.
    2) Why would the log of county size be omitted because of collinearity?
    3) Does anyone have any other tips on how to beef up the statistical validity of my project or is it hopeless?

    Thank you so much for any help.
    Attached Files
    Last edited by Gilbert Orbea; 20 Apr 2019, 13:51.

  • #2
    It isn't broken, so you shouldn't try to fix it.

    Or, if it is broken, the problem is in your data, not your modeling. I'm reading this output as saying that county size is a time-invariant attribute of the county. That is why it is colinear with the county fixed effects; attributes of the fixed effects that do not change over time are always omitted. That is a necessary matter of linear algebra and there is no getting around it. But it is also not a problem. Unless you are specifically interested in the effect of county size on your outcome, it doesn't matter at all. Typically the purpose of including a variable like this is to adjust for its confounding effect on the outcome variable. But, precisely because it is colinear with the county fixed-effects, the fixed effects themselves automatically adjust for this (and for everything, measured or not, that is an unchanging property of the county.) That is, in fact, one of the great advantages of fixed effects regression!

    So the question is whether this county size variable is appropriate data. There was only one decennial census during the period of your study. If you are using the 2010 population counts from the census in all of the years of your study, that gives you a time-invariant variable. But, of course, in the real world, the county size does change from year to year. So if the county size is a really important variable whose effects you really want to estimate, you would need to improve the data by getting year-by-year county size estimates. You would not find those in the regular census reports, but the Census Bureau does publish inter-censal population estimates. I don't recall at what intervals those come out, nor do I know if they are carried down to the county level, but if it's important to estimate county size effects, you could look into that.

    The interpretation of coefficients in log-log models is fairly straightforward, though many people find it confusing. It comes up often on this Forum. The key principle is that in a log-log model, a given proportional change in a predictor is associated with a corresponding proportional change in the outcome, the correspondence being a reflection of the coefficient. So, typically it is phrased in these terms. Imagine that pt itself increases by 1%. Then new pt = 1.01 * old pt. Taking logs, log new_pt = log(1.01) + log (old pt) = 0.00995 + log(old_pt). So log pt has increased by 0.00995. Now, since the regression model itself is linear (in the logged variables), that implies that the expected value of log tcrashraw will increase by coeff of l_tcrashraw * 0.00995 = 0.0307673*0.00995 = 0.00031. So log tcrashraw new = log tcrashraw old + 0.00031. Exponentiating both sides of this equation, tcrashraw new = tcrashraw old * exp(0.0031) = tcrashraw old * 1.0031. So when pt goes up by 1%, tcrashraw goes up by a factor of 1.00031, which could also be said as, tcrashraw increases by 0.031 percent.

    Now, you probably notice that this 0.031 percent increase in tcrashraw looks like just a repeat of the coefficient of l_pt itself. And that is not coincidental. In fact, if we carried the calculations out to more decimal places, there is a small difference, but when the coefficients are close to zero, we often speak somewhat sloppily and say that the coefficient itself gives the percent change in the outcome associated with a 1% change in the predictor. The approximation breaks down if the coefficient is large, but at this range it works very well.

    Comment

    Working...
    X