Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Choosing panel data model with count IV and interaction term

    Hello,

    I need help choosing a model for panel data at the regional level (EU regions) (2016-2021)
    I use the European Commission Regional Innovation Scoreboard dataset.

    My DV is continuous (no negative values) for patent applications

    My main IV - number of innovation hubs is count variable with many zeroes (about 30%). So I have data on how many hubs emerge in a region over time.

    I want to apply a model where I explain patent app by interacting the number of innovation hubs (varies over years) with the regional specialization (a categorical variable that measures how specialized is a certain region in certain technology)

    I wonder whether I need tobit regression (given no negative values in the DV) and whether I should use random effects or fixed effects model for panel data.

    I would appreciate advice on which model makes sense for such data.

    Thanks a lot in advance

  • #2
    There is no reason to use -tobit- here. The -tobit- model is used where the outcome variable is censored. What you describe does not involve censoring. Censoring would mean that some values, if known correctly, actually would be negative, but they are recorded as zero anyway because we cannot observe the negative values (for some reason). But your variable is a count of patent applications. It is literally impossible for the number of patent applications to be negative, right? So there is no censoring going on: there are no negative values that are reported as zero. There just aren't any negative values at all. So no censoring is involved and -tobit- is not appropriate.

    I will give you my opinion on fixed vs random effects here, but be aware that some in your field would disagree. In my view, it depends on a clearer statement of your research question. Are you trying to determine whether, within any region, changes over time in specialization and number of research hubs are associated with corresponding changes in number of patent applications in that region? If so, you want a fixed-effects model. (I am here glossing over the possibility that a Hausman or similar test might say that you can use random effects anyway--if so that is just affirming that the effect estimates would be essentially the same either way, and the random effects analysis is more statistically efficient. My point is that you should think of this as a fixed-effects model problem, and only use random effects under circumstances where the effect estimates would be essentially the same anyway.)

    If, on the other hand, you are trying to determine whether differences among regions in their specialization and number of research hubs are associated with different numbers of patent applications originating in those regions? If the latter, then you will not succeed in answering that question with a fixed effects model and should use random effects, or perhaps a pooled model or a purely between-panels model (-xtreg, be-). (Here, if you try a Hausman or similar test and it tells you to use fixed effects, I would emphatically state that you should ignore that. A fixed-effects model cannot estimate between-panel effects. In my view, you should avoid doing any such tests in the first place when you are trying to estimate between-panel effects.)

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      There is no reason to use -tobit- here. The -tobit- model is used where the outcome variable is censored. What you describe does not involve censoring. Censoring would mean that some values, if known correctly, actually would be negative, but they are recorded as zero anyway because we cannot observe the negative values (for some reason). But your variable is a count of patent applications. It is literally impossible for the number of patent applications to be negative, right? So there is no censoring going on: there are no negative values that are reported as zero. There just aren't any negative values at all. So no censoring is involved and -tobit- is not appropriate.

      I will give you my opinion on fixed vs random effects here, but be aware that some in your field would disagree. In my view, it depends on a clearer statement of your research question. Are you trying to determine whether, within any region, changes over time in specialization and number of research hubs are associated with corresponding changes in number of patent applications in that region? If so, you want a fixed-effects model. (I am here glossing over the possibility that a Hausman or similar test might say that you can use random effects anyway--if so that is just affirming that the effect estimates would be essentially the same either way, and the random effects analysis is more statistically efficient. My point is that you should think of this as a fixed-effects model problem, and only use random effects under circumstances where the effect estimates would be essentially the same anyway.)

      If, on the other hand, you are trying to determine whether differences among regions in their specialization and number of research hubs are associated with different numbers of patent applications originating in those regions? If the latter, then you will not succeed in answering that question with a fixed effects model and should use random effects, or perhaps a pooled model or a purely between-panels model (-xtreg, be-). (Here, if you try a Hausman or similar test and it tells you to use fixed effects, I would emphatically state that you should ignore that. A fixed-effects model cannot estimate between-panel effects. In my view, you should avoid doing any such tests in the first place when you are trying to estimate between-panel effects.)
      Thanks a lot Clyde for a detailed response!
      My key independent variable - "hubs" is also count and has many zeros. I know zero inflated models are used when the DV has many zeroes, but what do you think in this case?

      Comment


      • #4
        Well, you have longitudinal (panel) data, and official Stata has no zero-inflated panel estimator for count data. I'm not aware of any user-written commands to do that either. I believe that the estimates provided without the zero-inflation component are pretty robust regardless, although I do not have a reference to cite to support that. If your research goals involve specifically identifying a mixture model between a count data generating process and an always zero generating process, then just using -xtpoisson- (or -poisson- in the absence of panel data) will not achieve the goal. In that case there is no perfect solution that I know of, and I would use -zip- with robust variance estimation as the least bad way of ignoring the non-independence of observations in panels. But if explicitly identifying the zero component is not important, then I would just use -xtpoisson- with robust variance estimation and go with that.

        As I say, I know of no ideal solution to this. Perhaps one exists that I'm unaware of. If others following along have thoughts on this, I would be happy to learn of them.

        Comment

        Working...
        X