No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Selection effects - looking for the right command

    Hi all -

    This is my first post here, so I hope I'm in the right place. I'm finishing up an article on my data about patent litigation. I've got all the cases associated with a set of patents - those that led to an invalidity judgment, as well as those that didn't (usually because they settle early). This is relatively novel, as most folks discard all cases that don't reach a judgment on the merits, which is relatively few cases. So, I have a bunch of cases coded by patent number, and whether the patent was invalidated, plus other information.

    What I want to do is estimate the likelihood that a patent will be held invalid in a case based on independent variables relating to a) the patent (e.g. how many times it is cited) and b) the parties/litigation (e.g. number of defendants). I've got a reasonable straight logistic regression.

    But the issue I'm thinking about should be clear - there's a sort of selection effect going on: which cases select into getting a ruling? And once selected, is there anything about the patent that tells us which will be invalid? This is important because some plaintiffs may get selected to have rulings more often (that is, the defendants fight), OR it may be that their patents are worse once they get rulings. A single logistic coefficient is ambiguous for some variables.

    The question is, can I test the selection separately? I thought about Heckman selection, but everything I've read (including other patent studies) says no, because that's where you've got unobserved dependent variables - I don't have that. I've got all the observations, and within those, some selected to push for a merits ruling and some did not.

    And here's the Stata question: I've read a couple articles on two-stage regression, but I can't figure out how to make happen in Stata - any thoughts are appreciated, including whether I'm overthinking this, and I can just show what I want by dropping variables to show that the effects are selection versus quality based (I've done that already).

    Any input is much appreciated.

  • #2
    I don't think it's true that you don't have unobserved observations for your dependent variable.
    It sounds largely analagous to the wage example that is often used as a teaching example for the Heckman selection model. In the case of wages, you can observe wages for your entire sample. They're just zero for those not in the labor force. What you don't observe is what the wages of those not in the labor force would be if they were working and that's the thing you're interested in. If I understand correctly the same is true here. You don't know what the ruling would have been had there been a ruling, right?


    • #3
      Thanks for the reply. It turns out for this that we do know what the ruling would be - at least in theory. An invalid patent is an invalid patent regardless of the case, and so if we don't see an invalidation, it is for one of three reasons: 1) there was no challenge; 2) the challenge was on the wrong basis; or 3) error by the court.

      Now, there may be a bunch of patents in the set that were never invalidated even once, and perhaps if they were challenged they might be invalidated (and of those invalidated, if they were challenged properly earlier, they would have been invalidated). There are studies that ask that question with selection from the population of all patents.

      What I'm trying to get at is a slightly different question: given that I have every case in which these patents were litigated, why were they not challenged and/or invalidated? For example, do the odds of seeing an invalidation go up each time you assert a patent? Does suing more defendants mean a higher likelihood of invalidity? If the patent has more claims, is it more likely to be invalidated? Maybe that means the whole affair is a selection question, but the third question is tough, since the number of claims is invariant between cases and thus can only be used to compare against the odds of another patent being invalidated rather than this particular patent. Is that a Heckman problem?


      • #4
        I'm not sure I entirely understand the processes in play here. Your outcome seems to be an indicator where 1 means the patent was invalidated in the case but I'm not sure I understand what the zeros are. I assumed initially that these were all cases where the validity of the patent was being challenged and the patent was not ruled invalid, but your three reasons for not seeing an invalidation suggests that this may not be the case. If some of the cases involve no challenge to the validity of the patent, that seems like that's a selection problem. That is, if you have cases in your data that could not possibly have resulted in a ruling of invalidity (not because the patent isn't invalid but because it was never challenged) that's probably something you could model with a heckman model. Like I said, though, I don't think I understand what you have in your data and what process you're trying to model well enough to comment coherently.


        • #5
          Thanks - you're understanding it. I do have cases where there was no invalidity challenge. That's what distinguishes this data from other studies. Most studies have 150 or so validity findings, and that's it. I've got 5000 observations (case-patent pairs), and about 250 or so case-patents with an invalidity finding (not every patent in every case is invalidated). So the zeros are in cases with some patents that were invalidated in other cases as well as in cases where no patents are invalidated. It's surely selection, but I'm not sure of what. I'm not trying to test what would have happened if patents were selected to be challenged. I'm trying to get a handle on whether the fact of challenge (and result) in the first place is measurable.


          • #6
            BTW - I ran a Heckman estimation per your suggestion - the results work pretty well. I'm going to consider other solutions but if I don't find one, this may work. Thanks!


            • #7
              Final note for the post: turns out rho is not significant - not even close. I'm going to do a more simple method to test for selection.