Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching firm algorithm

    I am fairly new to Stata and for my PhD would like to try and reproduce previous research done.

    I have two datasets ; dataset1 contains 500 IPOs between 1988 until 1997. Dataset 2 contains firm data of over 4000 firms on three specific dates(31/12/1988, 31/12/1993 and 31/12/1997).




    All IPO firms from Dataset 1 need to be matched to firms from dataset 2, according to the following criteria:

    -IPO firms of the time period 1988-92 need to be matched to the firm with the same SIC code and closest market value on 31-dec-1988

    -IPO firms of the time period 1993-95 need to be matched to the firm with the same SIC code and closest market value on 31-dec-1993

    -IPO firms of the time period 1996-97 need to be matched to the firm with the same SIC code and closest market value on 31-dec-1997

    -Firm from dataset 2 can only be matched once every 3 years

    -If a matching firm in the same industry (based on SIC-code) is not available, then a small firm from another industry has to be chosen




    Any help regarding the matching code would be really appreciated, as I’ve spent tens of hours on this already and still can not find the solution.




    Kind regards,
    Last edited by MIchael Jefferson; 11 Feb 2019, 08:23. Reason: Matching

  • #2
    First, I'd have one question: Is it OK in your situation to have one of your control firms (non-IPO) be matched to more than one IPO case? If not, you won't likely be able to get the "closest" match for each firm since two firms might share the same nearest neighbor. Here are two possibilities, described schematically, possibly with bugs in the code, since I didn't want to create example data for testing.

    1) Pair each IPO firm with its closest match, possibly shared with another firm.
    Code:
    use dataset2
    rename id id2
    keep id2 sic market88
    rename market88_2
    save controls.dta
    //
    clear
    use dataset1
    keep if inrange(ipoyear, 1988, 1992)
    rename id id1
    rename market88_1
    // Make a data set in which each IPO firm is paired with all
    // controls within the same sic code
    joinby sic using(controls.dta)
    //
    //  Within each collection of pairs for a particular IPO firm
    // keep the one with the smallest difference in 1988 market values.
    gen diff = abs(market88_1 - market88_2)
    bysort id1 (diff): keep if (_n == 1)
    2) Use the user-contributed command -calipmatch-, which will match each IPO firm to one or more control firms that fall within some specified range of closeness (caliper width) on the market88 value, but not necessarily the closest. The match will be "greedy," with controls taken without replacement. See -ssc describe calipmatch-.
    Code:
    use dataset2
    keep id sic market88
    gen case = 0
    save controls.dta
    //
    clear
    use dataset1
    gen case = 1
    append using controls.dta
    // Match market value within 1000, for example.  Only one control per case, but you might want more.
    calipmatch if inlist(ipoyear, 1988, 1992), gen(pairid) casevar(case) maxmatches(1) ///
      calipermatch(market88)) caliperwidth(1000) exactmatch(sic)
    Last edited by Mike Lacy; 11 Feb 2019, 10:29.

    Comment


    • #3
      Originally posted by Mike Lacy View Post
      First, I'd have one question: Is it OK in your situation to have one of your control firms (non-IPO) be matched to more than one IPO case? If not, you won't likely be able to get the "closest" match for each firm since two firms might share the same nearest neighbor. Here are two possibilities, described schematically, possibly with bugs in the code, since I didn't want to create example data for testing.

      1) Pair each IPO firm with its closest match, possibly shared with another firm.
      Code:
      use dataset2
      rename id id2
      keep id2 sic market88
      rename market88_2
      save controls.dta
      //
      clear
      use dataset1
      keep if inrange(ipoyear, 1988, 1992)
      rename id id1
      rename market88_1
      // Make a data set in which each IPO firm is paired with all
      // controls within the same sic code
      joinby sic using(controls.dta)
      //
      // Within each collection of pairs for a particular IPO firm
      // keep the one with the smallest difference in 1988 market values.
      gen diff = abs(market88_1 - market88_2)
      bysort id1 (diff): keep if (_n == 1)
      2) Use the user-contributed command -calipmatch-, which will match each IPO firm to one or more control firms that fall within some specified range of closeness (caliper width) on the market88 value, but not necessarily the closest. The match will be "greedy," with controls taken without replacement. See -ssc describe calipmatch-.
      Code:
      use dataset2
      keep id sic market88
      gen case = 0
      save controls.dta
      //
      clear
      use dataset1
      gen case = 1
      append using controls.dta
      // Match market value within 1000, for example. Only one control per case, but you might want more.
      calipmatch if inlist(ipoyear, 1988, 1992), gen(pairid) casevar(case) maxmatches(1) ///
      calipermatch(market88)) caliperwidth(1000) exactmatch(sic)
      Thank you for your response!

      Excuse me for not being clear enough. It is not okay to use firms from dataset 2 (Non-IPO firms) twice within 3 years.

      Example: After matching an IPO firm of 1-Jan-1988 (Dataset1) to a non-IPO firm of 31-dec-1988 (Dataset 2), this non-IPO firm cannot be matched again with an IPO firm of 31-Dec-1990. However, this non-IPO firm can be matched with an IPO firm from 1-Jan-1991 onwards.

      I think this problem could be overcome by creating a certain loop. However, I do not know how to create this.

      Comment


      • #4
        I'm not sure I understand exactly your rules for re-using controls, but as near as I understand them, I'm not thinking of any easy way to implement them, although I'd presume some reasonable methods exists. I think there was a thread on StataList a few years ago, in which I participated, about how to do matching without replacement, so you might try searching for that. Keywords might be "cases, controls, without replacement, match." If I recall correctly, I found a solution in which, after picking a control for a case, the program deleted that control from the pairs pertaining to all other cases---a brute force approach.

        Matching without replacement is generally difficult. Although I personally find "without replacement" methods more intuitive, I believe (?) that the matching estimators in the built-in command -teffects- use matching *with* replacement, so I'd wonder if that might be preferable for your ultimate analytic goals. My impression is that -teffects- implements quite up to date methods.

        Comment


        • #5
          Originally posted by Mike Lacy View Post
          I'm not sure I understand exactly your rules for re-using controls, but as near as I understand them, I'm not thinking of any easy way to implement them, although I'd presume some reasonable methods exists. I think there was a thread on StataList a few years ago, in which I participated, about how to do matching without replacement, so you might try searching for that. Keywords might be "cases, controls, without replacement, match." If I recall correctly, I found a solution in which, after picking a control for a case, the program deleted that control from the pairs pertaining to all other cases---a brute force approach.

          Matching without replacement is generally difficult. Although I personally find "without replacement" methods more intuitive, I believe (?) that the matching estimators in the built-in command -teffects- use matching *with* replacement, so I'd wonder if that might be preferable for your ultimate analytic goals. My impression is that -teffects- implements quite up to date methods.
          Thank you for your response!

          Basically, I cannot re-use the controls for 3 years after using them.

          I looked for the thread you are mentioning. Unfortunately, I could not find it.

          Anyone else with suggestions?

          Comment


          • #6
            It looks to me like the statalist post Mike Lacy mentioned is this one: Question on matching in a nested case control study

            For other Statalist posts on matching firms without replacement (usually (a) have to be in same SIC, then (b) find closest in size), see here, here, and here


            Originally posted by Mike Lacy View Post
            Setting aside considerations of whether sampling with or without replacement is preferable, I have some code that I think does incidence density sampling without replacement. The overall strategy is to start with a file of both the cases and controls, and use it to make a file of all possible pairs of cases and controls. Then, only the pairs with a case that matches on the covariates are retained, and then the pairs that don't meet the risk set condition are dropped. At this point, using a loop over all sets of case-control pairs, a sort of greedy sampling is performed: The controls for the each case are picked, then any other pairs involving those controls are deleted. I'm not certain that what I have done is right, or that it is the most efficient approach, but I think it's close and fast enough.

            Code:
            // Matched case-control sampling using incidence density sampling, with no replacement.
            //
            // Create example files of cases and controls to work with.
            // Example conditions
            clear
            set seed 33245
            local matchvars = "x y z"
            local maxtime = 50
            local pcase = 0.05 // proportion of case events among all observations
            local ControlsPerCase = 3
            set obs 100000 // total number of persons, cases and potential controls
            //
            //
            gen int id = _n
            gen int evtime = ceil(runiform() * `maxtime') // time of disease event
            replace evtime = . if runiform() > `pcase' // Many persons never have the disease event
            // Create variables beside event time on which cases and controls would be matched.
            foreach v of local matchvars {
            gen `v' = ceil(3*runiform()) // 3 value for each match variable
            }
            // End of preparing example data
            // *********************************************
            //
            // Within this file of cases and controls, everyone is a potential control to start with,
            // so save everyone in this file as a source of controls.
            compress // important to save memory
            tempfile filecase filectl
            rename id idctl
            rename evtime evtimectl
            // randomize the order of the controls
            gen rand = runiform()
            sort rand, stable
            drop rand
            save `filectl' // file of controls
            rename idctl idcase
            rename evtimectl evtimecase
            //
            //
            // Strip the current file down to just the cases.
            drop if missing(evtimecase)
            qui count
            di r(N) " event cases in file"
            //
            // Pair up each case with each of the potential controls that match on the matching variables.
            // We will worry about time of event, risk set, etc. later.
            joinby `matchvars' using `filectl'
            //
            //
            // Drop impossible pairs
            drop if (idcase == idctl) // self pairs
            drop if (evtimecase >= evtimectl) // control member is not in risk set
            //
            // A few details before we start incidence sampling
            gen rand = runiform()
            sort idcase rand, stable // randomize the order of the cc pairs
            drop rand evtime* x y z // don't need these anymore
            by idcase: gen byte first = (_n ==1) // just to count cases
            qui count if first ==1
            di r(N) " event cases that have a potential match after considering event time"
            //
            //
            // Keep the desired number of controls for each case. For each case,
            // remove her/his controls from all other case-control pairs so
            // as to give sampling w/o replacement.
            // I use a loop here over all case-control pairs , generally not Stata-ish,
            // but it seems like a good approach here.
            by idcase: gen int seqnum = _n // sequence number of the c-c pair for each case case
            qui levelsof idcase, local(caselist) // a list of all the cases
            gen byte casedone = 0 // to mark each case as we process it.
            foreach c of local caselist {
            // Keep desired number of c-c pairs for this case
            qui drop if (idcase == `c') & (seqnum > `ControlsPerCase')
            qui replace casedone = 1 if (idcase == `c')
            //
            // Make a list of the controls just used. I used a clumsy approach with preserve/restore.
            preserve
            qui keep if (idcase == `c') // current case/control pairs
            local used "" // will hold the list of controls just used
            forval i = 1/`ControlsPerCase' {
            local used = "`used' " + string(idctl[`i'])
            }
            restore
            //
            // Drop all remaining unexamined pairs that involve the controls just used
            local used = subinstr(ltrim("`used'"), " " , ",", .)
            qui drop if (casedone == 0 ) & inlist(idctl, `used')
            }
            // Report on number of cases and controls matched.
            by idcase: gen NCtl = _N
            tab NCtl if first, missing
            Regards, Mike
            Last edited by David Benson; 17 Feb 2019, 22:28.

            Comment


            • #7
              Or, Mike could've been thinking of this post: Matching participants per cases and controls

              For a few more posts on matching without replacement (usually firms, but sometimes people), see here, here, here, and here

              Comment


              • #8
                Thanks for your response David! I will look at the mentioned links.

                Comment

                Working...
                X