Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Match sample of firms with a control sample

    Dear Statalist,

    I have a group of firms ( a panel). They include my treatment group and some other firms of my population.
    I need to match the treatment group with rest of the firms in a way within same industry (variable sic)
    and within 10% (plus or minus) of size. So each firm can have one two three or none matching firms.
    After that I have to remove rest of the firms that are neither matching nor treatment.

    How can I do it ?

  • #2
    The code for matching is moderately complicated and it depends on the details of the data you are starting with. No description in words can provide adequate information. You must show example data, and you must use the -datatex- command to do that in order to get a useful answer. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When you post back with example data from -dataex-, please also indicate whether you want matching with or without replacement. In matching with replacement, the same firm can be a control form for more than one treatment firm. In matching without replacement any control firm can only serve as the control for a single treatment firm. Replacement without matching usually results in more cases having no available match, and there are no statistical advantages to replacement without matching. Some people, however, prefer it for aesthetic reasons that I have never understood.

    Also, since you are talking about panel data, can the same treatment firm have different controls at different times? Or does each firm just get a single control that is paired with it for the entire duration.

    Comment


    • #3
      Dear Clyde,

      Please find herewith the example.
      In this case Firms X,Y,Z or W are control firms and Firms A to H are the firms that needed to be matched with them based on size and industry.
      1) There can be new control firms enter or leave in any given year.
      2) There can be non-control firms enter or leave in any given year.
      3) There can be more than one non-control firms matched with control firms (e.g in 1991 both firms A and D are matched with firm Z)
      4) "
      can the same treatment firm have different controls at different times?
      " Here I made it possible for same treatment firm have different controls at different times. Yes, that's fine.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int year byte id str6 firm int(sic size) byte(controlfirm matches_sic) int(lowerbound upperbound) byte(wihin_10pct match matchid)
      1991  1 "Firm X" 5812 1000 1 .  900 1100 . . .
      1991  2 "Firm Y" 5400 2000 1 . 1800 2200 . . .
      1991  3 "Firm Z" 5990 3000 1 . 2700 3300 . . .
      1991  4 "Firm A" 5990 3200 0 1    .    . 1 1 3
      1991  5 "Firm B" 6000 3200 0 0    .    . 0 0 0
      1991  6 "Firm C" 5400 2100 0 1    .    . 1 1 2
      1991  7 "Firm D" 5400 2050 0 1    .    . 1 1 3
      1991  8 "Firm E" 5990 3400 0 1    .    . 0 0 0
      1992  1 "Firm X" 5812 1000 1 .  900 1100 . . .
      1992  2 "Firm Y" 5400 2000 1 . 1800 2200 . . .
      1992  3 "Firm Z" 5990 3000 1 . 2700 3300 . . .
      1992  4 "Firm A" 5990 3200 0 1    .    . 1 1 3
      1992  5 "Firm B" 6000 3200 0 0    .    . 0 0 0
      1992  6 "Firm C" 5400 2100 0 1    .    . 1 1 2
      1992  7 "Firm D" 5400 2050 0 1    .    . 1 1 3
      1992  9 "Firm H" 5812 1100 0 1    .    . 1 1 1
      1993 10 "Firm W" 6200 1500 1 . 1350 1650 . . .
      1993  2 "Firm Y" 5400 2000 1 . 1800 2200 . . .
      1993  3 "Firm Z" 5990 3000 1 . 2700 3300 . . .
      1993  4 "Firm A" 5990 3200 0 1    .    . 1 1 3
      1993  5 "Firm B" 6000 3200 0 0    .    . 0 0 0
      1993  6 "Firm C" 5400 2100 0 1    .    . 1 1 2
      1993  7 "Firm D" 5400 2050 0 1    .    . 1 1 3
      1993  9 "Firm H" 5812 1100 0 0    .    . 0 0 0
      end

      Last edited by krishantha Ainsworth; 26 Apr 2022, 22:01.

      Comment


      • #4
        Code:
        //  SEPARATE THE CASES AND CONTROLS
        preserve
        keep if controlfirm == 1
        drop controlfirm
        tempfile controls
        save `controls'
        
        restore
        keep if controlfirm == 0
        drop controlfirm
        
        //  MATCH THEM
        gen lower = 0.9*size
        gen upper = 1.1*size
        rangejoin size lower upper using `controls', by(sic year) suffix(0)
        
        //  ELIMINATE THE UNMATCHED
        drop if missing(id0)
        
        //  KEEP AT MOST THREE MATCHES
        by year firm, sort: keep if _n < = 3
        
        //  RE-ORGANIZE TO LONG
        gen year0 = year
        gen sic0 = sic
        drop upper lower
        rename (year-size) =1
        gen match_id = id1
        reshape long year id firm sic size, i(id1 year1) j(controlfirm)
        drop id1 year1
        sort id year

        Comment


        • #5
          Dear Clyde,

          Thank you so much for the reply. Here I have a few issues with the code.
          1) Where do you compare the non-control firs with the control firms ( whether they are in the same industry and within lower and upper bounds)
          2) When I run this code for the full data set it give me an error

          Code:
          reshape long year id firm sic size, i(companyid1 fyear1) j(franch)
          (note: j = 0 1)
          variable id does not uniquely identify the observations
              Your data are currently wide.  You are performing a reshape long.  You specified
              i(companyid1 fyear1) and j(franch).  In the current wide form, variable  id1
              year1 should uniquely identify the observations.  Remember this picture:...
          Then what I did was removed the duplicates
          Code:
          duplicates drop id1 year1, force
          duplicates drop id0 year0, force
          After that when I run the reshape code it gives me an output (50/50) matched but it seems some of the control firms are also missing in that output.

          For example when I run

          Code:
          tabulate sic if controlfirm == 1
          I can see a drop in control firms at the very beginning and after the last line of the code.
          However when I run only
          Code:
          duplicates drop id1 year1, force
          it allows me to run the reshape code but this time number of control firms are higher than the initial.


          Ideally what I expect is to have the same number of observations at the beginning and the end and allowing me to recognize matched firms though the variable "match".
          Last edited by krishantha Ainsworth; 27 Apr 2022, 08:57.

          Comment


          • #6
            The code in #4 ran without error messages in the example data you provided. In that data, the combination of id and year uniquely identified observations, and the code relied on that. Evidently in your real data, that does not hold. The following code removes that reliance so you won't have to remove anything. That said, based on what you show in #5, it seems that you have some instances where the same firm in the same year appears more than once in the data set and with conflicting information about the other variables. That might mean you have bad data. I suggest you run -duplicates tag id year, gen(flag)- and then -browse if flag- to see these observations and figure out what is going on. If these observations represent data errors, you should remove the incorrect ones. But you should also investigate the data management that led up to the creation of the data set and fix those problems, along with any others you find in the process.

            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input int year byte id str6 firm int(sic size) byte controlfirm
            1991  1 "Firm X" 5812 1000 1
            1991  2 "Firm Y" 5400 2000 1
            1991  3 "Firm Z" 5990 3000 1
            1991  4 "Firm A" 5990 3200 0
            1991  5 "Firm B" 6000 3200 0
            1991  6 "Firm C" 5400 2100 0
            1991  7 "Firm D" 5400 2050 0
            1991  8 "Firm E" 5990 3400 0
            1992  1 "Firm X" 5812 1000 1
            1992  2 "Firm Y" 5400 2000 1
            1992  3 "Firm Z" 5990 3000 1
            1992  4 "Firm A" 5990 3200 0
            1992  5 "Firm B" 6000 3200 0
            1992  6 "Firm C" 5400 2100 0
            1992  7 "Firm D" 5400 2050 0
            1992  9 "Firm H" 5812 1100 0
            1993 10 "Firm W" 6200 1500 1
            1993  2 "Firm Y" 5400 2000 1
            1993  3 "Firm Z" 5990 3000 1
            1993  4 "Firm A" 5990 3200 0
            1993  5 "Firm B" 6000 3200 0
            1993  6 "Firm C" 5400 2100 0
            1993  7 "Firm D" 5400 2050 0
            1993  9 "Firm H" 5812 1100 0
            end
            
            //  SEPARATE THE CASES AND CONTROLS
            preserve
            keep if controlfirm == 1
            drop controlfirm
            tempfile controls
            save `controls'
            
            restore
            keep if controlfirm == 0
            drop controlfirm
            gen tuple_id = _n
            
            //  MATCH THEM
            gen lower = 0.9*size
            gen upper = 1.1*size
            rangejoin size lower upper using `controls', by(sic year) suffix(0)
            
            //  ELIMINATE THE UNMATCHED
            drop if missing(id0)
            
            //  KEEP AT MOST THREE MATCHES
            by tuple_id, sort: keep if _n < = 3
            
            //  RE-ORGANIZE TO LONG
            drop upper lower
            rename (id firm size) =1
            reshape long id firm size, i(tuple_id) j(controlfirm)
            sort id year
            Where do you compare the non-control firs with the control firms ( whether they are in the same industry and within lower and upper bounds)
            Look at the -rangejoin- command. That's where that happens.

            By the way, I forgot to mention that -rangejoin- is written by Robert Picard and is available from SSC. To use it you also need the -rangestat- command, by Robert Picard, Nick Cox, and Roberto Ferrer. It is also available from SSC.



            Comment


            • #7
              Dear Clyde,

              I had a look into my actual data. When I check for duplicates -duplicates tag id year, gen(flag)- at beginning there are no id year duplicates.
              however when I reach to the code -reshape long id firm size, i(tuple_id) j(controlfirm)- somehow it creates duplicates in terms of
              year and id1 that I found through -duplicates tag id year, gen(flag)-. They are created while running the code up to reshape.

              So the reshape again generates an error like this.

              Code:
              . reshape long id firm size, i(tuple_id) j(controlfirm)
              (note: j = 0 1)
              variable id does not uniquely identify the observations
                  Your data are currently wide.  You are performing a reshape long.  You specified
                  i(tuple_id) and j(controlfirm).  In the current wide form, variable tuple_id should
                  uniquely identify the observations.  Remember this picture:
              
                       long                                wide
                      +---------------+                   +------------------+
                      | i   j   a   b |                   | i   a1 a2  b1 b2 |
                      |---------------| <--- reshape ---> |------------------|
                      | 1   1   1   2 |                   | 1   1   3   2  4 |
                      | 1   2   3   4 |                   | 2   5   7   6  8 |
                      | 2   1   5   6 |                   +------------------+
                      | 2   2   7   8 |
                      +---------------+
                  Type reshape error for a list of the problem observations.
              I can remove them by -duplicates drop year id, force- and run reshape but it also reduces contolfirms.

              The differences in the actual data than the example are:
              1) There can be unmatched control firms in the actual with non-controls as there is no suitable firm to match
              3) The number of observations each year could be different.

              I'm not sure if above contribute to forming duplicates when reach to the command reshape from the original data.

              Comment


              • #8
                Sorry, this is an error in my code. The part I wrote about re-organizing the data to long would be correct if we were keeping only a single match to each case. And in your example data, that is what happened. But it breaks with multiple matching. The duplicates you are seeing are actually supposed to be there: they are the multiple matches to the same case. So, the code below will properly overcome this, not be removing them, but by distinguishing them for -reshape-:

                Code:
                //  SEPARATE THE CASES AND CONTROLS
                preserve
                keep if controlfirm == 1
                drop controlfirm
                tempfile controls
                save `controls'
                
                restore
                keep if controlfirm == 0
                drop controlfirm
                gen tuple_id = _n
                
                //  MATCH THEM
                gen lower = 0.9*size
                gen upper = 1.1*size
                rangejoin size lower upper using `controls', by(sic year) suffix(0)
                
                //  ELIMINATE THE UNMATCHED
                drop if missing(id0)
                
                //  KEEP AT MOST THREE MATCHES
                by tuple_id, sort: keep if _n < = 3
                
                //  RE-ORGANIZE TO LONG
                drop upper lower
                rename (id firm size) =1
                gen _j = id0
                reshape long id firm size, i(tuple_id _j) j(controlfirm)
                drop _j
                sort id year
                Last edited by Clyde Schechter; 27 Apr 2022, 22:59.

                Comment


                • #9
                  That's fine Clayed.

                  But now when I run this no error pops up.
                  I get a final data sample exactly double the size of controls and each industry has same number of control and non-control firms.
                  This is strange ( I'm not sure how I exactly got somethig like that as for some firms there are multiple matches and for some firms there are no matches).

                  Next, most firms I marked as control in the original sample (i.e controlfirm = 1) ends up as
                  non-control firms (i.e controlfirm = 0) after processing data .

                  I also don't get a column "match" which says which non-control firm matches to which control firm.
                  Last edited by krishantha Ainsworth; 27 Apr 2022, 23:32.

                  Comment


                  • #10
                    I get a final data sample exactly double the size of controls and each industry has same number of control and non-control firms.
                    This is strange ( I'm not sure how I exactly got somethig like that as for some firms there are multiple matches and for some firms there are no matches).
                    I think this is likely just a coincidence.

                    Next, most firms I marked as control in the original sample (i.e controlfirm = 1) ends up as
                    non-control firms (i.e controlfirm = 0) after processing data .
                    Ah, you are right. I reversed the coding of the controlfirm variable. I think I did that because usually when I do these matching operations I have a variable for case = 1 control = 0, whereas you are working the other way around. Anyway, I think the simplest solution is to simply
                    Code:
                    replace controlfirm = !controlfirm
                    at the end of the code, rather than making the changes at several places along the way.

                    I also don't get a column "match" which says which non-control firm matches to which control firm.
                    Well, I didn't call it match. But the variable tuple_id gives this information. If you sort the data by tuple_id, you will see that each group of observations with the same tuple_id contains one case and up to three controls that are matched to it. If you want a variable that presents this same information differently, as giving the id number of the case to which each control matches, you can do this (after you run the command above to fix the controlfirm variable):

                    Code:
                    by tuple_id, sort: egen matching_case = max(cond(controlfirm == 0, id, .)
                    In most analyses of matched data, however, you will not find this variable useful. Most matched-pair analysis commands (all that I am personally familiar with) identify a group of matches by a tuple_id, not by a variable identifying the case. But perhaps you have some other use for a case-identifier variable.

                    Comment


                    • #11



                      Originally posted by Clyde Schechter View Post
                      I think this is likely just a coincidence.


                      Ah, you are right. I reversed the coding of the controlfirm variable. I think I did that because usually when I do these matching operations I have a variable for case = 1 control = 0, whereas you are working the other way around. Anyway, I think the simplest solution is to simply
                      Code:
                      replace controlfirm = !controlfirm
                      at the end of the code, rather than making the changes at several places along the way.


                      Well, I didn't call it match. But the variable tuple_id gives this information. If you sort the data by tuple_id, you will see that each group of observations with the same tuple_id contains one case and up to three controls that are matched to it. If you want a variable that presents this same information differently, as giving the id number of the case to which each control matches, you can do this (after you run the command above to fix the controlfirm variable):

                      Code:
                      by tuple_id, sort: egen matching_case = max(cond(controlfirm == 0, id, .)
                      In most analyses of matched data, however, you will not find this variable useful. Most matched-pair analysis commands (all that I am personally familiar with) identify a group of matches by a tuple_id, not by a variable identifying the case. But perhaps you have some other use for a case-identifier variable.
                      It seems however that's not quite so the issue. Some original controls (i.e. controlfirms == 1) remain the same and some have changed. Another thing to note is that, despite the fact that no errors occur, duplicates are created in this sample after running the code. I think that what makes them equal at the end. There are repeated id year combinations in this sample after running this code. So I believe it has also removed some unmatched control firms.
                      A sample is here. Here Firm D has four entries for 2000. They match with different firms and two of them are below.
                      Code:
                       
                      tuple_id controlfirm year firm size id sic matching_case
                      8810 1 2000 Firm D 85.929 2163 5812 28511
                      9664 1 2000 Firm D 85.929 2163 5812 30818
                      4093 1 2000 Firm D 85.929 2163 5812 14292
                      13928 1 2000 Firm D 85.929 2163 5812 65434
                      tuple_id controlfirm year firm size id sic matching_case
                      8810 0 2000 Firm AB 89.575 28511 5812 28511
                      tuple_id controlfirm year firm size id sic matching_case
                      9664 0 2000 Firm AC 79.37 30818 5812 30818
                      If this panel full matching is prone to errors is there a way I can match year by year without this rangejoin command ?.
                      Last edited by krishantha Ainsworth; 28 Apr 2022, 01:36.

                      Comment


                      • #12
                        Some original controls (i.e. controlfirms == 1) remain the same and some have changed.
                        That does not happen in the example data you showed, and I do not see where the code would have introduced that error. Nevertheless, in the code at bottom, I have revised the code so that it does not change the controlfirm variable at all during processing. This should eliminate this problem. If it does not, please post back with a new data example which exhibits the problem so I can troubleshoot it.

                        What you show about firm D are not errors. Firm D is a control firm. It is matched to multiple cases. This is what matching with replacement does. I asked you in #2 whether you wanted to match with or without replacement. You didn't respond to that question. Since I prefer matching with replacement (because it is easier to code, leaves fewer cases unmatched, and because it has no statistical disadvantages compared to matching without replacement) I chose to go that route.

                        That said, it does seem that sometimes control firms are being reused unnecessarily and are "crowding out" some others that "never get a chance" even though they are suitable matches to some case. The code shown at the bottom here mitigates that problem by randomizing which controls are kept for each case. I should have done this in the first place.

                        So I believe it has also removed some unmatched control firms.
                        Yes, any unmatched control firms, and any unmatched case firms are removed from the data by the code. I understood this to be what you wanted. And I have not changed it.

                        If this panel full matching is prone to errors is there a way I can match year by year without this rangejoin command ?
                        Well, except for the issue with controlfirms being wrong, these are not, as I have just explained, errors. And in any case they have nothing to do with full sample vs year-by-year. In fact, the matching already is year-by-year because I included year in the -by()- option of -rangejoin-. You can verify for yourself in the results that all observations in any tuple have the same value of year. (I had assumed you wanted year by year matching, although looking back over the post I see now that you did not actually say that.)

                        Code:
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input int year byte id str6 firm int(sic size) byte controlfirm
                        1991  1 "Firm X" 5812 1000 1
                        1991  2 "Firm Y" 5400 2000 1
                        1991  3 "Firm Z" 5990 3000 1
                        1991  4 "Firm A" 5990 3200 0
                        1991  5 "Firm B" 6000 3200 0
                        1991  6 "Firm C" 5400 2100 0
                        1991  7 "Firm D" 5400 2050 0
                        1991  8 "Firm E" 5990 3400 0
                        1992  1 "Firm X" 5812 1000 1
                        1992  2 "Firm Y" 5400 2000 1
                        1992  3 "Firm Z" 5990 3000 1
                        1992  4 "Firm A" 5990 3200 0
                        1992  5 "Firm B" 6000 3200 0
                        1992  6 "Firm C" 5400 2100 0
                        1992  7 "Firm D" 5400 2050 0
                        1992  9 "Firm H" 5812 1100 0
                        1993 10 "Firm W" 6200 1500 1
                        1993  2 "Firm Y" 5400 2000 1
                        1993  3 "Firm Z" 5990 3000 1
                        1993  4 "Firm A" 5990 3200 0
                        1993  5 "Firm B" 6000 3200 0
                        1993  6 "Firm C" 5400 2100 0
                        1993  7 "Firm D" 5400 2050 0
                        1993  9 "Firm H" 5812 1100 0
                        end
                        
                        //  SEPARATE THE CASES AND CONTROLS
                        preserve
                        keep if controlfirm == 1
                        tempfile controls
                        save `controls'
                        
                        restore
                        keep if controlfirm == 0
                        gen tuple_id = _n
                        
                        //  MATCH THEM
                        gen lower = 0.9*size
                        gen upper = 1.1*size
                        rangejoin size lower upper using `controls', by(sic year) suffix(1)
                        
                        //  ELIMINATE THE UNMATCHED
                        drop if missing(id1)
                        
                        //  KEEP AT MOST THREE MATCHES SELECTED AT RANDOM
                        set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
                        gen double shuffle = runiform()
                        by tuple_id (shuffle), sort: keep if _n < = 3
                        
                        //  RE-ORGANIZE TO LONG
                        drop upper lower
                        rename (id firm size controlfirm) =0
                        gen matching_case_id = id0
                        assert controlfirm0 == 0 & controlfirm1 == 1
                        reshape long id firm size controlfirm, i(tuple_id matching_case_id)
                        drop _j shuffle
                        Note: The revisions from previous code here are fairly extensive, so I have not put them in bold face. Please use this code as a whole, rather than editing earlier code.

                        Comment


                        • #13
                          I am working with an almost similar data set for stunting on 189 cases and 380 controls which i have been trying to match without replacement on cases and gender unsuccessfully. Could you please offer any commands to fit since each time I do i end up with only cases while the controls disappeared? I was working after modifying this command for my situation but it did not work thanks.
                          local totalcases = 100 // for example local totalcontrols = 10000 local nctl = 4 // I'm choosing 4 controls per case // mock control data set seed 8846 set obs `totalcontrols' gen int id_ctl = _n gen byte case = 0 gen byte age = 20 +ceil(65*runiform()) // broad age range assumed gen int sex = runiform() > 0.5 tempfile controls save `controls' clear // mock case data set obs `totalcases' gen int id = _n gen byte case = 1 gen byte age = 20 +ceil(65*runiform()) gen sex = runiform() > 0.5 tempfile cases save `cases' // ************************************************** ***************************** // Actual solution starts here use `controls' gen rand = runiform() sort rand // random order for controls drop rand save `controls', replace // use `cases' // exact match on sex, within +/- 3 years for age compress rangejoin age -3 3 using `controls', by(sex) drop *_U // clean up rename (id age sex) (id_case age_case sex_case) // clean up // Sample `nctl' controls w/o replacement bysort id_ctl: keep if _n ==1 // use each control only once bysort id_case: keep if _n <= `nctl' // keep up to `nctl' controls for each case // // put the control data onto the file and check it out merge 1:1 id_ctl using `controls', keep(match) rename (age sex) (age_ctl sex_ctl) // Check how many controls were found for every case bysort id_case: gen byte numcontrols = _N if _n ==1 tab numcontrols

                          Comment

                          Working...
                          X