Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching sample of treatment firms to subset of non-treatment firms

    Hello statalist,

    I have a group of treatment firms and a group of non-treatment firms. I want to match each treatment firm to 1 non-treatment firm. As this is a step i want to do before a diff-in-diff model, I'm not interested in using a teffects approach.

    For the variables country_iso (country of operation), bvd_sector (sector of operation) i want to find exact matches. For the variable last_avail_year (last year of available data) I'd like to allow +-1 year for matches.

    When the treatment firms are matched to potential control firms on abovestanding criterias, i want to find the nearest neighbors on the following variables: incorporation_year (year firm was incorporated), revenue_last (last available revenue data) and pre2005_appln (patent applications before 2005).

    Treatment firms have ETS == 1, and non-treatment have ETS == 0.

    To be able to evaluate the quality of the matches, I would like to generate some variable calculating the percentage difference on the variables used for nearest neighbour matching.

    After having found the control firms, I'd like to remove the non-treated firms that are not matched.

    How should I go about this?

    Kind regards, Lorens





  • #2
    Forgot to add dataex, here it is:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str2 country_iso str56 bvd_sector float last_avail_year int incorporation_year double revenue_last float(pre2005_appln ETS)
    "DE" "Wood, Furniture & Paper Manufacturing"                         2021 1992      264737.15918 0 1
    "DE" "Utilities"                                                     2021 1992       26191.54223 0 1
    "AT" "Metals & Metal Products"                                       2021 1912      455662.13337 0 1
    "BE" "Leather, Stone, Clay & Glass products"                         2021 1989         80631.344 0 1
    "CZ" "Utilities"                                                2010.3334 1992 157756.0815473803 0 1
    "DE" "Metals & Metal Products"                                       2021 1995      706859.18635 1 1
    "DE" "Chemicals, Petroleum, Rubber & Plastic"                        2021 1985      330023.62171 0 1
    "DE" "Utilities"                                                   2010.5 1906          384968.5 0 1
    "BG" "Public Administration, Education, Health Social Services"      2020 2000  8.18048788306965 0 0
    "CZ" "Business Services"                                             2021 2004                 0 0 0
    "BG" "Metals & Metal Products"                                       2000 1996  1.53388362365669 0 0
    "DE" "Business Services"                                             2021 2000        5920.74095 0 0
    "CZ" "Public Administration, Education, Health Social Services"      2016 1994  12.9504641570166 0 0
    "CZ" "Business Services"                                             2016 1996  3.70013261629046 0 0
    "BG" "Retail"                                                        2019 2003  15.3386848414343 0 0
    "CZ" "Business Services"                                             2013 1994  3.64487074379563 0 0
    "BG" "Media & Broadcasting"                                          2012 1996  1.53259701973152 0 0
    "BG" "Retail"                                                        2021 2003  966893.510843924 0 0
    "CZ" "Food & Tobacco Manufacturing"                                  2016 2002  740.026523258092 0 0
    "DE" "Industrial, Electric & Electronic Machinery"                   2020 1989              1500 0 0
    "AT" "Industrial, Electric & Electronic Machinery"                   2021 1989              1309 0 0
    end

    Comment


    • #3
      As stated, your problem has, in general, no solution. The difficulty is that you are asking for nearest neighbor match on two different variables. But usually the nearest neighbor on one will not be the nearest on the other and you provide no criterion for selecting which is more important or weighting their distances in some way. If you post back with clarification on that (and with a data example that includes some matching cases and controls) I can extend the code shown below.

      The first part, identical matching on country and sector with last_avail_year within 1 is done as follows:
      Code:
      //    SPLIT DATA INTO TREATMENT AND CONTROL SUBSETS
      preserve
      keep if ETS == 0
      tempfile controls
      save `controls'
      
      restore
      keep if ETS == 1
      
      //    PUT THEM TOGETHER
      rangejoin last_avail_year -1 1 using `controls', by(country_iso bvd_sector) suffix(_ctrl)
      As it happens, in the example you showed, there are no satisfactory matches available, so the result is empty. But presumably in your full data set matches are there to be found.

      -rangejoin- is by Robert Picard and is available from SSC. To use it you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

      Comment


      • #4
        Hi Clyde,

        Thank you for your response, it is greatly appreciated.

        What you're saying with regards to a nearest neighbour approach not being able to chose the best match, when different observations are the nearest neighbour on different variables, makes a lot of sense. I will read a bit about weighting to consider how I might want to go about it, and i will get back to you.

        I don't quite understand what you mean with 'a data example that includes some matching cases and controls' - by matching cases, do you mean i should list an example of data that is able to be matched?

        Once again, thank you very much for your time.

        Comment


        • #5
          do you mean i should list an example of data that is able to be matched?
          Yes. Or more precisely, it would be helpful to have an example of data where some of the cases (ETS == 1) can be matched and some others cannot.

          Added: If you sort your data by country sector last year available and ETS it shouldn't be hard to pick out pairs that can match.
          Last edited by Clyde Schechter; 09 May 2023, 08:31.

          Comment


          • #6
            Hi again Clyde,

            Assume for now that I'd prioritise the three variables as follows: 1) revenue_last, 2) pre2005_appln 3) incorporation_year

            With 1) being the most important.


            Here a dataex, where first obs. can be matched but the last can't.
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str2 country_iso str56 bvd_sector float last_avail_year int incorporation_year double revenue_last float(pre2005_appln ETS)
            "AT" "Wholesale"                               2021 1896               3100 0 1
            "AT" "Wholesale"                               2021 1979              11945 0 0
            "AT" "Retail"                                  2001 1989            145.346 0 0
            "AT" "Travel, Personal & Leisure"              2015 2003                 72 0 0
            "AT" "Retail"                                  2018 1999                500 0 0
            "AT" "Business Services"                       2021 2002               2474 0 0
            "AT" "Construction"                            2021 1952        68813.10668 0 0
            "AT" "Banking, Insurance & Financial Services" 2016 1910       17978.916035 0 0
            "AT" "Wholesale"                               2021 1993         3856.78106 0 0
            "AT" "Travel, Personal & Leisure"              2023 1900                148 0 0
            "AT" "Transport, Freight & Storage"            2021 1996              10000 0 0
            "AT" "Construction"                            2001 1999             65.406 0 0
            "AT" "Banking, Insurance & Financial Services" 2014 1901 55659.379824999996 0 0
            "AT" "Business Services"                       2022 1972       216964.36346 0 0
            "AT" "Retail"                                  2010 1900                250 0 0
            "AT" "Business Services"                       2021 1997         3972.35831 0 0
            "AT" "Wholesale"                               2019 1995               4500 0 0
            "AT" "Computer Software"                       2022 1997                654 0 0
            "AT" "Construction"                            2021 1996               3121 0 0
            "AT" "Banking, Insurance & Financial Services" 2014 1916        64804.26751 0 0
            "AT" "Retail"                                  2022 2000                140 0 0
            "AT" "Construction"                            2021 1994           683.2778 0 0
            "AT" "Chemicals, Petroleum, Rubber & Plastic"  2021 1994       195480.27224 0 1
            end


            Comment


            • #7
              Thanks.
              Code:
              //    SPLIT DATA INTO TREATMENT AND CONTROL SUBSETS
              preserve
              keep if ETS == 0
              tempfile controls
              save `controls'
              
              restore
              keep if ETS == 1
              gen `c(obs_t)' pair_num = _n
              
              //    PUT THEM TOGETHER
              rangejoin last_avail_year -1 1 using `controls', by(country_iso bvd_sector) suffix(_ctrl)
              drop if missing(ETS_ctrl)
              
              //    FROM POTENTIAL MATCHES CHOOSE CLOSEST ON REVENUE_LAST, PRE2005_APPLN, INCORPORATION_YEAR
              foreach v of varlist revenue_last pre2005_appln incorporation_year {
                  gen d_`v' = abs(`v' - `v'_ctrl)
              }
              by pair_num (d_revenue_last d_pre2005_appln d_incorporation_year), sort: keep if _n == 1
              
              //    RESTORE LONG DATA LAYOUT
              ds *_ctrl
              local vbles `r(varlist)'
              local vbles: subinstr local vbles "_ctrl" "", all
              rename (`vbles') =_case
              reshape long `vbles', i(pair_num) j(case_ctrl) string
              drop case_ctrl

              Comment


              • #8
                This worked well, thank you once again.

                I'm sorry to burden you with one last question, I do hope it's okay.

                Say that I'm instead interested in selecting a nearest neighbour based on propensity scores, rather than prioritising the importance of variables. I.e., within the group of potential control firms for each treated firm (following the rangejoin command from your code), matching a given treated firm to the untreated firm closest in propensity score. Specifically, I'd like to calculate pscores on characteristics (revenue_last, incoporation_year & pre2005_appln) only for firms within the same group (equal on country, sector and last_avail_year +-1). Here's a dataex of a treatment firm, 3B-FIBREGLASS, and a list of untreated firms equal on country_iso, bvd_sector and last_avail_year +-1.

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str486 company_case int incorporation_year_case double revenue_last_case float(pre2005_appln_case ETS_case) int pair_num str486 company_ctrl int incorporation_year_ctrl double revenue_last_ctrl float(pre2005_appln_ctrl ETS_ctrl)
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "NERVA"                            1963 10313.694 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "COLSON"                           1988   937.638 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "ALPHA BETON"                      1993  9849.227 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "V.V.M."                           1990 73574.041 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "VESUVIUS BELGIUM"                 1973 38312.231 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "STONE IMMO"                       1989   620.305 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "ALKERN VOR BETON"                 1989 16567.072 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "WIND - MOLDERS BETON"             1995 11049.093 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "VAN LOOVEREN"                     1984 16401.687 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "SIKA BELGIUM"                     1946 71404.109 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "E.B.R. (EBEMA BETON RIJKEVORSEL)" 1989 19934.685 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "C-CONCRETE"                       1997 40585.328 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "ACAP"                             1981 26366.433 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "SEVETON"                          1997 18143.717 0 0
                "3B-FIBREGLASS" 1999 172449.148 0 1 187 "TITAN CEMENT INTERNATIONAL S.A."  1902   2290265 0 0
                end
                --------------


                To be able to calculate pscores i need variation in the treatment variable, ETS - in other words, i need to restore the data to a long layout. However, since I haven't yet selected which untreated units should be used as controls, the treated firms appear numeral times meaning pair_num doesn't uniquely identify the observations.

                Do you have any suggestions as to how I should go about this?




                Comment


                • #9
                  Specifically, I'd like to calculate pscores on characteristics (revenue_last, incoporation_year & pre2005_appln) only for firms within the same group (equal on country, sector and last_avail_year +-1).
                  But that is not how propensity scores work. You can just go ahead and calculate propensity scores in the full data set, and then do an exact match on country and sector, with a +/- 1 year caliper match on last_avail_year, and then select from among those potential matches the ones with the closest corresponding propensity scores if you like. But propensity score matching itself doesn't support constraint to exact matching or calipers on individual variables.

                  Comment


                  • #10
                    Okay. Thank you for your time Clyde

                    Comment

                    Working...
                    X