Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • closest relative match based on multiple variables after exact matching

    Dear Statalisters,

    I have the following problem:
    I am trying to match two survey waves (let's call 2012 and 2014).
    I have information on the country(nuts), sector(nace_b), turnover in 2012 and employees in 2012. (so the 2012 figures are asked in the 2014 survey as well, regarding turnover and employees).
    So in theory, exact matching should not be much of a problem.
    Nevertheless, there might be a small deviance on the turnover and employee figures. (they might be answered by a different person, might be rounded slightly different in the two waves...).
    Therefore, and exact match might not return an appropriate number of matches.
    Let's say we match 2012 (master) to the 2014 data.

    From this problem, I formulate the following requirements for my match:

    -exact matching on nuts and nace_b (find for every 2012 observation all 2014 observations in the same country and industry)
    -keep only the 2014 observations on turnover and employee figures that deviate maximum a certain % from the 2012 figures (let's say with maximum 5% difference).
    -within these potential subgroups that fulfill the previous requirements, take the closest 2014 observation (to keep it simple: that has the least % of total deviance in employee and turnover data, combined).
    -As a last requirement, every 2014 observation should be matched to it's closest 2012 counterpart as well... (this is the part that confuses me the most).

    I have looked into the following commands:
    psmatch2
    teffects
    calipmatch
    cem
    ...

    But I cannot find how to use them appropriately to match my data as required.

    Herewith an example of the data generated by datatex:


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str2 nuts str5 nace_b double turn12 float(emp12 year)
    "HU" "31"    16979712.259999998   5659 2014
    "HU" "23"              39522475  13174 2012
    "NO" "58"              11520000   3840 2014
    "ES" "52"              28719661   9573 2012
    "ES" "46"              56909291  18969 2012
    "DE" "22"              13906200   4635 2014
    "ES" "59-60"         1224437.48    408 2014
    "ES" "10-12"           16174637   5391 2012
    "DE" "28"               5540000   1846 2012
    "NO" "26"                209000     69 2014
    "ES" "73"               8702138   2900 2012
    "HR" "10-12"          232659000  77553 2014
    "PT" "49"             1700603.8    566 2014
    "ES" "62-63"            1163809    387 2012
    "DE" "26"              94100000  31366 2014
    "ES" "62-63"            7486122   2495 2014
    "ES" "23"    47452171.339999996  15817 2014
    "ES" "46"             178943368  59647 2014
    "BG" "46"                729113    243 2012
    "BG" "14-15"           17036.32      5 2014
    "DE" "25"               3650000   1216 2014
    "CZ" "13"              30904529  10301 2012
    "CZ" "31"             931757.54    310 2014
    "EE" "23"               1690129    563 2012
    "PT" "13"               6110728   2036 2012
    "ES" "62-63"            1627808    542 2012
    "CZ" "46"              15372063   5124 2014
    "DE" "22"              14190000   4730 2012
    "CZ" "24"             165548690  55182 2012
    "DE" "37-39"            2583280    861 2014
    "BG" "16"    315175.83999999997    105 2014
    "DE" "79"              10645000   3548 2012
    "NO" "62-63"            5090000   1696 2012
    "DE" "53"                341040    113 2014
    "DE" "64"            1.8332e+09 611066 2014
    "DE" "71"                663000    221 2012
    "DE" "27"              52000000  17333 2012
    "ES" "28"           17050399.66   5683 2014
    "CZ" "17"              30007714  10002 2012
    "BG" "25"                432560    144 2012
    "ES" "46"              20227923   6742 2012
    "BG" "14-15"            4449330   1483 2014
    "ES" "18"               1735012    578 2012
    "DE" "37-39"            1862000    620 2014
    "DE" "29"               1086000    362 2012
    "ES" "71"                277066     92 2014
    "CZ" "62-63"            2137898    712 2014
    "LT" "69-75"              44732     14 2012
    "ES" "19-20"            6101640   2033 2014
    "BG" "71"                736783    245 2012
    "DE" "29"               1064280    354 2014
    "ES" "42"              71051622  23683 2014
    "NO" "30"              29731000   9910 2012
    "ES" "21"              39389172  13129 2012
    "HU" "49"               6668902   2222 2014
    "ES" "31"            9495946.18   3165 2014
    "ES" "5-9"              2012564    670 2012
    "HU" "52"               3192252   1064 2012
    "CZ" "58"            3606504.86   1202 2014
    "DE" "81"              26698000   8899 2012
    "ES" "72+75"             152397     50 2014
    "ES" "10-12"           12804166   4268 2014
    "CZ" "26"               1946519    648 2012
    "PT" "73"               5490983   1830 2012
    "DE" "17"              25382000   8460 2014
    "CZ" "64"             366761899 122253 2012
    "ES" "23"           11951247.98   3983 2014
    "DE" "61"               1176000    392 2014
    "ES" "62-63"         1595251.84    531 2014
    "DE" "19-20"            9310000   3103 2014
    "DE" "25"              33781000  11260 2012
    "PT" "62-63"        61362084.56  20454 2014
    "DE" "62-63"             650000    216 2012
    "LT" "58-63"             513704    171 2012
    "DE" "37-39"             900000    300 2014
    "DE" "22"               2100000    700 2014
    "NO" "26"                209000     69 2012
    "ES" "10-12"            2454043    818 2012
    "LT" "13-15"            2075714    691 2014
    "EE" "10-12"        39682392.26  13227 2014
    "DE" "25"               1000000    333 2014
    "BG" "14-15"            1866244    622 2012
    "ES" "28"              11253027   3751 2012
    "CZ" "46"               8739672   2913 2012
    "BG" "46"               6931690   2310 2012
    "HU" "28"          415090643.38 138363 2014
    "LT" "24-25"          406238.42    135 2014
    "LT" "31-32"           39420181  13140 2012
    "CZ" "17"              30007714  10002 2014
    "NO" "71"              20237980   6745 2014
    "DE" "64"            1.8332e+09 611066 2012
    "NO" "33"               5890780   1963 2014
    "DE" "23"               1645420    548 2014
    "ES" "72+75"          3817565.5   1272 2014
    "ES" "74"               7394730   2464 2012
    "ES" "71"               8414352   2804 2012
    "HU" "62-63"           32595865  10865 2014
    "ES" "87"               1603969    534 2012
    "DE" "28"               5429200   1809 2014
    "ES" "71"               2360006    786 2012
    end

    Does anyone know how to properly execute this matching?
    Thanks in advance!
    Maarten

  • #2
    The following uses -joinby- to create a file identifying paired observations that are 1) the same on nuts and nace_b; 2) different on year; 3) defined as matched if close on both emp and turn. You then could pick out the 2014 or 2012 observations that you want to keep.

    Code:
    // An id variable is needed, but you didn't mention one
    gen int id = _n
    // A file with just year 2014
    preserve
    keep if year == 2014
    rename (turn12 emp12 id) (turn14 emp14 id14)
    tempfile temp14
    save `temp14'
    restore
    //
    // Just year 2012
    keep if year == 2012
    rename id id12
    //
    joinby nuts nace_b using `temp14'
    //
    local closepct = 5
    gen turnpct = 100 * abs(turn12-turn14)/turn12
    gen emppct = 100 * abs(emp12-emp14)/emp12
    gen match = (turnpct < `closepct') & (emppct < `closepct')

    Comment


    • #3
      Thank you for the fast reply, Mike!

      This piece of code adequately performs the first requirements of my matching very well, as it ends up with the pairwise combinations of all 2014 that can be matched to the 2012 observations (if fulfilling the X% deviation rules, as well as exact match on nace_b and nuts).

      Nevertheless the last matching requirement still triggers me as a beginning Stata user:

      1. Within all these potential matches for a 2012 observation (within X% of turn12 emp12, exact match on nace_b and nuts), we only want to retain the closest 2014 observation (let's say, least total % deviation compared to the 2012 observation on emp12 and turn12).
      2. If after this step, a 2014 observation is the closest match for multiple 2012 observations, only retain the match where the 2014 and 2012 observation are closest (let's say, also by comparing the least total % of deviation compared to the 2012 observation on emp12 and turn12).
      3. For all other 2012 observations where this 2014 observation was also the closest (but not as close as the 2012 observation that fulfilled this previous step), take the second closest match that fulfilled the previously stated requirements (that fulfilled the X%, nace_b, and nuts criterions).

      As such we end up with a matching procedure, where we ensure that every 2014 observation is only matched once to a 2012 observation (which might not be that important in case - control matching), but in our case is, given that we try to match the independents with the outcome variable 2 years later, for the same firm.

      Regards,
      Maarten

      Comment


      • #4
        Decide how you want to calculate a "deviation" index based on both emp12 and turn12. Then, 1. could be solved as:
        Code:
        bysort id12 (deviation): keep if _n == 1
        I'm not sure about 2., since it would seem to eliminate matches for 2012 observations. You might try something as simple as:
        Code:
        bysort id14 (deviation): keep if _n == 1
        but this would seem like it might eliminate matches you want to keep from 2012.

        Comment


        • #5
          Mike,
          Thanks again for your answer again.

          After going somewhat deeper on my last requirements of:
          **taking the closest 2014 observation for each 2012 observation
          **make sure every 2014 observation is only matched once to a 2012 observation (best match)

          I came up with the following loop to end the do file (based on a similar problem on statalist https://www.statalist.org/forums/for...ut-replacement). I have not tested this on my final dataset (due to confidentiality reasons I have no access yet), but at first sight it seems to do the job.
          (note: I call the relevant values in the 14 set: _control)

          Code:
          // An id variable is needed
          gen int ID = _n
          //
          // A file with just year 2014
          preserve
          keep if year == 2014
          rename (turn12 emp12 turnmar_1 ID) (turn12_control emp12_control ID_control turnmar_1_control)
          keep(turn12_control emp12_control ID_control turnmar_1_control nuts nace_b)
          tempfile temp14
          save `temp14'
          restore
          //
          // Just year 2012
          keep if year == 2012
          rename ID ID_12
          //
          joinby nuts nace_b using `temp14'
          //
          local closepct = 5
          gen turnpct = 100 * abs(turn12-turn12_control)/turn12
          gen emppct = 100 * abs(emp12-emp12_control)/emp12
          gen match = (turnpct < `closepct') & (emppct < `closepct')
          drop if match==0
          //
          //Compute relative difference taking into account both non-exact matched variables
          gen absdiffemp = abs(emp12-emp12_control)
          gen reldiffemp = (absdiffemp/emp12)*100
          drop absdiffemp
          //
          gen absdiffturn = abs(turn12-turn12_control)
          gen reldiffturn =(absdiffturn/turn12)*100
          drop absdiffturn
          //
          gen reldiff = reldiffemp + reldiffturn
          //
          //Give priority to observations with the best matches on emp12 and turn12 together (small score on reldiff)
          //Sort according to relative difference (reldiff), take the first occurrence of a 2012 firm ('treatment firm'), then delete all other matches for (a) that treatment firm (b) that matched firm
          //This satisfies the requirements:
           
          *Exact match on nuts and nace_b.
          *Keep only 2014 observations with a max X % deviation on emp12 and turn12.
          *Take closest 2014 observation.
          *And make sure every 2014 observations is matched only once (no replacement) to a 2012 observation.
          
          sort sizediff
          
          local N = _N
          forvalues i = 1/`N' {
              local N2 = _N
              local j = `i'+1
              while `j' <= `N2' {
                  if ID_12[`j'] == ID_12[`i'] {
                      drop in `j'
                      local j = `j' - 1
                  }
                  else if ID_control[`j'] == ID_control[`i'] {
                      drop in `j'
                      local j = `j' - 1
                  }
                  local j = `j'+1
              }
          }
          Regards,
          Maarten
          Last edited by Maarten Cerpentier; 27 Aug 2019, 07:59.

          Comment

          Working...
          X