Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging datasets with one to many mapping

    Hi All,

    My first dataset is an input-output matrix, that resembles the following:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6(commodity industry) float coefficient
    "1111A0" "1111A0"   .06733
    "1111A0" "112100"   .00023
    "1111A0" "112A00"   .00022
    "1111A0" "311221"   .01702
    "1111A0" "311222"   .67772
    "1111A0" "311223"   .29505
    "1111A0" "311225"   .10594
    "1111A0" "311911"   .02347
    "1111A0" "325110"   .00357
    "1111A0" "325130"   .00014
    "1111A0" "325190"    .0016
    "1111A0" "325400"   .00173
    "1111B0" "1111B0"   .08212
    "1111B0" "111200"    .0022
    "1111B0" "1119B0"   .00014
    "1111B0" "112100"    .0841
    "1111B0" "112300"    .0461
    "1111B0" "112A00"    .1343
    "1111B0" "311111"   .04019
    "1111B0" "311119"   .17807
    "1111B0" "311211"   .45758
    "1111B0" "311212"   .45063
    "1111B0" "311213"   .46031
    "1111B0" "311221"   .38701
    "1111B0" "311230"   .17574
    "1111B0" "311410"   .00313
    "1111B0" "311420"   .00385
    "1111B0" "31181A"   .01323
    "1111B0" "311822"   .04506
    "1111B0" "311830"   .01813
    "1111B0" "311919"   .06364
    "1111B0" "311990"    .0102
    "1111B0" "312120"   .00063
    "1111B0" "312140"   .00637
    "1111B0" "325190"   .00496
    "1111B0" "339994"   .00395
    "1111B0" "622000" 1.00e-05
    "1111B0" "623000"   .00004
    "1111B0" "624A00"    .0001
    "1111B0" "721A00" 1.00e-05
    "1111B0" "722000"   .00415
    "1111B0" "813B00" 1.00e-05
    "111200" "111200"   .02997
    "111200" "1113A0"   .00211
    "111200" "112100"   .00037
    "111200" "112A00"    .0003
    "111200" "311410"   .05497
    "111200" "311420"   .04328
    "111200" "31181A"   .00031
    "111200" "311919"   .05312
    "111200" "311941"   .00425
    "111200" "311990"     .021
    "111200" "325190"   .00028
    "111200" "483000"   .00005
    "111200" "48A000" 1.00e-05
    "111200" "622000"   .00029
    "111200" "623000"   .00071
    "111200" "624A00"   .00073
    "111200" "713A00"   .00002
    "111200" "7211A0"        0
    "111200" "721A00"   .00014
    "111200" "722000"   .00479
    "111200" "813B00"   .00008
    "111200" "S00102"   .00128
    "111335" "111335"    .0085
    "111335" "311221"   .00006
    "111335" "311222"   .00006
    "111335" "311225"   .00006
    "111335" "311230"   .00424
    "111335" "311320"   .00252
    "111335" "311330"   .01665
    "111335" "311340"   .00298
    "111335" "311822"   .00168
    "111335" "311911"   .23132
    "1113A0" "1113A0"   .02051
    "1113A0" "311320"   .08949
    "1113A0" "311330"   .00612
    "1113A0" "311410"     .041
    "1113A0" "311420"   .05512
    "1113A0" "311911"   .00261
    "1113A0" "311920"   .34353
    "1113A0" "311930"   .01193
    "1113A0" "311941"    .0009
    "1113A0" "311942"   .00445
    "1113A0" "311990"   .07084
    "1113A0" "312130"   .13881
    "1113A0" "4A0000"   .00008
    "1113A0" "622000"   .00017
    "1113A0" "623000"   .00045
    "1113A0" "624A00"   .00134
    "1113A0" "713A00"        0
    "1113A0" "721A00"   .00005
    "1113A0" "722000"   .00063
    "1113A0" "813B00"        0
    "1113A0" "S00102"   .00022
    "111400" "1111A0"   .00084
    "111400" "1111B0"   .00046
    "111400" "111200"   .00081
    "111400" "111335"   .00082
    "111400" "1113A0"   .00079
    end

    Here, "commodity" and "industry" share the same identifiers. The coefficient measures the amount of commodity used to produce 1 dollar of final output in industry. This dataset however, uses its own identifiers, that I wish to convert to a separate set of identifiers, namely the NAICS industry classification codes. The conversion table is given as:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6 industry long(naics1 naics2 naics3 naics4 naics5 nacs6 naics7 naics8)
    "1111A0"  11111  11112      .      .      .      .      .      .
    "1111B0"  11113  11114  11115  11116  11119      .      .      .
    "111200"   1112      .      .      .      .      .      .      .
    "1113A0"  11131  11132 111331 111332 111333 111334 111336 111339
    "111335" 111335      .      .      .      .      .      .      .
    "111400"   1114      .      .      .      .      .      .      .
    "111910"  11191      .      .      .      .      .      .      .
    "111920"  11192      .      .      .      .      .      .      .
    "1119A0"  11193 111991      .      .      .      .      .      .
    "1119B0"  11194 111992 111998      .      .      .      .      .
    "112100"   1121      .      .      .      .      .      .      .
    "112300"   1123      .      .      .      .      .      .      .
    "112A00"   1122   1124   1125   1129      .      .      .      .
    "113A00"   1131   1132      .      .      .      .      .      .
    "113300"   1133      .      .      .      .      .      .      .
    "114100"   1141      .      .      .      .      .      .      .
    "114200"   1142      .      .      .      .      .      .      .
    "115000"    115      .      .      .      .      .      .      .
    "211000"    211      .      .      .      .      .      .      .
    "212100"   2121      .      .      .      .      .      .      .
    "212210"  21221      .      .      .      .      .      .      .
    "212230"  21223      .      .      .      .      .      .      .
    "2122A0"  21222  21229      .      .      .      .      .      .
    "212310"  21231      .      .      .      .      .      .      .
    "212320"  21232      .      .      .      .      .      .      .
    "212390"  21239      .      .      .      .      .      .      .
    "213111" 213111      .      .      .      .      .      .      .
    "213112" 213112      .      .      .      .      .      .      .
    "21311A" 213113 213114 213115      .      .      .      .      .
    "221100"   2211      .      .      .      .      .      .      .
    "221200"   2212      .      .      .      .      .      .      .
    "221300"   2213      .      .      .      .      .      .      .
    "311111" 311111      .      .      .      .      .      .      .
    "311119" 311119      .      .      .      .      .      .      .
    "311211" 311211      .      .      .      .      .      .      .
    "311212" 311212      .      .      .      .      .      .      .
    "311213" 311213      .      .      .      .      .      .      .
    "311221" 311221      .      .      .      .      .      .      .
    "311222" 311222      .      .      .      .      .      .      .
    "311223" 311223      .      .      .      .      .      .      .
    "311225" 311225      .      .      .      .      .      .      .
    "311230"  31123      .      .      .      .      .      .      .
    "311310"  31131      .      .      .      .      .      .      .
    "311320"  31132      .      .      .      .      .      .      .
    "311330"  31133      .      .      .      .      .      .      .
    "311340"  31134      .      .      .      .      .      .      .
    "311410"  31141      .      .      .      .      .      .      .
    "311420"  31142      .      .      .      .      .      .      .
    "311511" 311511      .      .      .      .      .      .      .
    "311512" 311512      .      .      .      .      .      .      .
    "311513" 311513      .      .      .      .      .      .      .
    "311514" 311514      .      .      .      .      .      .      .
    "311520"  31152      .      .      .      .      .      .      .
    "311611" 311611      .      .      .      .      .      .      .
    "311612" 311612      .      .      .      .      .      .      .
    "311613" 311613      .      .      .      .      .      .      .
    "311615" 311615      .      .      .      .      .      .      .
    "311700"   3117      .      .      .      .      .      .      .
    "31181A" 311811 311812      .      .      .      .      .      .
    "311813" 311813      .      .      .      .      .      .      .
    "311821" 311821      .      .      .      .      .      .      .
    "311822" 311822      .      .      .      .      .      .      .
    "311823" 311823      .      .      .      .      .      .      .
    "311830"  31183      .      .      .      .      .      .      .
    "311911" 311911      .      .      .      .      .      .      .
    "311919" 311919      .      .      .      .      .      .      .
    "311920"  31192      .      .      .      .      .      .      .
    "311930"  31193      .      .      .      .      .      .      .
    "311941" 311941      .      .      .      .      .      .      .
    "311942" 311942      .      .      .      .      .      .      .
    "311990"  31199      .      .      .      .      .      .      .
    "312110"  31211      .      .      .      .      .      .      .
    "312120"  31212      .      .      .      .      .      .      .
    "312130"  31213      .      .      .      .      .      .      .
    "312140"  31214      .      .      .      .      .      .      .
    "312210"  31221      .      .      .      .      .      .      .
    "312221" 312221      .      .      .      .      .      .      .
    "312229" 312229      .      .      .      .      .      .      .
    "313100"   3131      .      .      .      .      .      .      .
    "313210"  31321      .      .      .      .      .      .      .
    "313220"  31322      .      .      .      .      .      .      .
    "313230"  31323      .      .      .      .      .      .      .
    "313240"  31324      .      .      .      .      .      .      .
    "313310"  31331      .      .      .      .      .      .      .
    "313320"  31332      .      .      .      .      .      .      .
    "314110"  31411      .      .      .      .      .      .      .
    "314120"  31412      .      .      .      .      .      .      .
    "314910"  31491      .      .      .      .      .      .      .
    "314992" 314992      .      .      .      .      .      .      .
    "31499A" 314991 314999      .      .      .      .      .      .
    "315111" 315111      .      .      .      .      .      .      .
    "315119" 315119      .      .      .      .      .      .      .
    "315190"  31519      .      .      .      .      .      .      .
    "315200"   3152      .      .      .      .      .      .      .
    "315900"   3159      .      .      .      .      .      .      .
    "316100"   3161      .      .      .      .      .      .      .
    "316200"   3162      .      .      .      .      .      .      .
    "316900"   3169      .      .      .      .      .      .      .
    "321113" 321113      .      .      .      .      .      .      .
    "321114" 321114      .      .      .      .      .      .      .
    end
    As can be seen, each industry/commodity corresponds in some cases to multiple NAICS codes (naics1-8). What I would like to do is to produce a final input-output table, from the original classification to the NAICS classification, bearing in mind these multiple potential matches. Is there any straightforward way to do this?


    Thanks,
    CS

  • #2
    Note that in your conversion table you have accidentally named your 6th NAICS code as "nacs6". In the example below I correct that, and I use a much smaller subset of the data just so that the example can be followed easily.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6 industry long(naics1 naics2 naics3 naics4 naics5 naics6 naics7 naics8)
    "1111A0"  11111  11112      .      .      .      .      .      .
    "112100"   1121      .      .      .      .      .      .      .
    "112A00"   1122   1124   1125   1129      .      .      .      .
    end
    reshape long naics, i(industry) j(num)
    drop if naics==.
    list, noobs sepby(industry) abbreviate(12)
    tempfile naics
    save `naics'
    
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6(commodity industry) float coefficient
    "1111A0" "1111A0"   .06733
    "1111A0" "112100"   .00023
    "1111A0" "112A00"   .00022
    end
    
    merge 1:m industry using `naics'
    sort commodity industry naics
    drop num
    list, noobs sepby(commodity industry) abbreviate(12)
    Code:
    . * Example generated by -dataex-. To install: ssc install dataex
    . clear
    
    . input str6 industry long(naics1 naics2 naics3 naics4 naics5 naics6 naics7 naics8)
    
          industry        naics1        naics2        naics3        naics4        naics5        naic
    > s6        naics7        naics8
      1. "1111A0"  11111  11112      .      .      .      .      .      .
      2. "112100"   1121      .      .      .      .      .      .      .
      3. "112A00"   1122   1124   1125   1129      .      .      .      .
      4. end
    
    . reshape long naics, i(industry) j(num)
    (note: j = 1 2 3 4 5 6 7 8)
    
    Data                               wide   ->   long
    -----------------------------------------------------------------------------
    Number of obs.                        3   ->      24
    Number of variables                   9   ->       3
    j variable (8 values)                     ->   num
    xij variables:
                   naics1 naics2 ... naics8   ->   naics
    -----------------------------------------------------------------------------
    
    . drop if naics==.
    (17 observations deleted)
    
    . list, noobs sepby(industry) abbreviate(12)
    
      +------------------------+
      | industry   num   naics |
      |------------------------|
      |   1111A0     1   11111 |
      |   1111A0     2   11112 |
      |------------------------|
      |   112100     1    1121 |
      |------------------------|
      |   112A00     1    1122 |
      |   112A00     2    1124 |
      |   112A00     3    1125 |
      |   112A00     4    1129 |
      +------------------------+
    
    . tempfile naics
    
    . save `naics'
    file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_16125.000001 saved
    
    . 
    . * Example generated by -dataex-. To install: ssc install dataex
    . clear
    
    . input str6(commodity industry) float coefficient
    
         commodity   industry  coeffic~t
      1. "1111A0" "1111A0"   .06733
      2. "1111A0" "112100"   .00023
      3. "1111A0" "112A00"   .00022
      4. end
    
    . 
    . merge 1:m industry using `naics'
    
        Result                           # of obs.
        -----------------------------------------
        not matched                             0
        matched                                 7  (_merge==3)
        -----------------------------------------
    
    . sort commodity industry naics
    
    . drop num
    
    . list, noobs sepby(commodity industry) abbreviate(12)
    
      +----------------------------------------------------------+
      | commodity   industry   coefficient   naics        _merge |
      |----------------------------------------------------------|
      |    1111A0     1111A0        .06733   11111   matched (3) |
      |    1111A0     1111A0        .06733   11112   matched (3) |
      |----------------------------------------------------------|
      |    1111A0     112100        .00023    1121   matched (3) |
      |----------------------------------------------------------|
      |    1111A0     112A00        .00022    1122   matched (3) |
      |    1111A0     112A00        .00022    1124   matched (3) |
      |    1111A0     112A00        .00022    1125   matched (3) |
      |    1111A0     112A00        .00022    1129   matched (3) |
      +----------------------------------------------------------+
    
    .

    Comment


    • #3
      Thanks a lot! William Lisowski

      Comment


      • #4
        Basically, in the first dataset, each industry corresponds to multiple NAICS codes. In the second dataset, many commodities are the input for one industry. So, when we merge the first dataset with the second, each commodity should be matched to the multiple definitions in the NAICS classification for the industry that it is an input to. So, for instance, if on average there were 4 NAICS codes per industry, the final dataset would be 4 times the size of the original input-output matrix.

        Comment


        • #5
          In this case, it seems that joinby might be more appropriate.

          Comment


          • #6
            You are correct, I oversimplified my example. The merge 1:m industry will not work because industries appear multiple times in the master dataset. Instead joinby industry should do the job.

            Comment

            Working...
            X