Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing duplicates with certain criteria

    Hello Everyone,

    I have a dataset that has variables like company identifier (ID), year, exchange rate and some financial variables. Ideally, one company for a given year should have corresponding financial variables (currency, assets, profit, sales, etc); however, in my dataset, there are several observations for the same company in a given year, with 2 differences - 1. in some observations not all financial variables are reported, and 2. in other observations, the financial variables are reported in both US dollars and in their national currencies. My goal is to remove those "duplicate" observations that have the least reported financial variables, as well as remove those observations that are in currencies other than US dollars & for a given year have equal financial reports (if they reported more financial variables, but in national currencies, I would prefer keeping them). Seems a bit complicated in terms of coming up with the right command, so I would highly appreciate any advice you would give.


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long ID double year str4 currency_code double(investments debt employees sales revenue income donations)
    1013 2001 "USD"    2499.7         3 12.042   -1287.7    2402.8    2402.8    278.6
    1013 2002 "USD"    1144.2      10.8    7.6     -1145    1047.7    1047.7    193.3
    1013 2003 "USD"    1296.9       400    5.7     -76.7     773.2     773.2    108.6
    1013 2004 "USD"    1428.1       400    7.5      16.4     784.3     784.3     59.1
    1013 2005 "USD"      1535       400    8.2     110.7    1169.2    1169.2     71.6
    1013 2006 "USD"    1611.4       400    8.6      65.7    1281.9    1281.9     72.4
    1013 2007 "USD"    1764.8     200.6   9.05     106.3    1322.2    1322.2     69.6
    1013 2008 "USD"      1921     650.7   10.6     -41.9    1456.4    1456.4     83.5
    1013 2009 "USD"    1343.6       651   9.05    -474.3     996.7     996.7     65.3
    1013 2010 "USD"    1474.5     650.8    9.3        62    1156.6    1156.6     69.7
    1045 2001 "USD"     32841      9834      .     -1762     18963     18963        .
    1045 2002 "USD"     30267     12310  109.6     -3511     17299     17299        .
    1045 2003 "USD"     29330     13126   96.4     -1228     17440     17440        .
    1045 2004 "USD"     28773     13524   92.1      -761     18645     18645        .
    1045 2005 "USD"     29495     13456   88.4      -861     20712     20712        .
    1045 2006 "USD"     29145     12041   86.6       231     22563     22563        .
    1045 2007 "USD"     28571     10093   85.5       504     22896     22896        .
    1045 2008 "USD"     25175      9001   84.1     -2071     23766     23766        .
    1045 2009 "USD"     25438     10583   78.9     -1468     19917     19917        .
    1045 2010 "USD"     25088      9253  78.25      -471     22170     22170        .
    1045 2011 "USD"     23848      6702   80.1     -1979     24022     24022        .
    1045 2012 "USD"     23510      7116  77.75     -1876     24855     24855        .
    1045 2013 "USD"     42278     15353  110.4     -1834     26712     26712        .
    1045 2014 "USD"     43771     16196  113.3      2882     42650     42650        .
    1045 2015 "USD"     48415     18330  118.5      7610     40990     40990        .
    1045 2016 "USD"     51274     22489  122.3      2676     40180     40180        .
    1045 2017 "USD"     51396     22511  126.6      1919     42207     42207        .
    1045 2018 "USD"     60580     29081  128.9      1412     44541     44541        .
    1045 2019 "USD"     59995     28875  133.7      1686     45768     45768        .
    1078 2001 "USD" 23296.423  4335.493 71.426   1550.39 16285.246 16285.246 2907.952
    1078 2002 "USD" 24259.102  4273.973 71.819  2793.703 17684.663 17684.663 1669.492
    1078 2003 "USD" 26715.342  3452.329 72.181  2753.233 19680.561 19680.561 1833.712
    1078 2004 "USD" 28767.494  4787.934   60.6  3235.851 19680.016 19680.016 1975.759
    1078 2005 "USD" 29141.203  4571.504 59.735  3372.065 22287.808 22287.808 1838.306
    1078 2006 "USD" 36178.172  7009.664 66.663  1716.755 22476.322 22476.322 4269.271
    1078 2007 "USD" 39713.924  9487.789     68  3606.314 25914.238 25914.238 2505.649
    1078 2008 "USD" 42419.204  8713.327     69  4880.719 29527.552 29527.552 2786.067
    1078 2009 "USD" 52416.623 11266.294     73  5745.838 30764.707 30764.707 2913.733
    1078 2010 "USD" 59462.266 12523.517     90  4626.172 35166.721 35166.721 4037.624
    1078 2011 "USD" 60276.893 12039.822     91  4728.449 38851.259 38851.259 4801.914
    1078 2012 "USD" 67234.944 18085.302     91   5962.92  39873.91  39873.91 4610.182
    1078 2013 "USD"     42953      3388     69      2576     21848     21848     1452
    1078 2014 "USD"     41275      3408     77      2284     20247     20247     1345
    1078 2015 "USD"     41247      5871     74      4423     20405     20405     1405
    1078 2016 "USD"     52666     20681     75      1400     20853     20853     1422
    1078 2017 "USD"     76250     27210     99       477     27390     27390     2235
    1078 2018 "USD"     67173     19359    103      2368     30578     30578     2300
    1078 2019 "USD"     67887     17416    107      3687     31904     31904     2440
    1161 2001 "USD"  5647.242   672.945 14.415   -60.581  3891.754  3891.754   650.93
    1161 2002 "USD"  5619.181  1779.837 12.146 -1303.012  2697.029  2697.029  816.114
    1161 2003 "USD"  7094.345  1899.674   14.3   -274.49  3519.168  3519.168  852.075
    1161 2004 "USD"   7844.21  1628.268   15.9    91.156  5001.435  5001.435  934.574
    1161 2005 "USD"  7287.779  1327.065   9.86   165.483  5847.577  5847.577 1144.025
    1161 2006 "USD"     13147      3672   16.5      -166      5649      5649     1621
    1161 2007 "USD"     11550      5031  16.42     -3379      6013      6013     1847
    1161 2008 "USD"      7675      4702   14.7     -3098      5808      5808     1848
    1161 2009 "USD"      9078      4252   13.4       376      5403      5403     1721
    1161 2010 "USD"      4964      2188   11.1       471      6494      6494     1405
    1161 2011 "USD"      4954      1527 11.093       491      6568      6568     1453
    1161 2012 "USD"      4000      2037  10.34     -1183      5422      5422     1354
    1161 2013 "HKD"      4337      1998 10.671       -83      5299      5299     1201
    1161 2013 "USD"      3767      2035  9.687      -403      5506      5506     1072
    1161 2015 "USD"      3109      2032  9.139      -660      3991      3991      947
    1161 2016 "USD"      3321      1435    8.2      -497      4272      4272     1008
    1161 2017 "USD"      3540      1325    8.9        43      5329      5329     1160
    1161 2018 "USD"      4556      1114   10.1       337      6475      6475     1434
    1161 2019 "USD"      6028       685   11.4       341      6731      6731     1547
    1246 2001 "USD"  6290.992  1965.716   37.6    15.205  5273.479  5273.479        .
    1246 2002 "USD"  6472.618  2090.178   33.2   150.334  4827.502  4827.502        .
    1246 2003 "USD"  6639.507  1982.558  30.25   116.017  4710.912  4710.912        .
    1246 2004 "USD"  4518.413  1164.725   29.4    83.694  4613.551  4613.551        .
    1246 2005 "USD"  3831.819   953.463   25.7    60.666  4377.305  4377.305        .
    1246 2006 "USD"  3231.699   657.583     25   106.202  4228.249  4228.249        .
    1246 2007 "USD"  3278.081   757.533     25   114.487  4168.344  4168.344        .
    1246 2008 "USD"  2966.334   737.622      .    91.335  4166.553  4166.553        .
    1246 2008 "CAD"     24226      4731    115       -99     23652     23652      832
    1300 2002 "USD"     27559      4719    108      -220     22274     22274      757
    1300 2003 "USD"     29344      4961    108      1324     23103     23103      751
    1300 2004 "USD"     31062      4069    109      1281     25601     25601      917
    1300 2005 "USD"     32294      3082    116      1655     27653     27653     1072
    1300 2006 "USD"     30941      3909    118      2083     31367     31367     1411
    1300 2007 "USD"     33805      5419    122      2444     34589     34589     1459
    1300 2008 "USD"     35490      5865    128      2792     36556     36556     1543
    1300 2009 "USD"     36004      6246    122      2153     30908     30908     1330
    1300 2010 "USD"     37834      5755    130      2022     33370     33370     1466
    1300 2010 "ZAR"     39808      6881    132      2067     36529     36529     1799
    1300 2012 "USD"     41853      6395    132      2926     37665     37665     1847
    1300 2013 "USD"     45435      6801    131      3924     39055     39055     1804
    1300 2014 "USD"     45451      6046    127      4239     40306     40306     1892
    1300 2015 "USD"     49316      5554    129      4768     38581     38581     1856
    1300 2016 "USD"     54146     12182    131      4809     39302     39302     2143
    1300 2017 "USD"     59387     12573    131      1655     40534     40534     1835
    1300 2018 "USD"     57773      9756    114      6765     41802     41802     1809
    1300 2019 "USD"     58679     11644    113      6143     36709     36709     1556
    1327 2001 "USD"   316.119      .106   .935   -18.286   126.502   126.502   41.578
    1327 2002 "USD"  1346.912   180.039      4  -236.064   457.769   457.769  198.103
    1327 2003 "USD"  1090.668       275    3.8  -451.416   617.789   617.789  151.762
    1327 2004 "USD"  1168.806       230    4.2    22.412   784.023   784.023  152.633
    1327 2005 "USD"  1187.843       230      4    25.611   792.371   792.371  152.215
    1327 2006 "USD"  1090.496   179.335      4   -88.152   778.788   778.788  164.106
    end


    Thank you in advance,
    Nick

  • #2
    Code:
    egen int amt_of_info = rownonmiss(currency_code-donations), strok
    gen byte in_usd = currency_code == "USD"
    
    by ID year (amt_of_info in_usd), sort: keep if _n == _N
    Added: You did not say what you want to do if there are two observations of the same ID and year that are tied for the largest number of non-missing financial variables and both are in USD, or both are not in USD. By the criteria you spelled out, these would be "tied" and this code will break those ties at random and irreproducibly. If there is some criterion you would apply to break those ties systematically instead, post back and I will try to incorporate it.
    Last edited by Clyde Schechter; 03 Apr 2023, 16:31.

    Comment


    • #3
      Dear Mr Schechter, thank you very much for the command; it works fine! Regarding the duplicates that have the same amount of variables and in the same currency, I would definitely like to remove all such cases as there is no theory behind why I need to keep one and not the other(s). Does the command need to be modified in that case? Again, thank you very much for your help!

      Comment


      • #4
        Code:
        egen int amt_of_info = rownonmiss(currency_code-donations), strok
        gen byte in_usd = currency_code == "USD"
        
        by ID year (amt_of_info in_usd), sort: gen priority = _n
        by ID year amt_of_info in_usd: replace priority = 0 if _N > 1
        by ID year (priority), sort: keep if _n == _N & priority > 0
        In this code, if there is a tie for the largest amount of information and same currency in some ID year combination, then the "best" observation that is not tied with any other will be used, or if all observations are tied, the entire ID year group of observations will be dropped.
        Last edited by Clyde Schechter; 05 Apr 2023, 18:15.

        Comment

        Working...
        X