Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel Duplicates by Different Firm ID

    I have a panel database of firms. There are some cases where the parent and their subsidiary both report the exact same values for things like sales, assets, etc. I was able to see these by using the duplicates command, specifically

    Code:
     duplicates tag sales assets year, generate(dups)

    If I sort by sales (or assets) and list the data, I can clearly see there are many pairs of firms that have the same values in the same year. I get something like this:

    Company ID Sales Assets Year dups
    1 6 25 1999 1
    2 6 25 1999 1
    3 10 100 1999 1
    4 10 100 1999 1
    1 3.5 45 2000 1
    2 3.5 45 2000 1
    3 1 50 2000 1
    4 1 50 2000 1

    To correct for double counting, I'd like to keep one firm and drop the other. I've seen code for create an alternating dummy value each year, which would work in my fictional list I've provided. In my actual (more complicated) data, this would not work (I would end up dropping some years for Company A and some years for Company B, when it would be better to just drop one or the other). That is, I need to create a dummy variable where 0 equals one of the firms with duplicate data and 1 equals the other. Or, in line with the fictional data I have, I want to tell Stata that firm 1 has the same values as firm 2 in all years and firm 3 has the same values as firm 4 in all years too.

    I hope this is clear. Any help anyone has is appreciated.
    Last edited by Matt Soener; 18 Feb 2018, 14:35.

  • #2
    Interesting problem. There's probably a fancy way to do this, but I like to use the basics (in other words, brute force). Here's an approach that may get you going in a useful direction. Note especially companies 5 and 6, which only match in a single year, so neither is flagged as being the same as the other.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte co float sales byte assets int year
    1   6  25 1999
    2   6  25 1999
    3  10 100 1999
    4  10 100 1999
    5   5  55 1999
    6   6  66 1999
    1 3.5  45 2000
    2 3.5  45 2000
    3   1  50 2000
    4   1  50 2000
    5   1  11 2000
    6   1  11 2000
    end
    bysort sales assets year (co): generate comatch = co[1]
    bysort co (comatch): generate cosame = comatch[1]==comatch[_N] & comatch[1]!=co
    sort year co
    list, clean
    Code:
    . list, clean
    
           co   sales   assets   year   comatch   cosame  
      1.    1       6       25   1999         1        0  
      2.    2       6       25   1999         1        1  
      3.    3      10      100   1999         3        0  
      4.    4      10      100   1999         3        1  
      5.    5       5       55   1999         5        0  
      6.    6       6       66   1999         6        0  
      7.    1     3.5       45   2000         1        0  
      8.    2     3.5       45   2000         1        1  
      9.    3       1       50   2000         3        0  
     10.    4       1       50   2000         3        1  
     11.    5       1       11   2000         5        0  
     12.    6       1       11   2000         5        0
    Added as an afterthought: suppose company 666 matches company 42 in 2000 but doesn't appear at all in 1999, this code may not work.

    Added later: crossed with Clyde's post, which lists all the things I should have thought of before going for brute force.
    Last edited by William Lisowski; 18 Feb 2018, 15:02.

    Comment


    • #3
      I have some ideas about how one might solve this, but the details would depend on specific aspects of how the original data is organized--just seeing the list of duplicates doesn't nail it down. So please post back with a good, representative example of the original data. Make sure the example includes some duplicates, and some that are not duplicates, and, if there are such in your data, some examples of firms that coincidentally report the same sales and assets in some years, but differ in other years (and hence are actually different firms). Also make the range of years shown representative of the data as a whole, and if there are missing values among the responses, show those as well. Make sure that it is clear from the example whether all firms are observed in the same years, or for the same number of years, or different firms for different numbers of years.

      Be sure to use the -dataex- command to post this example data. If you are not familiar with -dataex-, read FAQ #12 and follow the advice there.

      Comment


      • #4
        Thanks for the reply. Below you can see the results from firms I've tagged as duplicates (and the duplicate dummy is shown).

        Since I'm posting my code with my own variable names, GVKEY=firm ID, at=total assets, and conm=company name.

        For the companies UGI Corp, WGL Holdings INC and Integrys Holding Inc, you can't see the firms they match with (it is worth noting that for the first two they have just one year of data. This dataset is unbalanced). For Unilever PLC and Unilever NV though, you can see observations in each year are the same. These are exactly the kind of cases I'd like to address. Having only one of these is necessary.


        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input long gvkey str88 conm double(year sale at) int dups
        10801 "UGI CORP"             1991 1232.5744373092414 1754.8907111270296 1
        10845 "UNILEVER PLC"         1995  71618.04688311594                  0 1
        10845 "UNILEVER PLC"         1996  73316.11128976692  49021.94546094461 1
        10845 "UNILEVER PLC"         1997  68745.65258691588  48722.15340241122 1
        10845 "UNILEVER PLC"         1998 65199.346458415945  51786.00191131287 1
        10845 "UNILEVER PLC"         1999 62049.401972052256  39849.16768865295 1
        10845 "UNILEVER PLC"         2000 60408.406260770185  73177.68351205904 1
        10845 "UNILEVER PLC"         2001 61660.206862709005  63389.81432702189 1
        10845 "UNILEVER PLC"         2002  60053.72214399573 55485.309719865785 1
        10845 "UNILEVER PLC"         2003  62123.58934890652  55961.14115148261 1
        10845 "UNILEVER PLC"         2004 62612.034342098625  53512.25449970345 1
        10845 "UNILEVER PLC"         2005 59819.155885099965  59372.83429450737 1
        10845 "UNILEVER PLC"         2006  58441.65908805863 54652.872854863774 1
        10845 "UNILEVER PLC"         2007  62886.38129644088  58371.80668175872 1
        10845 "UNILEVER PLC"         2008  65523.35778719963  58439.53303420203 1
        10845 "UNILEVER PLC"         2009   61292.0444215007  56971.75793652587 1
        10845 "UNILEVER PLC"         2010  63682.07887674662 59229.138789910714 1
        10845 "UNILEVER PLC"         2011  68072.36572946975  69603.25049042473 1
        10845 "UNILEVER PLC"         2012  67979.60998530743  61147.74130195821 1
        10845 "UNILEVER PLC"         2013  67215.37457079558 61432.884367343795 1
        10845 "UNILEVER PLC"         2014  64347.27526562526 63803.918349619795 1
        10846 "UNILEVER NV"          1995  71618.04688311594                  0 1
        10846 "UNILEVER NV"          1996  73316.11128976692  49021.94546094461 1
        10846 "UNILEVER NV"          1997  68745.65258691588  48722.15340241122 1
        10846 "UNILEVER NV"          1998 65199.346458415945  51786.00191131287 1
        10846 "UNILEVER NV"          1999 62049.401972052256  39849.16768865295 1
        10846 "UNILEVER NV"          2000 60408.406260770185  73177.68351205904 1
        10846 "UNILEVER NV"          2001 61660.206862709005  63389.81432702189 1
        10846 "UNILEVER NV"          2002  60053.72214399573 55485.309719865785 1
        10846 "UNILEVER NV"          2003  62123.58934890652  55961.14115148261 1
        10846 "UNILEVER NV"          2004 62612.034342098625  53512.25449970345 1
        10846 "UNILEVER NV"          2005 59819.155885099965  59372.83429450737 1
        10846 "UNILEVER NV"          2006  58441.65908805863 54652.872854863774 1
        10846 "UNILEVER NV"          2007  62886.38129644088  58371.80668175872 1
        10846 "UNILEVER NV"          2008  65523.35778719963  58439.53303420203 1
        10846 "UNILEVER NV"          2009   61292.0444215007  56971.75793652587 1
        10846 "UNILEVER NV"          2010  63682.07887674662 59229.138789910714 1
        10846 "UNILEVER NV"          2011  68072.36572946975  69603.25049042473 1
        10846 "UNILEVER NV"          2012  67979.60998530743  61147.74130195821 1
        10846 "UNILEVER NV"          2013  67215.37457079558 61432.884367343795 1
        10846 "UNILEVER NV"          2014  64347.27526562526 63803.918349619795 1
        10984 "SPRINT CORP"          1997 21935.527289719623  26818.33121495327 1
        11296 "WGL HOLDINGS INC"     2000 1417.3203125744387  2666.435169206239 1
        11555 "INTEGRYS HOLDING INC" 1991 1083.5698722965699 1865.6843876183327 1
        11555 "INTEGRYS HOLDING INC" 1992 1070.9738427059672  1932.656301511055 1
        end
        format %ty year
        This may also be helpful. Here is one firm that has no duplicates (Armco Inc.) followed by a second firm (Armstrong Holdings Inc.) that has duplicates in 2004 and 2005. I looked up the firm it matches with (Armstrong World Industries) and it came into existence in 2004 so they overlapped in those two years.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input long gvkey str88 conm double(year sale at) int dups
        1755 "ARMCO INC"              1991  2772.448740534817 3209.1793670604857 0
        1755 "ARMCO INC"              1992 3498.3685625361823  3305.870562446783 0
        1755 "ARMCO INC"              1993  2725.735640138408   3120.01723183391 0
        1755 "ARMCO INC"              1994  2296.085877240877 3090.3565413699034 0
        1755 "ARMCO INC"              1995  2422.758167895947 2945.7036612805005 0
        1755 "ARMCO INC"              1996  2600.833753181832  2817.771046515676 0
        1755 "ARMCO INC"              1997  2697.790093457944 2774.4779439252334 0
        1755 "ARMCO INC"              1998 2478.0892638036808  2750.076441717791 0
        1762 "ARMSTRONG HOLDINGS INC" 1991  4239.224103796515   3736.28003966389 0
        1762 "ARMSTRONG HOLDINGS INC" 1992 4301.7651238207745 3390.7316439936435 0
        1762 "ARMSTRONG HOLDINGS INC" 1993  4136.762491349481 3160.3135640138407 0
        1762 "ARMSTRONG HOLDINGS INC" 1994  4396.518916444743 3565.6731503479814 0
        1762 "ARMSTRONG HOLDINGS INC" 1995  3238.161743859388  3338.961157345155 0
        1762 "ARMSTRONG HOLDINGS INC" 1996 3253.1542374485516 3221.7752687326683 0
        1762 "ARMSTRONG HOLDINGS INC" 1997 3242.5687850467284  3503.307476635514 0
        1762 "ARMSTRONG HOLDINGS INC" 1998  3987.886748466257  6205.315582822085 0
        1762 "ARMSTRONG HOLDINGS INC" 1999 4892.8417175057875  5916.789399080333 0
        1762 "ARMSTRONG HOLDINGS INC" 2000  4128.916797912045  5325.887550832282 0
        1762 "ARMSTRONG HOLDINGS INC" 2001  4190.565524691236  5392.108679216151 0
        1762 "ARMSTRONG HOLDINGS INC" 2002  4173.893471236376  5927.105037110496 0
        1762 "ARMSTRONG HOLDINGS INC" 2003  4192.420108695652   5978.99054347826 0
        1762 "ARMSTRONG HOLDINGS INC" 2004  4382.270708032068  5775.780917165532 1
        1762 "ARMSTRONG HOLDINGS INC" 2005  4312.715139982785  5582.387009543814 1
        1762 "ARMSTRONG HOLDINGS INC" 2006                  0 30.996427633000096 0
        end
        format %ty year
        There is missing data for both sales and assets in this data. If you'd like, I'd be happy to post an example showing this, but I thought it might be sufficient to just say this.

        Comment


        • #5
          I think the following will do it. Note that because the duplicates of some of the examples in the data are not visible in the example, this code treats them as not being duplicates. The code also treats two firms that show matching data for some, but not all, years are not being duplicates. This is to avoid inadvertently treating as duplicates two firms that, by coincidence, happen to have the same values of sale and at in a given year.

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input int gvkey str22 conm int year double(sale at)
          10801 "UGI CORP"               1991 1232.5744373092414 1754.8907111270296
          10845 "UNILEVER PLC"           1995  71618.04688311594                  0
          10845 "UNILEVER PLC"           1996  73316.11128976692  49021.94546094461
          10845 "UNILEVER PLC"           1997  68745.65258691588  48722.15340241122
          10845 "UNILEVER PLC"           1998 65199.346458415945  51786.00191131287
          10845 "UNILEVER PLC"           1999 62049.401972052256  39849.16768865295
          10845 "UNILEVER PLC"           2000 60408.406260770185  73177.68351205904
          10845 "UNILEVER PLC"           2001 61660.206862709005  63389.81432702189
          10845 "UNILEVER PLC"           2002  60053.72214399573 55485.309719865785
          10845 "UNILEVER PLC"           2003  62123.58934890652  55961.14115148261
          10845 "UNILEVER PLC"           2004 62612.034342098625  53512.25449970345
          10845 "UNILEVER PLC"           2005 59819.155885099965  59372.83429450737
          10845 "UNILEVER PLC"           2006  58441.65908805863 54652.872854863774
          10845 "UNILEVER PLC"           2007  62886.38129644088  58371.80668175872
          10845 "UNILEVER PLC"           2008  65523.35778719963  58439.53303420203
          10845 "UNILEVER PLC"           2009   61292.0444215007  56971.75793652587
          10845 "UNILEVER PLC"           2010  63682.07887674662 59229.138789910714
          10845 "UNILEVER PLC"           2011  68072.36572946975  69603.25049042473
          10845 "UNILEVER PLC"           2012  67979.60998530743  61147.74130195821
          10845 "UNILEVER PLC"           2013  67215.37457079558 61432.884367343795
          10845 "UNILEVER PLC"           2014  64347.27526562526 63803.918349619795
          10846 "UNILEVER NV"            1995  71618.04688311594                  0
          10846 "UNILEVER NV"            1996  73316.11128976692  49021.94546094461
          10846 "UNILEVER NV"            1997  68745.65258691588  48722.15340241122
          10846 "UNILEVER NV"            1998 65199.346458415945  51786.00191131287
          10846 "UNILEVER NV"            1999 62049.401972052256  39849.16768865295
          10846 "UNILEVER NV"            2000 60408.406260770185  73177.68351205904
          10846 "UNILEVER NV"            2001 61660.206862709005  63389.81432702189
          10846 "UNILEVER NV"            2002  60053.72214399573 55485.309719865785
          10846 "UNILEVER NV"            2003  62123.58934890652  55961.14115148261
          10846 "UNILEVER NV"            2004 62612.034342098625  53512.25449970345
          10846 "UNILEVER NV"            2005 59819.155885099965  59372.83429450737
          10846 "UNILEVER NV"            2006  58441.65908805863 54652.872854863774
          10846 "UNILEVER NV"            2007  62886.38129644088  58371.80668175872
          10846 "UNILEVER NV"            2008  65523.35778719963  58439.53303420203
          10846 "UNILEVER NV"            2009   61292.0444215007  56971.75793652587
          10846 "UNILEVER NV"            2010  63682.07887674662 59229.138789910714
          10846 "UNILEVER NV"            2011  68072.36572946975  69603.25049042473
          10846 "UNILEVER NV"            2012  67979.60998530743  61147.74130195821
          10846 "UNILEVER NV"            2013  67215.37457079558 61432.884367343795
          10846 "UNILEVER NV"            2014  64347.27526562526 63803.918349619795
          10984 "SPRINT CORP"            1997 21935.527289719623  26818.33121495327
          11296 "WGL HOLDINGS INC"       2000 1417.3203125744387  2666.435169206239
          11555 "INTEGRYS HOLDING INC"   1991 1083.5698722965699 1865.6843876183327
          11555 "INTEGRYS HOLDING INC"   1992 1070.9738427059672  1932.656301511055
           1755 "ARMCO INC"              1991  2772.448740534817 3209.1793670604857
           1755 "ARMCO INC"              1992 3498.3685625361823  3305.870562446783
           1755 "ARMCO INC"              1993  2725.735640138408   3120.01723183391
           1755 "ARMCO INC"              1994  2296.085877240877 3090.3565413699034
           1755 "ARMCO INC"              1995  2422.758167895947 2945.7036612805005
           1755 "ARMCO INC"              1996  2600.833753181832  2817.771046515676
           1755 "ARMCO INC"              1997  2697.790093457944 2774.4779439252334
           1755 "ARMCO INC"              1998 2478.0892638036808  2750.076441717791
           1762 "ARMSTRONG HOLDINGS INC" 1991  4239.224103796515   3736.28003966389
           1762 "ARMSTRONG HOLDINGS INC" 1992 4301.7651238207745 3390.7316439936435
           1762 "ARMSTRONG HOLDINGS INC" 1993  4136.762491349481 3160.3135640138407
           1762 "ARMSTRONG HOLDINGS INC" 1994  4396.518916444743 3565.6731503479814
           1762 "ARMSTRONG HOLDINGS INC" 1995  3238.161743859388  3338.961157345155
           1762 "ARMSTRONG HOLDINGS INC" 1996 3253.1542374485516 3221.7752687326683
           1762 "ARMSTRONG HOLDINGS INC" 1997 3242.5687850467284  3503.307476635514
           1762 "ARMSTRONG HOLDINGS INC" 1998  3987.886748466257  6205.315582822085
           1762 "ARMSTRONG HOLDINGS INC" 1999 4892.8417175057875  5916.789399080333
           1762 "ARMSTRONG HOLDINGS INC" 2000  4128.916797912045  5325.887550832282
           1762 "ARMSTRONG HOLDINGS INC" 2001  4190.565524691236  5392.108679216151
           1762 "ARMSTRONG HOLDINGS INC" 2002  4173.893471236376  5927.105037110496
           1762 "ARMSTRONG HOLDINGS INC" 2003  4192.420108695652   5978.99054347826
           1762 "ARMSTRONG HOLDINGS INC" 2004  4382.270708032068  5775.780917165532
           1762 "ARMSTRONG HOLDINGS INC" 2005  4312.715139982785  5582.387009543814
           1762 "ARMSTRONG HOLDINGS INC" 2006                  0 30.996427633000096
          end
          
          format %ty year
          
          preserve
          keep gvkey year
          tempfile years
          save `years'
          restore
          
          //    IDENTIFY FIRMS WITH IDENTICAL DATA DATA
          egen long pair = group(year sale at)
          //    RESHAPE THE COMPANY IDENTIFIERS WIDE
          by pair (gvkey), sort: gen _j = _n
          reshape wide gvkey conm, i(pair) j(_j)
          
          //    TWO "COMPANIES" ARE REALLY THE SAME IF
          //    THEY HAVE THE SAME DATA EVERY YEAR THEY APPEAR
          clonevar gvkey = gvkey1
          merge 1:1 gvkey year using `years', keep(match master)
          replace gvkey = gvkey2
          merge m:1 gvkey year using `years', keep(match master) gen(_merge2)
          drop gvkey
          by gvkey1 gvkey2 (year), sort: egen byte is_duplicate = min(3._merge & 3._merge2)
          //    BUT DISREGARD "PAIRS" WITH NO SECOND GVKEY
          replace is_duplicate = is_duplicate & !missing(gvkey2)
          
          //    GO BACK TO LONG LAYOUT NOW
          reshape long
          drop if missing(gvkey)
          isid pair year _j, sort
          
          //    REMOVE ONE OF THE TWO DUPLICATE FIRMS (THE ONE WITH HIGHER GVKEY IS DROPPED)
          drop if is_duplicate & _j == 2
          isid conm year, sort


          Comment


          • #6
            I appreciate your help. There were a few snags with the code. When I tried to reshape the data, Stata returned an error saying _j takes on too many values. I haven't used reshape in a while, but perhaps this is because my data is quite large. I have over 600,000 observations and 224 variables.

            That said, I wonder if this is one of those things where it is "easier" to do by hand. There are roughly 200 firms in my data that are duplicated in this fashion. If there was a way to return the firms that have duplicates in some years but not all (as you correctly tried to do Clyde), I could simply go through and keep one and not the other. This would be easy to do in an Excel sheet and then merge after.

            It would be time consuming, but not that time consuming. I think I could return something like this. The trick is to present the data in panel format so I can see one firm and its duplicate right after much like the way the Unilever data looks in the example I posted. Any other suggestions you have would be welcome.

            Comment


            • #7
              Um, I have never seen Stata give an error message that _j contains too many values from a -reshape- command, and that is particularly unexpected since as I set up the code, my expectation is that _j would never be more than 2! So there is something very different about your data than I understood from your example. Also, there are two -reshape- commands in my code: one long and one wide. Which one provoked the error message? Can you post an example of your data that reproduces this problem?

              Removing the surplus observations, once they are identified, is the easy part of this problem. It would make no sense at all to identify them, and then turn to Excel to remove them. And, in any case, unless you are just doing this for fun and games, you should never do data management in Excel. It leaves no audit trail of what you do. For my part, I would never lend any credibility to, or accept, results from data that were managed in a spreadsheet. If you are doing work that you or others plan to rely on in some way, spreadsheets should be used only for visual display of final results, if even for that.

              Comment


              • #8
                To Clyde's excellent advice I would add that almost without fail, whenever I do something manually in the way you describe in post #6, it guarantees that I wind up having to repeat the analysis on slightly different data or slightly expanded data, and I then get to do the manual process again, only this time it's complicated by needing to make sure that whatever I do, and whatever oversights and mistakes I make, are consistent with what I did the first time.

                And similarly, every time it takes me longer than expected to automate a process and I'm feeling guilty, I wind up having to repeat the process for some reason or another, and feel thankful that I spent the time to get it right and automated.

                Comment


                • #9
                  Ok fair enough. I imagined this would not be a popular opinion to do things by hand. I just wanted to get things moving.

                  Regarding this:

                  Originally posted by Clyde Schechter View Post
                  Um, I have never seen Stata give an error message that _j contains too many values from a -reshape- command, and that is particularly unexpected since as I set up the code, my expectation is that _j would never be more than 2! So there is something very different about your data than I understood from your example. Also, there are two -reshape- commands in my code: one long and one wide. Which one provoked the error message? Can you post an example of your data that reproduces this problem?
                  I used the code you provided:

                  Code:
                  preserve
                  keep gvkey year
                  tempfile years
                  save `years'
                  restore
                  //    IDENTIFY FIRMS WITH IDENTICAL DATA DATA
                  egen long pair = group(year sale at)
                  //    RESHAPE THE COMPANY IDENTIFIERS WIDE
                  by pair (gvkey), sort: gen _j = _n
                  reshape wide gvkey conm, i(pair) j(_j)
                  It was this last line where I received the message "variable _j takes on too many values" So this was the first reshape to go wide.

                  I was able to generate _j and pair. _j seemed to have flagged some duplicates of this sort too (i.e. values higher than 1). Perhaps there are other workarounds along these lines?

                  Regarding information about my data, I'm not sure what else would be useful for you to know. Please let me know. One thing that maybe is important is that I have quite a few other variables than just sales and assets. It's just that those would be the kind of values that would be duplicated (and are available for almost every firm). Other firm-level variables have varying levels of coverage ranging from pretty high to almost all missing. There are a few variables firms would have similar values on such as industry, country-level characteristics that apply to many firms in a country, or other basic dummy variables.

                  Comment


                  • #10
                    those would be the kind of values that would be duplicated (and are available for almost every firm)
                    Hmm, so they are missing sometimes? In any given year (I'm assuming year is not missing), every observation with missing values for both sale and at will be "paired" with every other similar observations. Something like
                    Code:
                    tab year if sale==. & at==.
                    might show the scope of the potential problem.

                    Comment

                    Working...
                    X