Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging 2 databases on company identifier and years is incorrect

    Dear statalist reader,

    Currently I am merging 2 databases(compustat and execomp) on company ID (Cusip) and years (fyear).

    Sequence of events:
    1. The first problem was the difference in cusip digits between the databases. I sloved this by reducing the number of Cusip digits to 6.
    2. The second problem is the merging.

    The codes I use thus far:
    Code:
    use "$dir\CompustatFinancials_data", clear
    drop if missing(cusip)
    
    *Creating 6 digit cusip in stead of 9
    gen cusip6=substr(cusip,1,6)
    rename cusip CUSIP9
    rename cusip6 cusip
    
    *Deleting duplicates
    sort cusip fyear
    quietly by cusip fyear: gen dup = cond(_N==1, 0,_n)
    tab dup
    drop if dup>0
    drop dup
    encode cusip, gen(Cusip)
    
    *Setting panel data
    xtset Cusip fyear
    
    *Format
    format Cusip %6.0f
    format fyear %4.0f
    
    *Save to file
    save CompustatFinancials_data_1.dta, replace
    
    
    use "$dir\Execcomp_data", clear
    drop if missing(cusip)
    rename year fyear
    
    *Creating 6 digit cusip in stead of 8
    gen cusip6=substr(cusip,1,6)
    rename cusip CUSIP8
    rename cusip6 cusip
    
    *Deleting duplicates
    sort cusip fyear
    quietly by cusip fyear: gen dup = cond(_N==1, 0,_n)
    tab dup
    drop if dup>0
    drop dup
    encode cusip, gen(Cusip)
    
    *Setting panel data
    xtset Cusip fyear
    
    *Save to file
    save Execcomp_data_1.dta, replace
    
    *Merging 2 datasets
    use "$dir\CompustatFinancials_data_1"
    merge 1:1 Cusip fyear using Execcomp_data_1.dta

    The problem I encounter is that stata does not merge the new 6 digit cusips in a correct way.
    The match is wrong.

    I cannot find a mistake in the code above. Does anyone know how to solve this problem?

    Kind Regards!





  • #2
    What is the error it outputs?

    Comment


    • #3
      There is no error.

      What happens is that the merge goes through but when I check it I see the following:

      for example:
      CUSIP9 CUSIP8 cusip Cusip
      000255109 00036110 000255 000255

      Where CUSIP9 is from the compustat database, CUSIP8 from the execomp database and cusip and Cusip the 6 digit Cusips of compustat.


      This means that the merge is wrong, because CUSIP9 is not equal to CUSIP8 which means 2 different companies have been merged in stead of the same companies...

      I totally do not get why this happens.

      Comment


      • #4
        .

        Comment


        • #5
          can you use dataex to give an example of how the two datasets look like?

          Comment


          • #6
            My datasets are too large for data ex but the table will show the same:

            Database 1
            fyear CUSIP9 cusip Cusip
            2002 000255109 000255 000255
            2003 000255109 000255 000255
            2004 000255109 000255 000255
            2003 000360206 000360 000360
            2004 000360206 000360 000360
            2005 000360206 000360 000360
            2006 000360206 000360 000360
            2002 000361105 000361 000361
            2003 000361105 000361 000361
            2004 000361105 000361 000361
            2005 000361105 000361 000361
            2006 000361105 000361 000361
            2007 000361105 000361 000361
            2008 000361105 000361 000361
            Database 2
            fyear CUSIP8 cusip Cusip
            2002 00036110 000361 000361
            2003 00036110 000361 000361
            2004 00036110 000361 000361
            2003 00088630 000886 000886
            2004 00088630 000886 000886
            2005 00088630 000886 000886
            2006 00088630 000886 000886
            2002 00095710 000957 000957
            2003 00095710 000957 000957
            2004 00095710 000957 000957
            2005 00095710 000957 000957
            2006 00095710 000957 000957
            2007 00095710 000957 000957
            2008 00095710 000957 000957

            You can see here that the merge is wrong
            fyear CUSIP9 CUSIP8 cusip Cusip
            2002 000255109 00036110 000255 000255
            2003 000255109 00036110 000255 000255
            2004 000255109 00036110 000255 000255
            2003 000360206 00088630 000360 000360
            2004 000360206 00088630 000360 000360
            2005 000360206 00088630 000360 000360
            2006 000360206 00088630 000360 000360
            2002 000361105 00095710 000361 000361
            2003 000361105 00095710 000361 000361
            2004 000361105 00095710 000361 000361
            2005 000361105 00095710 000361 000361
            2006 000361105 00095710 000361 000361
            2007 000361105 00095710 000361 000361
            2008 000361105 00095710 000361 000361
            Because CUSIP9 and CUSIP8 are NOT equal to eachother
            Last edited by Nicole Zander; 23 May 2016, 06:31.

            Comment


            • #7
              The encode command you have applied to your two datasets will not necessarily produce the same coding each time, and this appears to have been the case in your example. You should merge using cusip (the un-encoded 6-character string) and then do your encode and xtset on the merged dataset. Alternatively, if the cusip codes are entirely numeric, you could use destring rather than encode.

              See help encode and help destring to better understand how they differ.
              Last edited by William Lisowski; 23 May 2016, 06:30.

              Comment


              • #8
                I have just discovered something interesting.
                I have added a variable to check the 6 digit cusips:

                Code:
                *CHECKING for good merge variable
                gen cusip_check=Cusip
                What happens is the following
                Cusip cusip_check
                000255 2
                000360 3
                I would have though the cusip_check variable would be equal to Cusip but it is NOT.

                So, I think there is a mistake in the encoding of the cusips maybe?

                Comment


                • #9
                  Ah, I see you did indeed think that!

                  Comment


                  • #10
                    Thank you all for your help. The problem was indeed using an encoded version of Cusip. Now the merge worked correctly!

                    Comment


                    • #11
                      Hi all,

                      I am coping with the same problem of trying to merge two datasets without success. The Compustat masterfile contain the variables cusip and fyear, with cusip as string and fyear as numerical. Being unable to transform cusip into numerical because of cusip containing non-numericals, I transformed fyear to string and generated cusipfyear=cusip+fyear. The KLD (MSCI) database used to merge underwent exactly the same process. When trying to merge, the error code r(111) variable cusipfyear not found appears.

                      Any help is welcome here.

                      Btw duplicates were removed, and the merge option 1:1 is used. Further, the KLD cusip is 12str and the Compustat cusip is 13str (that is cusipfyear).

                      Comment


                      • #12
                        Your first problem is that your KLD cusipyear is12 characters and your Compustat cusipyear is 13 characters. Since the year is 4 characters, that means that the KLD data uses an 8-character cusip and the Compustat data has a 9-character cusip. Wikipedia tells us the 9th character in the CUSIP is a check digit created from the first 8 characters and has no meaning. The KLD dataset apparently omits the check digit.

                        Perhaps the following example will start you in a useful direction.
                        Code:
                        // set up example Compustat data with 9-character cusip
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input str9 cusip int fyear
                        "000255109" 2002
                        "000255109" 2003
                        "000255109" 2004
                        "000360206" 2004
                        "000360206" 2005
                        "000360206" 2006
                        "000361105" 2002
                        "000361105" 2003
                        "000361105" 2004
                        "000361105" 2005
                        "000361105" 2006
                        "000361105" 2007
                        "000361105" 2008
                        end
                        save cs, replace
                        
                        // set up example KLD data with 8-character cusip
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input str8 cusip int fyear
                        "00025510" 2002
                        "00025510" 2003
                        "00036020" 2003
                        "00036020" 2004
                        "00036020" 2005
                        "00036020" 2006
                        "00036110" 2002
                        "00036110" 2003
                        "00036110" 2004
                        "00036110" 2005
                        "00036110" 2006
                        "00036110" 2007
                        "00036110" 2008
                        end
                        save kld, replace
                        
                        // shorten Compustat cusip
                        use cs, clear
                        replace cusip = substr(cusip,1,8)
                        
                        // merge the KLD data
                        merge 1:1 cusip fyear using kld
                        
                        // how did we do?
                        sort cusip fyear
                        list, sepby(cusip)
                        Code:
                        . // merge the KLD data
                        . merge 1:1 cusip fyear using kld
                        
                            Result                      Number of obs
                            -----------------------------------------
                            Not matched                             2
                                from master                         1  (_merge==1)
                                from using                          1  (_merge==2)
                        
                            Matched                                12  (_merge==3)
                            -----------------------------------------
                        
                        . 
                        . // how did we do?
                        . sort cusip fyear
                        
                        . list, sepby(cusip)
                        
                             +------------------------------------+
                             |    cusip   fyear            _merge |
                             |------------------------------------|
                          1. | 00025510    2002       Matched (3) |
                          2. | 00025510    2003       Matched (3) |
                          3. | 00025510    2004   Master only (1) |
                             |------------------------------------|
                          4. | 00036020    2003    Using only (2) |
                          5. | 00036020    2004       Matched (3) |
                          6. | 00036020    2005       Matched (3) |
                          7. | 00036020    2006       Matched (3) |
                             |------------------------------------|
                          8. | 00036110    2002       Matched (3) |
                          9. | 00036110    2003       Matched (3) |
                         10. | 00036110    2004       Matched (3) |
                         11. | 00036110    2005       Matched (3) |
                         12. | 00036110    2006       Matched (3) |
                         13. | 00036110    2007       Matched (3) |
                         14. | 00036110    2008       Matched (3) |
                             +------------------------------------+

                        Comment

                        Working...
                        X