Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging data based on CUSIP

    Hello,

    I am new to Stata and this forum has been helpful to me. I have two databases that I am trying to merge based on CUSIP, which is KLD and Compustat. The problem arises when I look up into these CUSIP codes. For example, the database of KLD has shorter numbers than the database of Compustat. CUSIP in KLD database is 2444107 and the CUSIP in Compustat is 002444107. But I also see that this can be longer in the end: CUSIP KLD 00206R10 and Compustat 00206R102.

    I would like to merge these data based on the CUSIP. How can I do this? Any suggestions?

    Kind regards,

    John

  • #2
    Hi John,

    Welcome to Statalist! Your question is more about the various flavors of CUSIPs than about Stata. This page at the Princeton University library should get you started: https://libguides.princeton.edu/MatchFinancial. You should also check out the CUSIP FAQs .

    Devra
    Devra Golbe
    Professor Emerita, Dept. of Economics
    Hunter College, CUNY

    Comment


    • #3
      Hi Devra,

      Thank you for your response! I have looked at it and found some interesting content about the CUSIP converter. I want to convert the 8 CUSIP code of KLD to 9 CUSIP code of COMPUSTAT, since most variables come out of COMPUSTAT. Now I need to upload the output from Stata as txt format in order to put it in the converter of WRDS. If I do this only for the CUSIP numbers (so 1 column), it works. However, I also want to put the variables that are linked with the CUSIP in the output. How can I do this, such that I get an output with 9 CUSIP with the variables linked?

      Kind regards,

      John

      Comment


      • #4
        Hi John,

        As you have seen, the CUSIP converter accepts only a single list of CUSIPs and produces only a single list of CUSIPs of different length. To merge two datasets indexed by CUSIPs of different length (call them CUSIP8 and CUSIP9), you will need to match CUSIP8 to CUSIP9. This will be easiest if you hold these two variables as strings. Use Stata's string functions to extract the relevant 8 characters of CUSIP9 into an 8-digit string (CUSIP8a). Then merge the two datasets by matching on CUSIP8 and CUSIP8a.

        Devra
        Devra Golbe
        Professor Emerita, Dept. of Economics
        Hunter College, CUNY

        Comment


        • #5
          Hi Devra,

          So you advice to convert the CUSIP9 into CUSIP8? If I look into the CUSIP9 it has the type str10 and the CUSIP8 has str16. Will this be a problem?

          Kind regards,

          John

          Comment


          • #6
            Hi John,


            Sounds like your data need some initial cleaning. Stata's documentation is excellent. Read up on string variables. I'd start with 12.4 in the User manual on Strings, and follow up with String Functions. You will want to trim all leading and trailing spaces, and possibly leading zeros. Once your data are clean, the variables should be of proper length. (There's a string function for checking that.)

            Best,
            Devra
            Devra Golbe
            Professor Emerita, Dept. of Economics
            Hunter College, CUNY

            Comment


            • #7
              Hi John, a 9-digit CUSIP has 3 parts:

              1. Cusip issuer number – 6 digits (i.e. the company)
              2. Cusip issue number – 2 digits (the particular stock or bond offering)
              3. Cusip check digit – 1 digit

              So the 1st 6 digits should match between the KLD and Compustat databases. (BTW, ff KLD has PERMNO or GVKEY, those are easier to match on). The challenge is that CUSIP's can have leading zeroes, and those are often dropped if you imported the data into Excel. However, if you download both of them as strings, you should be able to create a 6-digit CUSIP as:
              Code:
               gen cusip_6 = substr(cusip_9, 1, 6)

              Comment


              • #8
                Hi all,

                I am coping with the same problem of trying to merge two datasets without success. The Compustat masterfile contain the variables cusip and fyear, with cusip as string and fyear as numerical. Being unable to transform cusip into numerical because of cusip containing non-numericals, I transformed fyear to string and generated cusipfyear=cusip+fyear. The KLD (MSCI) database used to merge underwent exactly the same process. When trying to merge, the error code r(111) variable cusipfyear not found appears.

                Any help is welcome here.

                Btw duplicates were removed, and the merge option 1:1 is used. Further, the KLD cusip is 12str and the Compustat cusip is 13str.

                Comment

                Working...
                X