Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge two data sets by a common variable and a date range


    Dear statalister,
    I have the following two dataset that I need to merge:

    (I am only showing a few records to illustrate data structure)

    My goal is to add cusip information from Cusip_permno to my permno dataset .

    Permno is a company ‘s permanent identifier – never changes. For each permno, it could correspond to more than one cusip if a company issued different types of stocks and received newly assigned cusip . the begin date and end date suggests during which time period and cusip applies. Hence, my merge is to add cusip to permno data based on the effective period cusip applies.

    Data Cusip_permno
    Permno Cusip Begindate Enddate
    10000 002b0102 19900102 19931231
    10001 0039b012 19900102 19920102
    10001 0029b011 19920103 19941203
    Permno year xvar
    10000 1990 1
    10000 1991 2
    10001 1990 19
    10001 1992 92
    10001 1993 13
    I am stuck with how to use the date to match with the year and permno.

    Thanks for considering my question,

    Rochelle

  • #2
    We need to clarify something here. For Permno 10001, Cusip 0029b011 covers most, but not all of 1992, all of 1993, and most, but not all of 1994. The beginning of 1992 is covered by Cusip 0039b012. So when you match up Permno 10001 and 1992, which Cusip do you want to match to? Or do you want to have two records, one for each Cusip? Please provide a general answer to the issue of multiple Cusips in a single year, not just this particular instance.
    Last edited by Clyde Schechter; 04 Oct 2014, 08:18.

    Comment


    • #3
      It is not fully clear to me what you want to do.
      I don't understand why in the example of your permno dataset above there are multiple years for each company, but these do not cover all years.

      Would you like your dataset to look like
      A. with Cusip connected to the start year
      10000 1990 1 002b0102
      10000 1991 2
      10001 1990 19
      0039b012
      10001 1992 92
      0029b011
      10001 1993 13

      B. Multiplying the observations in your permno database so that there is an observation for each year a cusip runs?
      10000 1990 1 002b0102
      10000 1991 2 002b0102
      10000 1992 . 002b0102
      10000 1993 . 002b0102
      10001 1990 19 0039b012
      10001 1991 . 0039b012
      10001 1992 92 0039b012
      10001 1992 92 0029b011
      10001 1993 13 0029b011
      10001 1994 . 0029b011
      From your description I'm guessing it's option B.

      You need to check what the format of your year variable in cusip is.
      For whatever reason I couldn't extract year using the year function from a variable formatted %tdCCYY-NN-DD so I decided to turn it to string and extract the year from that.

      Code:
      use "cusip.dta", clear
      tostring begindate, replace
      tostring enddate, replace
      gen beginyear=substr(begindate,1,4)
      gen endyear=substr(enddate,1,4)
      destring beginyear endyear, replace
      gen duration=endyear-beginyear
      expand duration+1
      bysort cusip: gen year=beginyear+_n-1
      keep cusip year
      save "cusip expanded.dta"
      
      use "permno.dta", clear
      joinby Permno year using "C:\Users\Evelyn\Downloads\cusip expanded.dta", unmatched(both)

      Comment


      • #4
        @ Clyde: Thank you ! I looked into the raw data. Yes, I do need to decide what to do when the year overlaps. Majority of my sample does not have this overlap issue. suppose I isolate these observations that have year overlaps and not merge them at this point (waiting for further investigation on what year to merge).

        could you please comment on merging the two data sets below (I edit my data to show there is no overlaps in year)
        Permno Cusip Begindate Enddate
        10000 002b0102 19900102 19931231
        10002 0032b016 19900101 19941231
        10002 0032b016 19950102 19991231
        Permno year xvar
        10000 1990 1
        10000 1991 2
        10002 1990 2
        10002 1995 9
        10002 1996 3

        Comment


        • #5
          @Evelyn, thank you for your detailed response .

          Re: your first question. permno dataset has years that are not continuous (gaps inbetween ). this could happen when the data collection was incomplete. I am using secondary data from some commercial databases.

          Re: my raw datasets are both sas datasets. the date format should work after importing into stata. ​your suggested solution should be useful if I have those overlapping years as Clyde mentioned in his post. Yes, your option B is better.

          thanks again,

          Comment


          • #6
            So, here's what I would do:

            Code:
            use cusip_data_set, clear
            isid Permno Cusip, sort // IF THIS FAILS, CODE BELOW WOULDN'T WORK
            
            // GENERATE YEAR VARIABLES AND EXPAND IT TO FILL IN
            // PERIODS THAT COVER MORE THAN ONE YEAR
            gen long first_date = date(Begindate, "YMD")
            gen long last_date = date(Enddate, "YMD")
            gen int first_year = year(first_date)
            gen int last_year = year(last_date)
            expand last_year-first_year+1
            by Permno Cusip, sort: gen int year = first_year + _n - 1
            drop first_date last_date first_year last_year
            
            // WE NOW HAVE ONE OBSERVATION FOR EACH CUSIP
            // FOR EACH YEAR IT WAS IN EFFECT
            tempfile cusips
            save `cusips'
            
            // NOW MERGE WITH THE PERMNO DATASET
            use permno_dataset, clear
            isid Permno year, sort // IF THIS FAILS, CODE BELOW WOULDN'T WORK
            merge 1:m Permno year using `cusips', keepusing(master match)
            The data now are matched up. If there are multiple Cusips that apply to a given Permno in a given year, they are all there and you will have to decide which one(s) to keep. If there are data in the Permno data set with no corresponding Cusips, they are characterized by _merge == 1, and you will also have to decide what to do about those.

            Comment


            • #7
              Thanks so much

              Comment


              • #8
                Hi, I have a similar question. What would happen if the year variable were a date in instead of what is shown here? ie. I wish to see if a certain date for each permo/year observation is within the beginning or end date of the first dataset?

                Comment

                Working...
                X