Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merging compustat annual and crsp daily to calculate annual average bid-ask spread ---again still in trouble

    Dear Statalist members,

    For my thesis, I need to merge CRSP daily and Compustat annual to calculate annual average bid-ask spread. I know that this topic about merging compustat annual and crsp daily has been dealt with in the past, but I have tried several ways but till now I haven't been able to come up with the results. Hopefully can somebody help me?

    I need to calculate annual average bid-ask spread in CRSP per firm on a fiscal-year basis. My sample starts from 1/1/2003 to 31/12/2015. In the end, I want my results to look something like this:
    cusip8 datadate annual_spread other compustat data
    12345678 31/12/2003 0.01 ...
    12345678 ... ... ...
    12345678 31/12/2015 ... ...
    98765432 31/3/2003 0.02 ...
    98765432 ... ... ...
    98765432 31/3/2015 ... ...
    The problem is CRSP shows closing bid and closing ask on a calendar-day basis running almost everyday from 1/1/2003 till 31/12/2015 while Compustat shows data on a yearly-basis.The firms in Compustat also have different year-end date like 31/12 or 31/3.

    1---I tried to generate the year by gen fyr=year(datadate) to just get the year and then merge compustat with crsp by cusip and fyr. However, it only works on company with fiscal year running from 1/1 to 31/12.

    2---Then, I tried to generate the beginning of the year and the end of the year so that the individual dates of crsp daily will be trapped in the beginning date and the end date. Thus, I have followed several posts, including Sarah Edgington's post (http://www.stata.com/statalist/archi.../msg00242.html) as follows:
    *generate the beginning date of fiscal year
    gen beginning=mdy(month(datadate-364),1,year(datadate-364))
    *use joinby to create pairwise combination between two datasets
    use crsp.dta
    joinby cusip8 using compustat.dta, unmatched(master)
    keep if date>=beginning & date<=datadate
    (with 'date' here is the individual date of the closing bid and closing ask)

    I tried to run the commands like this on a smaller trial dataset, and it works beautifully. However, for my too large datasets, it does not work, and I also run into the memory issues.

    I also run the codes provided by Phil, which follows Sarah's posts, but I don't know why it does not work. Stata says 'variables not defined' or 'too much [' or something like that.

    There is also a very similar post by Fabian Schneider (http://www.statalist.org/forums/foru...year-end-dates) , but then it is directed to another post by Rochelle Zhang, but I won't calculate any standard deviation by using the data from another year, so i don't think it will help me.

    Can you help me please? Otherwise I will have to drop all the firms with fiscal year ending other than 31/12, which is very bad for me.

    Kind regards,
    An

  • #2
    Regarding my question, I also read a post about finding the fiscal year for each observations (http://www.stata.com/statalist/archi.../msg01052.html). Actually I don't understand this command:
    gen mdate=mofd(date)
    but anyway, it works with smaller datasets and each dataset has equal number of observations. For my two datasets from Compustat annual and Crsp daily, there is a huge difference between the number of observations. Thus, I tried it and it won't work. and when I merge it by m:m, Stata says nothing is matched (there is no _merge==3) and if it is 1:m, then it only works for fiscal year ending after June. Thus for fiscal year ending like 31/3 or 31/5, it does not work.

    Please help me!!! I am really new to Stata.

    Kind regards,
    An

    Comment


    • #3
      You can do this using rangejoin (from SSC). To install it, type in Stata's command window:
      Code:
      ssc install rangejoin
      With rangejoin, you pair each annual Compustat observation with observations from CRSP that are within the same fiscal year. The start of the fiscal year is simply the next day, but with the previous calendar year (e.g. "31/03/2005" starts on "01/04/2004").

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str8 cusip8 str10 datadate float(crsp1 crsp2)
      "12345678" "31/12/2003" 1.1 1.2
      "12345678" "31/01/2004" 2.1 2.2
      "12345678" "28/02/2004" 3.1 3.2
      "12345678" "31/03/2004" 4.1 4.4
      "12345678" "30/04/2004" 5.1 5.4
      "12345678" "28/02/2005" 6.1 6.2
      "22233344" "01/11/2003" 7.1 7.2
      end
      gen ndate = daily(datadate,"DMY")
      format %td ndate
      save "temp_crsp.dta", replace
      
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str8 cusip8 str10 cdate float compustat1
      "12345678" "31/03/2004" 11
      "12345678" "31/03/2005" 22
      "22233344" "31/12/2003" 33
      end
      gen ndate = daily(cdate,"DMY")
      format %td ndate
      save "temp_compustat.dta", replace
      
      * ---------- using rangestat (from SSC) -----------------------------
      
      * define lower bound as the next day but using last year
      gen low = mdy(month(ndate + 1), day(ndate + 1), year(ndate + 1) - 1)
      format %td low
      
      rangejoin ndate low 0 using "temp_crsp.dta", by(cusip8)
      list, sepby(cusip8 cdate) noobs

      Comment


      • #4
        I can't comment on your overall question as I have no idea what these data are about. I will just make a couple of general Stata points that, if not directly helpful, will at least keep you from going down some bind alleys.
        when I merge it by m:m, Stata says nothing is matched (there is no _merge==3)
        You dodged a bullet here! -merge m:m- produces nothing but garbage in nearly all applications. There are very rare circumstances where it is appropriate, but even with my limited understanding of what you are doing here, I can assure you this is not one of them. Had you obtained more matches you might have been misled into thinking you had accomplished something, when all you would really have done is create a mangled set of mismatched data. Whenever you are tempted to use -merge m:m-, think twice, three times, four times....a thousand times. It is almost always the wrong thing to do. See a more extensive discussion of this issue at #6 in http://www.statalist.org/forums/foru...r-loop-problem.

        gen mdate=mofd(date)
        Stata has numerous types of internal date and time variables. Stata date variables are a count of the number of days after (before, if negative) 1 Jan 1960. Stata also has monthly dates, which count the number of months after (resp. before) January 1960. The -mofd()- function calculates the monthly date from a daily date. Thus it maps, for example, 23 June 2016 to June 2016 (or, actually, the corresponding underlying counts of days and months.) In almost every discipline that uses statistics, we often work with data that has been collected on different time scales or intervals, and it is often necessary to convert between different date-time schemes. Stata offers a large number of such functions, of which -mofd()- is just one. If you will be continuing to work with Stata you should read carefully the material in the datetime chapter of the [D] manual (which comes as part of your Stata installation since version 11.) It is a lot of material and it takes a long time to get comfortable with. Even experts often find themselves referring to the help files frequently to check details. But the effort made to assimilate it will be amply repaid over time.

        joinby cusip8 using compustat.dta, unmatched(master)
        keep if date>=beginning & date<=datadate...

        I tried to run the commands like this on a smaller trial dataset, and it works beautifully. However, for my too large datasets, it does not work, and I also run into the memory issues.
        Yes, this is a frequent problem with large data sets. Recently, Robert Picard has written a new program -rangejoin- that solves this problem. You can download it from SSC. The syntax that would be the equivalent to these two commands is:
        Code:
        rangejoin date beginning datadate using compustat, by(cusip)
        The nice thing about -rangejoin- is that it does not just mash the two data sets together and then leave you to -keep- the small subset of observations that you need. It uses a very clever, and surprisingly fast, algorithm to identify the appropriate records to pair up. It is not only much faster than the -joinby- approach, it does not encounter the memory limitations you ran into because it never creates a bloated everything-paired-with-everything data set in the first place.

        Added: Crossed in cyberspace with #7, where Robert Picard himself deals with the last of my points using a slightly different approach.

        Comment


        • #5
          Dear Mr. Robert Picard and Mr. Clyde Schechter

          Thank you so much for your post. I just have a question as follow. What would happen if the date in temp_crsp is not in the end of the month? Since in my sample, the date in crsp could be anyday of teh month, since it's the date for closing bid and closing ask.

          To illustrate, my two datasets are as follows (I am very sorry, I should have explained myself better in my first posts)
          I have crsp database that look like this: crsp.dta
          cusip8 date spread
          12345678 2Jan2003 0.01
          12345678 3Jan2003 0.02
          12345678 15Jan2003 0.03
          12345678 4Mar2003 0.04
          12345678 12Dec2003 0.05
          12345678 6Aug2004 0.06
          12345678 9May2005 0.07
          22233344 3Aug2003 0.08
          22233344 18Sep2003 0.09
          22233344 14Jul2004 0.01
          22233344 15Oct2005 0.02
          33344455 4Apr2003 0.03
          33344455 12Jan2004 0.04

          And I have compustat that look like this: compustat.dta
          cusip8 datadate beginning
          12345678 31Dec2003 1Jan2003
          12345678 31Dec2004 1Jan2004
          12345678 31Dec2005 1Jan2005
          22233344 30Jun2004 1Jul2003
          22233344 30Jun2005 1Jul2004
          22233344 30Jun2006 1Jul2005
          33344455
          33344455
          31Mar2004
          31Mar2005
          1Apr2003
          1Apr2004

          So when I merge the two datasets, I was hoping to have something like this: merge.dta
          cusip8 date spread datadate beginning
          12345678 2Jan2003 0.01 31Dec2003 1Jan2003
          12345678 3Jan2003 0.02 31Dec2003 1Jan2003
          12345678 15Jan2003 0.03 31Dec2003 1Jan2003
          12345678 4Mar2003 0.04 31Dec2003 1Jan2003
          12345678 12Dec2003 0.05 31Dec2003 1Jan2003
          12345678 6Aug2004 0.06 31Dec2004 1Jan2004
          12345678 9May2005 0.07 31Dec2005 1Jan2005
          22233344 3Aug2003 0.08 30Jun2004 1Jul2003
          22233344 18Sep2003 0.09 30Jun2004 1Jul2003
          22233344 14Jul2004 0.01 30Jun2005 1Jul2004
          22233344 15Oct2005 0.02 30Jun2006 1Jul2005
          33344455 4Apr2003 0.03 31Mar2004 1Apr2003
          33344455 12Jan2004 0.04 31Mar2004 1Apr2003
          I really want to use Mr. Picard's commands, but I don't have a common variable between the two set.

          For example:
          *generate ndate in crsp
          gen ndate=daily(date,"DMY")
          format %td ndate

          the result is that 2Jan2003 is still 2Jan2003

          *similarly generate ndate in compustat
          gen ndate=daily(datadate,"DMY")
          format %td ndate

          the result is that 31Dec2003 is still 31Dec2003

          Therefore, even if I use -raingejoin-, I couldn't have ndate as a common variable.

          I tried Mr. Clyde's commands:
          rangejoin date beginning datadate using compustat.dta, by(cusip8)

          but Stata says: 'was expecting a numeric variable or a number for the interval low: beginning'
          I am not sure why this happen. I check that my variable 'beginning'. It is the date variable, not a string variable. How come Stata won't merge it?

          Please help me!!!

          Kind regards,
          An







          Last edited by An Le; 24 Jun 2016, 16:05.

          Comment


          • #6
            Please take some time to read the help file on Stata dates and date formats (help date time and help datetime_display_formats) and the various date functions, in particular the ones used in my code example, that is daily(), mdy(), day(), year(). The executive summary is that numeric daily dates store the date in terms of the number of days since 01jan1960:
            Code:
            . dis %td 0
            01jan1960
            Today's date is, in terms of days since 01jan1960:
            Code:
            . dis daily("25jun2016","DMY")
            20630
            With rangejoin, you match each observation in the data in memory with as many observations from the using dataset that fall within the specified range. In order to determine if an observation in the using dataset falls within the range, you need to specify a key variable to look up. With your data, that would be the CRSP date variable. This needs to be a Stata daily date. The range is specified using a lower and upper bound, in your case the start and end of the fiscal year for the current observation in the Compustat data (in memory). Once the data is properly set up, all you need is

            Code:
            rangejoin crsp_date fystart fyend using "temp_crsp.dta", by(cusip8)
            Here's a replay of the example in #3 using your new data. I left in the code to calculate the start of the fiscal year and I check that the variable values match those in your beginning variable:

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str8 cusip8 str10 date float spread
            "12345678" "2Jan2003"  .01
            "12345678" "3Jan2003"  .02
            "12345678" "15Jan2003" .03
            "12345678" "4Mar2003"  .04
            "12345678" "12Dec2003" .05
            "12345678" "6Aug2004"  .06
            "12345678" "9May2005"  .07
            "22233344" "3Aug2003"  .08
            "22233344" "18Sep2003" .09
            "22233344" "14Jul2004" .01
            "22233344" "15Oct2005" .02
            "33344455" "4Apr2003"  .03
            "33344455" "12Jan2004" .04
            end
            gen crsp_date = daily(date,"DMY")
            format %td crsp_date
            save "temp_crsp.dta", replace
            
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str8 cusip8 str10(datadate beginning)
            "12345678" "31Dec2003" "1Jan2003"
            "12345678" "31Dec2004" "1Jan2004"
            "12345678" "31Dec2005" "1Jan2005"
            "22233344" "30Jun2004" "1Jul2003"
            "22233344" "30Jun2005" "1Jul2004"
            "22233344" "30Jun2006" "1Jul2005"
            "33344455" "31Mar2004" "1Apr2003"
            "33344455" "31Mar2005" "1Apr2004"
            end
            gen fyend = daily(datadate,"DMY")
            gen fystart = daily(beginning,"DMY")
            format %td fystart fyend
            save "temp_compustat.dta", replace
            
            * calculate the start of the fiscal year
            gen fystart2 = mdy(month(fyend + 1), day(fyend + 1), year(fyend + 1) - 1)
            assert fystart2 == fystart
            
            rangejoin crsp_date fystart fyend using "temp_crsp.dta", by(cusip8)
            list, sepby(cusip8 datadate) noobs
            And the results
            Code:
            . list, sepby(cusip8 datadate) noobs
            
              +-----------------------------------------------------------------------------------------------------+
              |   cusip8    datadate   beginn~g       fyend     fystart   fystart2        date   spread   crsp_date |
              |-----------------------------------------------------------------------------------------------------|
              | 12345678   31Dec2003   1Jan2003   31dec2003   01jan2003      15706    2Jan2003      .01   02jan2003 |
              | 12345678   31Dec2003   1Jan2003   31dec2003   01jan2003      15706    3Jan2003      .02   03jan2003 |
              | 12345678   31Dec2003   1Jan2003   31dec2003   01jan2003      15706   15Jan2003      .03   15jan2003 |
              | 12345678   31Dec2003   1Jan2003   31dec2003   01jan2003      15706    4Mar2003      .04   04mar2003 |
              | 12345678   31Dec2003   1Jan2003   31dec2003   01jan2003      15706   12Dec2003      .05   12dec2003 |
              |-----------------------------------------------------------------------------------------------------|
              | 12345678   31Dec2004   1Jan2004   31dec2004   01jan2004      16071    6Aug2004      .06   06aug2004 |
              |-----------------------------------------------------------------------------------------------------|
              | 12345678   31Dec2005   1Jan2005   31dec2005   01jan2005      16437    9May2005      .07   09may2005 |
              |-----------------------------------------------------------------------------------------------------|
              | 22233344   30Jun2004   1Jul2003   30jun2004   01jul2003      15887    3Aug2003      .08   03aug2003 |
              | 22233344   30Jun2004   1Jul2003   30jun2004   01jul2003      15887   18Sep2003      .09   18sep2003 |
              |-----------------------------------------------------------------------------------------------------|
              | 22233344   30Jun2005   1Jul2004   30jun2005   01jul2004      16253   14Jul2004      .01   14jul2004 |
              |-----------------------------------------------------------------------------------------------------|
              | 22233344   30Jun2006   1Jul2005   30jun2006   01jul2005      16618   15Oct2005      .02   15oct2005 |
              |-----------------------------------------------------------------------------------------------------|
              | 33344455   31Mar2004   1Apr2003   31mar2004   01apr2003      15796    4Apr2003      .03   04apr2003 |
              | 33344455   31Mar2004   1Apr2003   31mar2004   01apr2003      15796   12Jan2004      .04   12jan2004 |
              |-----------------------------------------------------------------------------------------------------|
              | 33344455   31Mar2005   1Apr2004   31mar2005   01apr2004      16162                    .           . |
              +-----------------------------------------------------------------------------------------------------+

            Comment


            • #7
              Dear Mr. Robert Picard,

              Thank you so much. It works!!! I have merged my two datasets thanks to your codes.

              Kind regards,
              An

              Comment

              Working...
              X