Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging Compustat annual data and CRSP daily data, taking into consideration different fiscal year start and ends

    Hi Statalisters,

    I would appreciate your help with another challenge I am trying to tackle before I can (hopefully) finalize my data and start doing the 'real work' (creating variables and regressions) required for my thesis. Again, I found a post: https://www.statalist.org/forums/for...ill-in-trouble that kind of resembles my problem. At very least that I think that the code provided can be used in my case too but I do not really have an idea on how I could apply it to my data (-dataex- below).

    I have two sets of data, one from CRSP with daily stock return data and the other with Compustat yearly (fiscal) data that I want to merge. I have unique identifiers and date variables that make the merge possible but my one issues is that in Compustat the firms in my dataset start and end their fiscal years at different points (e.g. from December till December, from March till March etc.).

    I cannot ignore this characteristic of my dataset because I calculate a number of variables using the CRSP data, over differing windows of historical data. So not being careful would bias my results. My goal is to have CRSP daily data matched to the correct fiscal years when I merge it with my Compustat yearly data.

    I would merge the two databases on unique combinations of gvkey and date, bdate is a Stata recognized business calendar date variable but obviously trading does not always start on the first and last calendar day of the month.

    I realize that even if I think I am being perfectly clear it might not be the case, so I am happy to provide any additional elaboration. In the Compustat data I added gender and at (total assets) to show that I have many other accounting variables ( based fiscal yearly obs) in that dataset. fyr = represents the month in which the firm's fiscal year ends. This also determines how fyear corresponds to the year of the datadate in the Compustat dataset, because if fyr is <= 5, fyear = year t-1 and if fyr is >5, year = t.

    CRSP Data:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long(gvkey lpermco lpermno) float year str8 date float(bdate annualized_sd annualized_sd_resid)
    1690 7 14593 2002 "20020102"  0 48.58944 .
    1690 7 14593 2002 "20020103"  1 48.58944 .
    1690 7 14593 2002 "20020104"  2 48.58944 .
    1690 7 14593 2002 "20020107"  3 48.58944 .
    1690 7 14593 2002 "20020108"  4 48.58944 .
    1690 7 14593 2002 "20020109"  5 48.58944 .
    1690 7 14593 2002 "20020110"  6 48.58944 .
    1690 7 14593 2002 "20020111"  7 48.58944 .
    1690 7 14593 2002 "20020114"  8 48.58944 .
    1690 7 14593 2002 "20020115"  9 48.58944 .
    1690 7 14593 2002 "20020116" 10 48.58944 .
    1690 7 14593 2002 "20020117" 11 48.58944 .
    1690 7 14593 2002 "20020118" 12 48.58944 .
    1690 7 14593 2002 "20020122" 13 48.58944 .
    1690 7 14593 2002 "20020123" 14 48.58944 .
    1690 7 14593 2002 "20020124" 15 48.58944 .
    1690 7 14593 2002 "20020125" 16 48.58944 .
    1690 7 14593 2002 "20020128" 17 48.58944 .
    1690 7 14593 2002 "20020129" 18 48.58944 .
    1690 7 14593 2002 "20020130" 19 48.58944 .
    1690 7 14593 2002 "20020131" 20 48.58944 .
    1690 7 14593 2002 "20020201" 21 48.58944 .
    1690 7 14593 2002 "20020204" 22 48.58944 .
    1690 7 14593 2002 "20020205" 23 48.58944 .
    1690 7 14593 2002 "20020206" 24 48.58944 .
    1690 7 14593 2002 "20020207" 25 48.58944 .
    1690 7 14593 2002 "20020208" 26 48.58944 .
    1690 7 14593 2002 "20020211" 27 48.58944 .
    1690 7 14593 2002 "20020212" 28 48.58944 .
    1690 7 14593 2002 "20020213" 29 48.58944 .
    1690 7 14593 2002 "20020214" 30 48.58944 .
    1690 7 14593 2002 "20020215" 31 48.58944 .
    1690 7 14593 2002 "20020219" 32 48.58944 .
    1690 7 14593 2002 "20020220" 33 48.58944 .
    1690 7 14593 2002 "20020221" 34 48.58944 .
    1690 7 14593 2002 "20020222" 35 48.58944 .
    1690 7 14593 2002 "20020225" 36 48.58944 .
    1690 7 14593 2002 "20020226" 37 48.58944 .
    1690 7 14593 2002 "20020227" 38 48.58944 .
    1690 7 14593 2002 "20020228" 39 48.58944 .
    1690 7 14593 2002 "20020301" 40 48.58944 .
    1690 7 14593 2002 "20020304" 41 48.58944 .
    1690 7 14593 2002 "20020305" 42 48.58944 .
    1690 7 14593 2002 "20020306" 43 48.58944 .
    1690 7 14593 2002 "20020307" 44 48.58944 .
    1690 7 14593 2002 "20020308" 45 48.58944 .
    1690 7 14593 2002 "20020311" 46 48.58944 .
    1690 7 14593 2002 "20020312" 47 48.58944 .
    1690 7 14593 2002 "20020313" 48 48.58944 .
    1690 7 14593 2002 "20020314" 49 48.58944 .
    end
    format %tbtrade_day2 bdate
    Compustat Data:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long datadate int fyear byte fyr long gvkey float at str6 gender
    20021231 2002 12 100004  2378.857 ""    
    20031231 2003 12 100004   3189.38 ""    
    20041231 2004 12 100004  3653.951 ""    
    20051231 2005 12 100004  4301.948 ""    
    20061231 2006 12 100004  8367.235 ""    
    20071231 2007 12 100004  2518.692 ""    
    20081231 2008 12 100004  2435.242 ""    
    20091231 2009 12 100004  2447.668 ""    
    20011231 2001 12  10000   509.429 ""    
    20021231 2002 12  10000   490.758 ""    
    20031231 2003 12  10000   694.525 ""    
    20041231 2004 12  10000   656.569 ""    
    20051231 2005 12  10000   653.044 ""    
    20061231 2006 12  10000   640.092 "MALE"
    20071231 2007 12  10000   678.092 "MALE"
    20081231 2008 12  10000   575.027 "MALE"
    20091231 2009 12  10000   484.459 "MALE"
    20101231 2010 12  10000   492.801 "MALE"
    20111231 2011 12  10000   550.722 "MALE"
    20121231 2012 12  10000   576.594 "MALE"
    20131231 2013 12  10000   615.523 "MALE"
    20141231 2014 12  10000   673.551 "MALE"
    20151231 2015 12  10000   681.064 "MALE"
    20161231 2016 12  10000   768.697 "MALE"
    20171231 2017 12  10000   787.567 "MALE"
    20010831 2001  8 100011  5595.056 ""    
    20020831 2002  8 100011   7628.12 ""    
    20030831 2003  8 100011  7998.488 ""    
    20040831 2004  8 100011  7856.107 ""    
    20011231 2001 12  10002  1366.301 ""    
    20021231 2002 12  10002  1792.126 ""    
    20031231 2003 12  10002  2460.703 ""    
    20041231 2004 12  10002  3013.233 ""    
    20051231 2005 12  10002  4280.842 ""    
    20061231 2006 12  10002  4502.941 "MALE"
    20071231 2007 12  10002  3400.726 "MALE"
    20081231 2008 12  10002  2249.854 "MALE"
    20091231 2009 12  10002  1861.011 "MALE"
    20101231 2010 12  10002  2133.123 "MALE"
    20111231 2011 12  10002  2200.383 "MALE"
    20121231 2012 12  10002  3113.074 "MALE"
    20131231 2013 12  10002  3662.105 "MALE"
    20141231 2014 12  10002   4174.42 "MALE"
    20151231 2015 12  10002  8345.505 "MALE"
    20161231 2016 12  10002  8709.044 "MALE"
    20010331 2000  3 100039  23752.21 ""    
    20020331 2001  3 100039 20674.504 ""    
    20030331 2002  3 100039  22871.05 ""    
    20040331 2003  3 100039 24939.877 ""    
    20050331 2004  3 100039  24305.39 ""    
    end






  • #2
    I can probably help you with this if you clarify the problem a bit.

    So I think you are looking to match up each observation in the first data set with observations in the second data set for the same gvkey value and where the date in the first data set falls in the fiscal year of the observation in the second data set.

    But I do not know how to interpret the variables fyear and fyr in the second data set. I understand that fyr = 12, for example, means that the fiscal year begins in December. (Correct? Or does it end in December?) But, looking at the first observation in the second data set, where fyear = 2002 and fyr = 12, does that mean that this observation is eligible to match with observations in the first data set having date between 12/1/2001 and 11/30/2002 or with observations between 12/1/2002 and 11/30/2003. Or does it mean something altogether different? I imagine that this is standard stuff in finance, but I know nothing about it: epidemiology does not deal in fiscal years.

    In any case, your data examples aren't really helpful for working on this problem because I can tell you at a glance that nothing in your first data set matches anything in the second one since the values of the gvkey variable don't overlap at all. So when you post back, please give data examples where there are some matches between the two data sets. The solution to your problem is probably not difficult, but I will want to test the code before responding, and testing it on the current examples would not really give any indication whether it is correct or not; it would only rule out typos that lead to syntax errors.

    Comment


    • #3
      Hello Clyde,

      I will do my best to clarify below:

      So I think you are looking to match up each observation in the first data set with observations in the second data set for the same gvkey value and where the date in the first data set falls in the fiscal year of the observation in the second data set.
      Correct. I am looking for the same result as that shown by Robert in #6 of the link I posted.

      But I do not know how to interpret the variables fyear and fyr in the second data set. I understand that fyr = 12, for example, means that the fiscal year begins in December. (Correct? Or does it end in December?) But, looking at the first observation in the second data set, where fyear = 2002 and fyr = 12, does that mean that this observation is eligible to match with observations in the first data set having date between 12/1/2001 and 11/30/2002 or with observations between 12/1/2002 and 11/30/2003.
      If fyr = 12, it means that the firm's fiscal year ends on the last day of December (e.g. for the first observation in the second data set, fyr = 12 and datadate = 12/31/2002).
      Thus, using again the first observation of the second data set as an example: having fyear = 2002 and fyr = 12 indicates that that firm's fiscal year started on 01/01/2002 and ended on 12/31/2002.

      Perhaps looking at the last 5 observations in the second data set (gvkey = 100039) gives a clearer example of the relationship between datadate, fyear and fyr.
      In the first observation of this set, datadate = 03/31/2001 and fyr = 3. Meaning that that this firm's fiscal year ended on the last day of March (03/31/2001) and thus started on 04/01/2000.
      Because fyr <= 5, the reasoning is that because most of the firm's operations for that observation occurred in 2000 (9 months) and just three months in 2001, that firm-year observation is assigned fyear = 2000 instead of fyear = 2001, as one would expect based on the datadate.

      In any case, your data examples aren't really helpful for working on this problem because I can tell you at a glance that nothing in your first data set matches anything in the second one since the values of the gvkey variable don't overlap at all. So when you post back, please give data examples where there are some matches between the two data sets.
      Understandable. I have the same problem as the link I posted where my first dataset due to it having daily data, has a significantly higher amount of observations (approx. 13 million vs. 100K) than my second dataset. I try to provide a more workable -dataex- below:

      CRSP data:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input long(gvkey lpermco lpermno) float(year bdate) str8 date float(annualized_sd annualized_sd_resid)
      1690 7 14593 2002  0 "20020102" 48.58944 .
      1690 7 14593 2002  1 "20020103" 48.58944 .
      1690 7 14593 2002  2 "20020104" 48.58944 .
      1690 7 14593 2002  3 "20020107" 48.58944 .
      1690 7 14593 2002  4 "20020108" 48.58944 .
      1690 7 14593 2002  5 "20020109" 48.58944 .
      1690 7 14593 2002  6 "20020110" 48.58944 .
      1690 7 14593 2002  7 "20020111" 48.58944 .
      1690 7 14593 2002  8 "20020114" 48.58944 .
      1690 7 14593 2002  9 "20020115" 48.58944 .
      1690 7 14593 2002 10 "20020116" 48.58944 .
      1690 7 14593 2002 11 "20020117" 48.58944 .
      1690 7 14593 2002 12 "20020118" 48.58944 .
      1690 7 14593 2002 13 "20020122" 48.58944 .
      1690 7 14593 2002 14 "20020123" 48.58944 .
      1690 7 14593 2002 15 "20020124" 48.58944 .
      1690 7 14593 2002 16 "20020125" 48.58944 .
      1690 7 14593 2002 17 "20020128" 48.58944 .
      1690 7 14593 2002 18 "20020129" 48.58944 .
      1690 7 14593 2002 19 "20020130" 48.58944 .
      1004 20000 54594 2002  0 "20020102" 74.91421 .
      1004 20000 54594 2002  1 "20020103" 74.91421 .
      1004 20000 54594 2002  2 "20020104" 74.91421 .
      1004 20000 54594 2002  3 "20020107" 74.91421 .
      1004 20000 54594 2002  4 "20020108" 74.91421 .
      1004 20000 54594 2002  5 "20020109" 74.91421 .
      1004 20000 54594 2002  6 "20020110" 74.91421 .
      1004 20000 54594 2002  7 "20020111" 74.91421 .
      1004 20000 54594 2002  8 "20020114" 74.91421 .
      1004 20000 54594 2002  9 "20020115" 74.91421 .
      1004 20000 54594 2002 10 "20020116" 74.91421 .
      1004 20000 54594 2002 11 "20020117" 74.91421 .
      1004 20000 54594 2002 12 "20020118" 74.91421 .
      1004 20000 54594 2002 13 "20020122" 74.91421 .
      1004 20000 54594 2002 14 "20020123" 74.91421 .
      1004 20000 54594 2002 15 "20020124" 74.91421 .
      1004 20000 54594 2002 16 "20020125" 74.91421 .
      1004 20000 54594 2002 17 "20020128" 74.91421 .
      1004 20000 54594 2002 18 "20020129" 74.91421 .
      1004 20000 54594 2002 19 "20020130" 74.91421 .
      1004 20000 54594 2002 20 "20020131" 74.91421 .
      4503 20678 11850 2002  0 "20020102" 32.22919 .
      4503 20678 11850 2002  1 "20020103" 32.22919 .
      4503 20678 11850 2002  2 "20020104" 32.22919 .
      4503 20678 11850 2002  3 "20020107" 32.22919 .
      4503 20678 11850 2002  4 "20020108" 32.22919 .
      4503 20678 11850 2002  5 "20020109" 32.22919 .
      4503 20678 11850 2002  6 "20020110" 32.22919 .
      4503 20678 11850 2002  7 "20020111" 32.22919 .
      4503 20678 11850 2002  8 "20020114" 32.22919 .
      4503 20678 11850 2002  9 "20020115" 32.22919 .
      4503 20678 11850 2002 10 "20020116" 32.22919 .
      4503 20678 11850 2002 11 "20020117" 32.22919 .
      4503 20678 11850 2002 12 "20020118" 32.22919 .
      4503 20678 11850 2002 13 "20020122" 32.22919 .
      4503 20678 11850 2002 14 "20020123" 32.22919 .
      4503 20678 11850 2002 15 "20020124" 32.22919 .
      4503 20678 11850 2002 16 "20020125" 32.22919 .
      4503 20678 11850 2002 17 "20020128" 32.22919 .
      4503 20678 11850 2002 18 "20020129" 32.22919 .
      4503 20678 11850 2002 19 "20020130" 32.22919 .
      4503 20678 11850 2002 20 "20020131" 32.22919 .
      end
      format %tbtrade_day2 bdate

      Compustat data:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input long datadate int fyear byte fyr long gvkey float at str6 gender
      20010531 2000  5 1004  701.854 ""    
      20020531 2001  5 1004  710.199 ""    
      20030531 2002  5 1004  686.621 ""    
      20040531 2003  5 1004  709.292 ""    
      20050531 2004  5 1004   732.23 ""    
      20060531 2005  5 1004  978.819 ""    
      20070531 2006  5 1004 1067.633 "MALE"
      20080531 2007  5 1004  1362.01 "MALE"
      20090531 2008  5 1004 1377.511 "MALE"
      20100531 2009  5 1004 1501.042 "MALE"
      20110531 2010  5 1004 1703.727 "MALE"
      20120531 2011  5 1004 2195.653 "MALE"
      20130531 2012  5 1004   2136.9 "MALE"
      20140531 2013  5 1004   2199.5 "MALE"
      20150531 2014  5 1004     1515 "MALE"
      20160531 2015  5 1004   1442.1 "MALE"
      20170531 2016  5 1004   1504.1 "MALE"
      20180531 2017  5 1004   1524.7 ""    
      20010930 2001  9 1690     6021 ""    
      20020930 2002  9 1690     6298 ""    
      20030930 2003  9 1690     6815 ""    
      20040930 2004  9 1690     8050 ""    
      20050930 2005  9 1690    11551 ""    
      20060930 2006  9 1690    17205 "MALE"
      20070930 2007  9 1690    25347 "MALE"
      20080930 2008  9 1690    39572 "MALE"
      20090930 2009  9 1690    47501 "MALE"
      20100930 2010  9 1690    75183 "MALE"
      20110930 2011  9 1690   116371 "MALE"
      20120930 2012  9 1690   176064 "MALE"
      20130930 2013  9 1690   207000 "MALE"
      20140930 2014  9 1690   231839 "MALE"
      20150930 2015  9 1690   290479 "MALE"
      20160930 2016  9 1690   321686 "MALE"
      20170930 2017  9 1690   375319 "MALE"
      20011231 2001 12 4503   143174 ""    
      20021231 2002 12 4503   152644 ""    
      20031231 2003 12 4503   174278 ""    
      20041231 2004 12 4503   195256 ""    
      20051231 2005 12 4503   208335 ""    
      20061231 2006 12 4503   219015 "MALE"
      20071231 2007 12 4503   242082 "MALE"
      20081231 2008 12 4503   228052 "MALE"
      20091231 2009 12 4503   233323 "MALE"
      20101231 2010 12 4503   302510 "MALE"
      20111231 2011 12 4503   331052 "MALE"
      20121231 2012 12 4503   333795 "MALE"
      20131231 2013 12 4503   346808 "MALE"
      20141231 2014 12 4503   349493 "MALE"
      20151231 2015 12 4503   336758 "MALE"
      end

      Comment


      • #4
        Thanks for clearing that up and the better example.

        There is one slight wrinkle. I don't have the definition of your business calendar, so I can't work with your bdate variable. So in the code below, in your CRSP data I have created a Stata internal format date variable, cal_date, from your string variable date, and used that to pair up observations between CRSP and COMPUSTAT. You can either do it this way yourself, or you can modify the code that generates the variables fyear_start and fyear_end in the COMPUSTAT data to create business calendar dates there, and then use your original CRSP data bdate to do the pairing.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input long(gvkey lpermco lpermno) float(year bdate) str8 date float(annualized_sd annualized_sd_resid)
        1690 7 14593 2002  0 "20020102" 48.58944 .
        1690 7 14593 2002  1 "20020103" 48.58944 .
        1690 7 14593 2002  2 "20020104" 48.58944 .
        1690 7 14593 2002  3 "20020107" 48.58944 .
        1690 7 14593 2002  4 "20020108" 48.58944 .
        1690 7 14593 2002  5 "20020109" 48.58944 .
        1690 7 14593 2002  6 "20020110" 48.58944 .
        1690 7 14593 2002  7 "20020111" 48.58944 .
        1690 7 14593 2002  8 "20020114" 48.58944 .
        1690 7 14593 2002  9 "20020115" 48.58944 .
        1690 7 14593 2002 10 "20020116" 48.58944 .
        1690 7 14593 2002 11 "20020117" 48.58944 .
        1690 7 14593 2002 12 "20020118" 48.58944 .
        1690 7 14593 2002 13 "20020122" 48.58944 .
        1690 7 14593 2002 14 "20020123" 48.58944 .
        1690 7 14593 2002 15 "20020124" 48.58944 .
        1690 7 14593 2002 16 "20020125" 48.58944 .
        1690 7 14593 2002 17 "20020128" 48.58944 .
        1690 7 14593 2002 18 "20020129" 48.58944 .
        1690 7 14593 2002 19 "20020130" 48.58944 .
        1004 20000 54594 2002  0 "20020102" 74.91421 .
        1004 20000 54594 2002  1 "20020103" 74.91421 .
        1004 20000 54594 2002  2 "20020104" 74.91421 .
        1004 20000 54594 2002  3 "20020107" 74.91421 .
        1004 20000 54594 2002  4 "20020108" 74.91421 .
        1004 20000 54594 2002  5 "20020109" 74.91421 .
        1004 20000 54594 2002  6 "20020110" 74.91421 .
        1004 20000 54594 2002  7 "20020111" 74.91421 .
        1004 20000 54594 2002  8 "20020114" 74.91421 .
        1004 20000 54594 2002  9 "20020115" 74.91421 .
        1004 20000 54594 2002 10 "20020116" 74.91421 .
        1004 20000 54594 2002 11 "20020117" 74.91421 .
        1004 20000 54594 2002 12 "20020118" 74.91421 .
        1004 20000 54594 2002 13 "20020122" 74.91421 .
        1004 20000 54594 2002 14 "20020123" 74.91421 .
        1004 20000 54594 2002 15 "20020124" 74.91421 .
        1004 20000 54594 2002 16 "20020125" 74.91421 .
        1004 20000 54594 2002 17 "20020128" 74.91421 .
        1004 20000 54594 2002 18 "20020129" 74.91421 .
        1004 20000 54594 2002 19 "20020130" 74.91421 .
        1004 20000 54594 2002 20 "20020131" 74.91421 .
        4503 20678 11850 2002  0 "20020102" 32.22919 .
        4503 20678 11850 2002  1 "20020103" 32.22919 .
        4503 20678 11850 2002  2 "20020104" 32.22919 .
        4503 20678 11850 2002  3 "20020107" 32.22919 .
        4503 20678 11850 2002  4 "20020108" 32.22919 .
        4503 20678 11850 2002  5 "20020109" 32.22919 .
        4503 20678 11850 2002  6 "20020110" 32.22919 .
        4503 20678 11850 2002  7 "20020111" 32.22919 .
        4503 20678 11850 2002  8 "20020114" 32.22919 .
        4503 20678 11850 2002  9 "20020115" 32.22919 .
        4503 20678 11850 2002 10 "20020116" 32.22919 .
        4503 20678 11850 2002 11 "20020117" 32.22919 .
        4503 20678 11850 2002 12 "20020118" 32.22919 .
        4503 20678 11850 2002 13 "20020122" 32.22919 .
        4503 20678 11850 2002 14 "20020123" 32.22919 .
        4503 20678 11850 2002 15 "20020124" 32.22919 .
        4503 20678 11850 2002 16 "20020125" 32.22919 .
        4503 20678 11850 2002 17 "20020128" 32.22919 .
        4503 20678 11850 2002 18 "20020129" 32.22919 .
        4503 20678 11850 2002 19 "20020130" 32.22919 .
        4503 20678 11850 2002 20 "20020131" 32.22919 .
        end
        format %tbtrade_day2 bdate
        gen cal_date = daily(date, "YMD")
        format cal_date %td
        tempfile crsp
        save `crsp'
        
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input long datadate int fyear byte fyr long gvkey float at str6 gender
        20010531 2000  5 1004  701.854 ""    
        20020531 2001  5 1004  710.199 ""    
        20030531 2002  5 1004  686.621 ""    
        20040531 2003  5 1004  709.292 ""    
        20050531 2004  5 1004   732.23 ""    
        20060531 2005  5 1004  978.819 ""    
        20070531 2006  5 1004 1067.633 "MALE"
        20080531 2007  5 1004  1362.01 "MALE"
        20090531 2008  5 1004 1377.511 "MALE"
        20100531 2009  5 1004 1501.042 "MALE"
        20110531 2010  5 1004 1703.727 "MALE"
        20120531 2011  5 1004 2195.653 "MALE"
        20130531 2012  5 1004   2136.9 "MALE"
        20140531 2013  5 1004   2199.5 "MALE"
        20150531 2014  5 1004     1515 "MALE"
        20160531 2015  5 1004   1442.1 "MALE"
        20170531 2016  5 1004   1504.1 "MALE"
        20180531 2017  5 1004   1524.7 ""    
        20010930 2001  9 1690     6021 ""    
        20020930 2002  9 1690     6298 ""    
        20030930 2003  9 1690     6815 ""    
        20040930 2004  9 1690     8050 ""    
        20050930 2005  9 1690    11551 ""    
        20060930 2006  9 1690    17205 "MALE"
        20070930 2007  9 1690    25347 "MALE"
        20080930 2008  9 1690    39572 "MALE"
        20090930 2009  9 1690    47501 "MALE"
        20100930 2010  9 1690    75183 "MALE"
        20110930 2011  9 1690   116371 "MALE"
        20120930 2012  9 1690   176064 "MALE"
        20130930 2013  9 1690   207000 "MALE"
        20140930 2014  9 1690   231839 "MALE"
        20150930 2015  9 1690   290479 "MALE"
        20160930 2016  9 1690   321686 "MALE"
        20170930 2017  9 1690   375319 "MALE"
        20011231 2001 12 4503   143174 ""    
        20021231 2002 12 4503   152644 ""    
        20031231 2003 12 4503   174278 ""    
        20041231 2004 12 4503   195256 ""    
        20051231 2005 12 4503   208335 ""    
        20061231 2006 12 4503   219015 "MALE"
        20071231 2007 12 4503   242082 "MALE"
        20081231 2008 12 4503   228052 "MALE"
        20091231 2009 12 4503   233323 "MALE"
        20101231 2010 12 4503   302510 "MALE"
        20111231 2011 12 4503   331052 "MALE"
        20121231 2012 12 4503   333795 "MALE"
        20131231 2013 12 4503   346808 "MALE"
        20141231 2014 12 4503   349493 "MALE"
        20151231 2015 12 4503   336758 "MALE"
        end
        tempfile compustat
        save `compustat'
        
        use `compustat', clear
        //    IDENTIFY START AND END OF FISCAL YEAR
        gen fyear_end = dofm(ym(fyear, fyr)+1)-1
        gen fyear_start = dofm(mofd(fyear_end)-11)
        format fyear_start fyear_end %td
        tempfile to_join
        
        rangejoin cal_date fyear_start fyear_end using `crsp', by(gvkey)
        drop if missing(cal_date)
        To use this code, you must have the program -rangejoin-, written by Robert Picard and available from SSC. To use -rangejoin-, you must also have -rangestat-, by Robert Picard, Roberto Ferrer, and Nick Cox, also available from SSC.

        Comment


        • #5
          Hi Clyde,

          Thank you for your help with the code. I am keen to try it.
          In the meanwhile, I have tried to adapt the code posted by Robert as follows, I am wondering if it also checks out.
          I am concerned however, because Stata has already been working for more than an hour and I do not know if this is normal/to be expected.

          Compustat (yearly fiscal):
          Code:
          gen crsp_date = daily(date, "YMD")
          format %td crsp_date
          save "temp_crsp.dta"
          CRSP (daily):
          Code:
          tostring datadate, replace
          gen fyend = daily(datadate, "YMD")
          format %td fyend
          
          gen fystart = mdy(month(fyend + 1), day(fyend + 1), year(fyend + 1) -1) <-- I think I already messed up here because my other date variables are "YMD"
          format %td fystart
          save "temp_compustat.dta"
          
          ssc install rangejoin
          rangejoin crsp_date fystart fyend using "temp_crsp.dta", by(gvkey)

          Comment


          • #6
            The code you show in #5 appears to be the same, at least in concept, as mine in #4. It differs in that it calculates the fystart and fyend dates from the variable datadate instead of from fyear and fyr. But if datadate is, in fact, always the final date of the fiscal year, then they really are just notational variants of each other.

            As for how long this will take to run, it depends on the size of the data sets, the speed of your computer, and the amount of memory your computer allocates to Stata.

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              The code you show in #5 appears to be the same, at least in concept, as mine in #4. It differs in that it calculates the fystart and fyend dates from the variable datadate instead of from fyear and fyr. But if datadate is, in fact, always the final date of the fiscal year, then they really are just notational variants of each other.

              As for how long this will take to run, it depends on the size of the data sets, the speed of your computer, and the amount of memory your computer allocates to Stata.
              Hi Clyde,

              I am trying to run the code you gave me but it is not completely working. The fyear_start and fyear_end variables are generated correctly for firms where fyr > 5 but when fyr < 5,
              the code understates the year by 1 (see -dataex-). This is not directly a problem, because I used the other code I have from Robert which seems to work for all fyr's,

              However, when I try to run rangejoin using the tempfile I get the error:
              . rangejoin cal_date fyear_start fyear_end using `crsp', by(gvkey)
              invalid file specification
              r(198);

              end of do-file

              r(198);

              Perhaps this is because I do not understand necessarily how tempfile works? Maybe there are some parts of the code that only you had to run because you do not have my full data?

              Here is the -dataex- fyear_start and fyear_end are calculated with Robert's code and fyear_start1 and fyear_end1 are calculated with your code.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str8 datadate long gvkey float(fyear_start fyear_end fyear_start1 fyear_end1)
              "20010331" 100039 14701 15065 14335 14700
              "20020331" 100039 15066 15430 14701 15065
              "20030331" 100039 15431 15795 15066 15430
              "20040331" 100039 15796 16161 15431 15795
              "20050331" 100039 16162 16526 15796 16161
              end
              format %td fyear_start
              format %td fyear_end
              format %td fyear_start1
              format %td fyear_end1
              Code:
              //clyde
              tempfile compustat
              save `compustat'
              
              use `compustat', clear
              //    IDENTIFY START AND END OF FISCAL YEAR
              gen fyear_end1 = dofm(ym(fyear, fyr)+1)-1
              gen fyear_start1 = dofm(mofd(fyear_end1)-11)
              format fyear_start1 fyear_end1 %td
              order fyear_start1, b(fyear_end1)
              tempfile to_join
              
              //robert
              tostring datadate, replace
              gen fyear_end = daily(datadate, "YMD")
              gen fyear_start = mdy(month(fyear_end + 1), day(fyear_end + 1), year(fyear_end + 1) -1) 
              format fyear_start fyear_end %td
              order fyear_start, b(fyear_end)
              tempfile to_join
              
              ssc install rangejoin
              rangejoin cal_date fyear_start fyear_end using `crsp', by(gvkey)
              drop if missing(cal_date)

              Comment


              • #8
                Code:
                rangejoin cal_date fyear_start fyear_end using `crsp', by(gvkey)
                invalid file specification
                r(198);
                The code I showed in #4 ran without any error messages on my computer. If you tried to run the code on yours and got this message, the most likely explanation is that you did not run the whole thing at once. Probably you ran pieces of it separately. That does not work when the code has local macros in it. If you run a bunch of code with a local macro, once that code finishes, the local macro goes out of existence. So if you ran the code that created tempfile `crsp' separately from the code that tried to use it in the rangejoin command, then when you got to the rangejoin command, Stata would no longer know what `crsp' refers to. If you run that code all together, in one fell swoop, you won't get that message.

                As for the calculation of the fiscal year, I guess I still don't understand what that's about. As far as I understand it, my code does what you described and Robert's doesn't. But you're the expert on this. Since you have Robert's code working, it probably isn't worth your time or mine to figure out where the misunderstanding arose at this point.

                That said, I do now notice that there is, in fact an error in the code in #4. The -using `crsp'- actually should have been -using `tojoin'- in the -rangejoin- command. But that is beside the point now. I'm glad you were able to solve your problem with Robert's code.

                Comment


                • #9
                  Originally posted by Clyde Schechter View Post
                  Code:
                  rangejoin cal_date fyear_start fyear_end using `crsp', by(gvkey)
                  invalid file specification
                  r(198);
                  The code I showed in #4 ran without any error messages on my computer. If you tried to run the code on yours and got this message, the most likely explanation is that you did not run the whole thing at once. Probably you ran pieces of it separately. That does not work when the code has local macros in it. If you run a bunch of code with a local macro, once that code finishes, the local macro goes out of existence. So if you ran the code that created tempfile `crsp' separately from the code that tried to use it in the rangejoin command, then when you got to the rangejoin command, Stata would no longer know what `crsp' refers to. If you run that code all together, in one fell swoop, you won't get that message.

                  As for the calculation of the fiscal year, I guess I still don't understand what that's about. As far as I understand it, my code does what you described and Robert's doesn't. But you're the expert on this. Since you have Robert's code working, it probably isn't worth your time or mine to figure out where the misunderstanding arose at this point.

                  That said, I do now notice that there is, in fact an error in the code in #4. The -using `crsp'- actually should have been -using `tojoin'- in the -rangejoin- command. But that is beside the point now. I'm glad you were able to solve your problem with Robert's code.
                  Dear Clyde,

                  Thank you for your explanation and patience so far. Unfortunately, I am not out of the woods yet.
                  I seem to be running into memory issues trying to run the rangejoin command. The command seems to be running for the first few minutes after I enter it and then gets stuck... I was not home for 12 hours since yesterday and left it to do it's work and it is still not done. The Stata windows intermittently reports a 'not responding' error and memory is at >88% with hard drive at 100% to whole time.

                  So far, I've ran rangejoin using my .dta files, would using the tempfiles make a difference? Otherwise, is there any other way to perform the same task that uses less resources?

                  Comment


                  • #10
                    Using tempfiles will not reduce the burden. I think the best approach her is to break up each of the data sets into smaller data sets containing ranges of gvkey. Run the code separately for each of those pairs of subsets, and then append all the results together. I have occasionally had to do similar things in my own work.

                    Comment


                    • #11
                      Hi Shaquille,

                      It might be easier to break this down into a couple of sub-tasks:
                      1. For firms where their fiscal year ends 12/31, then fiscal_year = year across both datasets. In this case it is easy to merge the 2 datasets.
                      2. For the remainder, it might be easier to create a variable in both datasets called match_year. In the COMPUSTAT data, match_year = fiscal_year. In the CRSP stock data, you would adjust match_year to be year, year-1, etc. So, if a firm has a 31 MARCH year end, and you have stock market data for 1 MARCH 2002 (so 20020301), do you want that to match to revenue in fiscal year 2001 or 2002? It's not clear from your original post which way you want to go. But assume that you want to match it to fiscal_year == 2001 (since revenue for March 2002 falls in the 2001 fiscal year for that firm). Then match_year = calendar_year - 1.
                      Once that is done, you could just merge it normally using something like:

                      Code:
                       merge 1:1 gvkey match_year using "Compustat_firm_year.dta"

                      Comment

                      Working...
                      X