Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging dataset with the first available date

    Dear all,
    I have two datasets:
    One data contains exchange rates of different currencies as follow

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long ANNDATS str4 CURR double EXRAT str28 CURNAM
    18994 "USD" 1 "U.S. Dollar"
    18995 "USD" 1 "U.S. Dollar"
    18996 "USD" 1 "U.S. Dollar"
    18997 "USD" 1 "U.S. Dollar"
    18998 "USD" 1 "U.S. Dollar"
    19001 "USD" 1 "U.S. Dollar"
    19002 "USD" 1 "U.S. Dollar"
    19003 "USD" 1 "U.S. Dollar"
    19004 "USD" 1 "U.S. Dollar"
    19005 "USD" 1 "U.S. Dollar"
    19008 "USD" 1 "U.S. Dollar"
    19009 "USD" 1 "U.S. Dollar"
    19010 "USD" 1 "U.S. Dollar"
    19011 "USD" 1 "U.S. Dollar"
    19012 "USD" 1 "U.S. Dollar"
    19015 "USD" 1 "U.S. Dollar"
    19016 "USD" 1 "U.S. Dollar"
    19017 "USD" 1 "U.S. Dollar"
    19018 "USD" 1 "U.S. Dollar"
    19019 "USD" 1 "U.S. Dollar"
    19022 "USD" 1 "U.S. Dollar"
    19023 "USD" 1 "U.S. Dollar"
    19024 "USD" 1 "U.S. Dollar"
    19025 "USD" 1 "U.S. Dollar"
    19026 "USD" 1 "U.S. Dollar"
    19029 "USD" 1 "U.S. Dollar"
    19030 "USD" 1 "U.S. Dollar"
    19031 "USD" 1 "U.S. Dollar"
    19032 "USD" 1 "U.S. Dollar"
    19033 "USD" 1 "U.S. Dollar"
    19036 "USD" 1 "U.S. Dollar"
    19037 "USD" 1 "U.S. Dollar"
    19038 "USD" 1 "U.S. Dollar"
    19039 "USD" 1 "U.S. Dollar"
    19040 "USD" 1 "U.S. Dollar"
    19043 "USD" 1 "U.S. Dollar"
    19044 "USD" 1 "U.S. Dollar"
    19045 "USD" 1 "U.S. Dollar"
    19046 "USD" 1 "U.S. Dollar"
    19047 "USD" 1 "U.S. Dollar"
    19050 "USD" 1 "U.S. Dollar"
    19051 "USD" 1 "U.S. Dollar"
    19052 "USD" 1 "U.S. Dollar"
    19053 "USD" 1 "U.S. Dollar"
    19054 "USD" 1 "U.S. Dollar"
    19057 "USD" 1 "U.S. Dollar"
    19058 "USD" 1 "U.S. Dollar"
    19059 "USD" 1 "U.S. Dollar"
    19060 "USD" 1 "U.S. Dollar"
    19061 "USD" 1 "U.S. Dollar"
    19064 "USD" 1 "U.S. Dollar"
    19065 "USD" 1 "U.S. Dollar"
    19066 "USD" 1 "U.S. Dollar"
    19067 "USD" 1 "U.S. Dollar"
    19068 "USD" 1 "U.S. Dollar"
    19071 "USD" 1 "U.S. Dollar"
    19072 "USD" 1 "U.S. Dollar"
    19073 "USD" 1 "U.S. Dollar"
    19074 "USD" 1 "U.S. Dollar"
    19075 "USD" 1 "U.S. Dollar"
    19078 "USD" 1 "U.S. Dollar"
    19079 "USD" 1 "U.S. Dollar"
    19080 "USD" 1 "U.S. Dollar"
    19081 "USD" 1 "U.S. Dollar"
    19082 "USD" 1 "U.S. Dollar"
    19085 "USD" 1 "U.S. Dollar"
    19086 "USD" 1 "U.S. Dollar"
    19087 "USD" 1 "U.S. Dollar"
    19088 "USD" 1 "U.S. Dollar"
    19089 "USD" 1 "U.S. Dollar"
    19092 "USD" 1 "U.S. Dollar"
    19093 "USD" 1 "U.S. Dollar"
    19094 "USD" 1 "U.S. Dollar"
    19095 "USD" 1 "U.S. Dollar"
    19096 "USD" 1 "U.S. Dollar"
    19099 "USD" 1 "U.S. Dollar"
    19100 "USD" 1 "U.S. Dollar"
    19101 "USD" 1 "U.S. Dollar"
    19102 "USD" 1 "U.S. Dollar"
    19103 "USD" 1 "U.S. Dollar"
    19106 "USD" 1 "U.S. Dollar"
    19107 "USD" 1 "U.S. Dollar"
    19108 "USD" 1 "U.S. Dollar"
    19109 "USD" 1 "U.S. Dollar"
    19110 "USD" 1 "U.S. Dollar"
    19113 "USD" 1 "U.S. Dollar"
    19114 "USD" 1 "U.S. Dollar"
    19115 "USD" 1 "U.S. Dollar"
    19116 "USD" 1 "U.S. Dollar"
    19117 "USD" 1 "U.S. Dollar"
    19120 "USD" 1 "U.S. Dollar"
    19121 "USD" 1 "U.S. Dollar"
    19122 "USD" 1 "U.S. Dollar"
    19123 "USD" 1 "U.S. Dollar"
    19124 "USD" 1 "U.S. Dollar"
    19127 "USD" 1 "U.S. Dollar"
    19128 "USD" 1 "U.S. Dollar"
    19129 "USD" 1 "U.S. Dollar"
    19130 "USD" 1 "U.S. Dollar"
    19131 "USD" 1 "U.S. Dollar"
    end
    format %td ANNDATS
    ANNDATS - the variable is on a daily trading basis. Which means that there are not no-trading days showed in the dataset(e.g. Saturday, Sunday, bank holidays..)

    Another dataset one contains information on firms on a quarterly basis

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long gvkey str10 datadate
    1004 "2013-08-31"
    1004 "2013-11-30"
    1004 "2014-02-28"
    1004 "2014-05-31"
    1004 "2014-08-31"
    1004 "2014-11-30"
    1004 "2015-02-28"
    1004 "2015-05-31"
    1004 "2015-08-31"
    1004 "2015-11-30"
    1004 "2016-02-29"
    1004 "2016-05-31"
    1004 "2016-08-31"
    1004 "2016-11-30"
    1004 "2017-02-28"
    1004 "2017-05-31"
    1004 "2017-08-31"
    1004 "2017-11-30"
    1004 "2018-02-28"
    1004 "2018-05-31"
    1004 "2018-08-31"
    1004 "2018-11-30"
    1004 "2019-02-28"
    1004 "2019-05-31"
    1004 "2019-08-31"
    1004 "2019-11-30"
    1004 "2020-02-29"
    1004 "2020-05-31"
    1004 "2020-08-31"
    1004 "2020-11-30"
    1004 "2021-02-28"
    1004 "2021-05-31"
    1004 "2021-08-31"
    1004 "2021-11-30"
    1004 "2022-02-28"
    1045 "2013-12-31"
    1045 "2014-03-31"
    1045 "2014-06-30"
    1045 "2014-09-30"
    1045 "2014-12-31"
    1045 "2015-03-31"
    1045 "2015-06-30"
    1045 "2015-09-30"
    1045 "2015-12-31"
    1045 "2016-03-31"
    1045 "2016-06-30"
    1045 "2016-09-30"
    1045 "2016-12-31"
    1045 "2017-03-31"
    1045 "2017-06-30"
    1045 "2017-09-30"
    1045 "2017-12-31"
    1045 "2018-03-31"
    1045 "2018-06-30"
    1045 "2018-09-30"
    1045 "2018-12-31"
    1045 "2019-03-31"
    1045 "2019-06-30"
    1045 "2019-09-30"
    1045 "2019-12-31"
    1045 "2020-03-31"
    1045 "2020-06-30"
    1045 "2020-09-30"
    1045 "2020-12-31"
    1045 "2021-03-31"
    1045 "2021-06-30"
    1045 "2021-09-30"
    1045 "2021-12-31"
    1045 "2022-03-31"
    1050 "2013-03-31"
    1050 "2013-06-30"
    1050 "2013-09-30"
    1050 "2013-12-31"
    1050 "2014-03-31"
    1050 "2014-06-30"
    1050 "2014-09-30"
    1050 "2014-12-31"
    1050 "2015-03-31"
    1050 "2015-06-30"
    1050 "2015-09-30"
    1050 "2015-12-31"
    1050 "2016-03-31"
    1050 "2016-06-30"
    1050 "2016-09-30"
    1050 "2016-12-31"
    1050 "2017-03-31"
    1050 "2017-06-30"
    1050 "2017-09-30"
    1050 "2017-12-31"
    1050 "2018-03-31"
    1050 "2018-06-30"
    1050 "2018-09-30"
    1050 "2018-12-31"
    1050 "2019-03-31"
    1050 "2019-06-30"
    1050 "2019-09-30"
    1050 "2019-12-31"
    1050 "2020-03-31"
    1050 "2020-06-30"
    1050 "2020-09-30"
    end
    I see that some dates from firm dataset (datadate) are not available in the exchange rate dataset. Because the latter only contains trading days.

    Therefore, how can I merge the two dataset (ideally the merge is "one to many" - because I have different firms in the firm dataset) by taking the first available date before “ANNDATS” in case there is not exact match between datadate(firm dataset) and ANNDATS (exchange rate dataset)?

    For example: 31 March 2013 is a Sunday. The firm dataset has record on that date. The exchange rate dataset not.
    In this case I need to take the first available date in the exchange rate which is 29 March 2013.


    Hope to have expressed my concern properly.

    Thanks for your help!
    Last edited by Marco Errico; 05 May 2022, 06:26.

  • #2
    Well, your example data is poorly chosen because the dates in the currency data are all from 2012, and the earliest date in the firm data is in 2013, and most are much more recent than that. If I literally do what you ask here, the code will assign the currency record as of 18 May 2012 to every firm:

    Code:
    gen _datadate = daily(datadate, "YMD")
    assert missing(_datadate) == missing(datadate)
    format _datadate %td
    drop datadate
    rename _datadate datadate
    
    isid gvkey datadate, sort
    
    rangejoin ANNDATS . datadate using `currency'
    by gvkey datadate (ANNDATS), sort: keep if _n == _N
    Now, presumably your real currency data set is more timely and won't suffer this problem. However, if your data sets are large you may run into memory problems with this code, since it works by matching each firm with every currency date up to the firm datadate and then choosing the one that is most current. That match may be unworkably large. In this case I think there is a relatively simple solution. If I can make the assumption that there are never more than 14 consecutive non-trading days, the following code will not be nearly as memory taxing and will accomplish the same thing:

    Code:
    gen _datadate = daily(datadate, "YMD")
    assert missing(_datadate) == missing(datadate)
    format _datadate %td
    drop datadate
    rename _datadate datadate
    
    isid gvkey datadate, sort
    gen lower = datadate-14
    
    rangejoin ANNDATS lower datadate using `currency'
    by gvkey datadate (ANNDATS), sort: keep if _n == _N
    Of course, in your example data this produces no matches at all because you have several years of "non-trading days." But presumably it would work well in the real data. By the way, if my assumption that there can never be more than 14 non-trading days in a row is too optimistic, you can replace 14 by some larger number. Just remember that the larger the number, the more memory will be required. So use the smallest number that reflects reality. (And if, in fact, the most consecutive non-trading days there can be is, say, 5, change 14 to 5).

    Added: -rangejoin- is written by Robert Picardand is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

    Comment

    Working...
    X