Merging two databases. With different dates. Best method?

Sierk Bulens

Join Date: Jan 2019

Posts: 2
#1

Merging two databases. With different dates. Best method?

30 Jan 2019, 07:03

Dear members,

These days I am working on my (master) thesis and I am working with Stata in order to process the data for my thesis. I have not yet used it before so I am learning while doing. I learned a lot but now I am stuck. I am writing about convertible bonds. I have one dataset which contains (amongst other things) different dates of issuing. For example 29-01-2004 a bond issue of Company A. Another bond issue on the 20^th of July of Company A and different dates for Company B. Then, I have another dataset which contains monthly dates of the amount of shares sold short. This is monthly data so usually on the 15^th of every month. I am interested in the amount of shares short(var: nshort~t )during, before and after the period of the issue. Now I need to compare different information from database2(every month data) with database 1. I can’t merge these two files apparently. I tried merge pairwise but I will get a strange datafile. Is there someone who could help me? I have converted all the variables already into strings.

IssuerP is the identification number.
New_offering date is time, days from the year 1960 I believe.
Database 1 output(1/10)

| issuer~p new_of~d new_sha~g cusip_name indust~p in_ban~y sic_code |
|-----------------------------------------------------------------------------------------------|
1. | 000361 16099 32046000 AAR CORP 1 N 3720 |
2. | 000361 16272 37720000 AAR CORP 1 N 3720 |
3. | 000361 16827 36349000 AAR CORP 1 N 3720 |
4. | 000361 16922 39940227 AAR CORP 1 N 3720 |
5. | 000361 17567 37917936 AAR CORP 1 N 3720 |
|-----------------------------------------------------------------------------------------------|
6. | 000361 17567 40283131 AAR CORP 1 N 3720 |
7. | 000361 19071 40167070 AAR CORP 1 N 3720 |
8. | 000361 19071 40167070 AAR CORP 1 N 3720 |
9. | 000361 19403 39694849 AAR CORP 1 N 3720 |
10. | 000886 15855 801000000 ADC TELECOMMUNICATIONS INC 1 N 3661 |
+-----------------------------------------------------------------------------------------------+

Database 2
+-----------------------------------------------------------------+
| nshort~t nsplit~e tic cusip sic issuer~p ntime |
|-----------------------------------------------------------------|
1. | 518299 14640 AIR 000361105 5080 000361 14623 |
2. | 438542 14669 AIR 000361105 5080 000361 14655 |
3. | 438412 14700 AIR 000361105 5080 000361 14684 |
4. | 386472 14730 AIR 000361105 5080 000361 14714 |
5. | 363889 14761 AIR 000361105 5080 000361 14745 |
|-----------------------------------------------------------------|
6. | 452221 14791 AIR 000361105 5080 000361 14776 |
7. | 492179 14822 AIR 000361105 5080 000361 14805 |
8. | 506868 14853 AIR 000361105 5080 000361 14837 |
9. | 436580 14883 AIR 000361105 5080 000361 14868 |
10. | 660439 14914 AIR 000361105 5080 000361 14896 |

I have come to a datafile which I can probably work with but I am not sure how because the first date relates to the issue of the convertible and the datadate variable relates to the short interest date(database2). The short interests dates are given every month. Usually around middle of the month.

issuer~p offerin~e shares~g time shortint shorti~j datadate splitad~e time2 differ~e

1. 000361 14feb2013 39694849 19403 518299 518299 14jan2000 31jan2000 14623 -4780
2. 000361 14feb2013 39694849 19403 438542 438542 15feb2000 29feb2000 14655 -4748
3. 000361 14feb2013 39694849 19403 438412 438412 15mar2000 31mar2000 14684 -4719
4. 000361 14feb2013 39694849 19403 386472 386472 14apr2000 30apr2000 14714 -4689
5. 000361 14feb2013 39694849 19403 363889 363889 15may2000 31may2000 14745 -4658

6. 000361 14feb2013 39694849 19403 452221 452221 15jun2000 30jun2000 14776 -4627
7. 000361 14feb2013 39694849 19403 492179 492179 14jul2000 31jul2000 14805 -4598
8. 000361 14feb2013 39694849 19403 506868 506868 15aug2000 31aug2000 14837 -4566
9. 000361 14feb2013 39694849 19403 436580 436580 15sep2000 30sep2000 14868 -4535
10. 000361 14feb2013 39694849 19403 660439 660439 13oct2000 31oct2000 14896 -4507

11. 000361 14feb2013 39694849 19403 604438 604438 15nov2000 30nov2000 14929 -4474
12. 000361 19mar2012 40167070 19071 518299 518299 14jan2000 31jan2000 14623 -4448
13. 000361 19mar2012 40167070 19071 518299 518299 14jan2000 31jan2000 14623 -4448
14. 000361 14feb2013 39694849 19403 499438 499438 15dec2000 31dec2000 14959 -4444
15. 000361 19mar2012 40167070 19071 438542 438542 15feb2000 29feb2000 14655 -4416

So for example I have an issue(database1) of Company x on the first of januari 2003. Then i want to know the difference in the amount of shares sold short(database 2) between december 2002 and januari 2003 for that company. So i want to know the influence of the (announced) issue). I find it hard to explain my question but I am hoping for some feedback.

Kind regards,

Sierk

Last edited by Sierk Bulens; 30 Jan 2019, 07:55.
Tags: None

David Benson

Join Date: Oct 2018
Posts: 489

30 Jan 2019, 20:08

Hi Sierk, and welcome to Statalist!

So a couple of things:
1) It would be *super* helpful if you shared a sample of your data using Stata's dataex command, rather than list (it will give us the full variable names as well as the data type (string, numeric, date, etc). I created a YouTube video on using dataex here. (I made it too long, feel free to watch at 2x speed, and the 1st six minutes are probably all you need

)

2) So you have 2 different datasets: (1) the monthly data on short sellers (and probably other variables), and (2) data on convertible bond transactions. It's not clear to me from your research question which dataset is going to be your master dataset. But you'll want to create a variable in both datasets called Stata_month, and then you can merge based on firm_id (I assume CUSIP) and Stata_month.

Code:

* I created some toy data (this would be your monthly data on short sellers.)
* I called the variable "sales" instead, but the principle is the same
dataex firm date_string sales  // Data shared via by -dataex-. To install: ssc install dataex
clear
input byte firm str9 date_string byte sales
1 "8/1/2015"  58
1 "9/1/2015"  40
1 "10/1/2015" 67
1 "11/1/2015" 82
1 "12/1/2015" 70
1 "1/1/2016"  76
1 "2/1/2016"  60
1 "3/1/2016"  30
1 "4/1/2016"  47
2 "9/1/2015"  48
2 "10/1/2015" 56
2 "11/1/2015" 85
2 "12/1/2015" 29
2 "1/1/2016"  82
2 "2/1/2016"  61
2 "3/1/2016"  64
2 "4/1/2016"  43
end

Code:

* Create the date variables (assuming your date variable is a string)
* See "help datetime translation"
gen date = daily( date_string, "MDY", 2020)  // daily() and date() do the same thing, but daily() is preferred because it is more explicit about what you are doing
format date %td
sort firm date

. list, sepby(firm) abbrev(12) noobs

  +----------------------------------------+
  | firm   date_string        date   sales |
  |----------------------------------------|
  |    1      8/1/2015   01aug2015      58 |
  |    1      9/1/2015   01sep2015      40 |
  |    1     10/1/2015   01oct2015      67 |
  |    1     11/1/2015   01nov2015      82 |
  |    1     12/1/2015   01dec2015      70 |
  |    1      1/1/2016   01jan2016      76 |
  |    1      2/1/2016   01feb2016      60 |
  |    1      3/1/2016   01mar2016      30 |
  |    1      4/1/2016   01apr2016      47 |
  |----------------------------------------|
  |    2      9/1/2015   01sep2015      48 |
  |    2     10/1/2015   01oct2015      56 |
  |    2     11/1/2015   01nov2015      85 |
  |    2     12/1/2015   01dec2015      29 |
  |    2      1/1/2016   01jan2016      82 |
  |    2      2/1/2016   01feb2016      61 |
  |    2      3/1/2016   01mar2016      64 |
  |    2      4/1/2016   01apr2016      43 |
  +----------------------------------------+

* Creating lagged monthly sales (or short seller) using subscripts
* I use p for "prior month"
bysort firm (date): gen sales_p1 = sales[_n-1]
bysort firm (date): gen sales_p2 = sales[_n-2]

* Creating lagged monthly using xtset and Stata lead and lags terminology
gen Stata_month = mofd(date)  // note that date here needs to be a Stata daily date for this to work 
format Stata_month %tm
order Stata_month, after(date)
gen year = year( date)
gen s_month = mofd(date)
order year s_month, before(Stata_month)  // s_month & Stata_month are the same. Stata_month is just formatted as %tm
// Stata counts months after jan1960. So Aug2015 is 667 months after jan1960

xtset firm Stata_month
/*
       panel variable:  firm (unbalanced)
        time variable:  Stata_month, 2015m8 to 2016m4
                delta:  1 month
*/
gen lag_sales1 = l.sales   // This is lowercase letter "L"
gen lag_sales2 = l2.sales  // Can use f.sales for 1 month in future, f2.sales for 2months in future, etc
gen lag_sales3 = l3.sales

list firm date- lag_sales3 , sepby(firm) abbrev(14) noobs

  +----------------------------------------------------------------------------------------------------------------------+
  | firm        date   year   s_month   Stata_month   sales   sales_p1   sales_p2   lag_sales1   lag_sales2   lag_sales3 |
  |----------------------------------------------------------------------------------------------------------------------|
  |    1   01aug2015   2015       667        2015m8      58          .          .            .            .            . |
  |    1   01sep2015   2015       668        2015m9      40         58          .           58            .            . |
  |    1   01oct2015   2015       669       2015m10      67         40         58           40           58            . |
  |    1   01nov2015   2015       670       2015m11      82         67         40           67           40           58 |
  |    1   01dec2015   2015       671       2015m12      70         82         67           82           67           40 |
  |    1   01jan2016   2016       672        2016m1      76         70         82           70           82           67 |
  |    1   01feb2016   2016       673        2016m2      60         76         70           76           70           82 |
  |    1   01mar2016   2016       674        2016m3      30         60         76           60           76           70 |
  |    1   01apr2016   2016       675        2016m4      47         30         60           30           60           76 |
  |----------------------------------------------------------------------------------------------------------------------|
  |    2   01sep2015   2015       668        2015m9      48          .          .            .            .            . |
  |    2   01oct2015   2015       669       2015m10      56         48          .           48            .            . |
  |    2   01nov2015   2015       670       2015m11      85         56         48           56           48            . |
  |    2   01dec2015   2015       671       2015m12      29         85         56           85           56           48 |
  |    2   01jan2016   2016       672        2016m1      82         29         85           29           85           56 |
  |    2   01feb2016   2016       673        2016m2      61         82         29           82           29           85 |
  |    2   01mar2016   2016       674        2016m3      64         61         82           61           82           29 |
  |    2   01apr2016   2016       675        2016m4      43         64         61           64           61           82 |
  +----------------------------------------------------------------------------------------------------------------------+

* Having Stata_month in both datasets means you can merge on firm_id and Stata_month.
* Having the prior short seller info on the same line means you can include them in your regressions if the transaction dataset is your master dataset

Announcement

Merging two databases. With different dates. Best method?

Comment