Merging two datasets

Nienke vHoof

Join Date: Jun 2014

Posts: 18
#1

Merging two datasets

29 Jun 2014, 15:28

Hi all,

In order to investigate the possible relation between the appearance of news and share price volatility, I am in the possession of two datasets:
Dataset A: This dataset contains the daily stock returns, for one year, of 384 different (companies) stocks, in addition to the traded volume of each of the stocks on that specific day.
Dataset B: This data set contains the news articles that mention the companies that are chosen in dataset A. Variables are for example: Ticker of the company, journal in which article was posted, title of article etc.

I now want to examine whether or not a relation exists between stock price volatility of company X and the appearance of news covering that same company X.
In order to do this, I obviously need a way to merge these two datasets. I have been looking at this issue for days now, and can't seem to find an answer. Part of the problem is the 'date' issue I believe.

I was thinking about taking the % of total articles on a certain company that appeared in a certain week of the year, so that if all the news was distributed evenly, each week would have 1/52 of the total news, and then combining this with the share price volatility in that same week of the year. However, I again would have to first be able to merge the datasets...

Many thanks,
Tags: None
ben earnhart

Join Date: May 2014

Posts: 1027
#2

29 Jun 2014, 16:38

Is dataset B in long form (one line per article)? And is dataset A in long form (one obs per day) or wide form (one obs per company, with different variables for different days)? Knowing this would help to establish a good way to approach the problem.
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#3

29 Jun 2014, 20:39

It's difficult to assess the problem if you don't show the structure of your two databases. A minimal, representative example of both databases would help.

Is the company ticker not present in both data sets?

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Nienke vHoof

Join Date: Jun 2014

Posts: 18
#4

30 Jun 2014, 02:14

I'm sorry for being unclear.
Dataset A has the structure as shown in the photo, it has one observartion (the share price and volume) per company per day. So each company has roughly 250 observartions.

Dataset B has one article per line, as can be seen in the second photo. Per article, one line in stata is used, showing the date of publication, the ticker of the company, the journal of publication and some sentiment data.

I've tried using commands such as merge and joinby but either I get an error because of time series (merge) or the result of combining the datasets seems off... (joinby)

Thanks for the effort already

2 Photos
Comment
bsc.j.j.w

Join Date: May 2014

Posts: 50
#5

30 Jun 2014, 02:26

Try the following:

Code:

use "Dataset A.dta", clear joinby Ticker using "Dataset B.dta"

It should work and if I am correct this should get you what you want and if not, please check:

Code:

help merge

and tell us how exactly you want to combine your dataset. If this is too complicated, you could use the GUI to combine datasets, which is located Data > Combine Datasets.
Comment
Nienke vHoof

Join Date: Jun 2014

Posts: 18
#6

30 Jun 2014, 06:59

I've tried both commands before. Using merge constantly gives me errors, and by using joinby I end up with over 3 million observations... (whereas dataset A has just under 100,000 observations, and dataset B has around 11,000)

Basically, what I want to achieve is to see the variables from Dataset B added to Dataset A, so that the line of variables looks like: NamesDate - Ticker - PriceorBid - Volume - Returns - Newspaper - verb - swear

At first I thought it was logical to just use merge by using the common variable date in both (I've renamed NamesDate), but I constantly end up with errors... (Which I believe come from the fact that every company has the same 250 dates from which share price data is taken)

Thanks again,
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#7

30 Jun 2014, 09:26

You could simply rename the date variable in dataset B to match that of the date variable in dataset A and then use append to combine both datasets. You can then sort by Ticker and date and decide what to do from there.
Comment
Nienke vHoof

Join Date: Jun 2014

Posts: 18
#8

03 Jul 2014, 07:03

Thanks all for the help already!

I've made some changes to dataset B, so that on a given day for a certain company, only one line of sentiment data exists (using the collapse function), making a 1:1 merge possible. The title/author etc. is not really of much importance and therefore left out. For example: Whereas AAPL had 5 articles on 2/1/2013, I now took the average to get one line of sentiment data for 1/2/2013.

I now want to merge this 1:1 to Dataset A, which has variables:
Date (renamed from NamesDate) Ticker Return etc.
(After merging, I can simply drop the dates for which no articles exist.)

However, when using

[CODE] merge 1:1 Ticker Date using "`root'/dtafiles/datasetA.dta" [CODE]

The following error appears:
variables Ticker Date do not uniquely identify observations in the using data
r(459);

I do not get why Ticker and Date do not uniquely identify the observations in dataset A... For example, for AA on 1/2/2013, only one return and share price exists right?

Thanks again,
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4458
#9

03 Jul 2014, 07:20

you seem to be asking us questions about your data - but only you have access to that data; why not check it out using, e.g., the duplicates command (see the help file) or using -egen- with the group function (if a value of your new variable appears more than once you then know you have a problem and what Ticker and Date combination are the problem)?
1 like
Comment

Announcement

Merging two datasets

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment