Merging Data sets

Yapa Bandara

Join Date: Aug 2015
Posts: 34

20 Aug 2015, 01:26

I have a dataset which contains monthly share transactions data for about 470 companies. Each company trades about 50 - 60 shares a month and a sample of dataset looks likes this.

Company Name	Share Name	Date	Qty
ABC	delta	01/01/2003	500
ABC	delta	01/02/2003	200
ABC	delta	01/03/2003	100
ABC	gamma	01/01/2003	20
ABC	gamma	01/02/2003	15
AAA	gamma	01/01/2004	1000
AAA	gamma	01/05/2004	120
AAA	zigma	01/01/2006	500
AAA	zigma	01/02/2006	65

I want to merge this data set with another dataset which contains codes for each stock and it looks like this

Share Name	Code
delta	dt
gamma	gm
zigma	zg

It doesn't allow me to merge using m:1 as it gave me an error message once. When I tried to merge them again, now it constantly says the code file can't be found even if it's in the same folder. Please help.
Thank you

Last edited by Yapa Bandara; 20 Aug 2015, 01:34.

Tags: None

Yapa Bandara

Join Date: Aug 2015

Posts: 34
#2

20 Aug 2015, 06:48

The message I get is,
variable share_name does not uniquely identify observations in the using data
r(459);
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

20 Aug 2015, 07:20

It would really help if we saw the command you issued, and understood which is the main data (in memory) and which is the using data.

With that said, if your first dataset is the one in memory, and your second dataset the using dataset, then you need merge m:1 share_name using ... . If its the other way around, then you need merge 1:m share_name using ... . The m:1 or 1:m needs to be in the same order as the datasets with multiple or 1 observation per share_name.

The error message you quote in your second message tells you that your using dataset has multiple observations with the same value of share_name. That is wrong if you are doing an m:1 merge. Either (a) your second dataset has at least one share_name that appears more than once or (b) your m:1 or 1:m should be reversed on your merge commend.

I cannot at all comment on the problem about your files not being found. We can better help you if we know what commands you have tried and what Stata told you to indicate that there was a problem. Please review the Statalist FAQ linked to from the top of the page, especially sections 9-12 on how to best pose your question. It's particularly helpful to copy commands and output from your Stata log window and paste them into your Statalist post using CODE delimiters, as described in section 12 of the FAQ.
Comment

Yapa Bandara

Join Date: Aug 2015
Posts: 34

20 Aug 2015, 17:34

I'm sorry for being less informative. The one in the memory is the first table named "holdings" and the second table is the using data set and it is named "share_code". The common field as you can see is the "share name". However, in the main file as I explained before, one company (shown by company name) trades 50 - 60 shares and these shares has been sorted first by share name and then by dates. So share_name contains multiple observations.

However, the using data set has only one observation for one share name with the relevant code. I need to assign codes for the main file and it should look like this after merging.

Company Name	share_name	Date	Qty	Code
ABC	delta	01/01/2003	500	dt
ABC	delta	01/02/2003	200	dt
ABC	delta	01/03/2003	100	dt
ABC	gamma	01/02/2003	20	gm
ABC	gamma	01/01/2004	15	gm
AAA	gamma	01/01/2004	1000	gm
AAA	gamma	01/05/2004	120	gm
AAA	zigma	01/01/2006	500	zg
AAA	zigma	01/02/2006	65	zg

I did
merge m:1 share_name using share_code

variable share_name does not uniquely identify observations in the using data
r(459);

Thank You

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

20 Aug 2015, 18:08

The error message is telling you that despite your assertion to the contrary, there is at least one share_name that appears more than once in the share_code dataset.

See help duplicates for instructions on using the duplicates command to identify the duplicated observations in your share_code dataset. Here is an example of its use on a modified version of your sample data.

Code:

. list, clean abbreviate(12)

       share_name   code  
  1.        delta     dt  
  2.        gamma     gm  
  3.        zigma     zg  
  4.        gamma     gg  
  5.        delta     dt  

. duplicates report share_name

Duplicates in terms of share_name

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |            1             0
        2 |            4             2
--------------------------------------

. duplicates list share_name, abbreviate(12)

Duplicates in terms of share_name

  +----------------------------+
  | group:   obs:   share_name |
  |----------------------------|
  |      1      1        delta |
  |      1      5        delta |
  |      2      2        gamma |
  |      2      4        gamma |
  +----------------------------+

.

Last edited by William Lisowski; 20 Aug 2015, 18:10.

Comment

Yapa Bandara

Join Date: Aug 2015

Posts: 34
#6

21 Aug 2015, 00:55

I figured it out and merge was done. Thank you William. By the way, can you please let me know what is meant by abbreviate(12) in your command? I ran duplicate list without using it.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

21 Aug 2015, 09:09

The abbreviate option on the duplicates command has the same function as it does on the list command: it sets the length at which variable names are shown as abbreviated in the log output. Compare the output below to my previous sample.

You should read the output of the help duplicates command for details, and to learn more about the duplicates command.

Code:

. duplicates list share_name Duplicates in terms of share_name +--------------------------+ | group: obs: share_~e | |--------------------------| | 1 1 delta | | 1 5 delta | | 2 2 gamma | | 2 4 gamma | +--------------------------+ .
Comment

Announcement