How to merge with variable which is not uniquely identified

Chul-Kyoo Jung

Join Date: Oct 2018

Posts: 122
#1

How to merge with variable which is not uniquely identified

08 May 2021, 17:01

My question may be a general question on merge command, but I was wondering if anyone could help with this issue.

My command is:

Code:

merge m:1 kiscode using yyy.dta, keepusing(mkt_dummy, type_dummy)

And I got the following message

variable kiscode does not uniquely identify obs in the using data

So I checked with using data and found that several kiscodes' frequency is more than 1, which I think the source of the error message.
In this case, is there any common way of resolving the issue?

Since variables from using data (mkt_dummy, type_dummy) are simply dummy variables, my idea is that I simply make all the duplicated kiscodes in the using data appear only once and then merge, but this sounds not right to me. I just think that there should be way better solution than this.

The following is how using data looks like"

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long kiscode byte(mkt_dummy type_dummy) 814059 0 1 514039 0 1 501417 0 1 136088 0 1 370037 1 1 359203 0 1 370037 1 0 360120 0 0 370037 1 0 210226 0 0 370037 1 1 612049 0 1 840033 1 1 850390 1 1 99120 0 0 320013 0 0 340391 0 1 821705 0 1 330051 0 0 305736 0 0 324183 0 1 327930 0 1 351164 1 0 220035 1 0 351164 1 1 220035 1 1 293658 1 0 350516 1 0 148440 0 1 251569 0 1 361321 0 1 353353 0 1 810053 0 0 630306 0 0 620084 0 0 630306 0 0 620084 1 0 630306 0 0 620084 1 0 630306 0 0 870048 1 0 880990 0 0 381110 0 1 371475 0 1 381110 0 1 620190 0 1 500720 1 0 44292 0 0 451762 0 1 158488 0 1 751260 0 1 end

------------------ copy up to and including the previous line ------------------

Thank you.

Last edited by Chul-Kyoo Jung; 08 May 2021, 17:02. Reason: merge not uniquely identified variables
Tags: data, merge
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

08 May 2021, 17:46

No, you should not just delete extra observations having the same value of kiscode, because some of them have different values of mkt_dummy and type_dummy, so you would be losing information. Only delete them if, on review, you find they are data errors.

So I can see two paths forward, depending on what the master data (the data that was in memory before you issued the -merge- command) are like.

If kiscode uniquely identifies observations in the master data, then you just have the code slightly backwards: it should be:

Code:

merge 1:m kiscode using yyy.dta

If however, kiscode does not uniquely identify observations in the master data, then these two data sets are incompatible for merging. Let's take a concrete example

kiscode 370037 appears 4 times in your using data, and even taking into account the values of mkt_dummy and type_dummy, there are three distinct observations. Now you don't show any example of the master data. Let me suppose that there kiscode 370037 appears more than once. Let's say it appears twice. Suppose you had the two data sets printed out side by side on a sheet of paper and you had to draw lines connecting each observation in the master data with the corresponding observation in the using data that you want it linked with. How would you decide which of the 4 370037 observations in using goes with which of the 2 370037 observations in master? If there is information in other variables that appear in the two data sets that would enable you to make that choice, then those other variables would need to be built into the -merge- command accordingly. If there is no such information, then it is not possible to -merge- the data sets.

There is, however, a different way to link them, which may (or may not) be suitable for your purpose. You might connect each of the two master 370037 observations with all four of the using 370037 observations, resulting in a total of 8 observations in the linked data set. If that's what you want, you do not use -merge-. Instead you use:

Code:

joinby kiscode using yyy.dta

If this is what you need, read -help joinby- for more information about the command.
1 like
Comment

Chul-Kyoo Jung

Join Date: Oct 2018
Posts: 122

09 May 2021, 09:42

Thanks Clyde, and sorry that I wasn't clarified enough. I should've attached both data.
So I just want to retry to clarify the purpose with two data sets.
The master data looks like:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long kiscode int year double(sales liability equity earnings)
108 1980            .           .           .           .
108 1981            .           .           .           .
108 1982            .           .           .           .
108 1983            .           .           .           .
108 1984            .           .           .           .
108 1985            .           .           .           .
108 1986            .           .           .           .
108 1987            .           .           .           .
108 1988   9730461000  2586199000   966707000    25052000
108 1989  10069818000  4012581000   584720000   -16471000
108 1990  12192565000  5165254000   410577000  -154150000
108 1991  13279253000  6729581000    88444000  -253764000
108 1992  13352551000  5701120000   509566000   181797000
108 1993  11025693000  6385232000   299109000  -184938000
108 1994  14806314000  8526309000   591566000    90533000
108 1995  14077964000 10379895000   186777000  -421140000
108 1996  13895713000 10530379000  2603862000  -236126000
108 1997  15173625000  9162140000  3017197000   438687000
108 1998  18112878000  9156668000  3537940000   520743000
108 1999  25454702000  8137447000  4322462000  1029522000
108 2000  25967273000  6911196000  5022832000   700370000
108 2001  25604441000 10281398000  5760426000  1006638000
108 2002  30339629000  8810634000  7059694000  1545632000
108 2003  35842681000 12779524000  8160275000  1132227000
108 2004  24562965000 11871589000  7100664000 -3467644000
108 2005  20802863000 10575441000  8005390000 -1786703000
140 1980            .           .           .           .
140 1981            .           .           .           .
140 1982            .           .           .           .
140 1983            .           .           .           .
140 1984            .           .           .           .
140 1985            .           .           .           .
140 1986   4085263000  2511592000   361520000   108645000
140 1987   4486320000  2423020000   491431000   132331000
140 1988   5831976000  2308130000   669012000   181532000
140 1989   6425697000  5576639000   890058000   101045000
140 1990  10829780000  6870017000  1037127000   147069000
140 1991  14312460000  8913098000  2199024000   171897000
140 1992  16795171000 11523123000  2991444000   -17579000
140 1993  19690683000 12708261000  3316380000  -365065000
140 1994  18247793000 13505302000  2381453000 -1457927000
140 1995  18654794000 13378259000  2484629000   103175000
140 1996  19649123000 15702746000   959578000 -1525050000
140 1997  20816083000 15809838000   125656000  -997190000
140 1998  22831458000 12852111000  6602560000   668265000
140 1999  23102324000 10961899000  6697967000   116612000
140 2000  23246074000 12787114000  6789438000    91471000
140 2001  31569301000 14880064000  6979065000   189627000
140 2002  35506232000 16704344000  7317515000   338450000
140 2003  38762897000 18233781000  7555408000   237893000
140 2004  38408846000 17115833000  6933621000  -621787000
140 2005  33277936000 17631646000  6558978000  -374643000
442 1980            .           .           .           .
442 1981            .           .           .           .
442 1982            .           .           .           .
442 1983            .           .           .           .
442 1984            .           .           .           .
442 1985            .           .           .           .
442 1986            .           .           .           .
442 1987            .           .           .           .
442 1988            .           .           .           .
442 1989   5635311000  4554891000   599033000   100990000
442 1990   5973640000  6357334000   671748000    72715000
442 1991   7.1959e+09  8629613000  1435067000   263319000
442 1992  11221108000 10235087000  1891531000   456464000
442 1993  14539483000 10786417000  2824791000   472261000
442 1994  17146850000 12511221000  3498167000   673375000
442 1995  25031517000 15970744000  5350880000   352713000
442 1996  35287548000 26786602000  6900954000  1052759000
442 1997  36400471000 31388983000  9833024000   932069000
442 1998  38533216000 3.05183e+10 16953140000  1102495000
442 1999  42780112000 36474080000 17396396000  2023058000
442 2000  56246491000 41163410000 22876442000  1290926000
442 2001  82896479000 48099197000 21018066000 -1548798000
442 2002 118200066000 53928456000 20446431000  1256737000
442 2003 126625824000 54997551000 21947723000  4054215000
442 2004 108997354000 82411204000 25473028000  3127922000
442 2005 125135024000 88857327000 34018746000  3016280000
485 1980            .           .           .           .
485 1981            .           .           .           .
485 1982            .           .           .           .
485 1983            .           .           .           .
485 1984            .           .           .           .
485 1985            .           .           .           .
485 1986   2380506000  1742072000   498416000    51442000
485 1987            .           .           .           .
485 1988            .           .           .           .
485 1989            .           .           .           .
485 1990            .           .           .           .
485 1991            .           .           .           .
485 1992            .           .           .           .
485 1993            .           .           .           .
485 1994   4194081000  2574689000   918453000    89682000
485 1995   4557786000  3220120000   987258000    68805000
485 1996   5876373000  3242154000  1115076000   139003000
485 1997   5893297000  2772360000  1131522000    16446000
485 1998   5339126000  2971313000   981754000  -149768000
485 1999   5192212000  3705664000   887058000  -104224000
485 2000   5424433000  5069321000   937056000    49999000
485 2001   6195599000  4626465000   949548000    48299000
end

And more complete format of using data is:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long kiscode int year byte(under30_dummy chaebol30_dummy chaebol05_dummy mkt_dummy type_dummy)
814059 1996 1 0 0 0 1
514039 1996 1 0 0 0 1
501417 1986 0 1 0 0 1
136088 1986 0 0 0 0 1
370037 1996 1 0 0 1 1
359203 1996 0 0 0 0 1
370037 1991 1 0 0 1 0
360120 1991 1 0 0 0 0
370037 1995 1 0 0 1 0
210226 1995 1 0 0 0 0
370037 1999 0 1 0 1 1
612049 1999 0 1 0 0 1
840033 1999 0 1 1 1 1
850390 1999 0 1 1 1 1
 99120 1979 1 . . 0 0
320013 1979 1 . . 0 0
340391 1995 0 1 0 0 1
821705 1995 0 1 0 0 1
330051 1987 0 0 0 0 0
305736 1987 0 0 0 0 0
324183 1992 1 0 0 0 1
327930 1992 0 0 0 0 1
351164 1994 1 0 0 1 0
220035 1994 0 0 0 1 0
351164 1998 0 1 0 1 1
220035 1998 0 0 0 1 1
293658 1996 1 0 0 1 0
350516 1996 0 0 0 1 0
148440 1995 0 0 0 0 1
251569 1995 0 0 0 0 1
361321 1995 1 0 0 0 1
353353 1995 1 0 0 0 1
810053 1983 0 1 1 0 0
630306 1983 0 1 1 0 0
620084 1982 0 1 1 0 0
630306 1982 0 1 1 0 0
620084 1992 0 1 1 1 0
630306 1992 0 1 1 0 0
620084 1995 1 0 0 1 0
630306 1995 0 0 0 0 0
870048 1997 0 0 0 1 0
880990 1997 0 0 0 0 0
381110 1996 0 1 0 0 1
371475 1996 0 0 0 0 1
381110 1996 0 0 0 0 1
620190 1996 0 1 0 0 1
500720 1995 0 1 1 1 0
 44292 1995 0 1 1 0 0
451762 1991 0 0 0 0 1
158488 1991 0 0 0 0 1
751260 1997 0 0 0 0 1
     . 1997 0 0 0 0 1
110213 1996 0 0 0 0 1
312886 1996 0 0 0 0 1
370258 1994 0 1 0 1 0
389145 1994 0 1 0 0 0
350893 1995 0 1 0 1 0
389145 1995 0 1 0 0 0
451916 1993 0 0 0 0 1
397210 1993 0 0 0 0 1
381420 1988 0 0 0 0 1
384461 1988 0 0 0 0 1
389765 1998 0 0 0 0 1
167932 1998 0 0 0 0 1
380636 1991 0 1 1 1 0
352845 1991 0 0 0 0 0
810177 1998 1 0 0 0 1
 41272 1998 1 0 0 0 1
637190 1999 1 0 0 0 1
 58357 1999 1 0 0 0 1
155306 1989 0 0 0 0 1
     . 1989 0 0 0 . 1
910147 1989 0 0 0 0 1
501964 1989 0 0 0 0 1
910147 1989 0 0 0 0 1
390470 1989 0 0 0 0 1
910147 1989 0 0 0 0 1
990272 1989 0 1 0 0 1
910147 1998 0 1 0 0 0
710229 1998 0 1 0 0 0
910147 1998 0 1 0 0 1
710229 1998 0 1 0 0 1
360244 1985 1 0 0 1 0
381055 1985 0 0 0 0 0
360244 1997 1 0 0 1 1
381055 1997 1 0 0 0 1
500844 1990 1 0 0 1 1
     . 1990 1 0 0 0 1
     . 1995 1 0 0 0 1
 24466 1995 1 0 0 0 1
310328 1976 0 . . 0 0
310794 1976 0 . . 0 0
500607 1999 0 1 0 1 1
382485 1999 0 1 0 0 1
 99562 1991 0 0 0 0 1
245490 1991 0 0 0 0 1
860344 1995 1 0 0 0 1
820202 1995 1 0 0 0 1
370380 1995 1 0 0 1 0
 58231 1995 0 0 0 0 0
end

And my goal is to combine the following variables (under30_dummy chaebol30_dummy chaebol05_dummy mkt_dummy type_dummy) into master data based on the corresponding kiscode and year.
As you point out, two data might not be compatible. I think the tricky fact is that all these dummy variables are time-variant. So some kiscode + year are uniquely identified, but some others are not uniquely identified. .
I tried

Code:

merge 1:m kiscode year using yyy.dta

, but got the same error message of unique identification issue, and I also tried

Code:

merge m:1 kiscode year using yyy.dta

but failed with the same issue. I am still wondering if there's a way to impute those dummies into my master data.

Thanks.

ps. Both datasets are pretty big so overlapping kiscodes occur in the omitted part of data.

Last edited by Chul-Kyoo Jung; 09 May 2021, 09:51.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#4

09 May 2021, 11:45

First, your data in the using data set are not as you describe. It is simply not true that the dummy variables are time-invariant. All of them exhibit variation in different years within the same kiscode. They also all vary across kiscodes within the same year. The using data set also contains some observations that are completely duplicate. That is, they are the same on all the variables. Usually that indicates that something went wrong with the data management that created that data set--normally you don't introduce exact duplicate observations. There are also observations that have the same value of kiscode and year but have different values for the dummy variables--so the data are internally inconsistent. This using data set does not seem to me to be usable for any purpose at all: it contains both redundant observations and contradictory information. I think you need to review how this data set came into existence and get it fixed.

Nevertheless, it is possible to merge these two data sets because in the master data, kiscode and year uniquely identify the observations (at least this is true in the example you show.) So

Code:

use master_data merge 1:m kiscode year using using_data

will run without error messages.

That said, as already noted, I think that there are serious anomalies in your using data set, so I think that after the -merge- you will be left with a mess of both redundant and contradictory information that will not be usable. So, you can merge these data sets, but I really don't think you should. I think the using data set is wrong and needs to be fixed before you use it for anything.
Comment
Chul-Kyoo Jung

Join Date: Oct 2018

Posts: 122
#5

09 May 2021, 13:46

Thanks Clyde,

True, they are time-variant (my bad) and I think the data consistency issue is more complicated than I expected.
If I may add two things on what's going on the using_data,

1) The variation within the same kiscode and year is coming from the change that occurred in the same year. For example, a certain company merged with other company more than once on the same year, generating different dummy values within the same year. I think this is really a tough issue for me to resolve. But maybe I can use the most updated dummy value within the year, but in this case, I don't know if there is a handy way to deal with it.

2) Duplication report shows the following:

Duplicates in terms of all variables in using data.

Code:

-------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 5284 0 2 | 196 98 3 | 36 24 4 | 16 12 5 | 5 4 8 | 8 7 --------------------------------------

And maybe I can drop all the duplicates without losing much data (if this is the only way).

Here is the result after I just merged following your suggestion:

Code:

Result # of obs. ----------------------------------------- not matched 214,606 from master 210,865 (_merge==1) from using 3,741 (_merge==2) matched 1,804 (_merge==3)

This seems not pretty to me. Do you have any tip for these issues?
Thanks again,
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#6

09 May 2021, 14:00

Well, yes, it seems you have a very difficult dilemma to solve concerning companies that, by virtue of mergers, etc., changed during a year. How best to handle that is really a substantive question in your discipline, not a statistics or Stata question, and not one I can advise you on, as I have no expertise there. Suffice it to say that you really need to solve this before you proceed.

Yes, you can drop duplicate observations with no loss of information. And if the duplicates resulted from firms undergoing mergers, etc., but the variables did not change when that happened, then I would say that is a very reasonable way to deal with those. I would just strongly caution you to make sure that that's the only reason you have duplicate observations. Usually when a data set has completely duplicate observations it is due to an error in the data management that created it, and where there is one mistake recognized there are often others lurking unrecognized. So whenever I encounter duplicate observations in a data set, I investigate the data management to find out how they happened and check carefully for other errors.

The match results are, I agree, not pretty. The vast majority of the observations in your master data set have no corresponding observation in the using data set. There is no workaround for this: the data are what they are. If this result surprises you, as I guess it does since you're raising this question, then the solution is to be found, again, in how the data sets were created: something is wrong with the data, and it could be in either data set. Or perhaps there are two different systems of kiscode being used in the data sets? Certainly in the examples you show, the kiscodes are multidigit numbers in the master data but only 3 digit numbers in the using. I don't know what a kiscode is, but I have observed on Statalist that certain types of codes used in business and finance such as CUSIP codes or SIC codes come in long and short versions--perhaps you have different versions in the two data sets and need to first harmonize them. Or perhaps the problem is with the time periods of the data sets. In the examples, the using data is all from before 2000, but the data in the master is, in part, more recent. Whatever is going on, these two data sets do not seem to have much capability to work together in their present form.
1 like
Comment
Chul-Kyoo Jung

Join Date: Oct 2018

Posts: 122
#7

09 May 2021, 14:10

Thanks for your comments. They sounds very fair to me.
Comment
James Voss

Join Date: Aug 2022

Posts: 22
#8

22 Aug 2023, 09:50

Originally posted by Clyde Schechter View Post

Nevertheless, it is possible to merge these two data sets because in the master data, kiscode and year uniquely identify the observations (at least this is true in the example you show.) So

Code:

use master_data merge 1:m kiscode year using using_data

will run without error messages.

Looking through old threads to find an answer to my question, and this one seems closest. Please forgive me if I'm missing a thread that is more direct.

I would like to merge two data sets using an agency ID number ("ID") for each one of hundreds of agencies, but there is more than one row of data for each ID in the using_data.

There is more than one row for each ID in the using_data because the using_data has a row of data for each ID for each year over the past 20+ years.

I only need the data from the "using" file for the year 2022.

I can think of two clunky ways of using the m:1 merge to bring into the master_data only the using_data for the year 2022, but is there an efficient, maybe the term is "elegant," way to achieve that end?

The most elegant thing I can come up with is to gen a new variable in the master_data called "year" with the value of 2022, and then merge with something like this:

Code:

use master_data merge m:1 ID year using using_data

If feels clunky because I've had to add a variable. But maybe that's as elegant as it gets?

Thank you!

Last edited by James Voss; 22 Aug 2023, 10:20.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#9

22 Aug 2023, 09:54

I would first purge the using file of non-year 2022 data and then do the -merge-.

Code:

use if year != 2022 using_dataset, clear merge 1:1 ID using the_other_data_set

Elegance is in the eye of the beholder. As for efficiency, I think this is the most efficient way you can do it.
Comment
James Voss

Join Date: Aug 2022

Posts: 22
#10

24 Aug 2023, 11:24

Thanks Clyde Schechter . That is definitely more efficient.
Comment

Announcement