Problems with 1:1 merging two panel data sets

Ralph Klijn

Join Date: Apr 2018
Posts: 41

Problems with 1:1 merging two panel data sets

08 May 2018, 11:45

Dear users,

I am trying to 1:1 merge two panel data sets on the variables id and year.

When looking at the data I see several points which are uniquely identified by id and year. However, they do not merge. Below I have posted examples from my dataset. The dataset above is the using file and the data set below is the master file after merging. As shown, both are identified by the same values for id and year, however, Stata indicated master file only after merging.

Does someone know the issue? Thank you in advance.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float year long id float panelid long change float(lag_IO lag_IO_USD lag_IO_PROP lag_HHI)
2002 87 59506   -2516  .6018923  5709159424 .0011180569  .04789028
2003 87 59670  -71680  .6130163  4918107136  .001185605  .04927485
2004 87 58456  160155  .6398947  6301260800 .0011565884  .04430657
2005 87 59739    -474  .5909891  8636153856  .001347798 .026627507
2006 87 59956 -211954 .59058523  9510181888 .0012887628  .04242186
2007 87 59829 -233665  .5738135 12048774144 .0013304997  .04610071
end
label values id id2
label def id2 87 "ADM", modify

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float year long id float(esg panelid) long change float(lag_IO lag_IO_USD lag_IO_PROP lag_HHI) byte _merge
2010 2 78.91 2861     -739  .6656588 10451429376   .0014352862 .027291656 3
2011 2 87.57 4106     1285  .6152239  9676599296   .0012215164  .03233925 3
2012 2 86.87 3073      126  .5457439  5022808064   .0007129693 .035534024 3
2013 2 78.31 3853 -3206283  .5536094  5127286272    .000667337   .0428944 3
2004 3  31.8    .        .         .           .             .          . 1
2005 3 26.87 4611    14435  .1929813   152088576  .00004504978  .08928618 3
2006 3  58.5 4618    10395  .3637863   302313728 .000064802254  .04092287 3
2007 3 66.54 4812  -330112  .4739601   279020288  .00005940335  .12103135 3
2008 3 61.96 4741   -19284 .52113575   168175728 .000033705313 .071338974 3
end
label values id id2
label def id2 2 "ABS", modify
label def id2 3 "ADM", modify
label values _merge _merge
label def _merge 1 "master only (1)", modify
label def _merge 3 "matched (3)", modify

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#2

08 May 2018, 14:49

The data examples you have shown are not helpful here. The examples from the using set have no corresponding match at all in the examples from the merged file.

It seems that Stata is saying that your using data set does not contain any observation for ID = 3 and year = 2004. Have you in fact checked the using data set? I am confident you will find there is no such observation there. I have never known Stata to be wrong about this kind of thing. So then the question becomes: is your expectation that everything should match ill-founded, or is there a problem with your using data set that needs to be fixed?
Comment
Ralph Klijn

Join Date: Apr 2018

Posts: 41
#3

09 May 2018, 00:05

Hi Clyde,

Thank you for the response. There is, however, a match in id and year.

Somehow, this happens when I use -dataex-, but where id shows id=87 in upper sample and id=3 in the lower sample, the actual id values are both "ADM" (which is the value I see when I look at the data in Stata). I don't know why it is shown this way when I use -dataex-. The only place where is shown in the sample is by the following:

In the upper sample:

Code:

label def id2 87 "ADM", modify

In the lower sample:

Code:

label def id2 3 "ADM", modify

My apologies if I should have mentioned this. I hope this helps.

I am looking forward to your reaction.

Last edited by Ralph Klijn; 09 May 2018, 00:08.
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#4

09 May 2018, 05:13

The "ADM" is the value label, which helps you browse/read your data more easily. It's not the 'actual value', at least not in the way Stata feels about it.
Stata looks at the id values, not their labels, for identifying matching IDs.
To work towards succesful merging, you could create new variables that contain the labels of 'id2' as 'actaul values', by doing:

Code:

decode id, gen(idstring)

in both datasets, and then merge on the newly created idstring variable.

In addition, I would say its not a good idea to have different coding for the same variable in different datasets, in particular when they are combined afterwards. Why would ADM be id 3 in one, and 87 in another set? This is just confusing.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#5

09 May 2018, 05:20

Somehow, this happens when I use -dataex-, but where id shows id=87 in upper sample and id=3 in the lower sample, the actual id values are both "ADM" (which is the value I see when I look at the data in Stata). I don't know why it is shown this way when I use -dataex-. The only place where is shown in the sample is by the following:

This happens because your variable id is a numeric variable with value labels attached. -dataex- shows the real data (which is why we insist on it here) whereas when you -list- or -browse- your data Stata applies the labels instead of showing you the real data. I take it you are not familiar with labeled variables. Read the -label- chapter in the [D] volume of the PDF manuals that are installed with your Stata for more information.

The problem now is very clear, and I'm sorry I didn't notice it before. These data sets have different value labels attached to id and that is causing massive confusion here. In fact, if you were to dig into it, I think you will find that many of the apparent matches are in fact mismatched. When you are merging to data sets on a merge key variable (id) that has a value label, it is the numeric values that have to match properly, not the labels. So Stata isn't even trying to match ADM in one data set with ADM in the other. It's trying to match 3 with 3 and 87 with 87. But 3 means different things in the two data sets and 87 means different things in the two data sets.

Ultimately, you need to fix this by getting a single value labeling in place. The simplest way to do this is as follows. In both the master and using data sets, run:

Code:

decode id, gen(string_id)

and save them.

This will create a new ID variable that is actually a string, so that where id = 3 in the using data set string_id will equal "ADM", and where id = 87 in the master data, you will have string_id equal to "ADM". Now merge the data sets using string_id and year instead of id and year. The matches will now be correct. And if you find some observations that go unmatched, at least that will really represent a gap in the data, not an artifact of mislabeling.

Once you have merged the two data sets successfully, I would

Code:

drop id encode string_id, gen(id)

Now you will have a new variable, id, numeric with a coherent value label in this data set. (It will not agree with the value labels in either of the original data sets, so you should not try to -merge- it again with either of those, nor with any other data sets in your project that might have been previously labeled. I would keep string_id around in case you need to do more merges with other datasets going forward. string_id will merge correctly across data sets, id will not.

Ultimately the error here lies with the creation of these data sets. When you create a series of data sets that you plan to ultimately put together, it is important to either not use value labels at all, or to do so with extreme care to specify the value labels consistently across data sets. Just using -encode- with no direct specification of the value label itself to create value labels separately in each data set leads to the kind of confusion you are facing here.

Again, sorry I didn't spot this in your original post.
Comment
Ralph Klijn

Join Date: Apr 2018

Posts: 41
#6

09 May 2018, 07:57

Thank you very much. Knowing this has saved me a lot of problems.
Comment

Announcement

Problems with 1:1 merging two panel data sets

Comment

Comment

Comment

Comment

Comment