Merging Data Sets with matching Language pairs

Marvin Haas

Join Date: Jun 2022

Posts: 11
#1

Merging Data Sets with matching Language pairs

24 Jun 2022, 10:54

Dear stata-community,
in the context of a research project using the gravity model of trade I am having difficulties to merge two datassets.

this is an example of a line from each dataset:

Dataset 1 (https://vi.unctad.org/tpa/web/vol2/vol2home.html) Chapter1Application1
tradeflows among countries 1986-2006
(ARG DEU and DEU ARG are having the same pair_id btw)

Dataset 2 (https://www.usitc.gov/data/dicl.htm)
index measuring linguistic similarity

how can i merge both cells into one under the condition that they are the same pair of countries
ARG-DEU and DEU-ARG should have the same linguistic similarity values added in the cells ?
Dataset 2 also offers many country pairs that do not occur in dataset 1 and should therefore also be deleted when merging.

i hope you can help me merge the data since I did not get any further using the stata mask above either.

regards Marvin
Attached Files
Tags: None

Fei Wang

Join Date: Oct 2021
Posts: 726

24 Jun 2022, 11:24

Marvin, it would be helpful to display your data using command -dataex- rather than posting pictures.

Below I made up another line in Data 1 where DEU is the exporter while ARG is the importer.

Code:

* Data 1
    clear
    input str3(exporter importer) float(pair_id year trade)
    "ARG" "DEU" 30 1986 302.6463
    "DEU" "ARG" 30 1986 123.4567
    end

    save data1, replace
    list, sep(0)

     +-------------------------------------------------+
     | exporter   importer   pair_id   year      trade |
     |-------------------------------------------------|
  1. |      ARG        DEU        30   1986   302.6463 |
  2. |      DEU        ARG        30   1986   123.4567 |
     +-------------------------------------------------+


* Data 2
    clear
    input str3(exporter importer) float(col cnl lp cl)
    "ARG" "DEU" 0 .0092 .1358 .0483
    end

    save data2, replace
    list, sep(0)

     +---------------------------------------------------+
     | exporter   importer   col     cnl      lp      cl |
     |---------------------------------------------------|
  1. |      ARG        DEU     0   .0092   .1358   .0483 |
     +---------------------------------------------------+


* START HERE
    use data2, clear
    gen joint_id = cond(exporter<importer, exporter+importer, importer+exporter)
    save data2_m, replace
    
    use data1, clear
    gen joint_id = cond(exporter<importer, exporter+importer, importer+exporter)
    merge m:1 joint_id using data2_m, keep(3) nogen
    
    drop joint_id    
    list, sep(0)

     +-------------------------------------------------------------------------------+
     | exporter   importer   pair_id   year      trade   col     cnl      lp      cl |
     |-------------------------------------------------------------------------------|
  1. |      ARG        DEU        30   1986   302.6463     0   .0092   .1358   .0483 |
  2. |      DEU        ARG        30   1986   123.4567     0   .0092   .1358   .0483 |
     +-------------------------------------------------------------------------------+

Comment

Marvin Haas

Join Date: Jun 2022

Posts: 11
#3

24 Jun 2022, 11:42

Thank you for your very detailed answer Fei Wang. I will immediately try to merge the data using your method and will give you a feedback.
thanks
Marvin
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#4

24 Jun 2022, 11:56

I might make a small modification to the code shown in #2. It does eliminate the pairs from data set 2 that have nothing corresponding in data set 1, as requested in #1. But it also eliminates any observations in data set 1 that do not appear in data set 2. Now, depending on what O.P. wants to do from this point forward, that may be fine. But if he needs to retain all the original observations in data set 1, even if there is no information for them in the language data set (data set 2), then the -keep()- option of the -merge- command should be -keep(1 3)-, or, as I would write it more understandably, -keep(master match)-.
1 like
Comment

Marvin Haas

Join Date: Jun 2022
Posts: 11

26 Jun 2022, 03:52

Dear Community, Mr. Wang, Mr. Schlechter,

creating the joint_id works great.
Unfortunately the variables col, cnl,lp and cl are not included in the cell.
as it looks like:

(sorry for the picture, seemed necessary to me again)

Click image for larger version

Name: merged.PNG
Views: 1
Size: 55.0 KB
ID: 1670945

Code:

 . use "C:\Users\marvin.haas\Desktop\STATA\ch1\langua ge variables.dta", clear

. gen joint_id = cond(exporter<importer, exporter+importer, importer+exporter)

. save "C:\Users\marvin.haas\Desktop\STATA\ch1\langua ge variables.dta", replace
file C:\Users\marvin.haas\Desktop\STATA\ch1\language variables.dta saved

. use "C:\Users\marvin.haas\Desktop\STATA\ch1\Chapte r1Ap plication1.dta", clear

. gen joint_id = cond(exporter<importer, exporter+importer, importer+exporter)

. merge m:1 joint_id using "C:\Users\marvin.haas\Desktop\STATA\ch1\langua ge variables.dta", keep(3) nogen
variable joint_id does not uniquely identify observations in the using data
r(459);

. describe

Contains data from C:\Users\marvin.haas\Desktop\STATA\ch1\Chapter1App lication1.dta
Observations: 99,981
Variables: 10 25 Oct 2016 07:57
(_dta has notes)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
exporter str3 %9s Exporter ISO country code
importer str3 %9s Importer ISO country code
pair_id float %9.0g Symmetric Pair ID
year float %9.0g Year
trade double %10.0g Nominal trade flows in current US dollars
DIST double %9.0g Population-weighted bilateral distance between i and j, in kilometers
CNTG byte %8.0g Indicator. Equal to 1 if i and j share a common border
LANG byte %8.0g Indicator. Equal to 1 if i and j speak the same official language
CLNY byte %8.0g Indicator. Equal to 1 if i and j share a colonial relationship
joint_id str6 %9s
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: joint_id
Note: Dataset has changed since last saved.

Last edited by Marvin Haas; 26 Jun 2022, 04:09.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10168
#6

26 Jun 2022, 04:41

. merge m:1 joint_id using "C:\Users\marvin.haas\Desktop\STATA\ch1\langua ge variables.dta", keep(3) nogen variable joint_id does not uniquely identify observations in the using data r(459);

This error implies that the merge was not successful. You have duplicate entries of joint_id in your using dataset, so you need to investigate why this is the case.

Code:

use "C:\Users\marvin.haas\Desktop\STATA\ch1\langua ge variables.dta", clear bys joint_id: g tag= _N>1 list if tag, sepby(joint_id)

If these are simply duplicates (repeated entries), you can

Code:

bys joint_id: keep if _n==1

then proceed with the merge.
Comment

Marvin Haas

Join Date: Jun 2022
Posts: 11

26 Jun 2022, 07:25

Stata forum is when a german gets a solution presented by a chinese professor with additions from an american professor and crucial modifications from a norwegian professor.

Thanks to all!

I attached the entire code:

Code:

use "C:\Users\marvin.haas\Desktop\STATA\ch1\language variables.dta", clear

gen joint_id = cond(exporter<importer, exporter+importer, importer+exporter)

bys joint_id: g tag= _N>1

bys joint_id: keep if _n==1

save "C:\Users\marvin.haas\Desktop\STATA\ch1\language variables.dta", replace

use "C:\Users\marvin.haas\Desktop\STATA\ch1\Chapter1Application1.dta", clear

gen joint_id = cond(exporter<importer, exporter+importer, importer+exporter)

merge m:1 joint_id using "C:\Users\marvin.haas\Desktop\STATA\ch1\language variables.dta", keep(3) nogen

drop joint_id

save "C:\Users\marvin.haas\Desktop\STATA\ch1\merged_dataset.dta"

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#8

26 Jun 2022, 11:04

The first problem is the one that Andrew Musau responds to in #6: you do not have unique observations corresponding to the joint_id in the language data set. You must fix that before you can make any progress. You need to identify the surplus observations using the way shown in #6 and then figure out 1) which observation is the correct one to retain in each case, and 2) why there are extra observations in that data set in the first place--this usually means that the data management that created that data set contains errors, and where one error is found others may lurk. So you should thoroughly review the steps that led up to the creation of the language data set and correct any errors you find. It's important to get the data right!

There may be a second problem as well. I'm not sure what the screenshot you show in #5 represents. But if it represents the appearance of the data after the attempted, but failed, -merge- command, then we do, indeed, have another problem. The problem is that the variables CNTG, LANG, and CLNY are in the data set. Since the -merge- did not actually run, these must have come from the Chapter1Application data set. The reason this is a problem is that the -merge- command, as written, even after you fix the duplicate observations problem, will not change the pre-existing values of these variables. You have two choices for fixing this. One is to drop these variables before you run the -merge-. Alternatively, you can change the -merge- command to read:

Code:

merge m:1 joint_id using "C:\Users\marvin.haas\Desktop\STATA\ch1\language variables.dta", update replace keep(3 4 5) nogen

These additions to the -merge- command will allow it to overwrite existing values in the Chapter1Application data set with the values found in the language variables data set. Be cautious in using this approach, however, because it will cause -merge- to also overwrite any other variables that appear in both data sets. Now, maybe there are no other variables common to both, or perhaps you want them overwritten in any case. That's fine if that's the case. But just be sure you're not inadvertently changing things that shouldn't be changed.

Added: My mistake, please disregard what I said about the second problem. Seeing a variable named LANG, I leapt to the conclusion that it was supposed to come from the language data set, and I imagined CNTG, and CLNY did as well. But looking back to #1, I see that these are not language variables, and that the variables to be brought in from the language data set have different (though somewhat similar names): col, cnl, lp, and cl. As these are different variables, there is no problem with the simpler version of -merge- you were using before. I apologize for any confusion I caused.

Last edited by Clyde Schechter; 26 Jun 2022, 11:12.
Comment

Announcement