Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging Data Sets with matching Language pairs

    Dear stata-community,
    in the context of a research project using the gravity model of trade I am having difficulties to merge two datassets.

    this is an example of a line from each dataset:

    Dataset 1 (https://vi.unctad.org/tpa/web/vol2/vol2home.html) Chapter1Application1
    tradeflows among countries 1986-2006
    (ARG DEU and DEU ARG are having the same pair_id btw)
    Click image for larger version

Name:	dataset 2.PNG
Views:	1
Size:	19.3 KB
ID:	1670812


    Dataset 2 (https://www.usitc.gov/data/dicl.htm)
    index measuring linguistic similarity
    Click image for larger version

Name:	dataset1.PNG
Views:	2
Size:	11.6 KB
ID:	1670813


    how can i merge both cells into one under the condition that they are the same pair of countries
    ARG-DEU and DEU-ARG should have the same linguistic similarity values added in the cells ?
    Dataset 2 also offers many country pairs that do not occur in dataset 1 and should therefore also be deleted when merging.

    Click image for larger version

Name:	merge options.PNG
Views:	2
Size:	82.0 KB
ID:	1670815

    i hope you can help me merge the data since I did not get any further using the stata mask above either.

    regards Marvin
    Attached Files

  • #2
    Marvin, it would be helpful to display your data using command -dataex- rather than posting pictures.

    Below I made up another line in Data 1 where DEU is the exporter while ARG is the importer.
    Code:
    * Data 1
        clear
        input str3(exporter importer) float(pair_id year trade)
        "ARG" "DEU" 30 1986 302.6463
        "DEU" "ARG" 30 1986 123.4567
        end
    
        save data1, replace
        list, sep(0)
    
         +-------------------------------------------------+
         | exporter   importer   pair_id   year      trade |
         |-------------------------------------------------|
      1. |      ARG        DEU        30   1986   302.6463 |
      2. |      DEU        ARG        30   1986   123.4567 |
         +-------------------------------------------------+
    
    
    * Data 2
        clear
        input str3(exporter importer) float(col cnl lp cl)
        "ARG" "DEU" 0 .0092 .1358 .0483
        end
    
        save data2, replace
        list, sep(0)
    
         +---------------------------------------------------+
         | exporter   importer   col     cnl      lp      cl |
         |---------------------------------------------------|
      1. |      ARG        DEU     0   .0092   .1358   .0483 |
         +---------------------------------------------------+
    
    
    * START HERE
        use data2, clear
        gen joint_id = cond(exporter<importer, exporter+importer, importer+exporter)
        save data2_m, replace
        
        use data1, clear
        gen joint_id = cond(exporter<importer, exporter+importer, importer+exporter)
        merge m:1 joint_id using data2_m, keep(3) nogen
        
        drop joint_id    
        list, sep(0)
    
         +-------------------------------------------------------------------------------+
         | exporter   importer   pair_id   year      trade   col     cnl      lp      cl |
         |-------------------------------------------------------------------------------|
      1. |      ARG        DEU        30   1986   302.6463     0   .0092   .1358   .0483 |
      2. |      DEU        ARG        30   1986   123.4567     0   .0092   .1358   .0483 |
         +-------------------------------------------------------------------------------+

    Comment


    • #3
      Thank you for your very detailed answer Fei Wang. I will immediately try to merge the data using your method and will give you a feedback.
      thanks
      Marvin

      Comment


      • #4
        I might make a small modification to the code shown in #2. It does eliminate the pairs from data set 2 that have nothing corresponding in data set 1, as requested in #1. But it also eliminates any observations in data set 1 that do not appear in data set 2. Now, depending on what O.P. wants to do from this point forward, that may be fine. But if he needs to retain all the original observations in data set 1, even if there is no information for them in the language data set (data set 2), then the -keep()- option of the -merge- command should be -keep(1 3)-, or, as I would write it more understandably, -keep(master match)-.

        Comment


        • #5
          Dear Community, Mr. Wang, Mr. Schlechter,

          creating the joint_id works great.
          Unfortunately the variables col, cnl,lp and cl are not included in the cell.
          as it looks like:

          (sorry for the picture, seemed necessary to me again)
          Click image for larger version

Name:	merged.PNG
Views:	1
Size:	55.0 KB
ID:	1670945



          Code:
           . use "C:\Users\marvin.haas\Desktop\STATA\ch1\langua ge variables.dta", clear
          
          . gen joint_id = cond(exporter<importer, exporter+importer, importer+exporter)
          
          . save "C:\Users\marvin.haas\Desktop\STATA\ch1\langua ge variables.dta", replace
          file C:\Users\marvin.haas\Desktop\STATA\ch1\language variables.dta saved
          
          . use "C:\Users\marvin.haas\Desktop\STATA\ch1\Chapte r1Ap plication1.dta", clear
          
          . gen joint_id = cond(exporter<importer, exporter+importer, importer+exporter)
          
          . merge m:1 joint_id using "C:\Users\marvin.haas\Desktop\STATA\ch1\langua ge variables.dta", keep(3) nogen
          variable joint_id does not uniquely identify observations in the using data
          r(459);
          
          . describe
          
          Contains data from C:\Users\marvin.haas\Desktop\STATA\ch1\Chapter1App lication1.dta
          Observations: 99,981
          Variables: 10 25 Oct 2016 07:57
          (_dta has notes)
          ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          Variable Storage Display Value
          name type format label Variable label
          ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          exporter str3 %9s Exporter ISO country code
          importer str3 %9s Importer ISO country code
          pair_id float %9.0g Symmetric Pair ID
          year float %9.0g Year
          trade double %10.0g Nominal trade flows in current US dollars
          DIST double %9.0g Population-weighted bilateral distance between i and j, in kilometers
          CNTG byte %8.0g Indicator. Equal to 1 if i and j share a common border
          LANG byte %8.0g Indicator. Equal to 1 if i and j speak the same official language
          CLNY byte %8.0g Indicator. Equal to 1 if i and j share a colonial relationship
          joint_id str6 %9s
          ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          Sorted by: joint_id
          Note: Dataset has changed since last saved.
          Last edited by Marvin Haas; 26 Jun 2022, 04:09.

          Comment


          • #6
            . merge m:1 joint_id using "C:\Users\marvin.haas\Desktop\STATA\ch1\langua ge variables.dta", keep(3) nogen variable joint_id does not uniquely identify observations in the using data r(459);
            This error implies that the merge was not successful. You have duplicate entries of joint_id in your using dataset, so you need to investigate why this is the case.

            Code:
            use "C:\Users\marvin.haas\Desktop\STATA\ch1\langua ge variables.dta", clear
            bys joint_id: g tag= _N>1
            list if tag, sepby(joint_id)
            If these are simply duplicates (repeated entries), you can

            Code:
            bys joint_id: keep if _n==1
            then proceed with the merge.

            Comment


            • #7
              Stata forum is when a german gets a solution presented by a chinese professor with additions from an american professor and crucial modifications from a norwegian professor.

              Thanks to all!

              I attached the entire code:

              Code:
              use "C:\Users\marvin.haas\Desktop\STATA\ch1\language variables.dta", clear
              
              gen joint_id = cond(exporter<importer, exporter+importer, importer+exporter)
              
              bys joint_id: g tag= _N>1
              
              bys joint_id: keep if _n==1
              
              save "C:\Users\marvin.haas\Desktop\STATA\ch1\language variables.dta", replace
              
              use "C:\Users\marvin.haas\Desktop\STATA\ch1\Chapter1Application1.dta", clear
              
              gen joint_id = cond(exporter<importer, exporter+importer, importer+exporter)
              
              merge m:1 joint_id using "C:\Users\marvin.haas\Desktop\STATA\ch1\language variables.dta", keep(3) nogen
              
              drop joint_id
              
              save "C:\Users\marvin.haas\Desktop\STATA\ch1\merged_dataset.dta"

              Comment


              • #8
                The first problem is the one that Andrew Musau responds to in #6: you do not have unique observations corresponding to the joint_id in the language data set. You must fix that before you can make any progress. You need to identify the surplus observations using the way shown in #6 and then figure out 1) which observation is the correct one to retain in each case, and 2) why there are extra observations in that data set in the first place--this usually means that the data management that created that data set contains errors, and where one error is found others may lurk. So you should thoroughly review the steps that led up to the creation of the language data set and correct any errors you find. It's important to get the data right!

                There may be a second problem as well. I'm not sure what the screenshot you show in #5 represents. But if it represents the appearance of the data after the attempted, but failed, -merge- command, then we do, indeed, have another problem. The problem is that the variables CNTG, LANG, and CLNY are in the data set. Since the -merge- did not actually run, these must have come from the Chapter1Application data set. The reason this is a problem is that the -merge- command, as written, even after you fix the duplicate observations problem, will not change the pre-existing values of these variables. You have two choices for fixing this. One is to drop these variables before you run the -merge-. Alternatively, you can change the -merge- command to read:
                Code:
                merge m:1 joint_id using "C:\Users\marvin.haas\Desktop\STATA\ch1\language variables.dta", update replace keep(3 4 5) nogen
                These additions to the -merge- command will allow it to overwrite existing values in the Chapter1Application data set with the values found in the language variables data set. Be cautious in using this approach, however, because it will cause -merge- to also overwrite any other variables that appear in both data sets. Now, maybe there are no other variables common to both, or perhaps you want them overwritten in any case. That's fine if that's the case. But just be sure you're not inadvertently changing things that shouldn't be changed.

                Added: My mistake, please disregard what I said about the second problem. Seeing a variable named LANG, I leapt to the conclusion that it was supposed to come from the language data set, and I imagined CNTG, and CLNY did as well. But looking back to #1, I see that these are not language variables, and that the variables to be brought in from the language data set have different (though somewhat similar names): col, cnl, lp, and cl. As these are different variables, there is no problem with the simpler version of -merge- you were using before. I apologize for any confusion I caused.
                Last edited by Clyde Schechter; 26 Jun 2022, 11:12.

                Comment

                Working...
                X