Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge observations by pairs

    Hello,

    I'm cleaning a huge dataset and I'm stucked in one point. I've find similar problems and nice answers in here but the structures are slightly different from mine and I can not apply this to my issue.

    I have data on trade for lots of countries and years. My data for now looks like this (numbers are fake):
    country_1 country_2 year imp exp "other variables"
    1 greece united_kingdom 1857 124353.66 . XX
    2 greece united_kingdom 1857 . 4659756.67 XX
    3 greece united_kingdom 1858 574974.86 . XX
    4 greece united_kingdom 1858 . 658796.34 XX
    ...... .............. ..................... ..... ................. ............... .............
    99 greece belgium 1857 56746.67 . XX
    100 greece belgium 1857 . 45675.67 XX

    And I want to merge/combine all the pairs of variables that have the same contry 1, country 2 and year, which have the same values for all the variables but "imp" and "exp", which each ones take . (missing value) in one case. So, I want to have something like this:
    country_1 country_2 year imp exp "other variables"
    1 greece united_kingdom 1857 124353.66 4659756.67 XX
    2 greece united_kingdom 1858 574974.86 658796.34 XX
    ..... .............. ..................... ..... ................. ............... XX
    50 greece belgium 1857 56746.67 45675.67 XX

    And therefore, have the half of the observations than before.

    Many thanks for your help!!

    Last edited by Gin Ayra; 06 May 2022, 05:49.

  • #2
    Code:
    collapse (firstnm) imp exp other_variables, by(country_1 country_2 year)

    Comment


    • #3
      It worked! Many thanks

      Comment


      • #4
        This isn't a merge in the sense of Stata's command merge. It could be a collapse. But for data as you describe. there is a simple trick. For paired observations under by: observation 3 - _n is observation 2 if we are looking at observation 1, and vice versa. So that's generic syntax for "look in the twin observation".

        Thus consider this example:

        Code:
        clear
        input str9 country_1 str9 country_2 year import export x y 
        "Freedonia" "Sylvania"  1856    1  .   42  666
        "Freedonia" "Sylvania"  1856    .  2   42  666
        "Sylvania" "Freedonia"  1857    3  .    7   11
        "Sylvania" "Freedonia"  1857    .  4    7   11
        end 
        
        bysort country_1 country_2 year : replace import = import[3 - _n] if missing(import)
        bysort country_1 country_2 year : replace export = export[3 - _n] if missing(export)
        
        list 
        
        duplicates drop 
        
        list 
        
        
            +-----------------------------------------------------------+
             | country_1   country_2   year   import   export    x     y |
             |-----------------------------------------------------------|
          1. | Freedonia    Sylvania   1856        1        2   42   666 |
          2. |  Sylvania   Freedonia   1857        3        4    7    11 |
             +-----------------------------------------------------------+
        I didn't use your data example -- which is clear but still needs too much editing as you didn't use dataex to show an example. See https://www.statalist.org/forums/help#stata for the longstanding request to use that command.

        See https://www.stata-journal.com/articl...article=dm0043 for the main idea here.

        For a paranoid check run this first

        Code:
        bysort country_1 country_2 year : assert _N == 2 
        If that fails, you should do this

        Code:
         
        bysort country_1 country_2 year : gen nobs = _N 
        edit if nobs != 2

        Comment

        Working...
        X