Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Expand data to included variable containing all possible country pairs

    Hello all,

    I am working with a dataset containing an id variable "id" and a country identifier "nat". I would like to expand this dataset to add another variable which includes all possible pairs of nationality by id to be merged with another dataset including cross-country data. I know that this will enlarge the dataset significantly. The new variable would add n observations to each unique id and nat pair which would contain a string of the concatenated country abbreviations.

    An example can be considered with the id 8229 for which 4 pairs would be generated "USUS", "USGB", "USIN" and "GBIN". These pairs may also be in whatever order with either country preceding the other. It might look something like this:

    id nat pair
    8229 "US" "USUS"
    8229 "US" "USGB"
    8229 "US" "USIN"
    8229 "US" "GBIN"
    8229 "GB" "USGB"
    8229 "GB" "GBIN"
    8229 "IN" "USIN"
    8229 "IN" "GBIN"

    I've given it my best shot using egen and some other approached. I also tried the tuples command, but the dataset contains around a million observations so this approach seems really time-inefficient considering the command takes some time to run.

    Thanks in advance for any help.


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long id str4 nat
    6757 "US"
    6931 "DE"
    7011 "US"
    7025 "US"
    7027 "US"
    7031 "US"
    7031 "US"
    7085 "DE"
    7843 "US"
    7843 "US"
    7843 "US"
    7843 "US"
    7843 "US"
    7843 "US"
    7843 "US"
    7851 "DZ"
    7853 "US"
    7877 "US"
    8075 "DE"
    8085 "FR"
    8089 "US"
    8135 "DE"
    8159 "US"
    8171 "US"
    8209 "DE"
    8225 "US"
    8225 "GB"
    8225 "US"
    8227 "GB"
    8227 "US"
    8227 "US"
    8229 "IN"
    8229 "US"
    8229 "US"
    8229 "US"
    8229 "GB"
    8249 "US"
    8263 "US"
    8263 "US"
    8263 "US"
    8265 "IN"
    8273 "CA"
    8303 "US"
    8303 "US"
    8303 "US"
    8303 "US"
    8313 "US"
    8313 "US"
    8325 "US"
    8325 "IQ"
    8325 "US"
    8325 "IN"
    8325 "US"
    8325 "US"
    8327 "GB"
    8327 "DE"
    8385 "GB"
    8391 "CN"
    8391 "US"
    8391 "CN"
    8391 "CN"
    8391 "US"
    8391 "US"
    8413 "CH"
    8431 "FR"
    8543 "US"
    8559 "US"
    8695 "FI"
    8695 "NO"
    8699 "US"
    8699 "US"
    8717 "US"
    8727 "US"
    8727 "US"
    8727 "US"
    8727 "US"
    8727 "US"
    8727 "US"
    8727 "US"
    8731 "US"
    8739 "US"
    8757 "CN"
    8757 "US"
    8757 "CN"
    8759 "US"
    8759 "US"
    8759 "GB"
    8817 "US"
    8817 "RU"
    8817 "CA"
    8817 "US"
    8817 "US"
    8817 "US"
    8821 "US"
    8821 "CA"
    8821 "US"
    8825 "US"
    8825 "US"
    8825 "US"
    end

  • #2
    I don't understand how you arrived as the example results you show:
    id nat pair
    8229 "US" "USUS"
    8229 "US" "USGB"
    8229 "US" "USIN"
    8229 "US" "GBIN"
    8229 "GB" "USGB"
    8229 "GB" "GBIN"
    8229 "IN" "USIN"
    8229 "IN" "GBIN"
    On the one hand "US" ends up paired not only with "USUS", "USGB", and "USIN", in which US itself appears, but also with "GBIN" in which it does not. That's fine, but then why does "GB" appear only paired with "USGB" and "GBIN" but not with "USIN" or "USUS" or "IN"IN"? Please clarify the rule deciding which pairs appear with each value of nat.

    By the way, let me set some expectations for you. When you have a data set with 1,000,000 observations and you start to generate pairs, the size of the data set is going to blow up. The process of creating this larger data set will involve repeated calls to the operating system for more memory, and then Stata's updating the enlarged data set. This process is inherently slow. There is no magical code that will run this in seconds or minutes unless you have access to a supercomputer. While there are faster and slower ways to do this, you should expect this process to take a long time to complete even with the most efficient code. Be patient: get something else to keep yourself occupied while it runs. Or run it overnight.

    Comment


    • #3
      joinby will give you all pairwise combinations within groups, but as Clyde points out, your description is not clear.

      Code:
      clear
      input long id str4 nat
      6757 "US"
      6931 "DE"
      7011 "US"
      7025 "US"
      7027 "US"
      7031 "US"
      7031 "US"
      7085 "DE"
      7843 "US"
      7843 "US"
      7843 "US"
      7843 "US"
      7843 "US"
      7843 "US"
      7843 "US"
      7851 "DZ"
      7853 "US"
      7877 "US"
      8075 "DE"
      8085 "FR"
      8089 "US"
      8135 "DE"
      8159 "US"
      8171 "US"
      8209 "DE"
      8225 "US"
      8225 "GB"
      8225 "US"
      8227 "GB"
      8227 "US"
      8227 "US"
      8229 "IN"
      8229 "US"
      8229 "US"
      8229 "US"
      8229 "GB"
      8249 "US"
      8263 "US"
      8263 "US"
      8263 "US"
      8265 "IN"
      8273 "CA"
      8303 "US"
      8303 "US"
      8303 "US"
      8303 "US"
      8313 "US"
      8313 "US"
      8325 "US"
      8325 "IQ"
      8325 "US"
      8325 "IN"
      8325 "US"
      8325 "US"
      8327 "GB"
      8327 "DE"
      8385 "GB"
      8391 "CN"
      8391 "US"
      8391 "CN"
      8391 "CN"
      8391 "US"
      8391 "US"
      8413 "CH"
      8431 "FR"
      8543 "US"
      8559 "US"
      8695 "FI"
      8695 "NO"
      8699 "US"
      8699 "US"
      8717 "US"
      8727 "US"
      8727 "US"
      8727 "US"
      8727 "US"
      8727 "US"
      8727 "US"
      8727 "US"
      8731 "US"
      8739 "US"
      8757 "CN"
      8757 "US"
      8757 "CN"
      8759 "US"
      8759 "US"
      8759 "GB"
      8817 "US"
      8817 "RU"
      8817 "CA"
      8817 "US"
      8817 "US"
      8817 "US"
      8821 "US"
      8821 "CA"
      8821 "US"
      8825 "US"
      8825 "US"
      8825 "US"
      end
      
      preserve
      contract id nat
      rename nat nat2
      tempfile tojoin
      save `tojoin'
      restore
      joinby id using `tojoin'
      bys id (nat): gen wanted=cond(_N==1, "", nat+nat2)
      drop nat2 _freq
      Res.:

      Code:
      . l, sepby(id)
      
           +---------------------+
           |   id   nat   wanted |
           |---------------------|
        1. | 6757    US          |
           |---------------------|
        2. | 6931    DE          |
           |---------------------|
        3. | 7011    US          |
           |---------------------|
        4. | 7025    US          |
           |---------------------|
        5. | 7027    US          |
           |---------------------|
        6. | 7031    US     USUS |
        7. | 7031    US     USUS |
           |---------------------|
        8. | 7085    DE          |
           |---------------------|
        9. | 7843    US     USUS |
       10. | 7843    US     USUS |
       11. | 7843    US     USUS |
       12. | 7843    US     USUS |
       13. | 7843    US     USUS |
       14. | 7843    US     USUS |
       15. | 7843    US     USUS |
           |---------------------|
       16. | 7851    DZ          |
           |---------------------|
       17. | 7853    US          |
           |---------------------|
       18. | 7877    US          |
           |---------------------|
       19. | 8075    DE          |
           |---------------------|
       20. | 8085    FR          |
           |---------------------|
       21. | 8089    US          |
           |---------------------|
       22. | 8135    DE          |
           |---------------------|
       23. | 8159    US          |
           |---------------------|
       24. | 8171    US          |
           |---------------------|
       25. | 8209    DE          |
           |---------------------|
       26. | 8225    GB     GBUS |
       27. | 8225    GB     GBGB |
       28. | 8225    US     USGB |
       29. | 8225    US     USGB |
       30. | 8225    US     USUS |
       31. | 8225    US     USUS |
           |---------------------|
       32. | 8227    GB     GBUS |
       33. | 8227    GB     GBGB |
       34. | 8227    US     USUS |
       35. | 8227    US     USGB |
       36. | 8227    US     USGB |
       37. | 8227    US     USUS |
           |---------------------|
       38. | 8229    GB     GBIN |
       39. | 8229    GB     GBGB |
       40. | 8229    GB     GBUS |
       41. | 8229    IN     ININ |
       42. | 8229    IN     INGB |
       43. | 8229    IN     INUS |
       44. | 8229    US     USUS |
       45. | 8229    US     USUS |
       46. | 8229    US     USGB |
       47. | 8229    US     USGB |
       48. | 8229    US     USUS |
       49. | 8229    US     USIN |
       50. | 8229    US     USIN |
       51. | 8229    US     USGB |
       52. | 8229    US     USIN |
           |---------------------|
       53. | 8249    US          |
           |---------------------|
       54. | 8263    US     USUS |
       55. | 8263    US     USUS |
       56. | 8263    US     USUS |
           |---------------------|
       57. | 8265    IN          |
           |---------------------|
       58. | 8273    CA          |
           |---------------------|
       59. | 8303    US     USUS |
       60. | 8303    US     USUS |
       61. | 8303    US     USUS |
       62. | 8303    US     USUS |
           |---------------------|
       63. | 8313    US     USUS |
       64. | 8313    US     USUS |
           |---------------------|
       65. | 8325    IN     INUS |
       66. | 8325    IN     ININ |
       67. | 8325    IN     INIQ |
       68. | 8325    IQ     IQIN |
       69. | 8325    IQ     IQUS |
       70. | 8325    IQ     IQIQ |
       71. | 8325    US     USIN |
       72. | 8325    US     USIN |
       73. | 8325    US     USIQ |
       74. | 8325    US     USUS |
       75. | 8325    US     USUS |
       76. | 8325    US     USIQ |
       77. | 8325    US     USUS |
       78. | 8325    US     USIQ |
       79. | 8325    US     USIQ |
       80. | 8325    US     USIN |
       81. | 8325    US     USUS |
       82. | 8325    US     USIN |
           |---------------------|
       83. | 8327    DE     DEGB |
       84. | 8327    DE     DEDE |
       85. | 8327    GB     GBGB |
       86. | 8327    GB     GBDE |
           |---------------------|
       87. | 8385    GB          |
           |---------------------|
       88. | 8391    CN     CNCN |
       89. | 8391    CN     CNUS |
       90. | 8391    CN     CNUS |
       91. | 8391    CN     CNUS |
       92. | 8391    CN     CNCN |
       93. | 8391    CN     CNCN |
       94. | 8391    US     USCN |
       95. | 8391    US     USUS |
       96. | 8391    US     USUS |
       97. | 8391    US     USCN |
       98. | 8391    US     USCN |
       99. | 8391    US     USUS |
           |---------------------|
      100. | 8413    CH          |
           |---------------------|
      101. | 8431    FR          |
           |---------------------|
      102. | 8543    US          |
           |---------------------|
      103. | 8559    US          |
           |---------------------|
      104. | 8695    FI     FINO |
      105. | 8695    FI     FIFI |
      106. | 8695    NO     NOFI |
      107. | 8695    NO     NONO |
           |---------------------|
      108. | 8699    US     USUS |
      109. | 8699    US     USUS |
           |---------------------|
      110. | 8717    US          |
           |---------------------|
      111. | 8727    US     USUS |
      112. | 8727    US     USUS |
      113. | 8727    US     USUS |
      114. | 8727    US     USUS |
      115. | 8727    US     USUS |
      116. | 8727    US     USUS |
      117. | 8727    US     USUS |
           |---------------------|
      118. | 8731    US          |
           |---------------------|
      119. | 8739    US          |
           |---------------------|
      120. | 8757    CN     CNCN |
      121. | 8757    CN     CNCN |
      122. | 8757    CN     CNUS |
      123. | 8757    CN     CNUS |
      124. | 8757    US     USUS |
      125. | 8757    US     USCN |
           |---------------------|
      126. | 8759    GB     GBUS |
      127. | 8759    GB     GBGB |
      128. | 8759    US     USGB |
      129. | 8759    US     USUS |
      130. | 8759    US     USUS |
      131. | 8759    US     USGB |
           |---------------------|
      132. | 8817    CA     CAUS |
      133. | 8817    CA     CARU |
      134. | 8817    CA     CACA |
      135. | 8817    RU     RUCA |
      136. | 8817    RU     RUUS |
      137. | 8817    RU     RURU |
      138. | 8817    US     USUS |
      139. | 8817    US     USCA |
      140. | 8817    US     USUS |
      141. | 8817    US     USCA |
      142. | 8817    US     USRU |
      143. | 8817    US     USUS |
      144. | 8817    US     USRU |
      145. | 8817    US     USRU |
      146. | 8817    US     USUS |
      147. | 8817    US     USCA |
      148. | 8817    US     USCA |
      149. | 8817    US     USRU |
           |---------------------|
      150. | 8821    CA     CAUS |
      151. | 8821    CA     CACA |
      152. | 8821    US     USUS |
      153. | 8821    US     USCA |
      154. | 8821    US     USCA |
      155. | 8821    US     USUS |
           |---------------------|
      156. | 8825    US     USUS |
      157. | 8825    US     USUS |
      158. | 8825    US     USUS |
           +---------------------+
      Last edited by Andrew Musau; 26 Nov 2023, 11:29.

      Comment


      • #4
        Dear Clyde and Andrew,

        Thank you for bringing that error to my attention and for the potential solution. The corrected example would be:

        id nat pair
        8229 "US" "USUS"
        8229 "US" "USGB"
        8229 "US" "USIN"
        8229 "GB" "USGB"
        8229 "GB" "GBIN"
        8229 "IN" "USIN"
        8229 "IN" "GBIN"

        That is, the associated pair should always contain the nat variable itself. A same country pair should only occur when there are two or more individuals of the same country within the same id. So, in the solution suggested by Andrew, the id 8229 should not contain the wanted pairs "GBGB" or "ININ" since there was only one individual from each of those countries.

        Comment


        • #5
          EDITED: This will give you distinct pairs, which appears to be what you want.

          Code:
          clear
          input long id str4 nat
          6757 "US"
          6931 "DE"
          7011 "US"
          7025 "US"
          7027 "US"
          7031 "US"
          7031 "US"
          7085 "DE"
          7843 "US"
          7843 "US"
          7843 "US"
          7843 "US"
          7843 "US"
          7843 "US"
          7843 "US"
          7851 "DZ"
          7853 "US"
          7877 "US"
          8075 "DE"
          8085 "FR"
          8089 "US"
          8135 "DE"
          8159 "US"
          8171 "US"
          8209 "DE"
          8225 "US"
          8225 "GB"
          8225 "US"
          8227 "GB"
          8227 "US"
          8227 "US"
          8229 "IN"
          8229 "US"
          8229 "US"
          8229 "US"
          8229 "GB"
          8249 "US"
          8263 "US"
          8263 "US"
          8263 "US"
          8265 "IN"
          8273 "CA"
          8303 "US"
          8303 "US"
          8303 "US"
          8303 "US"
          8313 "US"
          8313 "US"
          8325 "US"
          8325 "IQ"
          8325 "US"
          8325 "IN"
          8325 "US"
          8325 "US"
          8327 "GB"
          8327 "DE"
          8385 "GB"
          8391 "CN"
          8391 "US"
          8391 "CN"
          8391 "CN"
          8391 "US"
          8391 "US"
          8413 "CH"
          8431 "FR"
          8543 "US"
          8559 "US"
          8695 "FI"
          8695 "NO"
          8699 "US"
          8699 "US"
          8717 "US"
          8727 "US"
          8727 "US"
          8727 "US"
          8727 "US"
          8727 "US"
          8727 "US"
          8727 "US"
          8731 "US"
          8739 "US"
          8757 "CN"
          8757 "US"
          8757 "CN"
          8759 "US"
          8759 "US"
          8759 "GB"
          8817 "US"
          8817 "RU"
          8817 "CA"
          8817 "US"
          8817 "US"
          8817 "US"
          8821 "US"
          8821 "CA"
          8821 "US"
          8825 "US"
          8825 "US"
          8825 "US"
          end
          
          preserve
          contract id nat
          rename nat nat2
          tempfile tojoin
          save `tojoin'
          restore
          joinby id using `tojoin'
          bys id (nat): gen wanted=cond(_N==1, "", cond(nat<nat2, nat, nat2)  + cond(nat<nat2, nat2, nat))
          replace wanted="" if nat==nat2 & _freq<2
          contract id nat wanted, freq(count)
          drop count
          bys id: drop if missing(wanted) & _N>1
          Res.:

          Code:
          
          . l, sepby(id)
          
               +---------------------+
               |   id   nat   wanted |
               |---------------------|
            1. | 6757    US          |
               |---------------------|
            2. | 6931    DE          |
               |---------------------|
            3. | 7011    US          |
               |---------------------|
            4. | 7025    US          |
               |---------------------|
            5. | 7027    US          |
               |---------------------|
            6. | 7031    US     USUS |
               |---------------------|
            7. | 7085    DE          |
               |---------------------|
            8. | 7843    US     USUS |
               |---------------------|
            9. | 7851    DZ          |
               |---------------------|
           10. | 7853    US          |
               |---------------------|
           11. | 7877    US          |
               |---------------------|
           12. | 8075    DE          |
               |---------------------|
           13. | 8085    FR          |
               |---------------------|
           14. | 8089    US          |
               |---------------------|
           15. | 8135    DE          |
               |---------------------|
           16. | 8159    US          |
               |---------------------|
           17. | 8171    US          |
               |---------------------|
           18. | 8209    DE          |
               |---------------------|
           19. | 8225    GB     GBUS |
           20. | 8225    US     GBUS |
           21. | 8225    US     USUS |
               |---------------------|
           22. | 8227    GB     GBUS |
           23. | 8227    US     GBUS |
           24. | 8227    US     USUS |
               |---------------------|
           25. | 8229    GB     GBIN |
           26. | 8229    GB     GBUS |
           27. | 8229    IN     GBIN |
           28. | 8229    IN     INUS |
           29. | 8229    US     GBUS |
           30. | 8229    US     INUS |
           31. | 8229    US     USUS |
               |---------------------|
           32. | 8249    US          |
               |---------------------|
           33. | 8263    US     USUS |
               |---------------------|
           34. | 8265    IN          |
               |---------------------|
           35. | 8273    CA          |
               |---------------------|
           36. | 8303    US     USUS |
               |---------------------|
           37. | 8313    US     USUS |
               |---------------------|
           38. | 8325    IN     INIQ |
           39. | 8325    IN     INUS |
           40. | 8325    IQ     INIQ |
           41. | 8325    IQ     IQUS |
           42. | 8325    US     INUS |
           43. | 8325    US     IQUS |
           44. | 8325    US     USUS |
               |---------------------|
           45. | 8327    DE     DEGB |
           46. | 8327    GB     DEGB |
               |---------------------|
           47. | 8385    GB          |
               |---------------------|
           48. | 8391    CN     CNCN |
           49. | 8391    CN     CNUS |
           50. | 8391    US     CNUS |
           51. | 8391    US     USUS |
               |---------------------|
           52. | 8413    CH          |
               |---------------------|
           53. | 8431    FR          |
               |---------------------|
           54. | 8543    US          |
               |---------------------|
           55. | 8559    US          |
               |---------------------|
           56. | 8695    FI     FINO |
           57. | 8695    NO     FINO |
               |---------------------|
           58. | 8699    US     USUS |
               |---------------------|
           59. | 8717    US          |
               |---------------------|
           60. | 8727    US     USUS |
               |---------------------|
           61. | 8731    US          |
               |---------------------|
           62. | 8739    US          |
               |---------------------|
           63. | 8757    CN     CNCN |
           64. | 8757    CN     CNUS |
           65. | 8757    US     CNUS |
               |---------------------|
           66. | 8759    GB     GBUS |
           67. | 8759    US     GBUS |
           68. | 8759    US     USUS |
               |---------------------|
           69. | 8817    CA     CARU |
           70. | 8817    CA     CAUS |
           71. | 8817    RU     CARU |
           72. | 8817    RU     RUUS |
           73. | 8817    US     CAUS |
           74. | 8817    US     RUUS |
           75. | 8817    US     USUS |
               |---------------------|
           76. | 8821    CA     CAUS |
           77. | 8821    US     CAUS |
           78. | 8821    US     USUS |
               |---------------------|
           79. | 8825    US     USUS |
               +---------------------+
          Last edited by Andrew Musau; 27 Nov 2023, 05:19.

          Comment

          Working...
          X