Expand data to included variable containing all possible country pairs

Salvatore Viola

Join Date: Aug 2022

Posts: 20
#1

Expand data to included variable containing all possible country pairs

26 Nov 2023, 09:31

Hello all,

I am working with a dataset containing an id variable "id" and a country identifier "nat". I would like to expand this dataset to add another variable which includes all possible pairs of nationality by id to be merged with another dataset including cross-country data. I know that this will enlarge the dataset significantly. The new variable would add n observations to each unique id and nat pair which would contain a string of the concatenated country abbreviations.

An example can be considered with the id 8229 for which 4 pairs would be generated "USUS", "USGB", "USIN" and "GBIN". These pairs may also be in whatever order with either country preceding the other. It might look something like this:

id nat pair
8229 "US" "USUS"
8229 "US" "USGB"
8229 "US" "USIN"
8229 "US" "GBIN"
8229 "GB" "USGB"
8229 "GB" "GBIN"
8229 "IN" "USIN"
8229 "IN" "GBIN"

I've given it my best shot using egen and some other approached. I also tried the tuples command, but the dataset contains around a million observations so this approach seems really time-inefficient considering the command takes some time to run.

Thanks in advance for any help.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long id str4 nat 6757 "US" 6931 "DE" 7011 "US" 7025 "US" 7027 "US" 7031 "US" 7031 "US" 7085 "DE" 7843 "US" 7843 "US" 7843 "US" 7843 "US" 7843 "US" 7843 "US" 7843 "US" 7851 "DZ" 7853 "US" 7877 "US" 8075 "DE" 8085 "FR" 8089 "US" 8135 "DE" 8159 "US" 8171 "US" 8209 "DE" 8225 "US" 8225 "GB" 8225 "US" 8227 "GB" 8227 "US" 8227 "US" 8229 "IN" 8229 "US" 8229 "US" 8229 "US" 8229 "GB" 8249 "US" 8263 "US" 8263 "US" 8263 "US" 8265 "IN" 8273 "CA" 8303 "US" 8303 "US" 8303 "US" 8303 "US" 8313 "US" 8313 "US" 8325 "US" 8325 "IQ" 8325 "US" 8325 "IN" 8325 "US" 8325 "US" 8327 "GB" 8327 "DE" 8385 "GB" 8391 "CN" 8391 "US" 8391 "CN" 8391 "CN" 8391 "US" 8391 "US" 8413 "CH" 8431 "FR" 8543 "US" 8559 "US" 8695 "FI" 8695 "NO" 8699 "US" 8699 "US" 8717 "US" 8727 "US" 8727 "US" 8727 "US" 8727 "US" 8727 "US" 8727 "US" 8727 "US" 8731 "US" 8739 "US" 8757 "CN" 8757 "US" 8757 "CN" 8759 "US" 8759 "US" 8759 "GB" 8817 "US" 8817 "RU" 8817 "CA" 8817 "US" 8817 "US" 8817 "US" 8821 "US" 8821 "CA" 8821 "US" 8825 "US" 8825 "US" 8825 "US" end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#2

26 Nov 2023, 11:06

I don't understand how you arrived as the example results you show:

id nat pair
8229 "US" "USUS"
8229 "US" "USGB"
8229 "US" "USIN"
8229 "US" "GBIN"
8229 "GB" "USGB"
8229 "GB" "GBIN"
8229 "IN" "USIN"
8229 "IN" "GBIN"

On the one hand "US" ends up paired not only with "USUS", "USGB", and "USIN", in which US itself appears, but also with "GBIN" in which it does not. That's fine, but then why does "GB" appear only paired with "USGB" and "GBIN" but not with "USIN" or "USUS" or "IN"IN"? Please clarify the rule deciding which pairs appear with each value of nat.

By the way, let me set some expectations for you. When you have a data set with 1,000,000 observations and you start to generate pairs, the size of the data set is going to blow up. The process of creating this larger data set will involve repeated calls to the operating system for more memory, and then Stata's updating the enlarged data set. This process is inherently slow. There is no magical code that will run this in seconds or minutes unless you have access to a supercomputer. While there are faster and slower ways to do this, you should expect this process to take a long time to complete even with the most efficient code. Be patient: get something else to keep yourself occupied while it runs. Or run it overnight.
1 like
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

26 Nov 2023, 11:26

joinby will give you all pairwise combinations within groups, but as Clyde points out, your description is not clear.

Code:

clear
input long id str4 nat
6757 "US"
6931 "DE"
7011 "US"
7025 "US"
7027 "US"
7031 "US"
7031 "US"
7085 "DE"
7843 "US"
7843 "US"
7843 "US"
7843 "US"
7843 "US"
7843 "US"
7843 "US"
7851 "DZ"
7853 "US"
7877 "US"
8075 "DE"
8085 "FR"
8089 "US"
8135 "DE"
8159 "US"
8171 "US"
8209 "DE"
8225 "US"
8225 "GB"
8225 "US"
8227 "GB"
8227 "US"
8227 "US"
8229 "IN"
8229 "US"
8229 "US"
8229 "US"
8229 "GB"
8249 "US"
8263 "US"
8263 "US"
8263 "US"
8265 "IN"
8273 "CA"
8303 "US"
8303 "US"
8303 "US"
8303 "US"
8313 "US"
8313 "US"
8325 "US"
8325 "IQ"
8325 "US"
8325 "IN"
8325 "US"
8325 "US"
8327 "GB"
8327 "DE"
8385 "GB"
8391 "CN"
8391 "US"
8391 "CN"
8391 "CN"
8391 "US"
8391 "US"
8413 "CH"
8431 "FR"
8543 "US"
8559 "US"
8695 "FI"
8695 "NO"
8699 "US"
8699 "US"
8717 "US"
8727 "US"
8727 "US"
8727 "US"
8727 "US"
8727 "US"
8727 "US"
8727 "US"
8731 "US"
8739 "US"
8757 "CN"
8757 "US"
8757 "CN"
8759 "US"
8759 "US"
8759 "GB"
8817 "US"
8817 "RU"
8817 "CA"
8817 "US"
8817 "US"
8817 "US"
8821 "US"
8821 "CA"
8821 "US"
8825 "US"
8825 "US"
8825 "US"
end

preserve
contract id nat
rename nat nat2
tempfile tojoin
save `tojoin'
restore
joinby id using `tojoin'
bys id (nat): gen wanted=cond(_N==1, "", nat+nat2)
drop nat2 _freq

Res.:

Code:

. l, sepby(id)

     +---------------------+
     |   id   nat   wanted |
     |---------------------|
  1. | 6757    US          |
     |---------------------|
  2. | 6931    DE          |
     |---------------------|
  3. | 7011    US          |
     |---------------------|
  4. | 7025    US          |
     |---------------------|
  5. | 7027    US          |
     |---------------------|
  6. | 7031    US     USUS |
  7. | 7031    US     USUS |
     |---------------------|
  8. | 7085    DE          |
     |---------------------|
  9. | 7843    US     USUS |
 10. | 7843    US     USUS |
 11. | 7843    US     USUS |
 12. | 7843    US     USUS |
 13. | 7843    US     USUS |
 14. | 7843    US     USUS |
 15. | 7843    US     USUS |
     |---------------------|
 16. | 7851    DZ          |
     |---------------------|
 17. | 7853    US          |
     |---------------------|
 18. | 7877    US          |
     |---------------------|
 19. | 8075    DE          |
     |---------------------|
 20. | 8085    FR          |
     |---------------------|
 21. | 8089    US          |
     |---------------------|
 22. | 8135    DE          |
     |---------------------|
 23. | 8159    US          |
     |---------------------|
 24. | 8171    US          |
     |---------------------|
 25. | 8209    DE          |
     |---------------------|
 26. | 8225    GB     GBUS |
 27. | 8225    GB     GBGB |
 28. | 8225    US     USGB |
 29. | 8225    US     USGB |
 30. | 8225    US     USUS |
 31. | 8225    US     USUS |
     |---------------------|
 32. | 8227    GB     GBUS |
 33. | 8227    GB     GBGB |
 34. | 8227    US     USUS |
 35. | 8227    US     USGB |
 36. | 8227    US     USGB |
 37. | 8227    US     USUS |
     |---------------------|
 38. | 8229    GB     GBIN |
 39. | 8229    GB     GBGB |
 40. | 8229    GB     GBUS |
 41. | 8229    IN     ININ |
 42. | 8229    IN     INGB |
 43. | 8229    IN     INUS |
 44. | 8229    US     USUS |
 45. | 8229    US     USUS |
 46. | 8229    US     USGB |
 47. | 8229    US     USGB |
 48. | 8229    US     USUS |
 49. | 8229    US     USIN |
 50. | 8229    US     USIN |
 51. | 8229    US     USGB |
 52. | 8229    US     USIN |
     |---------------------|
 53. | 8249    US          |
     |---------------------|
 54. | 8263    US     USUS |
 55. | 8263    US     USUS |
 56. | 8263    US     USUS |
     |---------------------|
 57. | 8265    IN          |
     |---------------------|
 58. | 8273    CA          |
     |---------------------|
 59. | 8303    US     USUS |
 60. | 8303    US     USUS |
 61. | 8303    US     USUS |
 62. | 8303    US     USUS |
     |---------------------|
 63. | 8313    US     USUS |
 64. | 8313    US     USUS |
     |---------------------|
 65. | 8325    IN     INUS |
 66. | 8325    IN     ININ |
 67. | 8325    IN     INIQ |
 68. | 8325    IQ     IQIN |
 69. | 8325    IQ     IQUS |
 70. | 8325    IQ     IQIQ |
 71. | 8325    US     USIN |
 72. | 8325    US     USIN |
 73. | 8325    US     USIQ |
 74. | 8325    US     USUS |
 75. | 8325    US     USUS |
 76. | 8325    US     USIQ |
 77. | 8325    US     USUS |
 78. | 8325    US     USIQ |
 79. | 8325    US     USIQ |
 80. | 8325    US     USIN |
 81. | 8325    US     USUS |
 82. | 8325    US     USIN |
     |---------------------|
 83. | 8327    DE     DEGB |
 84. | 8327    DE     DEDE |
 85. | 8327    GB     GBGB |
 86. | 8327    GB     GBDE |
     |---------------------|
 87. | 8385    GB          |
     |---------------------|
 88. | 8391    CN     CNCN |
 89. | 8391    CN     CNUS |
 90. | 8391    CN     CNUS |
 91. | 8391    CN     CNUS |
 92. | 8391    CN     CNCN |
 93. | 8391    CN     CNCN |
 94. | 8391    US     USCN |
 95. | 8391    US     USUS |
 96. | 8391    US     USUS |
 97. | 8391    US     USCN |
 98. | 8391    US     USCN |
 99. | 8391    US     USUS |
     |---------------------|
100. | 8413    CH          |
     |---------------------|
101. | 8431    FR          |
     |---------------------|
102. | 8543    US          |
     |---------------------|
103. | 8559    US          |
     |---------------------|
104. | 8695    FI     FINO |
105. | 8695    FI     FIFI |
106. | 8695    NO     NOFI |
107. | 8695    NO     NONO |
     |---------------------|
108. | 8699    US     USUS |
109. | 8699    US     USUS |
     |---------------------|
110. | 8717    US          |
     |---------------------|
111. | 8727    US     USUS |
112. | 8727    US     USUS |
113. | 8727    US     USUS |
114. | 8727    US     USUS |
115. | 8727    US     USUS |
116. | 8727    US     USUS |
117. | 8727    US     USUS |
     |---------------------|
118. | 8731    US          |
     |---------------------|
119. | 8739    US          |
     |---------------------|
120. | 8757    CN     CNCN |
121. | 8757    CN     CNCN |
122. | 8757    CN     CNUS |
123. | 8757    CN     CNUS |
124. | 8757    US     USUS |
125. | 8757    US     USCN |
     |---------------------|
126. | 8759    GB     GBUS |
127. | 8759    GB     GBGB |
128. | 8759    US     USGB |
129. | 8759    US     USUS |
130. | 8759    US     USUS |
131. | 8759    US     USGB |
     |---------------------|
132. | 8817    CA     CAUS |
133. | 8817    CA     CARU |
134. | 8817    CA     CACA |
135. | 8817    RU     RUCA |
136. | 8817    RU     RUUS |
137. | 8817    RU     RURU |
138. | 8817    US     USUS |
139. | 8817    US     USCA |
140. | 8817    US     USUS |
141. | 8817    US     USCA |
142. | 8817    US     USRU |
143. | 8817    US     USUS |
144. | 8817    US     USRU |
145. | 8817    US     USRU |
146. | 8817    US     USUS |
147. | 8817    US     USCA |
148. | 8817    US     USCA |
149. | 8817    US     USRU |
     |---------------------|
150. | 8821    CA     CAUS |
151. | 8821    CA     CACA |
152. | 8821    US     USUS |
153. | 8821    US     USCA |
154. | 8821    US     USCA |
155. | 8821    US     USUS |
     |---------------------|
156. | 8825    US     USUS |
157. | 8825    US     USUS |
158. | 8825    US     USUS |
     +---------------------+

Last edited by Andrew Musau; 26 Nov 2023, 11:29.

Comment

Salvatore Viola

Join Date: Aug 2022

Posts: 20
#4

27 Nov 2023, 04:04

Dear Clyde and Andrew,

Thank you for bringing that error to my attention and for the potential solution. The corrected example would be:

id nat pair
8229 "US" "USUS"
8229 "US" "USGB"
8229 "US" "USIN"
8229 "GB" "USGB"
8229 "GB" "GBIN"
8229 "IN" "USIN"
8229 "IN" "GBIN"

That is, the associated pair should always contain the nat variable itself. A same country pair should only occur when there are two or more individuals of the same country within the same id. So, in the solution suggested by Andrew, the id 8229 should not contain the wanted pairs "GBGB" or "ININ" since there was only one individual from each of those countries.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

27 Nov 2023, 04:28

EDITED: This will give you distinct pairs, which appears to be what you want.

Code:

clear
input long id str4 nat
6757 "US"
6931 "DE"
7011 "US"
7025 "US"
7027 "US"
7031 "US"
7031 "US"
7085 "DE"
7843 "US"
7843 "US"
7843 "US"
7843 "US"
7843 "US"
7843 "US"
7843 "US"
7851 "DZ"
7853 "US"
7877 "US"
8075 "DE"
8085 "FR"
8089 "US"
8135 "DE"
8159 "US"
8171 "US"
8209 "DE"
8225 "US"
8225 "GB"
8225 "US"
8227 "GB"
8227 "US"
8227 "US"
8229 "IN"
8229 "US"
8229 "US"
8229 "US"
8229 "GB"
8249 "US"
8263 "US"
8263 "US"
8263 "US"
8265 "IN"
8273 "CA"
8303 "US"
8303 "US"
8303 "US"
8303 "US"
8313 "US"
8313 "US"
8325 "US"
8325 "IQ"
8325 "US"
8325 "IN"
8325 "US"
8325 "US"
8327 "GB"
8327 "DE"
8385 "GB"
8391 "CN"
8391 "US"
8391 "CN"
8391 "CN"
8391 "US"
8391 "US"
8413 "CH"
8431 "FR"
8543 "US"
8559 "US"
8695 "FI"
8695 "NO"
8699 "US"
8699 "US"
8717 "US"
8727 "US"
8727 "US"
8727 "US"
8727 "US"
8727 "US"
8727 "US"
8727 "US"
8731 "US"
8739 "US"
8757 "CN"
8757 "US"
8757 "CN"
8759 "US"
8759 "US"
8759 "GB"
8817 "US"
8817 "RU"
8817 "CA"
8817 "US"
8817 "US"
8817 "US"
8821 "US"
8821 "CA"
8821 "US"
8825 "US"
8825 "US"
8825 "US"
end

preserve
contract id nat
rename nat nat2
tempfile tojoin
save `tojoin'
restore
joinby id using `tojoin'
bys id (nat): gen wanted=cond(_N==1, "", cond(nat<nat2, nat, nat2)  + cond(nat<nat2, nat2, nat))
replace wanted="" if nat==nat2 & _freq<2
contract id nat wanted, freq(count)
drop count
bys id: drop if missing(wanted) & _N>1

Res.:

Code:


. l, sepby(id)

     +---------------------+
     |   id   nat   wanted |
     |---------------------|
  1. | 6757    US          |
     |---------------------|
  2. | 6931    DE          |
     |---------------------|
  3. | 7011    US          |
     |---------------------|
  4. | 7025    US          |
     |---------------------|
  5. | 7027    US          |
     |---------------------|
  6. | 7031    US     USUS |
     |---------------------|
  7. | 7085    DE          |
     |---------------------|
  8. | 7843    US     USUS |
     |---------------------|
  9. | 7851    DZ          |
     |---------------------|
 10. | 7853    US          |
     |---------------------|
 11. | 7877    US          |
     |---------------------|
 12. | 8075    DE          |
     |---------------------|
 13. | 8085    FR          |
     |---------------------|
 14. | 8089    US          |
     |---------------------|
 15. | 8135    DE          |
     |---------------------|
 16. | 8159    US          |
     |---------------------|
 17. | 8171    US          |
     |---------------------|
 18. | 8209    DE          |
     |---------------------|
 19. | 8225    GB     GBUS |
 20. | 8225    US     GBUS |
 21. | 8225    US     USUS |
     |---------------------|
 22. | 8227    GB     GBUS |
 23. | 8227    US     GBUS |
 24. | 8227    US     USUS |
     |---------------------|
 25. | 8229    GB     GBIN |
 26. | 8229    GB     GBUS |
 27. | 8229    IN     GBIN |
 28. | 8229    IN     INUS |
 29. | 8229    US     GBUS |
 30. | 8229    US     INUS |
 31. | 8229    US     USUS |
     |---------------------|
 32. | 8249    US          |
     |---------------------|
 33. | 8263    US     USUS |
     |---------------------|
 34. | 8265    IN          |
     |---------------------|
 35. | 8273    CA          |
     |---------------------|
 36. | 8303    US     USUS |
     |---------------------|
 37. | 8313    US     USUS |
     |---------------------|
 38. | 8325    IN     INIQ |
 39. | 8325    IN     INUS |
 40. | 8325    IQ     INIQ |
 41. | 8325    IQ     IQUS |
 42. | 8325    US     INUS |
 43. | 8325    US     IQUS |
 44. | 8325    US     USUS |
     |---------------------|
 45. | 8327    DE     DEGB |
 46. | 8327    GB     DEGB |
     |---------------------|
 47. | 8385    GB          |
     |---------------------|
 48. | 8391    CN     CNCN |
 49. | 8391    CN     CNUS |
 50. | 8391    US     CNUS |
 51. | 8391    US     USUS |
     |---------------------|
 52. | 8413    CH          |
     |---------------------|
 53. | 8431    FR          |
     |---------------------|
 54. | 8543    US          |
     |---------------------|
 55. | 8559    US          |
     |---------------------|
 56. | 8695    FI     FINO |
 57. | 8695    NO     FINO |
     |---------------------|
 58. | 8699    US     USUS |
     |---------------------|
 59. | 8717    US          |
     |---------------------|
 60. | 8727    US     USUS |
     |---------------------|
 61. | 8731    US          |
     |---------------------|
 62. | 8739    US          |
     |---------------------|
 63. | 8757    CN     CNCN |
 64. | 8757    CN     CNUS |
 65. | 8757    US     CNUS |
     |---------------------|
 66. | 8759    GB     GBUS |
 67. | 8759    US     GBUS |
 68. | 8759    US     USUS |
     |---------------------|
 69. | 8817    CA     CARU |
 70. | 8817    CA     CAUS |
 71. | 8817    RU     CARU |
 72. | 8817    RU     RUUS |
 73. | 8817    US     CAUS |
 74. | 8817    US     RUUS |
 75. | 8817    US     USUS |
     |---------------------|
 76. | 8821    CA     CAUS |
 77. | 8821    US     CAUS |
 78. | 8821    US     USUS |
     |---------------------|
 79. | 8825    US     USUS |
     +---------------------+

Last edited by Andrew Musau; 27 Nov 2023, 05:19.

Announcement

Expand data to included variable containing all possible country pairs

Comment

Comment

Comment

Comment