Hello all,
I am working with a dataset containing an id variable "id" and a country identifier "nat". I would like to expand this dataset to add another variable which includes all possible pairs of nationality by id to be merged with another dataset including cross-country data. I know that this will enlarge the dataset significantly. The new variable would add n observations to each unique id and nat pair which would contain a string of the concatenated country abbreviations.
An example can be considered with the id 8229 for which 4 pairs would be generated "USUS", "USGB", "USIN" and "GBIN". These pairs may also be in whatever order with either country preceding the other. It might look something like this:
id nat pair
8229 "US" "USUS"
8229 "US" "USGB"
8229 "US" "USIN"
8229 "US" "GBIN"
8229 "GB" "USGB"
8229 "GB" "GBIN"
8229 "IN" "USIN"
8229 "IN" "GBIN"
I've given it my best shot using egen and some other approached. I also tried the tuples command, but the dataset contains around a million observations so this approach seems really time-inefficient considering the command takes some time to run.
Thanks in advance for any help.
I am working with a dataset containing an id variable "id" and a country identifier "nat". I would like to expand this dataset to add another variable which includes all possible pairs of nationality by id to be merged with another dataset including cross-country data. I know that this will enlarge the dataset significantly. The new variable would add n observations to each unique id and nat pair which would contain a string of the concatenated country abbreviations.
An example can be considered with the id 8229 for which 4 pairs would be generated "USUS", "USGB", "USIN" and "GBIN". These pairs may also be in whatever order with either country preceding the other. It might look something like this:
id nat pair
8229 "US" "USUS"
8229 "US" "USGB"
8229 "US" "USIN"
8229 "US" "GBIN"
8229 "GB" "USGB"
8229 "GB" "GBIN"
8229 "IN" "USIN"
8229 "IN" "GBIN"
I've given it my best shot using egen and some other approached. I also tried the tuples command, but the dataset contains around a million observations so this approach seems really time-inefficient considering the command takes some time to run.
Thanks in advance for any help.
Code:
* Example generated by -dataex-. For more info, type help dataex clear input long id str4 nat 6757 "US" 6931 "DE" 7011 "US" 7025 "US" 7027 "US" 7031 "US" 7031 "US" 7085 "DE" 7843 "US" 7843 "US" 7843 "US" 7843 "US" 7843 "US" 7843 "US" 7843 "US" 7851 "DZ" 7853 "US" 7877 "US" 8075 "DE" 8085 "FR" 8089 "US" 8135 "DE" 8159 "US" 8171 "US" 8209 "DE" 8225 "US" 8225 "GB" 8225 "US" 8227 "GB" 8227 "US" 8227 "US" 8229 "IN" 8229 "US" 8229 "US" 8229 "US" 8229 "GB" 8249 "US" 8263 "US" 8263 "US" 8263 "US" 8265 "IN" 8273 "CA" 8303 "US" 8303 "US" 8303 "US" 8303 "US" 8313 "US" 8313 "US" 8325 "US" 8325 "IQ" 8325 "US" 8325 "IN" 8325 "US" 8325 "US" 8327 "GB" 8327 "DE" 8385 "GB" 8391 "CN" 8391 "US" 8391 "CN" 8391 "CN" 8391 "US" 8391 "US" 8413 "CH" 8431 "FR" 8543 "US" 8559 "US" 8695 "FI" 8695 "NO" 8699 "US" 8699 "US" 8717 "US" 8727 "US" 8727 "US" 8727 "US" 8727 "US" 8727 "US" 8727 "US" 8727 "US" 8731 "US" 8739 "US" 8757 "CN" 8757 "US" 8757 "CN" 8759 "US" 8759 "US" 8759 "GB" 8817 "US" 8817 "RU" 8817 "CA" 8817 "US" 8817 "US" 8817 "US" 8821 "US" 8821 "CA" 8821 "US" 8825 "US" 8825 "US" 8825 "US" end
Comment