Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create Unique Values/IDs for a "Region" Variable in a Multi-Country Dataset

    Hello, I am working on a multi-country dataset, with a consecutive numerical code representing the regions in each country. However, each country is represented by a unique non-consecutive code, as the table below shows.

    I am trying to create a pooled dataset of about 27 countries. I want a way of recreating the region code in such a way that it preserves the uniqueness of the region codes in the appended dataset. Given that each country has a unique code and label, I used the "egen group" function with the labels option to ensure that the labels are preserved. The result - in column 5 - is a string variable that I cannot use in my multilevel models to account for regional variations. Moreover, when When I append the countries to each other, the regional codes get messed up, as column 6 in my example shows.

    One solution could be to combine the unique "country code" with the "region code", as shown in the last column. I believe this would be a unique indicator for the regions within each country. If so, how do I create this last column? Here is my multi-step process which gets me part of the way:

    Code:
    * 1. group original country code and region together - numeric variable
    egen region3 = group(v000n v024), label
    fre region3
    
    * 2. Concat 3 letter country code and region, preserve value labels
    egen region4 = concat(v000 v024), decode
    fre region4
    
    * 3. convert the numeric country (v000n) and region(region3n) to string
    tostring v000n region3, gen(v000ns region3s)
    
    * 4. Create a newstring variable that combines both country number and region
    gen region8 = v000ns + region3s
    fre region8
    
    * 5. Convert newly created string variable to numeric. This is unique. 
    destring region8, gen(region9)
    fre region9
    Though tortuous, I still need to:
    (a) figure out a way to attach value labels to the newly generated region8 variable and attach value labels from the region variable (v024 or region4) to it, and (b) generate the new region code (with its value labels) and execute it for each of the 27 countries (the variable names are the same from country to country).

    I am very grateful for some assistance with these tasks. A sample data is provided below.

    Thanks very much in advance - cY


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str3 v000 byte v024 float v000n str1 v000ns float region3 str2 region3s str17 region4 str3 region8 int region9
    "AO7" 9 1 "1" 9 "9" "AO7benguela" "19" 19
    "AO7" 9 1 "1" 9 "9" "AO7benguela" "19" 19
    "AO7" 9 1 "1" 9 "9" "AO7benguela" "19" 19
    "AO7" 9 1 "1" 9 "9" "AO7benguela" "19" 19
    "AO7" 9 1 "1" 9 "9" "AO7benguela" "19" 19
    end
    label values v024 V024
    label def V024 9 "benguela", modify
    label values v000n v000_1
    label def v000_1 1 "AO7", modify
    label values region3 region3
    label def region3 9 "AO7 benguela", modify








    Country Name CountryCode RegionCode RegionLabel Region from Egen group Incorrect Appended Data: Code for Region What I Desire /Want
    BurkinaFasso 6 1 BF6 boucle de mouhoun 1 BF6 boucle de mouhoun 1 BF6 boucle de mouhoun 61
    BurkinaFasso 6 2 BF6 cascades 2 BF6 cascades 2 BF6 cascades 62
    BurkinaFasso 6 3 BF6 centre 3 BF6 centre 3 BF6 centre 63
    BurkinaFasso 6 4 BF6 centre-est 4 BF6 centre-est 4 BF6 centre-est 64
    BurkinaFasso 6 5 BF6 centre-nord 5 BF6 centre-nord 5 BF6 centre-nord 65
    BurkinaFasso 6 6 BF6 centre-ouest 6 BF6 centre-ouest 6 BF6 centre-ouest 66
    BurkinaFasso 6 7 BF6 centre-sud 7 BF6 centre-sud 7 BF6 centre-sud 67
    BurkinaFasso 6 8 BF6 est 8 BF6 est 8 BF6 est 68
    BurkinaFasso 6 9 BF6 hauts basins 9 BF6 hauts basins 9 BF6 hauts basins 69
    BurkinaFasso 6 10 BF6 nord 10 BF6 nord 10 BF6 nord 610
    BurkinaFasso 6 11 BF6 plateau central 11 BF6 plateau central 11 BF6 plateau central 611
    BurkinaFasso 6 12 BF6 sahel 12 BF6 sahel 12 BF6 sahel 612
    BurkinaFasso 6 13 BF6 sud-ouest 13 BF6 sud-ouest 13 BF6 sud-ouest 613
    14
    15
    Angola 9 1 AO7 cabinda 1 AO7 cabinda 16 91
    Angola 9 2 AO7 zaire 2 AO7 zaire 17 92
    Angola 9 3 AO7 u�ge 3 AO7 u�ge 18 93
    Angola 9 4 AO7 luanda 4 AO7 luanda Total 94
    Angola 9 5 5O7 cuanza norte 5 AO7 cuanza norte 95
    Angola 9 6 AO7 cuanza sul 6 AO7 cuanza sul 96
    Angola 9 7 AO7 malanje 7 AO7 malanje 97
    Angola 9 8 AO7 lunda norte 8 AO7 lunda norte 98
    Angola 9 9 AO7 benguela 9 AO7 benguela 99
    Angola 9 10 AO7 huambo 10 AO7 huambo 910
    Angola 9 11 AO7 bi� 11 AO7 bi� 911
    Angola 9 12 AO7 moxico 12 AO7 moxico 912
    Angola 9 13 AO7 cuando cubango 13 AO7 cuando cubango 913
    Angola 9 14 AO7 namibe 14 AO7 namibe 914
    Angola 9 15 AO7 hu�la 15 AO7 hu�la 915
    Angola 9 16 AO7 cunene 16 AO7 cunene 916
    Angola 9 17 AO7 lunda sul 17 AO7 lunda sul 917
    Angola 9 18 AO7 bengo 18 AO7 bengo 918

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str3 v000 byte v024 float v000n str1 v000ns float region3 str2 region3s str17 region4 str3 region8 int region9
    "AO7" 9 1 "1" 9 "9" "AO7benguela" "19" 19
    "AO7" 9 1 "1" 9 "9" "AO7benguela" "19" 19
    "AO7" 9 1 "1" 9 "9" "AO7benguela" "19" 19
    "AO7" 9 1 "1" 9 "9" "AO7benguela" "19" 19
    "AO7" 9 1 "1" 9 "9" "AO7benguela" "19" 19
    end
    label values v024 V024
    label def V024 9 "benguela", modify
    label values v000n v000_1
    label def v000_1 1 "AO7", modify
    label values region3 region3
    label def region3 9 "AO7 benguela", modify
    Last edited by Yawo Kokuvi; 16 Aug 2023, 23:17. Reason: Provided dataex example, and partially successful code

  • #2
    I do not follow this. If you have the country names/ country codes and the region names in the combined dataset, doesn't the following give you a unique region identifier?

    Code:
    egen wanted= group(country region), label

    Comment


    • #3
      I can't follow this easily either, but egen, group() can't know about any other application of the same code. So, the time to create a composite identifier is after any append, not before.

      Comment


      • #4
        I am sorry for the lack of clarity. Let me summarize the problem again. I have a 27 country dataset, each of which has the following:

        v000: a 3-letter country code
        v024: a numeric variable that identifies regions within each country, the number of which varies from country to country

        When I append all these 27 datasets, the new v024 gets messed up, and doesn't retain the unique value labels of each country after the append.

        I want a way of recreating the region code in such a way that it preserves the uniqueness of the region codes in the appended dataset. The variable needs to be numeric so I can use it for modeling. After reviewing the forum overnight, I understand that this is a typical behavior of some variables after append and that creating any composite variable has to have append before, and not after the append. So, here is a code I am working with now to try to resolve the issue: the main goal is to create a string version of both the country and region variable codes, append them, then after appending encode them back to numeric variable, and apply any sort of value labels I have:

        Code:
        A: FOR EACH COUNTRY DATASET, DO THE FOLLOWING, AND SAVE AS NEW DATASET
        
        * 1. Create a numeric version of v000
        gen v000n=2 /* remember to change this value for each country */
        label values v000n v000_1 
        
        * 2. Convert v000n to string
        tostring v000n, gen(v000s)
        
        * 3. Convert v024 to string
        tostring v024, gen(v024s)
        
        * 4. Concat 3 letter country code and region, preserve value labels
        egen countryregion = concat(v000 v024), decode
        fre countryregion
        
        * 5. Create new region variable by merging country number and region number
        egen region_new_string = concat(v000s v024s)
        fre region_new_string
        
        
        B: REPEAT 1 TO 5 FOR EACH OF THE OTHER 26 COUNTRIES: a loop will help here
        
        
        C: APPEND ALL DATASETS
        
        * 6. Visually check to ensure that relevant variables (countryregion and region_new_string)
        have country codes/identifies as their first digit (so that those values are consistent within each country)
        
        * 7.  After append, encode the region_new_string to become numeric
        encode region_new_string, gen(region_new)
        
        * 8.  Use labmask utility to associate values from countryregion labels to region_new_string
        labmask region_new, values(countryregion)
        fre region_new
        So, this is where I've got to. It has worked for 2 countries so far. I am looking for some assistance re the sanity of these codes, and some help with looping through these changes over the 27 countries.

        Thanks again, and with much appreciation, cY

        Comment


        • #5
          In the appended dataset, do the following:
          Code:
          label drop v024
          * check the duplicates of v000 and v024
          duplicates report v000 v024 // should not have duplicates
          
          * generate wanted region id
          gen region_new_string = v000 + strofreal(v024)
          encode region_new_string, gen(region_new)
          Hope this help

          Comment

          Working...
          X