Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why encoding a string ID variable generating weird value labels?

    Please consider the following example data:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte shdist str4 districtid str27 District long districtid_num
    19 "0219" "ADILABAD"    19
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
    12 "0612" "AHMADABAD"   98
     9 "1309" "AHMADNAGAR" 228
     9 "1309" "AHMADNAGAR" 228
     9 "1309" "AHMADNAGAR" 228
     9 "1309" "AHMADNAGAR" 228
     9 "1309" "AHMADNAGAR" 228
     9 "1309" "AHMADNAGAR" 228
     9 "1309" "AHMADNAGAR" 228
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
     1 "1601" "AIZAWL"     261
    11 "2011" "AJMER"      306
    11 "2011" "AJMER"      306
    11 "2011" "AJMER"      306
    11 "2011" "AJMER"      306
    11 "2011" "AJMER"      306
    11 "2011" "AJMER"      306
    11 "2011" "AJMER"      306
    11 "2011" "AJMER"      306
    23 "1323" "AKOLA"      241
    23 "1323" "AKOLA"      241
    23 "1323" "AKOLA"      241
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     5 "2005" "ALWAR"      300
     1 "0701" "AMBALA"     106
     1 "0701" "AMBALA"     106
     1 "0701" "AMBALA"     106
     1 "0701" "AMBALA"     106
     1 "0701" "AMBALA"     106
     1 "0701" "AMBALA"     106
     1 "0701" "AMBALA"     106
    24 "1324" "AMRAVATI"   242
    24 "1324" "AMRAVATI"   242
    24 "1324" "AMRAVATI"   242
     5 "0605" "AMRELI"      91
     5 "0605" "AMRELI"      91
    end
    label values districtid_num districtid_num
    label def districtid_num 19 "0219", modify
    label def districtid_num 91 "0605", modify
    label def districtid_num 98 "0612", modify
    label def districtid_num 106 "0701", modify
    label def districtid_num 228 "1309", modify
    label def districtid_num 241 "1323", modify
    label def districtid_num 242 "1324", modify
    label def districtid_num 261 "1601", modify
    label def districtid_num 300 "2005", modify
    label def districtid_num 306 "2011", modify
    districtid is a string, which I have encoded to generate districtid_num
    Code:
    encode districtid,gen(districtid_num)
    But this is assigning labels to districtid_num. Is there any way to encode that doesnt generate labels?

    I also tried
    Code:
     destring districtid,gen(districtid_num_2)
    but that is generating integers excluding the leading 0.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte shdist str4 districtid int districtid_num_2 str27 District long districtid_num
    19 "0219"  219 "ADILABAD"    19
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
    12 "0612"  612 "AHMADABAD"   98
     9 "1309" 1309 "AHMADNAGAR" 228
     9 "1309" 1309 "AHMADNAGAR" 228
     9 "1309" 1309 "AHMADNAGAR" 228
     9 "1309" 1309 "AHMADNAGAR" 228
     9 "1309" 1309 "AHMADNAGAR" 228
     9 "1309" 1309 "AHMADNAGAR" 228
     9 "1309" 1309 "AHMADNAGAR" 228
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
     1 "1601" 1601 "AIZAWL"     261
    11 "2011" 2011 "AJMER"      306
    11 "2011" 2011 "AJMER"      306
    11 "2011" 2011 "AJMER"      306
    11 "2011" 2011 "AJMER"      306
    11 "2011" 2011 "AJMER"      306
    11 "2011" 2011 "AJMER"      306
    11 "2011" 2011 "AJMER"      306
    11 "2011" 2011 "AJMER"      306
    23 "1323" 1323 "AKOLA"      241
    23 "1323" 1323 "AKOLA"      241
    23 "1323" 1323 "AKOLA"      241
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     5 "2005" 2005 "ALWAR"      300
     1 "0701"  701 "AMBALA"     106
     1 "0701"  701 "AMBALA"     106
     1 "0701"  701 "AMBALA"     106
     1 "0701"  701 "AMBALA"     106
     1 "0701"  701 "AMBALA"     106
     1 "0701"  701 "AMBALA"     106
     1 "0701"  701 "AMBALA"     106
    24 "1324" 1324 "AMRAVATI"   242
    24 "1324" 1324 "AMRAVATI"   242
    24 "1324" 1324 "AMRAVATI"   242
     5 "0605"  605 "AMRELI"      91
     5 "0605"  605 "AMRELI"      91
    end
    label values districtid_num districtid_num
    label def districtid_num 19 "0219", modify
    label def districtid_num 91 "0605", modify
    label def districtid_num 98 "0612", modify
    label def districtid_num 106 "0701", modify
    label def districtid_num 228 "1309", modify
    label def districtid_num 241 "1323", modify
    label def districtid_num 242 "1324", modify
    label def districtid_num 261 "1601", modify
    label def districtid_num 300 "2005", modify
    label def districtid_num 306 "2011", modify


    It is important that I generate the numerical districtid with 4 characters and without labels.

    Would appreciate any help.

    Thanks

  • #2
    The whole purpose of -encode- is to create a variable that is consecutive integers with value labels attached. If you prefer not to see the labels, then you can just remove them with -label values district_num-, which will remove them. But I don't see why this matters for anything you will do with this variable.

    Actually, I don't understand why you want to do anything to district_num at all. It is an identifier variable, so you will not be performing any calculations with it. And the leading zeroes are important to you (and perhaps meaningful in distinguishing identified entities). So I suggest you just leave it as a string variable.

    Perhaps I am missing something relevant in your context, so, if you wish, you can
    Code:
    destring district_id, gen(n_district_id)
    format n_district_id %04.0f
    This will give you a numeric variable that is display-formatted to show leading zeroes. The actual internal value will, however, just be a numeric variable that is no different from the variable districtid_num_2 that you show because from the perspective of numeric operations, leading zeroes are meaningless. So you can do this, if you want, but I don't see what you gain from it, and would be interested to learn why you would want this if, in fact, it is what you want.

    Added: Perhaps you want a numeric variable so that you can do something like -xtset district_id-, or perhaps use district_id as a level in a multi-level model. That would make sense, but if that is your purpose, then what -encode- does is perfectly satisfactory for the purpose and there is no need to worry about the labels, or what they look like, or whether leading zeroes are shown: the internal numbers will actually be 1, 2, 3,... and they will serve this purpose just fine. Stata never really cares about what the value labels are: they are just a convenience for displaying the variable in ways that a human eye and brain will never understand. But any computations done with value-labeled variables are handled in exactly the same way as they would be if the value labels were not there. Value labels are purely cosmetic and have no functional consequences in Stata.
    Last edited by Clyde Schechter; 26 Aug 2022, 12:28.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      I don't understand why you want to do anything to this variable at all. It is an identifier variable, so you will not be performing any calculations with it. And the leading zeroes are important to you (and perhaps meaningful in distinguishing identified entities). So I suggest you just leave it as a string variable.

      Perhaps I am missing something relevant in your context, so, if you wish, you can
      Code:
      destring district_id, gen(n_district_id)
      format n_district_id %04.0f
      This will give you a numeric variable that is display-formatted to show leading zeroes. The actual internal value will, however, just be a numeric variable that is no different from the variable districtid_num_2 that you show because from the perspective of numeric operations, leading zeroes are meaningless. So you can do this, if you want, but I don't see what you gain from it, and would be interested to learn why you would want this if, in fact, it is what you want.
      Thanks Clyde for the solution. I want numerical district id because I want to use it in a further operation with labmask
      Code:
      labmask n_district_id ,values(District)
      Which would attach the district ids to corresponding district names.

      But as you said, the internal value excludes leading zeroes. So the label assigned to district name excludes leading zero. For eg., "Adilabad" has value label 219, but I would ideally want it to have label 0219. Is there any way to achieve that?

      Comment


      • #4
        I am not very familiar with -labmask-, but I don't see why
        Code:
        egen n_district_id = group(districtid)
        labmask n_district_id, values(District)
        wouldn't work. The -egen, group()- command does something similar to -encode-, but it does not, by default, add value labels to the result. Not that that really matters: -labmask- would just overwrite the value labels that -encode- would create anyway. But it does seem wasteful to have -encode- create labels that you are just going to discard anyway.

        Of course, that still does not leave you with any variable that has a value label with a leading zero, but it is still unclear to me what use any such variable would have.

        And, in fact, why do you need to resort to -labmask- anyway? Why not just
        Code:
        encode District, gen(n_district_id)
        to get the same result in a single command.
        Last edited by Clyde Schechter; 26 Aug 2022, 12:55.

        Comment


        • #5
          Titir Bhattacharya I am also confused here about what you are trying to achieve overall.

          I understand that we often want to treat identifiers as sacrosanct (we may want to later merge them with other datasets that use the same identifiers, for instance), so you may not want to generate arbitrary numbers, which is one problem with both the solutions offered in #4. If I am not mistaken, you already have such arbitrary numbers sitting in shdist.

          But if you want a number that is pretty much the original number that was coded as a string, it obviously cannot have a leading zero. It can visually be made to look like it has leading zeros, with the appropriate display format. This was already suggested in #2.

          An intermediate solution between this and having arbitrary numbers might be to add 1000 to the number. Given that India has fewer than 1000 districts, this should still end up giving you only four digit numeric codes, though you seem to have district codes in the thousands, so you will need to check if anything is coded above 9000 and so adding 1000 will take it into five-figure territory.

          I also don't follow why you are creating the label districtid_num and then changing around its values, and then doing this entire route. I would probably:
          1. create a copy of the original districtid_num to be safe (or just work directly with the variable)
          2. use the -replace- or -recode- command to change numbers as necessary
          3. use -labmask- to give this new variable a value label based on District
          Last edited by Hemanshu Kumar; 26 Aug 2022, 21:13.

          Comment


          • #6
            I also do not follow here what a problem OP perceives. But to Clyde's comment in #4, -egen, group()- has the option label, and with this option it does exactly what -encode- does:

            Code:
            . encode districtid,gen(districtid_encode)
            
            . egen districtid_group = group(districtid), label
            
            . des
            
            Contains data
             Observations:           100                  
                Variables:             7                  
            --------------------------------------------------------------------------------------------------
            Variable      Storage   Display    Value
                name         type    format    label      Variable label
            --------------------------------------------------------------------------------------------------
            shdist          byte    %8.0g                 
            districtid      str4    %9s                   
            districtid_nu~2 int     %8.0g                 
            District        str27   %27s                  
            districtid_num  long    %12.0g     districtid_num
                                                          
            districtid_en~e long    %8.0g      districtid_encode
                                                          
            districtid_gr~p float   %9.0g      districtid_group
                                                          group(districtid)
            --------------------------------------------------------------------------------------------------
            Sorted by: 
                 Note: Dataset has changed since last saved.
            
            . list in 1/10, nol
            
                 +---------------------------------------------------------------------------+
                 | shdist   distri~d   distri~2    District   distri~m   distri~e   distri~p |
                 |---------------------------------------------------------------------------|
              1. |     19       0219        219    ADILABAD         19          1          1 |
              2. |     12       0612        612   AHMADABAD         98          3          3 |
              3. |     12       0612        612   AHMADABAD         98          3          3 |
              4. |     12       0612        612   AHMADABAD         98          3          3 |
              5. |     12       0612        612   AHMADABAD         98          3          3 |
                 |---------------------------------------------------------------------------|
              6. |     12       0612        612   AHMADABAD         98          3          3 |
              7. |     12       0612        612   AHMADABAD         98          3          3 |
              8. |     12       0612        612   AHMADABAD         98          3          3 |
              9. |     12       0612        612   AHMADABAD         98          3          3 |
             10. |     12       0612        612   AHMADABAD         98          3          3 |
                 +---------------------------------------------------------------------------+

            Comment


            • #7
              Thanks Clyde, Hemanshu and Joro. V helpful solutions.

              Comment

              Working...
              X