Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Encode error

    Hi everyone,
    I have problem with Encode command. My dataset includes 2 variables in string type (reachange2 and reachange3). These vars comprise numeric and text. However, when I use encode command, the numeric data of 'reachange3' var turns to a different number. I also post the example in the .doc file: The numeric '4' of 'reachange3' turns to number '3' of 'lydochange3' after using Encode. (After using encode, 2 new vars are generated respectively 'lydochange2 and lydochange3'.
    command: encode reachange3, gen(lydochange3)
    province distric commune hoso reachange2 reachange3 lydochange2 lydochange3
    38 388 15046 107
    38 388 15046 107
    38 388 15046 107 4 4 4 4
    38 388 15046 107 4 4
    38 388 15046 107 4 4 4 3
    Please help detect where is the error and how can resolve it. I really appreciate that. Thank you a lot.
    ps: Nick: May I tag you here, sorry if any inconvenience cause you

    Regards
    Linh
    Last edited by Linh mt; 10 Jun 2019, 22:05.

  • #2
    No error here. encode doesn't look inside the strings it encodes to see if there is a number present. It is sufficient that the value label carries the information.

    If all your string values are numbers, you should be reaching for destring, not encode.

    NB: your tag was not sufficient to reach me. In general, it is not an especially good idea to tag individuals unless you are quoting their posts.

    Comment


    • #3
      Adding a footnote to Nick's advice, when problems arise with a command a good place to start is with the online help for that command. In this example, type help encode into Stata's Command window and review the output. In that output you will find

      Do not use encode if varname contains numbers that merely happen to be stored as strings; instead, use generate newvar = real(varname) or destring

      Comment


      • #4
        Also, if you really wish a given number for each category,
        you can fiddle with - encode - plus the option - label - like in the toy example below.

        Code:
        . sysuseauto
        command sysuseauto is unrecognized
        r(199);
        
        . sysuse auto
        (1978 Automobile Data)
        
        . codebook foreign
        
        ----------------------------------------------------------------------------------------------------------
        foreign                                                                                           Car type
        ----------------------------------------------------------------------------------------------------------
        
                          type:  numeric (byte)
                         label:  origin
        
                         range:  [0,1]                        units:  1
                 unique values:  2                        missing .:  0/74
        
                    tabulation:  Freq.   Numeric  Label
                                    52         0  Domestic
                                    22         1  Foreign
        
        . decode foreign, gen(for2)
        
        . codebook for2
        
        ----------------------------------------------------------------------------------------------------------
        for2                                                                                              Car type
        ----------------------------------------------------------------------------------------------------------
        
                          type:  string (str8)
        
                 unique values:  2                        missing "":  0/74
        
                    tabulation:  Freq.  Value
                                    52  "Domestic"
                                    22  "Foreign"
        
        . label define mylab 11 "Foreign" 33 "Domestic"
        
        . encode for2, gen(for3) label(mylab)
        
        . codebook for3
        
        ----------------------------------------------------------------------------------------------------------
        for3                                                                                              Car type
        ----------------------------------------------------------------------------------------------------------
        
                          type:  numeric (long)
                         label:  mylab
        
                         range:  [11,33]                      units:  1
                 unique values:  2                        missing .:  0/74
        
                    tabulation:  Freq.   Numeric  Label
                                    22        11  Foreign
                                    52        33  Domestic
        Hopefully that helps.
        Best regards,

        Marcos

        Comment


        • #5
          Hi Nick, William and Marcos,

          Thank you for your reply. The value of varname string reachange3 include both numeric and text as example below:
          province distric commune hoso reachange2 reachange3 lydochange2 lydochange3
          38 388 15046 107
          38 388 15046 107
          38 388 15046 107 4 4 4 4
          38 388 15046 107 4 4
          38 388 15046 107 4 4 4 3
          38 388 15046 107 change job move house 5 5
          38 388 15046 107 suitable move house 6 5
          38 388 15046 107 change job retired 5 6
          However, I found where the mistake is. The reason is: reachange3 contains non-continuous numeric values, say 1,2,4,... Therefore, when encode reachange3, the numeric values 4 will be converted to 3.

          @To William,
          I think in this case, the destring command does not work, doesn't it because the varname include both numeric and text values?


          Thank you very much all
          Regards
          Linh

          Comment


          • #6
            With regards to - destring -, if you type - help destring - in the command window, you'll find this information:

            destring converts variables in varlist from string to numeric. If varlist
            is not specified, destring will attempt to convert all variables in the
            dataset from string to numeric. Characters listed in ignore() are
            removed. Variables in varlist that are already numeric will not be
            changed.
            [...]
            Either generate() or replace must be specified. With either option, if
            any string variable contains nonnumeric characters not specified with
            ignore(), then no corresponding variable will be generated, nor will that
            variable be replaced (unless force is specified).

            If I understood right, what you (wrongly) assumed as an error under - encode - command was the finding of different numbers (in 2 string variables) for the same level.

            The way to assure these strings will represent the same code in both (encoded) variables is shown in #4.

            No matter the absence of a few levels in one variable, or in the other variable, you can this way guarantee that, say, "4" means the same thing in both variables.
            Last edited by Marcos Almeida; 12 Jun 2019, 04:25.
            Best regards,

            Marcos

            Comment


            • #7
              Originally posted by Marcos Almeida View Post
              With regards to - destring -, if you type - help destring - in the command window, you'll find this information:




              If I understood right, what you (wrongly) assumed as an error under - encode - command was the finding of different numbers (in 2 string variables) for the same level.

              The way to assure these strings will represent the same code in both (encoded) variables is shown in #4.

              No matter the absence of a few levels in one variable, or in the other variable, you can this way guarantee that, say, "4" means the same thing in both variables.
              Hi @Marcos Almeida,

              Thank you for reply. Actually, no problem with the encode. What I want to emphasize here is that I want to get the same values of string variable before and after ENDCODE. If the string var misses some levels, so the new var after ENCODE will miss some corresponding values. Say: the string var is below
              reachange3
              1
              2
              4
              yes

              After ENCODE, the newvar is numeric:
              newvar
              1
              2
              4
              yes

              but the actual values of newvar is
              1
              2
              3
              4

              Hopefully my problem is clearly expressed. Now it is solved alreadly. Thank you again for helping me understand more about ENDCODE

              Kind regards
              linh

              Comment


              • #8
                As underlined in #6, the way to assure these strings will represent the same code in both (encoded) variables is shown in #4.
                Best regards,

                Marcos

                Comment


                • #9
                  Yes, I got it. Thank you so much, Marcos
                  Best regards
                  Linh

                  Comment

                  Working...
                  X