Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • recode string as numeric?

    Suppose I have the following data:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str5 t str1 gz
    "C    " "Y"
    "A++  " "N"
    "B    " "N"
    "A+   " "N"
    "C    " "Y"
    "A    " "N"
    "A-   " "N"
    "C    " "N"
    "A    " "N"
    "-    " "Y"
    "C    " "N"
    "B    " "N"
    "C-   " "N"
    end
    1. For variable t, I have 7 categories, e.g., A++ (1), A+ (2), A (3), A- (4), B (5), C (6), and C- (7). Missing data are denoted as "-". I want to recode the string variable t into numerical variable, say t1, with values denoted in the corresponding parenthesis, and "-" are replaced by the standard ".".
    2. Similarly, for gz variable, I want to construct a numeric variable gz1 with "N"/"Y" replaced by 0/1, respectively.
    Any suggestions?
    Ho-Chuan (River) Huang
    Stata 17.0, MP(4)

  • #2
    Code:
    gen gz1 = 0
    replace gz1 = 1 if gz=="Y"
    Same for t, only with more replace commands.

    Comment


    • #3
      Or,
      Code:
      . encode t,gen(t_code)
      
      . replace t_code=. if t=="-"

      Comment


      • #4
        Originally posted by Charlie Joyez View Post
        Code:
        . encode t,gen(t_code)
        . replace t_code=. if t=="-"
        These commands don't yield the values that River needs.
        Code:
        . lab list t_code
        t_code:
                   1 -    
                   2 A    
                   3 A+   
                   4 A++  
                   5 A-   
                   6 B    
                   7 C    
                   8 C-

        Comment


        • #5
          I suggest the following commands:

          Code:
          . encode t, gen(t1)
          
          . replace t1 = . if t1 ==1
          (1 real change made, 1 to missing)
          
          . codebook t1
          
          ----------------------------------------------------------------------------------------------------------------------------------------------
          t1                                                                                                                                 (unlabeled)
          ----------------------------------------------------------------------------------------------------------------------------------------------
          
                            type:  numeric (long)
                           label:  t1
          
                           range:  [2,8]                        units:  1
                   unique values:  7                        missing .:  1/13
          
                      tabulation:  Freq.   Numeric  Label
                                       2         2  A
                                       1         3  A+
                                       1         4  A++
                                       1         5  A-
                                       2         6  B
                                       4         7  C
                                       1         8  C-
                                       1         .  
          
          . list
          
               +------------------+
               |     t   gz    t1 |
               |------------------|
            1. | C        Y     C |
            2. | A++      N   A++ |
            3. | B        N     B |
            4. | A+       N    A+ |
            5. | C        Y     C |
               |------------------|
            6. | A        N     A |
            7. | A-       N    A- |
            8. | C        N     C |
            9. | A        N     A |
           10. | -        Y     . |
               |------------------|
           11. | C        N     C |
           12. | B        N     B |
           13. | C-       N    C- |
               +------------------+
          With regards to the second query, the command may be done within one line as well:

          Code:
          . gen gz1 = gz =="Y"
          
          . list
          
               +------------------------+
               |     t   gz    t1   gz1 |
               |------------------------|
            1. | C        Y     C     1 |
            2. | A++      N   A++     0 |
            3. | B        N     B     0 |
            4. | A+       N    A+     0 |
            5. | C        Y     C     1 |
               |------------------------|
            6. | A        N     A     0 |
            7. | A-       N    A-     0 |
            8. | C        N     C     0 |
            9. | A        N     A     0 |
           10. | -        Y     .     1 |
               |------------------------|
           11. | C        N     C     0 |
           12. | B        N     B     0 |
           13. | C-       N    C-     0 |
               +------------------------+
          Last edited by Marcos Almeida; 28 Apr 2017, 04:49.
          Best regards,

          Marcos

          Comment


          • #6
            In addition to Marcos Almeida's excellent advice, you should have a look at -encode-'s option -label()-: If you want to have control over the numeric codes used by encode, you can (1) define a value label containing the desired codes and (2) let encode use this value label.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str5 t str1 gz
            "C    " "Y"
            "A++  " "N"
            "B    " "N"
            "A+   " "N"
            "C    " "Y"
            "A    " "N"
            "A-   " "N"
            "C    " "N"
            "A    " "N"
            "-    " "Y"
            "C    " "N"
            "B    " "N"
            "C-   " "N"
            end
            
            label define ratings 1 "A++" 2 "A+" 3 "A" 4 "A-" 5 "B" 6 "C" 7 "C-" .a "-"
            replace t=trim(itrim(t))
            encode t , generate(t_code) label(ratings) noextend
            
            label define yesno 1 "Y" 2 "N"
            replace gz=trim(itrim(gz))
            encode gz , generate(gz_code) label(yesno) noextend
            Note that, for the sake of keeping everything a little easier, my code starts with trimming blank characters off the string values of your original variables. The option -noextend- makes -encode- stop execution in case any values not defined in the given value label are encountered in the source string variable.

            Regards
            Bela
            Last edited by Daniel Bela; 28 Apr 2017, 05:24. Reason: Added info about the -noextend- option to -encode-.

            Comment


            • #7
              Originally posted by Daniel Bela View Post
              In addition to Marcos Almeida's excellent advice, you should have a look at -encode-'s option -label()-: If you want to have control over the numeric codes used by encode, you can (1) define a value label containing the desired codes and (2) let encode use this value label.
              Thank you all for the suggestions. @Friedrich: I knew your suggestive method, but tried to search for a more concise approach. @Charlie, @Friedrich and @Marcos:I also knew the "encode" command can do something like this. But the results are NOT what I had in mind (say, I wanted A++ to be 1 rather than 4). @Daniel's answer is exactly what I need. Thank you, Daniel. Thank you all.
              Ho-Chuan (River) Huang
              Stata 17.0, MP(4)

              Comment


              • #8
                Thank you all for the suggestions. @Friedrich: I knew your suggestive method, but tried to search for a more concise approach. @Charlie, @Friedrich and @Marcos:I also knew the "encode" command can do something like this. But the results are NOT what I had in mind (say, I wanted A++ to be 1 rather than 4)
                I presume you also knew how "to construct a numeric variable gz1 with "N"/"Y" replaced by 0/1, respectively."

                No doubt, Daniel Bela shared excellent commands. What is more, he envisaged issues which could well bite, and gave excellent strategy on how to use the -noextend- option. This was a good lesson I learned today. Thanks, Daniel.

                That said, as you know, what matters most when categorizing a variable is providing the correct label. However much we further change the codes. the results will keep the same, provided the label is correct.

                This notwithstanding, for the exact "end" result you wished, I mean, the command, as you may also know, - recode - will simply do fine:

                Code:
                 . gen t2 = t1
                (1 missing value generated)
                
                . recode t2 (2=1 "A") (3=2 "A+") (4=3 "A++") (5=4 "A-") (6=5 "B") (7=6 "C") (8=7 "C-") (miss=.), gen(t3)
                (12 differences between t2 and t3)
                
                . codebook t3
                
                ----------------------------------------------------------------------------------------------------------------------------------------------
                t3                                                                                                                                RECODE of t2
                ----------------------------------------------------------------------------------------------------------------------------------------------
                
                                  type:  numeric (float)
                                 label:  t3
                
                                 range:  [1,7]                        units:  1
                         unique values:  7                        missing .:  1/13
                
                            tabulation:  Freq.   Numeric  Label
                                             2         1  A
                                             1         2  A+
                                             1         3  A++
                                             1         4  A-
                                             2         5  B
                                             4         6  C
                                             1         7  C-
                                             1         .

                As previously said, the result - whatsoever it is - after the second labeling - won't change:

                Code:
                . tab t1
                
                         t1 |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                      A     |          2       16.67       16.67
                      A+    |          1        8.33       25.00
                      A++   |          1        8.33       33.33
                      A-    |          1        8.33       41.67
                      B     |          2       16.67       58.33
                      C     |          4       33.33       91.67
                      C-    |          1        8.33      100.00
                ------------+-----------------------------------
                      Total |         12      100.00
                
                . tab t3
                
                  RECODE of |
                         t2 |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                          A |          2       16.67       16.67
                         A+ |          1        8.33       25.00
                        A++ |          1        8.33       33.33
                         A- |          1        8.33       41.67
                          B |          2       16.67       58.33
                          C |          4       33.33       91.67
                         C- |          1        8.33      100.00
                ------------+-----------------------------------
                      Total |         12      100.00
                
                . tab t1 t3
                
                           |                                 RECODE of t2
                        t1 |         A         A+        A++         A-          B          C         C- |     Total
                -----------+-----------------------------------------------------------------------------+----------
                     A     |         2          0          0          0          0          0          0 |         2
                     A+    |         0          1          0          0          0          0          0 |         1
                     A++   |         0          0          1          0          0          0          0 |         1
                     A-    |         0          0          0          1          0          0          0 |         1
                     B     |         0          0          0          0          2          0          0 |         2
                     C     |         0          0          0          0          0          4          0 |         4
                     C-    |         0          0          0          0          0          0          1 |         1
                -----------+-----------------------------------------------------------------------------+----------
                     Total |         2          1          1          1          2          4          1 |        12
                Last edited by Marcos Almeida; 29 Apr 2017, 07:30.
                Best regards,

                Marcos

                Comment


                • #9
                  Originally posted by Marcos Almeida View Post

                  I presume you also knew how "to construct a numeric variable gz1 with "N"/"Y" replaced by 0/1, respectively."

                  No doubt, Daniel Bela shared excellent commands. What is more, he envisaged issues which could well bite, and gave excellent strategy on how to use the -noextend- option. This was a good lesson I learned today. Thanks, Daniel.

                  That said, as you know, what matters most when categorizing a variable is providing the correct label. However much we further change the codes. the results will keep the same, provided the label is correct.

                  This notwithstanding, for the exact "end" result you wished, I mean, the command, as you may also know, - recode - will simply do fine:
                  Many thanks again, Marcos. I suppose that you meant
                  Code:
                  recode t2 (2=3 "A") (3=2 "A+") (4=1 "A++") (5=4 "A-") (6=5 "B") (7=6 "C") (8=7 "C-") (miss=.), gen(t3)
                  which is what I needed. Your suggestion of using "recode" in this way is even more intuitive. I like it. Thanks.
                  Ho-Chuan (River) Huang
                  Stata 17.0, MP(4)

                  Comment

                  Working...
                  X