Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I get consistent encoding across variables?

    I am using encode to convert string variables into numeric, e.g. "chemistry" might get encoded to 1, "physics" to 2, etc. The problem is that there are 17 variables that use the same string codes but not all of them contain all 100 categories, e.g. for var2, if no one was in chemistry then "physics" could get encoded as 1 instead of 2.

    Is there any easy way to get consistent encoding across variables? I can think of harder ways, e.g. a recode command where I recode 100 values, but I wonder if there isn't something simpler.

    It will be even harder, of course, if the vars sometimes have different categories, e.g. "physics" appears in var2 but not var1. So, I suppose you would want an encoding based on all the categories in all 17 vars. I guess I could get all the categories in a file, encode it, and then merge, but this too seems tedious. I think I would have to repeat the process 17 times.

    This seems like a common enough problem that someone would have written a routine for it. But maybe not.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

  • #2
    try defining value labels and then using the label option on your -encode- command

    Comment


    • #3
      You're a genius. This seems to work:

      Code:
      * encode the 17 keywords
      forval j = 1/17 {
          encode keyword`j', gen(key`j') label(keywords)
      }
      If I understand this correctly, if "physics" was in keyword2 but not keyword1, it would get added as a new category. Which would screw up the alphabetical ordering but I'm not so concerned about that. I just want consistent coding.

      EDIT: And, if I am really concerned about alphabetical order, I suppose I could create the label myself. Or edit the label created by the above and rerun the encoding.
      Last edited by Richard Williams; 26 Jul 2017, 05:41.
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      Stata Version: 17.0 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        yes, I (almost) always create the label myself prior to using -encode-; sorry not to have been clearer

        Comment


        • #5
          See also -multencode- from SSC.

          Comment


          • #6
            When using your own value label in the -encode- command, use the -noextend- option. This will cause Stata to throw an error if there are additional categories which you do not know exist/forget to include in your initial value label. If you do not use the option, Stata will happily use automatic encoding to create values for the additional categories, which then might be different from dataset to dataset.

            The -noextend- option appeared at some point in the Stata 14 years.

            Comment


            • #7
              This is great! multencode would be very handy if you have multiple vars and don't know all the codes beforehand. noextend can help avoid errors or inconsistencies across data sets. And defining the labels beforehand can get you the coding you want, e.g. for gender

              Code:
               * gender variable 
              label define female 0 "M" 1 "F"
              encode Gender, gen(female) label(female)
              Without the label option, the encoded variable would be 1 = female, 2 = male. It would be even worse for something like "high", "medium", and "low" because the default encoding would be 1 = high, 2 = low, 3 = medium. I thought I would have to recode after encoding but now I see that probably isn't necessary if I can first define the label.

              Thanks much.
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              Stata Version: 17.0 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://www3.nd.edu/~rwilliam

              Comment

              Working...
              X