How can I get consistent encoding across variables?

Richard Williams

Join Date: Apr 2014

Posts: 4984
#1

How can I get consistent encoding across variables?

26 Jul 2017, 05:17

I am using encode to convert string variables into numeric, e.g. "chemistry" might get encoded to 1, "physics" to 2, etc. The problem is that there are 17 variables that use the same string codes but not all of them contain all 100 categories, e.g. for var2, if no one was in chemistry then "physics" could get encoded as 1 instead of 2.

Is there any easy way to get consistent encoding across variables? I can think of harder ways, e.g. a recode command where I recode 100 values, but I wonder if there isn't something simpler.

It will be even harder, of course, if the vars sometimes have different categories, e.g. "physics" appears in var2 but not var1. So, I suppose you would want an encoding based on all the categories in all 17 vars. I guess I could get all the categories in a file, encode it, and then merge, but this too seems tedious. I think I would have to repeat the process 17 times.

This seems like a common enough problem that someone would have written a routine for it. But maybe not.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4461
#2

26 Jul 2017, 05:18

try defining value labels and then using the label option on your -encode- command
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4984
#3

26 Jul 2017, 05:35

You're a genius. This seems to work:

Code:

* encode the 17 keywords forval j = 1/17 { encode keyword`j', gen(key`j') label(keywords) }

If I understand this correctly, if "physics" was in keyword2 but not keyword1, it would get added as a new category. Which would screw up the alphabetical ordering but I'm not so concerned about that. I just want consistent coding.

EDIT: And, if I am really concerned about alphabetical order, I suppose I could create the label myself. Or edit the label created by the above and rerun the encoding.

Last edited by Richard Williams; 26 Jul 2017, 05:41.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4461
#4

26 Jul 2017, 06:49

yes, I (almost) always create the label myself prior to using -encode-; sorry not to have been clearer
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35676
#5

26 Jul 2017, 07:02

See also -multencode- from SSC.
1 like
Comment
Bill Rising (StataCorp)

StataCorp Employee

Join Date: Apr 2014

Posts: 28
#6

26 Jul 2017, 09:12

When using your own value label in the -encode- command, use the -noextend- option. This will cause Stata to throw an error if there are additional categories which you do not know exist/forget to include in your initial value label. If you do not use the option, Stata will happily use automatic encoding to create values for the additional categories, which then might be different from dataset to dataset.

The -noextend- option appeared at some point in the Stata 14 years.
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4984
#7

26 Jul 2017, 09:36

This is great! multencode would be very handy if you have multiple vars and don't know all the codes beforehand. noextend can help avoid errors or inconsistencies across data sets. And defining the labels beforehand can get you the coding you want, e.g. for gender

Code:

* gender variable label define female 0 "M" 1 "F" encode Gender, gen(female) label(female)

Without the label option, the encoded variable would be 1 = female, 2 = male. It would be even worse for something like "high", "medium", and "low" because the default encoding would be 1 = high, 2 = low, 3 = medium. I thought I would have to recode after encoding but now I see that probably isn't necessary if I can first define the label.

Thanks much.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement

How can I get consistent encoding across variables?

Comment

Comment

Comment

Comment

Comment

Comment