Labeling Data with ranges of values

Peter Vaughn

Join Date: Mar 2015

Posts: 21
#1

Labeling Data with ranges of values

26 Mar 2015, 06:21

Hey everybody. I need to label/categorize a variable which shows the highest completed grade of school for about 10'000 people. The range is from 0 years of school up to 20 years of school. I need to categorize these observations into "low" for those having had 0 to 11 years of schooling, "medium" for those having completed high school (=12 years) and high for those having between 13 and 20 years of schooling. How can I generate a new variable "educ_categorized" containing these three labels? Stata help and google didn't not solve this... Many thanks to all of you.

Best,

Peter
Tags: None

Maarten Buis

Join Date: Mar 2014
Posts: 3459

26 Mar 2015, 06:31

Code:

// open some example data
sysuse nlsw88, clear

// create the categorized variable
gen byte edcat = cond(grade  < 12, 1,     ///
                 cond(grade == 12, 2, 3)) ///
                 if !missing(grade)
                
// add some labels                
label variable edcat "education categorized"
label define edlevs 1 "less than highschool" ///
                    2 "highschool"           ///
                    3 "more than highschool"
label value edcat edlevs

// admire the result
tab grade edcat

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35726
#3

26 Mar 2015, 06:38

help egen and help functions both contain relevant suggestions. I favour an explicit definition such as

Code:

gen myvar = cond(schooling <= 11, 1, cond(schooling == 12, 2, cond(schooling <= 20, 3, 4))) if schooling < . label def myvar 1 "<= 11" 2 "12" 3 "<= 20" 4 "error?" label val myvar myvar

The value label stuff is standard and well documented and needs no more comment.

That generate syntax using cond() may look horrible, but it is nicer than it looks. You are saying

Code:

cond(schooling <= 11, 1, cond(schooling == 12, 2, cond(schooling <= 20, 3, 4 ) ) ) if schooling < .

So, using more words,

if schooling <= 11, assign 1;

else if schooling == 12, assign 2,

else if schooling <= 20, assign 3,

else assign 4,

all so long as the variable is not missing.

I know you said nothing about schooling more than 20 years, but if there are freak values in your dataset you don't know about they will show up. (Suppose somewhere there is a 21 that should be 12, etc.) A similar comment applies to missings. Naturally you may well be checking in any case.

As far as the syntax for cond() is concerned, clearly cond() calls can be nested and parentheses work as in elementary algebra: a left parenthesis ( amounts to a promise to write down its match later.

I greatly favour this way of writing down definitions:

1. What happens at class intervals is totally explicit. This isn't true of e.g. egen's cut() function.

2. By the same token, your code has a record of exact definitions.

But there is taste here too. Some very experienced Stata users hate cond() with a passion matched only by my pet peeves.

Last edited by Nick Cox; 26 Mar 2015, 06:55.
Comment

Patrick Abi Nader

Join Date: Jun 2014
Posts: 174

26 Mar 2015, 07:04

Hi Peter,

The following code should do what you are asking. I am calling your original education variable yrs_education.

Code:

gen educ_categorized=yrs_education
recode educ_categorized (0/11=1) (12=2) 13/20=3)
label define educ_cat 1"Low" 2"Medium" 3"High"
label values educ_categorized educ_cat

. list

     +---------------------+
     | yrs_ed~n   educ_c~d |
     |---------------------|
  1. |        0        Low |
  2. |        1        Low |
  3. |        1        Low |
  4. |        2        Low |
  5. |        3        Low |
     |---------------------|
  6. |        4        Low |
  7. |        5        Low |
  8. |        7        Low |
  9. |        4        Low |
 10. |        5        Low |
     |---------------------|
 11. |       18       High |
 12. |       20       High |
 13. |       13       High |
 14. |       12     Medium |
 15. |       14       High |
     |---------------------|
 16. |       15       High |
 17. |       12     Medium |
 18. |        9        Low |
 19. |       12     Medium |
 20. |       15       High |
     |---------------------|
 21. |        8        Low |
 22. |        9        Low |
     +---------------------+

Comment

Svend Juul

Join Date: Apr 2014

Posts: 515
#5

26 Mar 2015, 07:27

I don't hate cond(), but I dislike it because of its complexity. I like recode for such tasks; I find it more transparent. Maarten's example would look like this:

Code:

recode grade (min/11=1)(12=2)(13/max=3) , generate(edcat)

But in this case we assumed that grade is integers 0-20, and the following may be safer:

Code:

recode grade (13/20=3)(12/13=2)(0/12=1)(missing=.)(*=4) , generate(edcat)

Here, the intervals touch, so no non-integer values drop between bins. The rule is that if two bins overlap, the bin specified first wins. (*=4) collects any values not specified by the previous rules.

The recode command can also be used to specify value labels.

A major problem with recode is that it may tempt you to omit the generate() option:

Code:

recode grade (min/11=1)(12=2)(13/max=3) // Don't do that

in which case the original grade variable will be destroyed. Actually recode ought to require either a generate() or a replace option.
Comment

Sergiy Radyakin

Join Date: Apr 2014
Posts: 1867

26 Mar 2015, 07:47

Two comments:
1) Svend's syntax doesn't quite work, as the option missing is incompatible with option * (star), which means "everything else":

Code:

. recode grade (13/20=3) (12/13=2) (0/12=1) (missing=.) (*=4) , generate(edcat)
keywords else/* and missing/nonmissing may not be combined
r(198);

Keyword nonmissing should be used here to denote "other nonmissing values not falling into any of the prescribed bins".
But otherwise recode is just the right tool for this kind of tasks.

2) Since the original question was about labels, the recode syntax can be modified to prescribe the labels in the same step:

Code:

sysuse auto, clear
replace mpg=mpg-10
rename mpg grade
recode grade (13/20=3 "high") (12/13=2 "medium") (0/12=1 "low") (missing=.) ///
                 (nonmissing=4 "unknown or miscoded") , generate(edcat)
tabulate edcat

Produces:

Code:

    RECODE of grade |
    (Mileage (mpg)) |      Freq.     Percent        Cum.
--------------------+-----------------------------------
                low |         43       58.11       58.11
             medium |          5        6.76       64.86
               high |         21       28.38       93.24
unknown or miscoded |          5        6.76      100.00
--------------------+-----------------------------------
              Total |         74      100.00

I wish we could prescribe the variable label as well in the same syntax as the default RECODE of ... is very annoying (imho).

Best, Sergiy Radyakin

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35726
#7

26 Mar 2015, 08:00

The default variable label I suspect is totally deliberate as a flag which is a very gentle version of SOME PREVIOUS USER, PERHAPS EVEN YOU, CHANGED THESE DATA, SO WATCH OUT.
Comment
Peter Vaughn

Join Date: Mar 2015

Posts: 21
#8

26 Mar 2015, 08:21

Thanks so much! I used a combination of Svend and Sergiy's codes. Many thanks, have a great day.

Best, Peter Vaughn
Comment

Announcement