Categorise continous variable using ranges

Adam Mitchell

Join Date: Oct 2021

Posts: 56
#1

Categorise continous variable using ranges

08 Aug 2022, 10:33

I have a continous variable ranging from 0 to 1 and I would like to categorize it into 4 categories. Some of the categories need to be in a range of values so for example I want the categories to be

1 = <0.2
2 = 0.2 - 0.39
3 = 0.4 - 0.59
4 = >= 0.60

Does anyone know how to code this effectively?

Many thanks in advance.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

08 Aug 2022, 10:46

The following code assumes that the variable you have is named var and that it is a true numeric variable (not a string variable, and not a value-labeled integer variable--these can look like true numeric variables but they aren't.)

Code:

gen byte wanted = 1 if var < 0.2 replace wanted = 2 if var < 0.4 & missing(wanted) replace wanted = 3 if var < 0.6 & missing(wanted) replace wanted = 4 if missing(wanted) & !missing(var)

Note: You do not say how you want to handle the situation where var is missing value. The code above leaves wanted as a missing value in that case. That is the most common way to handle it, and it is also the best for most purposes.

A shorter way to code this, given that your variable ranges between 0 and 1 and the cutpoints are equally spaced is:

Code:

gen byte wanted = min(floor(5*var) + 1, 4)

I do not recommend this second approach here, however, because it is far from transparent. The slight amount of extra time it will take for you to type, and for Stata to execute the longer code given at the top of this reply is insignifcant compared to the time you will spend puzzling out how the one-line version works, and, even more insignificant compared to the time you will spend trying to remember what it's about when you come back to this code several months later or have to explain it somebody else. The only reason I even offer the approach is that if we were dealing with, say, 20 categories instead of 4, it would make sense to use this kind of approach.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#3

08 Aug 2022, 10:50

A continuous variable like this can be coded with say

Code:

gen wanted = ceil(5 * given)

with values 1(1)5 for upper limits 0.2(0.2)1 — in contrast to your scheme. See e.g. a paper on rounding and binning fairly recently in the Stata Journal. If you want or need irregular bins, then see the thread started recently by
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#4

08 Aug 2022, 10:52

Alberto Siviero …

I don’t agree completely with Clyde here. People not knowing what floor and ceiling functions are miss out on simple and widely useful functions, which allow consistent and concise and translatable rules for binning. But I thumped the table on this point in my binning paper, so won’t repeat the advocacy at length.

Last edited by Nick Cox; 08 Aug 2022, 10:59.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#5

09 Aug 2022, 03:54

Here are the references for #3 and #4.

https://www.stata-journal.com/articl...article=dm0095

https://www.statalist.org/forums/for...g-foreach-loop esp. #4 for repeated use of cond().
Comment
Adam Mitchell

Join Date: Oct 2021

Posts: 56
#6

10 Aug 2022, 03:42

Thank you Clyde and Nick for your help and advice! Really appreciated
Comment
Max Rakotonirinalalao

Join Date: Aug 2022

Posts: 8
#7

22 Aug 2022, 03:17

Hello,
Stata newbie here.
I have a similar question. My variable (age in years) is ranging from 18 to 70 and I would like to categorize it into 5 categories.
1) 18-28 (years old)
2) 29-39
3) 40-50
4) 51-61
5) 62+
I need to describe the age distribution for a study.
How can I code this? Thanks in advance

Kind regards
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35697

22 Aug 2022, 03:36

#7 is really the same question as #1, as only your variable names and bin limits differ.

Code:

gen wanted = 1 if age <= 28 
replace wanted = 2 if age <= 39 & missing(wanted) 
replace wanted = 3 if age <= 50 & missing(wanted) 
replace wanted = 4 if age <= 61 & missing(wanted) 
replace wanted = 5 if age < . & missing(wanted)

Code:

gen wanted = cond(age <= 28, 1, cond(age <= 39, 2, cond(age <= 50, 3, cond(age <= 61, 4, 5))))  if age < .

or check out recode

In each case also define and link value labels

Code:

label def wanted 1 "18-28" 2 "29-39" 3 "40-50" 4 "51-61" 5 "62+" 
label val wanted wanted

where you should use the variable name you have for age if not age and you should use something fitting your goal for wanted.

Comment

Noah Mkasanga

Join Date: Nov 2020

Posts: 31
#9

22 Aug 2022, 03:44

recode age (18/28=1 "18-28") ///
(29/39=2 "29-39") ///
(40/50=3 "40-50") ///
(51/61=4 "51-61") ///
(62/70=5 "62+") ///
,gen(agegrp)
label var agegrp "Age Category"
1 like
Comment

Announcement

Categorise continous variable using ranges

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment