Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Categorise continous variable using ranges

    I have a continous variable ranging from 0 to 1 and I would like to categorize it into 4 categories. Some of the categories need to be in a range of values so for example I want the categories to be

    1 = <0.2
    2 = 0.2 - 0.39
    3 = 0.4 - 0.59
    4 = >= 0.60

    Does anyone know how to code this effectively?

    Many thanks in advance.

  • #2
    The following code assumes that the variable you have is named var and that it is a true numeric variable (not a string variable, and not a value-labeled integer variable--these can look like true numeric variables but they aren't.)

    Code:
    gen byte wanted = 1 if var < 0.2
    replace wanted = 2 if var < 0.4 & missing(wanted)
    replace wanted = 3 if var < 0.6 & missing(wanted)
    replace wanted = 4 if missing(wanted) & !missing(var)
    Note: You do not say how you want to handle the situation where var is missing value. The code above leaves wanted as a missing value in that case. That is the most common way to handle it, and it is also the best for most purposes.

    A shorter way to code this, given that your variable ranges between 0 and 1 and the cutpoints are equally spaced is:

    Code:
    gen byte wanted = min(floor(5*var) + 1, 4)
    I do not recommend this second approach here, however, because it is far from transparent. The slight amount of extra time it will take for you to type, and for Stata to execute the longer code given at the top of this reply is insignifcant compared to the time you will spend puzzling out how the one-line version works, and, even more insignificant compared to the time you will spend trying to remember what it's about when you come back to this code several months later or have to explain it somebody else. The only reason I even offer the approach is that if we were dealing with, say, 20 categories instead of 4, it would make sense to use this kind of approach.

    Comment


    • #3
      A continuous variable like this can be coded with say

      Code:
      gen wanted = ceil(5 * given)
      with values 1(1)5 for upper limits 0.2(0.2)1 — in contrast to your scheme. See e.g. a paper on rounding and binning fairly recently in the Stata Journal. If you want or need irregular bins, then see the thread started recently by

      Comment


      • #4
        Alberto Siviero

        I don’t agree completely with Clyde here. People not knowing what floor and ceiling functions are miss out on simple and widely useful functions, which allow consistent and concise and translatable rules for binning. But I thumped the table on this point in my binning paper, so won’t repeat the advocacy at length.
        Last edited by Nick Cox; 08 Aug 2022, 10:59.

        Comment


        • #5
          Here are the references for #3 and #4.

          https://www.stata-journal.com/articl...article=dm0095

          https://www.statalist.org/forums/for...g-foreach-loop esp. #4 for repeated use of cond().

          Comment


          • #6
            Thank you Clyde and Nick for your help and advice! Really appreciated

            Comment


            • #7
              Hello,
              Stata newbie here.
              I have a similar question. My variable (age in years) is ranging from 18 to 70 and I would like to categorize it into 5 categories.
              1) 18-28 (years old)
              2) 29-39
              3) 40-50
              4) 51-61
              5) 62+
              I need to describe the age distribution for a study.
              How can I code this? Thanks in advance

              Kind regards

              Comment


              • #8
                #7 is really the same question as #1, as only your variable names and bin limits differ.

                Code:
                gen wanted = 1 if age <= 28 
                replace wanted = 2 if age <= 39 & missing(wanted) 
                replace wanted = 3 if age <= 50 & missing(wanted) 
                replace wanted = 4 if age <= 61 & missing(wanted) 
                replace wanted = 5 if age < . & missing(wanted)
                or

                Code:
                gen wanted = cond(age <= 28, 1, cond(age <= 39, 2, cond(age <= 50, 3, cond(age <= 61, 4, 5))))  if age < .
                or check out recode

                In each case also define and link value labels

                Code:
                label def wanted 1 "18-28" 2 "29-39" 3 "40-50" 4 "51-61" 5 "62+" 
                label val wanted wanted
                where you should use the variable name you have for age if not age and you should use something fitting your goal for wanted.

                Comment


                • #9
                  recode age (18/28=1 "18-28") ///
                  (29/39=2 "29-39") ///
                  (40/50=3 "40-50") ///
                  (51/61=4 "51-61") ///
                  (62/70=5 "62+") ///
                  ,gen(agegrp)
                  label var agegrp "Age Category"

                  Comment

                  Working...
                  X