Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating Categorical variable based on ranges of another variable using loops

    Hello,

    I created a categorical variable "agecat" using the ranges of variable "hhh_age". I used following commands in Stata 12 version:

    Code:
    gen agecat=.
    replace agecat=1 if  hhh_age<=30
    replace agecat=2 if  hhh_age>30 & hhh_age<=40
    replace agecat=3 if  hhh_age>40 & hhh_age<=50
    replace agecat=4 if  hhh_age>50 & hhh_age<=60
    replace agecat=5 if  hhh_age>60 & hhh_age!=.
    
    label define age 1 "upto 30" 2 "30-40" 3 "40-50" 4 "50-60" 5 "Above 60"
    label value agecat age
    These are the first 50 observations

    Code:
     list id hhh_age agecat in 1/50,    noobs
    
        +----------------------------+
        id   hhh_age     agecat
        ----------------------------
        61        45      40-50
        62        25    upto 30
        63        24    upto 30
        68        60      50-60
        73        63   Above 60
        ----------------------------
        77        35      30-40
        81        55      50-60
        84        60      50-60
        86        33      30-40
        87        26    upto 30
        ----------------------------
        91        38      30-40
        93        40      30-40
        94        45      40-50
        95        39      30-40
        96        42      40-50
        ----------------------------
        97        40      30-40
        98        32      30-40
        99        27    upto 30
        102        48      40-50
        103        55      50-60
        ----------------------------
        104        30    upto 30
        105        28    upto 30
        108        55      50-60
        109        38      30-40
        110        28    upto 30
        ----------------------------
        113        72   Above 60
        124        70   Above 60
        156        55      50-60
        157        28    upto 30
        160        33      30-40
        ----------------------------
        161        42      40-50
        164        43      40-50
        170        45      40-50
        176        68   Above 60
        178        32      30-40
        ----------------------------
        184        60      50-60
        187        29    upto 30
        190        42      40-50
        193        52      50-60
        194        45      40-50
        ----------------------------
        196        32      30-40
        199        28    upto 30
        200        49      40-50
        202        64   Above 60
        206        50      40-50
        ----------------------------
        210        55      50-60
        211        50      40-50
        213        50      40-50
        214        28    upto 30
        216        34      30-40
        +----------------------------+

    Now, I would like to create categorical variables with 5 years of intervals until 90 and then 90+ but that would require repeating a lot of lines if I follow the procedure above. For example:

    Code:
    gen agecat5=.
    replace age_int=1 if  hhh_age>0  & hhh_age<5
    replace age_int=2 if  hhh_age>=5 & hhh_age<10
    replace age_int=3 if  hhh_age>=10 & hhh_age<15
    replace age_int=4 if  hhh_age>=15 & hhh_age<20
    replace age_int=5 if  hhh_age>=20 & hhh_age<25
    replace age_int=19 if  hhh_age>=90 & hhh_age!=.
    I am wondering whether I can use loops to write lesser codes. I tried to use the code below but I think it's more suitable for creating dummies, and hence it hasn't worked.

    Code:
    forvalues hhh_age=1(5)103{
       local val=`val'+1
       local top=`hhh_age' + 4
       gen age_int=`val'+1 if hhh_age>=`age' & hhh_age<`top'
       }
    The summary of hhh_age is given below
    Code:
    su    hhh_age
    
        Variable    Obs    Mean    Std. Dev.    Min    Max
                            
        hhh_age    8311    47.1859    13.00567    3    103
    I would appreciate any advice on the matter.

    Thanks in advance.
    Pablo
    Last edited by Pablo Miah; 08 Aug 2018, 19:51. Reason: Added Summary of hhh_age variable.

  • #2
    You don't need any loops at all. In fact, it's just two lines of code (and one of those is needed only because you decided to put all ages from 90 up into a single category, not just 90 through 94.)

    Code:
    //    CREATE DEMONSTRATOIN DATA
    clear*
    set obs 101
    gen hhh_age = _n-1
    
    //    CALCULATE AGE GROUP
    gen age_int = floor(hhh_age/5) + 1
    replace age_int = min(age_int, 19)
    
    //    SHOW THAT IT CAME OUT CORRECT
    tabstat hhh_age, by(age_int) statistics(min max)
    Added: In general, one of the interesting features of the Stata language is that many things that would be done with loops in other languages can be done without them. Yes, there are some things that must be done with loops in Stata, but when you find yourself starting to write a loop, it is always worth taking a moment to ponder whether there might be another way, relying on functions, or using -by:-.
    Last edited by Clyde Schechter; 08 Aug 2018, 20:06.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      You don't need any loops at all. In fact, it's just two lines of code (and one of those is needed only because you decided to put all ages from 90 up into a single category, not just 90 through 94.)

      Code:
      // CREATE DEMONSTRATOIN DATA
      clear*
      set obs 101
      gen hhh_age = _n-1
      
      // CALCULATE AGE GROUP
      gen age_int = floor(hhh_age/5) + 1
      replace age_int = min(age_int, 19)
      
      // SHOW THAT IT CAME OUT CORRECT
      tabstat hhh_age, by(age_int) statistics(min max)
      Added: In general, one of the interesting features of the Stata language is that many things that would be done with loops in other languages can be done without them. Yes, there are some things that must be done with loops in Stata, but when you find yourself starting to write a loop, it is always worth taking a moment to ponder whether there might be another way, relying on functions, or using -by:-.
      Thank you so much! I will keep the advice in mind.

      Comment


      • #4
        I'd add that there's a simple advantage to categorisations like 5 * floor(age/5) or 5 * ceil(age/5): they are self-explanatory and easy to explain to others too. If you are binning, there's no very strong reason beyond convention that bin identifiers are successive integers or even that they start at 1. Naturally schemes like those above can be explained with value labels, but equally schemes like those above need to be explained by value labels.

        Comment

        Working...
        X