Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Labeling Data with ranges of values

    Hey everybody. I need to label/categorize a variable which shows the highest completed grade of school for about 10'000 people. The range is from 0 years of school up to 20 years of school. I need to categorize these observations into "low" for those having had 0 to 11 years of schooling, "medium" for those having completed high school (=12 years) and high for those having between 13 and 20 years of schooling. How can I generate a new variable "educ_categorized" containing these three labels? Stata help and google didn't not solve this... Many thanks to all of you.

    Best,

    Peter

  • #2
    Code:
    // open some example data
    sysuse nlsw88, clear
    
    // create the categorized variable
    gen byte edcat = cond(grade  < 12, 1,     ///
                     cond(grade == 12, 2, 3)) ///
                     if !missing(grade)
                    
    // add some labels                
    label variable edcat "education categorized"
    label define edlevs 1 "less than highschool" ///
                        2 "highschool"           ///
                        3 "more than highschool"
    label value edcat edlevs
    
    // admire the result
    tab grade edcat
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      help egen and help functions both contain relevant suggestions. I favour an explicit definition such as

      Code:
      gen myvar = cond(schooling <= 11, 1, cond(schooling == 12, 2, cond(schooling <= 20, 3, 4))) if schooling < .
      
      label def myvar 1 "<= 11" 2 "12" 3 "<= 20" 4 "error?"
      label val myvar myvar
      The value label stuff is standard and well documented and needs no more comment.

      That generate syntax using cond() may look horrible, but it is nicer than it looks. You are saying

      Code:
      cond(schooling <= 11, 1,
      cond(schooling == 12, 2,
      cond(schooling <= 20, 3, 4
      )
      )
      )
      if schooling < .
      So, using more words,

      if schooling <= 11, assign 1;

      else if schooling == 12, assign 2,

      else if schooling <= 20, assign 3,

      else assign 4,

      all so long as the variable is not missing.

      I know you said nothing about schooling more than 20 years, but if there are freak values in your dataset you don't know about they will show up. (Suppose somewhere there is a 21 that should be 12, etc.) A similar comment applies to missings. Naturally you may well be checking in any case.

      As far as the syntax for cond() is concerned, clearly cond() calls can be nested and parentheses work as in elementary algebra: a left parenthesis ( amounts to a promise to write down its match later.

      I greatly favour this way of writing down definitions:

      1. What happens at class intervals is totally explicit. This isn't true of e.g. egen's cut() function.

      2. By the same token, your code has a record of exact definitions.

      But there is taste here too. Some very experienced Stata users hate cond() with a passion matched only by my pet peeves.


      Last edited by Nick Cox; 26 Mar 2015, 06:55.

      Comment


      • #4
        Hi Peter,

        The following code should do what you are asking. I am calling your original education variable yrs_education.

        Code:
        gen educ_categorized=yrs_education
        recode educ_categorized (0/11=1) (12=2) 13/20=3)
        label define educ_cat 1"Low" 2"Medium" 3"High"
        label values educ_categorized educ_cat
        
        . list
        
             +---------------------+
             | yrs_ed~n   educ_c~d |
             |---------------------|
          1. |        0        Low |
          2. |        1        Low |
          3. |        1        Low |
          4. |        2        Low |
          5. |        3        Low |
             |---------------------|
          6. |        4        Low |
          7. |        5        Low |
          8. |        7        Low |
          9. |        4        Low |
         10. |        5        Low |
             |---------------------|
         11. |       18       High |
         12. |       20       High |
         13. |       13       High |
         14. |       12     Medium |
         15. |       14       High |
             |---------------------|
         16. |       15       High |
         17. |       12     Medium |
         18. |        9        Low |
         19. |       12     Medium |
         20. |       15       High |
             |---------------------|
         21. |        8        Low |
         22. |        9        Low |
             +---------------------+

        Comment


        • #5
          I don't hate cond(), but I dislike it because of its complexity. I like recode for such tasks; I find it more transparent. Maarten's example would look like this:
          Code:
          recode grade (min/11=1)(12=2)(13/max=3) , generate(edcat)
          But in this case we assumed that grade is integers 0-20, and the following may be safer:
          Code:
          recode grade (13/20=3)(12/13=2)(0/12=1)(missing=.)(*=4) , generate(edcat)
          Here, the intervals touch, so no non-integer values drop between bins. The rule is that if two bins overlap, the bin specified first wins. (*=4) collects any values not specified by the previous rules.

          The recode command can also be used to specify value labels.

          A major problem with recode is that it may tempt you to omit the generate() option:
          Code:
          recode grade (min/11=1)(12=2)(13/max=3)           // Don't do that
          in which case the original grade variable will be destroyed. Actually recode ought to require either a generate() or a replace option.

          Comment


          • #6
            Two comments:
            1) Svend's syntax doesn't quite work, as the option missing is incompatible with option * (star), which means "everything else":

            Code:
            . recode grade (13/20=3) (12/13=2) (0/12=1) (missing=.) (*=4) , generate(edcat)
            keywords else/* and missing/nonmissing may not be combined
            r(198);
            Keyword nonmissing should be used here to denote "other nonmissing values not falling into any of the prescribed bins".
            But otherwise recode is just the right tool for this kind of tasks.

            2) Since the original question was about labels, the recode syntax can be modified to prescribe the labels in the same step:
            Code:
            sysuse auto, clear
            replace mpg=mpg-10
            rename mpg grade
            recode grade (13/20=3 "high") (12/13=2 "medium") (0/12=1 "low") (missing=.) ///
                             (nonmissing=4 "unknown or miscoded") , generate(edcat)
            tabulate edcat
            Produces:
            Code:
                RECODE of grade |
                (Mileage (mpg)) |      Freq.     Percent        Cum.
            --------------------+-----------------------------------
                            low |         43       58.11       58.11
                         medium |          5        6.76       64.86
                           high |         21       28.38       93.24
            unknown or miscoded |          5        6.76      100.00
            --------------------+-----------------------------------
                          Total |         74      100.00
            I wish we could prescribe the variable label as well in the same syntax as the default RECODE of ... is very annoying (imho).

            Best, Sergiy Radyakin

            Comment


            • #7
              The default variable label I suspect is totally deliberate as a flag which is a very gentle version of SOME PREVIOUS USER, PERHAPS EVEN YOU, CHANGED THESE DATA, SO WATCH OUT.

              Comment


              • #8
                Thanks so much! I used a combination of Svend and Sergiy's codes. Many thanks, have a great day.

                Best, Peter Vaughn

                Comment

                Working...
                X