Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • gencat: Generating Dummies and Categorical Variables

    I'd appreciate any comments you have on this code: gencat.

    The purpose is to generate dummies and categorical variables in one line of code, avoiding the missing issue with generate. Included is a dataset with sex and race; I introduced missing as either . or -99.

    An error is given if the variable exists. Also can use prefix. For either dummies or cats, you can create individual dummies (so, male and female) and include a prefix if you want to avoid interference with an existing variable.

    target chooses which value you want to be coded 1 in a 0/1 situation.

    zero sets the dummy to 0/1 in cases where you have 1/2 or some such.

    Code:
    clear all
    use gencat_data , clear
    
    ** Sex
    tab SEX , missing
    gencat female = SEX , values(1 male 2 female) zero target(female) dummies
    
    ** Race
    tab RACE , missing
    gencat racecat = RACE , values(1 white 2 black 3 hispanic 4 asian 5 other) dummies
    
    ** Race with prefix
    gencat racecat2 = RACE , values(1 white 2 black 3 hispanic 4 asian 5 other) dummies prefix(race)

    Attached Files
    Last edited by George Ford; 03 May 2026, 10:09.

  • #2
    Hi George,

    I think this is a good idea to have, as it blends functionality of both -recode- and -tab, gen()-. Here are some comments after just testing it for a few minutes with the example dataset.
    • Using the -dummies- option, new variables are named using those specified labels. I think this behaviour could be optionally switched to instead use numeric codes if combined with prefix, similar to how -tab, gen()- enumerates a list of new indicator variables.
    • Labels are constrained to a single word (or potentially valid Stata name). This to me seems like a limitation as labels are often not fully described with a single word. It would be nice if the limit could be relaxed and also applied as labels to the newly created indicator variables.
    • The example with the variable RACE highlights an important potential point of confusion. The default value for non-target indicator variables is given a label of "Other", while "Other" is also a valid recoded value. Maybe let the user optionally control this label value, and change the default to be "Not xxxx".
    • Minor nit pick, but some people don't like the terminology of dummy-coded/dummies. Maybe allow -indicator- as a synonym.

    Comment


    • #3
      Good thoughts, some of which I've already started to include. Also added if in. I think a label option to label the variable is a useful add on. I've been using it, and it makes like a lot easier.

      Comment


      • #4
        Here's an update. Give it a shot. It has if/in. There's a label option to name the variable. There's a tabstyle option that does EDUC1, EDUC2, .... Also a "bug" was cleared in the numbering of the cat variable. dummies is now indicator, so as not to offend.
        Attached Files

        Comment


        • #5
          Speaking for myself, and starting at the bottom, I am not fond of the term dummy variable and generally prefer indicator variable.

          There are several grounds for that -- and several other terms you could use instead, including binary, dichotomous, quantal, Boolean, logical and one-hot. Often I will use binary, This is all social not logical: I tend to follow what seems common practice: a 0-1 variable may be called a binary outcome, but an indicator predictor.

          The objections to dummy variable include possible serious embarrasment (or worse), I have heard several stories with this flavour: a researcher was casual in a presentation about the term as a familiar technicality to them, but what was encoded was important, and the term was wildly miscontrued by audience members as implying that what was encoded was regarded as trivial or unimportant, or even that researcher was being offensive, accidentally or on purpose.

          There is lengthier discussion within

          SJ-19-1 dm0099 . . . . . . How best to generate indicator or dummy variables
          . . . . . . . . . . . . . . . . . . . . N. J. Cox and C. B. Schechter
          Q1/19 SJ 19(1):246--259 (no commands)
          discusses how to best generate indicator or dummy variables

          SJ-16-1 dm0087 . . . Speaking Stata: Truth, falsity, indication, and negation
          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
          Q1/16 SJ 16(1):229--236 (no commands)
          looks at the following concepts: indicator variables, by: for
          groupwise calculations, and control of sort order to enable
          exactly what you want

          Now to my main reaction here. I want usually an indicator variable, by that or any either name, that is ONE of

          1. Zero or one

          2. Zero or one or missing

          and I want often value labels attached.

          That is all typically three lines of official Stata. The problems with any community-contributed command -- and I include commands written by me singly or jointly in the past, still on SSC, but not used by me any more -- are

          a. whether I really save much effort

          b. the fact that often I really need to check what a command is doing exactly (in particular, the difference between #1 and #2 above is often crucial)

          c. I have been there, but now suggest that one shouldn't need a different command for such a task -- or oblige your readers to find out about it.

          Comment


          • #6
            added some functionality to gencat. you can reverse order of categories and assign the source variable to the label. tabstyle is an option.
            no residual "dummy" parlance.

            as Nick asks, does it save much effort? perhaps there will be some disagreement, but I'm using it and like the setup. it will certainly avoid the missing issue, which is a trap less experienced (and sometimes experienced) users fall into.
            .
            does it reduce many lines of code, in some cases, I find yes. In others, not so much. for me, in several cases, it replaces several lines of code, especially with labels, combining groups, and reverse ordering. there's also a clarity to it when you see the code.

            thoughts, bugs, and odd use cases are appreciated.


            Code:
            sysuse auto, clear
            
            * Categorical variable for repair record (rep78 has values 1-5):
            gencat rep = rep78, values(1 "Poor" 2 "Fair" 3 "Average" 4 "Good" 5 "Excellent") label("Repair Record")
            
            * Same with indicators (creates Poor, Fair, Average, Good, Excellent):
            gencat rep2 = rep78, values(1 "Poor" 2 "Fair" 3 "Average" 4 "Good" 5 "Excellent") label("Repair Record") indicator
            
            * Reverse-coded repair record (5=Poor, 4=Fair, ...):
            gencat rep_r = rep78, values(1 "Poor" 2 "Fair" 3 "Average" 4 "Good" 5 "Excellent") reverse label("Repair Record (reversed)")
            
            * Combine groups:
            gencat rep_com = rep78, values(1 2 "Poor/Fair" 3 "Average" 4 "Good" 5 "Excellent") label("Repair Record (combo)")
            
            * Indicator for foreign cars (foreign==1), with indicator:
            gencat domestic = foreign, values(0 domestic 1 foreign) zero target(domestic) source
            
            * Tabstyle — auto-creates rep1, rep2, ... from observed levels of rep78 and includes source variable in label:
            gencat rep_t = rep78, tabstyle label("Repair Record") source
            
            * Subset with if — domestic cars only:
            gencat rep_for = rep78 if !foreign , values(1 "Poor" 2 "Fair" 3 "Average" 4 "Good" 5 "Excellent") label("Repair Record")
            Attached Files

            Comment


            • #7
              I recoded it, but it works the same. Before I was parsing a bunch of text, but this uses traditional syntax.

              New examples provided in the help file, all based on auto.

              Thank to Leonardo and Nick for the comments, which I believe I have included in the code.
              Attached Files

              Comment


              • #8
                found a bug in testing.
                Attached Files

                Comment

                Working...
                X