No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Composite Categorical Variables

    I am trying to create a single composite categorical variable from several (7) dummy variables. I exported the data from a survey software that created dummies out of a check any that apply question. I have tried gen and egen commands and every time I try to recode/replace the value it changes the original value from the dummy variable.

    For example:

    gen sector=.
    replace sector=1 if gbcon==1
    replace sector=2 if hhc==1
    replace sector=3 if cm==1
    replace sector=4 if engin==1
    replace sector=5 if arch==1
    replace sector=6 if spec==1
    replace sector=7 if other_sector==1

    The problem is that when I originally replace sector=1 if gbcon==1, the count is correct, however once I add the command replace sector=-2 if hhc==1, the count of gbcon changes from its original count to an inaccurate count.

    when 'replace sector=1 if gbcon==1' the count is 25
    after I add the code for 'replace sector=2 if hhc==1' the count for gbcon (or sector=1) changes to 19

    I cannot figure out a) why this is happening as gbcon and hhc are not coming from the same variable, and b) how to stop the count from changing when adding additional dummy variables to this composite variable

  • #2
    At the moment you are just replacing everything, both missing and relevant information.
    If you just want to fill 'sector' up with the following answers from 'hhc' etc then you can do:

      replace sector=1 if gbcon==1
    replace sector=2 if hhc==1 & sector==.
    replace sector=3 if cm==1 & sector==.
    I you want to collect all the information in 'sector' then perhaps :

    replace sector=1 if gbcon==1
    replace sector=sector + 10 if hhc==1 
    replace sector=sector + 100  if cm==1 & sector==.


    • #3
      Well, for a):
      tabulate gbcon hhc, missing
      and you'll see why.

      For b): you can't, so long as there are observations with where both the indicator variables are equal to one.


      • #4
        The problem is that the original data is "check any that apply." So if somebody checks both gbcon and hhc, then sector will first be set to 1, but then changed to 2. (And, depending on what else is checked, perhaps changed again before we are done.)

        You simply can't do what you're trying to do. When the original data is check all that apply, you have to leave it as separate dichotomous indicators for each response: you cannot make a single variable out of them in this way.

        If you really have a reason to make a single variable, you have to do it differently. You have to have a variable that accommodates all possible combinations of sectors being checked. In this case that means 27 possibilities, so the variable will range from 1 through 128. -egen, group()- will do this for you.


        • #5
          Another possibility is just to concatenate the variables, as in

          egen sector = concat(gbcon hhc cm engin arch spec other_sector)
          The resulting values won't be especially easy to interpret but the coding does keep the information.

          A little more work is something like this.

          gen sector = "" 
          foreach v in gbcon hhc cm engin arch spec other_sector { 
                  replace sector = sector + " `v'" if `v' == 1 
          replace sector = trim(sector)
          That way you'll end with string values like "engin spec" or "hhc". If your data are like data I know only some of the combinations will occur and most people will check just a small number of categories. So, summary tables and graphs are not far away.


          • #6
            Thanks all for the help!