Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Grouping Categorical Data

    Hi there,
    I am attempting to create four groups from a categorical variable. I have attempted using the code for continuous data but have had no luck, e.g recode A (var1,var2,var3=1) (var4,var5=2) gen A_group ..etc. But when I tried this I got the error message "unknown el Var1 in rule
    r(198);

    There are between 2 and 3 different answers in each category. Any advice would be appreciated.
    Last edited by Mike ZCollins; 06 Aug 2019, 13:28.

  • #2
    1. you can work on a varlist but not this way; there appear to be other syntax errors also; see
    Code:
    help recode
    2. since you did not supply a data example, it is not clear what you want to do here; see the FAQ for the use of -dataex-

    Comment


    • #3
      Hi Mike,

      This error r(198) usually denotes an invalid syntax. Consider posting your exact syntax so others can help you with it. The FAQ (section 12) gives instruction on what/how you should post. Please consider reading it.

      Recode is a fairly straightforward command.

      Code:
      recode item (1 2 = 1) (3 = 2) (4/7 = 3),
      This command will recode the content of a variable called item, changing 1 and 2 to 1, 3 to 2 and any values between 4 and 7 to 3. Note that variables are numbers.

      You can always try the generate/replace approach (which is longer than recode, but usually allows you to get where you want). Consider a variable (original_variable) that is a continuous number from 0 to 1, and that you want to categorize it according to certain cutoffs:

      Code:
      gen categorical_variable = .
      replace categorical_variable = 1 if original_variable <= 0.25 & categorical_variable==.
      replace categorical_variable = 2 if original_variable <= 0.50 & categorical_variable==.
      replace categorical_variable = 3 if original_variable <= 0.75 & categorical_variable==.
      replace categorical_variable = 4 if original_variable <= 1 & categorical_variable==.

      Comment


      • #4
        Thank you Igor and Rich for your insights,

        recode money(HighA HighB HighC = 1) (HighD HighE HighF=2) (LowA LowB LowC=3) (LowD LowE=4),gen(money_group)

        This was the original code I used, after basing it on similar posts relating to continuous variables.

        Comment


        • #5
          Thanks for giving your code.

          Code:
          recode money(HighA HighB HighC = 1) (HighD HighE HighF=2) (LowA LowB LowC=3) (LowD LowE=4),gen(money_group)
          But it's illegal. I guess you got an error message objecting to HighA as part of a rule for recoding. Stata bailed out at the first problem.

          Beyond that we're guessing because you are not giving a data example and not explaining your rules clearly enough for anyone to give a precise answer.

          Perhaps HighA to HighF and LowA to LowE are indicators (1 or 0 or possibly 1 and missing) and you want some classification such as

          if HighA or HighB or HighC is 1 then return 1
          otherwise if HighD or HighE or HighF is 1 then return 2
          otherwise if LowA or LowB or LowC is 1 then return 3
          otherwise if LowD or LowE is 1 return 4

          But this is just guessing. And where does money fit in?

          Really, the onus is you to say more, as otherwise you're hoping that we can read minds.

          https://www.statalist.org/forums/help#stata applies as always.

          Comment


          • #6
            Apologies, I am new to STATAlist and will try to be more precise in future.

            Money is a persons level of income, however it is not continuous data; there are 11 possible answers to signify the persons level of income (HighA,HighB,HighC,HighD,HighE,HighF,LowA,LowB,Low C,LowD,LowE), what I would like to do is to create 4 groups out of the 11 answers e.g (Very high, high, low, very low).

            Very High = HighA,HighB,HighC

            High = HighD,HighE,HighF

            Low=LowA,LowB,LowC

            Very Low=LowD,LowE

            Comment


            • #7
              Future precision would be very good, but you're just ignoring my request now for a data example using dataex. In fact, we have asked three times now for you to do that (Rich in #2, Igor in #3, me in #5).

              I don't know why you think that's not needed, because it really would help.

              You haven't even read the FAQ Advice carefully, as every new message prompt requests, if you think that the forum is called STATAlist.

              Is money a string variable with string values like HighA? If so, recode is quite wrong, as the help explains:


              recode changes the values of numeric variables according to the rules specified.
              Your problem is surely soluble, but guessing exactly what it is is not entertaining.
              Last edited by Nick Cox; 07 Aug 2019, 05:34.

              Comment


              • #8
                As Nick pointed, it's guesswork, because I'm not sure how your money variable is recorded. In any case, it seems to be a string (text) variable, and you want to re-classify the observations grouping certain groups. Try the following.

                Code:
                clear
                input str5 money
                "HighA"
                "HighB"
                "HighC"
                "HighD"
                "HighE"
                "HighF"
                "LowA"
                "LowB"
                "LowC"
                "LowD"
                "LowE"
                end
                
                gen test = ""
                replace test="1" if (money=="HighA" | money=="HighB" | money=="HighC")
                replace test="2" if (money=="HighD" | money=="HighE" | money=="HighF")
                replace test="3" if (money=="LowA" | money=="LowB" | money=="LowC")
                replace test="4" if (money=="LowD" | money=="LowE")
                
                *the values the variable test assumes are odd because they increase as the financial condition decreases, but this follows what you posted on #4
                
                *to transform test in a numeric variable (it was created as a string above)
                destring test, replace
                
                *to add labels
                label define test_label 1 "Very rich" 2 "Rich" 3 "Poor" 4 "Very poor"
                label values test test_label
                If this doesn't work and help is still needed, please post a snippet of your data using dataex. More info on this can be found on the forum FAQ, linked on prior replies.

                Comment

                Working...
                X