Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a categorical variable with several conditions

    Dear all,
    I am a STATA-beginner and am currently struggling with creating a categorical variable that depends on 3 conditions.
    My goal is the following:
    I want to create a variable that indicates whether the observation (firm data) is part of a group based on 3 conditions - the country group, the industry, and the size of the firm.
    CountryGroup is a variable indicating 1 to 4, Size is a variable indicating 1 to 4, and Industry is a variable that holds the industry code (27 different industries).
    For now, my only option to create a variable that indicates each optional combination is as follows:

    >generate Category=1 if Industry==15 & Size==1 & CountryGroup==1
    >replace Category=2 if Industry==15 & Size==1 & CountryGroup==2
    >replace Category=3 if Industry==15 & Size==1 & CountryGroup==3
    >replace Category=4 if Industry==15 & Size==1 & CountryGroup==4
    >replace Category=4 if Industry==15 & Size==2 & CountryGroup==1
    ....


    Is there a way to code this in a more efficient way than to write (4x4x27) lines of code? So far, I've only found solutions for categorical variables that indicate 0 / 1 or categorical variables that only depend on one condition.

    Thanks a lot!
    Dana

    Last edited by Dana Fuhs; 22 Sep 2020, 03:00. Reason: categorical variable

  • #2
    The problem is that you don't tell us your definitions. Why for example are these quite different combinations

    Code:
    replace Category=4 if Industry==15 & Size==1 & CountryGroup==4
    replace Category=4 if Industry==15 & Size==2 & CountryGroup==1
    both mapped to 4? On the other hand, perhaps you didn't mean what you typed and wanted

    Code:
    replace Category=5 if Industry==15 & Size==2 & CountryGroup==1
    It's possible that

    Code:
    egen Category = group(Industry Size CountryGroup)
    is what you are seeking, but that is going to be hard to interpret unless you insist on value labels

    Code:
    egen Category = group(Industry Size CountryGroup), label
    and even then why do you think such a variable will be more useful than the existing variables Industry Size CountryGroup?


    Comment


    • #3
      If you want a cateogrical variable with one category for each combination of the three variables that appear occur in your data, there is a much simpler way to this:

      Code:
      egen newvariable = group(Industry Size CountryGroup)
      Please read the FAQ before your next post. You should use code delimiters when you post examples of your code.

      Comment


      • #4
        Thank you! Apparently, I didn't understand how the "egen group()" function would work for my problem but now my problem is solved!

        For explanatory reasons (replying to Nick Cox): I need to calculate a variable that represents a firm's performance within one group and I wanted to do that in two steps, first creating the category and then calculating the relative performance using that categorical variable. And sorry, yes, the 5 was a mistake!

        Thanks again!
        Last edited by Dana Fuhs; 22 Sep 2020, 03:57.

        Comment


        • #5
          Good that you found answers helpful but the explanation.
          I need to calculate a variable that represents a firm's performance within one group and I wanted to do that in two steps, first creating the category and then calculating the relative performance using that categorical variable.
          doesn't answer my question in #2.

          Comment


          • #6
            I wanted to create that variable in order to have a better overview of the data composition for each of the groups created by that categorical variable. Maybe this is not the best way to proceed with my task but so far I am quite new to Stata and this appeared to be my best option.

            Comment


            • #7
              It is important to know where this task is going, because as Nick pointed out, most of the time you do not need to compress your three variables in one category.

              E.g., if firm performance is average return,

              Code:
              egen meanret = mean(ret), by(Industry Size CountryGroup)
              would give you the average return by the group defined by the three variables, and the intermediate step of creating a group variable is not needed.

              Comment


              • #8
                #1 explains that you have 432 cross-combinations. Combining them in a single categorical variable will make some tabulations and listings easier -- although if that is what you want, do ask, as there are alternatives -- but I defy anyone to make easy sense of a table with 432 cells. So, it would seem that making sense of performance would depend on some kind of model, simple though it might need to be.

                You don't have to defend yourself or apologise for being new to Stata, but in turn questions from the more grizzled on why you want to do this sometimes reveal that someone is looking in unfruitful directions.

                Comment


                • #9
                  Thank you very much, both answers are very helpful! I definitely have to take a further look at "egen" to understand how it can help me create the variables I need!
                  Creating the category-variable already helped me to understand that I need to find a better way to group the observations (as some cross-combinations lack enough observations to continue)!

                  Comment


                  • #10
                    Indeed you would need to set this for every dataset. I think of setting a variable's display format as soon too giving it a variable label and value labels, so don't find it onerous to include in the coding/cleaning do file, but obviously you can disagree :-)

                    Comment


                    • #11
                      Looking at 432 combinations is a tough call but a start could be e.g. mean of all + means for each category of single predictors. In your case that would be 1 + 4 + 4 + 27 = 36 means. For means, read any other standard summary.

                      Consider for example:

                      Code:
                      . webuse nlswork, clear
                      (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
                      
                      . designplot ln_wage  south race year, max(1) exclude0
                      
                      . search designplot, sj
                      
                      Search of official help files, FAQs, Examples, and Stata Journals
                      
                      SJ-19-3 gr0061_3  . . . . . . . . . . . . . . . Software update for designplot
                              (help designplot if installed)  . . . . . . . . . . . . . .  N. J. Cox
                              Q3/19   SJ 19(3):748--751
                              any attempt to use the missing option of graph dot,
                              graph hbar, or graph bar is now ignored and advice on
                              what to do instead is shown
                      
                      SJ-17-3 gr0061_2  . . . . . . . . . . . . . . . Software update for designplot
                              (help designplot if installed)  . . . . . . . . . . . . . .  N. J. Cox
                              Q3/17   SJ 17(3):779
                              help file updated
                      
                      SJ-15-2 gr0061_1  . . . . . . . . . . . . . . . Software update for designplot
                              (help designplot if installed)  . . . . . . . . . . . . . .  N. J. Cox
                              Q2/15   SJ 15(2):605--606
                              bug fixed for Stata 14
                      
                      SJ-14-4 gr0061  Design plots for graphical summary of a response given factors
                              (help designplot if installed)  . . . . . . . . . . . . . .  N. J. Cox
                              Q4/14   SJ 14(4):975--990
                              produces a graphical summary of a numeric response variable
                              given one or more factors
                      Click image for larger version

Name:	designplot.png
Views:	1
Size:	28.1 KB
ID:	1573793

                      Comment


                      • #12
                        That could really help me with my problem! As I am working with the BEEPS data (EBRD/World Bank data), I am facing a highly diverse data set (35+ countries, different firm sizes, and industries) and this appears to be a great way to tackle this challenge.
                        Thanks a lot!

                        Comment

                        Working...
                        X