Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cluster analysis STATA

    Hello Statalist,

    I have a model with 5 binary independent variables and one dependent variable (company profit).
    The 5 binary independent variables indicate attributes of an organization. If fulfilled the value takes "1" otherwise "0".

    What I would like to investigate is the impact of every possible combination of the 5 binary independent variables on the company profit.
    Is this done via a cluster analysis? Could you help me doing this in Stata?

    Thank you so much!

  • #2
    Does anyone have any ideas?

    Comment


    • #3
      If you have 5 binary independent variables and you want to look at all combinations, this gives you a total of 2^5=32 combinations to test. Given that your dataset is large enough and each combination actually exists, you can do this even "manually", which might be a lot cleaner.
      Code:
      egen groups = group(var1 var2 var3 var4 var5)
      tab groups
      tabstat var1 var2 var3 var4 var5, by(groups)
      reg profit i.groups
      Best wishes

      Stata 18.0 MP | ORCID | Google Scholar

      Comment


      • #4
        Thank you!

        I am pretty sure that not all of the combinations would exist meaning that for instance there is not a combination of var1=1 and var2=2
        Would it still be possible to calculate interaction terms with this data?

        Comment


        • #5
          The code will work as is. The main difference is that you will create fewer than 32 groups. And I would sort out all groups with a very small number of cases manually since the results are probably unstable.
          Best wishes

          Stata 18.0 MP | ORCID | Google Scholar

          Comment


          • #6
            Thank you very much!
            And the interaction terms will work as well right?


            one more question with regards to cluster analysis:
            I have a dataset with two columns. The first column contains my independent variable. The first column contains characteristics of a company that are separated by a comma. So for instance "A,C, D" in the first row and "C, A" in the second row. The second column contains my dependent variable which is the revenue of a company.

            Is there a way in STATA to compute a regression to get data on which of the charateristics (for instance "A) has what kind of impact on the revenue? Is there also a way to see what kind of combinations work best (for instance "A" with "C")?

            Comment


            • #7
              Technically, the interactions are already regarded in creating the groups and you do not have to specify any interactions in your regression model.
              The second question is a bit unclear to me and you might want to post an example dataset here. In any case, you need to encode the data correctly and make sure that you separate all the variables. Stata cannot compute regressions with strings or data separated by commas.
              Best wishes

              Stata 18.0 MP | ORCID | Google Scholar

              Comment


              • #8
                Thank you.

                Regarding the first qustion with the interaction: What I meant was adding an additional interaction variable. For instance the "firm size". So would it be possible to add the interaction between firm size and the charateristics (reg profit c.firmsize##i.groups)?

                Regarding the second question:
                An exemplary dataset would be:
                Characteristic Profit
                A,C,D 34
                C,A 32
                S 12
                A, S, D, C 1
                C,A 2
                D 43
                A, C, D 43
                S 53

                Comment


                • #9
                  Regarding the interactions, yes, this is possible, given again that there are enough cases.
                  In the second question you need to encode the string variable using, for example, split. See https://wlm.userweb.mwn.de/Stata/wstavart.htm and the help page for this command.
                  Best wishes

                  Stata 18.0 MP | ORCID | Google Scholar

                  Comment


                  • #10
                    Thanks!!
                    How would you proceed after the splitting?

                    Is fsQCA an option that should be used here?

                    Comment


                    • #11
                      I like the idea at #3 but recommend

                      Code:
                       
                       egen groups = group(var1 var2 var3 var4 var5), label

                      Comment


                      • #12
                        Thank you Nick!
                        Do you also have an idea how to solve the issue with the charateristics in #8?

                        Comment


                        • #13
                          Sorry, but I don't understand what that problem is.

                          Comment


                          • #14
                            Thanks Nick. Does the following clarify the problem?

                            I have a dataset with two columns.
                            The first column contains my independent variable. The first column contains string characteristics of a company that are separated by a comma. So for instance "A,C, D" in the first row and "C, A" in the second row.
                            The second column contains my dependent variable which is the revenue of a company (numeric variable).

                            Is there a way in STATA to compute a regression to get data on which of the charateristics (for instance "A") has what kind of impact on the revenue? Is there also a way to see what kind of combinations work best (for instance "A" with "C")?

                            An exemplary dataset would be the following:
                            Characteristic
                            Revenue
                            A,C,D 34
                            C,A 32
                            S 12
                            A,S,D,C 1
                            C,A 2
                            D 43
                            A,D,D 43
                            S 53

                            Comment


                            • #15
                              With just these two variables, the regression boils down to fitting separate means any way you prefer, say

                              Code:
                              tabstat Revenue , by(Characteristic)
                              although you can get the usual machinery of P-values and confidence intervals by

                              Code:
                              encode Characteristic, gen(Which)
                              and then

                              Code:
                              regress Revenue i.Which
                              I guess you just made up your data example, but "C,A" appears twice and "A, D, D" perhaps means "A, D".

                              Which combination works best perhaps means which produces the highest mean revenue, but the usual qualifiers may apply:

                              1. A mean may be pulled up by one or more very high values, so consider other summary statistics too.

                              2. Sometimes a variable like Revenue should be analysed on a logarithmic scale.



                              Comment

                              Working...
                              X