Cluster analysis STATA

OLMABA JALA

Join Date: Jan 2021

Posts: 68
#1

Cluster analysis STATA

10 May 2021, 11:50

Hello Statalist,

I have a model with 5 binary independent variables and one dependent variable (company profit).
The 5 binary independent variables indicate attributes of an organization. If fulfilled the value takes "1" otherwise "0".

What I would like to investigate is the impact of every possible combination of the 5 binary independent variables on the company profit.
Is this done via a cluster analysis? Could you help me doing this in Stata?

Thank you so much!
Tags: cluster, regression
OLMABA JALA

Join Date: Jan 2021

Posts: 68
#2

10 May 2021, 23:48

Does anyone have any ideas?
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 694
#3

11 May 2021, 00:11

If you have 5 binary independent variables and you want to look at all combinations, this gives you a total of 2^5=32 combinations to test. Given that your dataset is large enough and each combination actually exists, you can do this even "manually", which might be a lot cleaner.

Code:

egen groups = group(var1 var2 var3 var4 var5) tab groups tabstat var1 var2 var3 var4 var5, by(groups) reg profit i.groups

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
1 like
Comment
OLMABA JALA

Join Date: Jan 2021

Posts: 68
#4

11 May 2021, 00:19

Thank you!

I am pretty sure that not all of the combinations would exist meaning that for instance there is not a combination of var1=1 and var2=2
Would it still be possible to calculate interaction terms with this data?
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 694
#5

11 May 2021, 00:22

The code will work as is. The main difference is that you will create fewer than 32 groups. And I would sort out all groups with a very small number of cases manually since the results are probably unstable.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
1 like
Comment
OLMABA JALA

Join Date: Jan 2021

Posts: 68
#6

11 May 2021, 00:35

Thank you very much!
And the interaction terms will work as well right?

one more question with regards to cluster analysis:
I have a dataset with two columns. The first column contains my independent variable. The first column contains characteristics of a company that are separated by a comma. So for instance "A,C, D" in the first row and "C, A" in the second row. The second column contains my dependent variable which is the revenue of a company.

Is there a way in STATA to compute a regression to get data on which of the charateristics (for instance "A) has what kind of impact on the revenue? Is there also a way to see what kind of combinations work best (for instance "A" with "C")?
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 694
#7

11 May 2021, 00:45

Technically, the interactions are already regarded in creating the groups and you do not have to specify any interactions in your regression model.
The second question is a bit unclear to me and you might want to post an example dataset here. In any case, you need to encode the data correctly and make sure that you separate all the variables. Stata cannot compute regressions with strings or data separated by commas.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
1 like
Comment
OLMABA JALA

Join Date: Jan 2021

Posts: 68
#8

11 May 2021, 01:03

Thank you.

Regarding the first qustion with the interaction: What I meant was adding an additional interaction variable. For instance the "firm size". So would it be possible to add the interaction between firm size and the charateristics (reg profit c.firmsize##i.groups)?

Regarding the second question:
An exemplary dataset would be:

Characteristic Profit

A,C,D 34

C,A 32

S 12

A, S, D, C 1

C,A 2

D 43

A, C, D 43

S 53
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 694
#9

11 May 2021, 01:07

Regarding the interactions, yes, this is possible, given again that there are enough cases.
In the second question you need to encode the string variable using, for example, split. See https://wlm.userweb.mwn.de/Stata/wstavart.htm and the help page for this command.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
1 like
Comment
OLMABA JALA

Join Date: Jan 2021

Posts: 68
#10

11 May 2021, 01:19

Thanks!!
How would you proceed after the splitting?

Is fsQCA an option that should be used here?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#11

11 May 2021, 01:58

I like the idea at #3 but recommend

Code:

egen groups = group(var1 var2 var3 var4 var5), label
1 like
Comment
OLMABA JALA

Join Date: Jan 2021

Posts: 68
#12

11 May 2021, 02:08

Thank you Nick!
Do you also have an idea how to solve the issue with the charateristics in #8?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#13

11 May 2021, 02:17

Sorry, but I don't understand what that problem is.
1 like
Comment
OLMABA JALA

Join Date: Jan 2021

Posts: 68
#14

11 May 2021, 02:25

Thanks Nick. Does the following clarify the problem?

I have a dataset with two columns.
The first column contains my independent variable. The first column contains string characteristics of a company that are separated by a comma. So for instance "A,C, D" in the first row and "C, A" in the second row.
The second column contains my dependent variable which is the revenue of a company (numeric variable).

Is there a way in STATA to compute a regression to get data on which of the charateristics (for instance "A") has what kind of impact on the revenue? Is there also a way to see what kind of combinations work best (for instance "A" with "C")?

An exemplary dataset would be the following:

Characteristic

Revenue

A,C,D 34

C,A 32

S 12

A,S,D,C 1

C,A 2

D 43

A,D,D 43

S 53
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#15

11 May 2021, 02:40

With just these two variables, the regression boils down to fitting separate means any way you prefer, say

Code:

tabstat Revenue , by(Characteristic)

although you can get the usual machinery of P-values and confidence intervals by

Code:

encode Characteristic, gen(Which)

and then

Code:

regress Revenue i.Which

I guess you just made up your data example, but "C,A" appears twice and "A, D, D" perhaps means "A, D".

Which combination works best perhaps means which produces the highest mean revenue, but the usual qualifiers may apply:

1. A mean may be pulled up by one or more very high values, so consider other summary statistics too.

2. Sometimes a variable like Revenue should be analysed on a logarithmic scale.
Comment

Characteristic	Profit
A,C,D	34
C,A	32
S	12
A, S, D, C	1
C,A	2
D	43
A, C, D	43
S	53

Announcement

Cluster analysis STATA

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment