Group overlap

giorgioconti

Join Date: Oct 2014

Posts: 13
#1

Group overlap

14 Nov 2016, 14:26

Dear Stata users,

I am sorry in advance if this question is not directly related to Stata's functionality.

I have data on companies nested within clusters over years. Unfortunately, the algorithm assigned cluster categories in a given year randomly and now I need to define consistency between clusters.

I have the following data now (this is just an example, the actual dataset is significantly bigger with thousands of firms nested within clusters over a decade):

firm_id cluster_id year
Firm1 1 2001
Firm2 1 2001
Firm3 1 2001
Firm4 2 2001
Firm5 2 2001
Firm6 2 2001
Firm7 2 2001
Firm4 1 2002
Firm5 1 2002
Firm8 1 2002
Firm1 2 2002
Firm2 2 2002
Firm9 2 2002

What can be seen from here is that Firm1 and Firm2 are in Cluster2 in 2002. Apparently this is Cluster1 from 2001. Percentage of member overlap between Cluster1 in 2001 and Cluster2 in 2002 is thus 50% (2 firms shared between two clusters over 4 firms in both communities). What I would like to have is to rename Cluster2 in 2002 in Cluster1 given a certain percentage of members overlap (say, 50%).

I would be grateful for you help.

Best,
Giorgio
Tags: None
Charlie Joyez

Join Date: Dec 2014

Posts: 421
#2

15 Nov 2016, 01:38

Gorgio, this seems rather hazardous, why not recreate new cluster variable that directly fits your goal?
What's the clustering logic here? See whether the egen group command could help you (help egen group)

What you're trying to do is quite complicated, since you want Stata to browse all the firm_id in one cluster (each year?) and check how many of them are already registered in another cluster in another year (or just the previous year?). We could find a way to do that, but I'm not convinced about the pertinence of the result.

And small Stata note : you don't want to "rename" Cluster2 in Cluster1, but to replace cluster==2 by cluster==1, renaming is for variable names in Stata, replace for their values.
Also, please consider using dataex (ssc install dataex) to post example of your data.

Best,
Charlie
1 like
Comment
giorgioconti

Join Date: Oct 2014

Posts: 13
#3

15 Nov 2016, 03:19

Dear Charlie,

Thank you for your reply.

I used a clustering algorithm (e.g., hierarchical clustering) to identify the (topological) clusters of firms in a given year. It is like having the friendship networks of individuals and identifying which social groups they belong to from year to year. If we have several years, we need to run the clustering algorithm for each year. The problem is that the clusters will not be numbered consistently from year to year (e.g., if three individuals were in one cluster numbered 1 in year 2000, this cluster may be numbered 2 in year 2001, while this is the same cluster) and the membership overlap will vary (the network is dynamic). What I wanted to do is to code the consistency in numbering such clusters. The -egen group- command will not be helpful here because it will not guarantee such consistency.

You are right, I don't want to "rename" Cluster2 in Cluster1, but to replace cluster==2 by cluster==1.

I agree that this is not easy to do and Stata may not be the most suitable software here.

Best,
Giorgio
Comment
Charlie Joyez

Join Date: Dec 2014

Posts: 421
#4

15 Nov 2016, 03:35

Gorgio,
Thanks for the explanations, it does make more sense. I know a little about network analysis and topological structure, and it is true Stata might not be the best way to do deal with it, but you can (although using Mata might sometimes be useful also).

If you want the consistency to be perfect simply run the clustering algorithm for the first year and then keep with those clusters every year.

Code:

clear input str10 firm_id cluster_id year Firm1 1 2001 Firm2 1 2001 Firm3 1 2001 Firm4 2 2001 Firm5 2 2001 Firm6 2 2001 Firm7 2 2001 Firm4 1 2002 Firm5 1 2002 Firm8 1 2002 Firm1 2 2002 Firm2 2 2002 Firm9 2 2002 end tab cluster if firm=="Firm1" bysort firm (year) : replace cluster=cluster[1] tab cluster if firm=="Firm1"

I'm still not really convinced by the intermediate solution you want: dynamic clusters, but arbitrary intervention to improve consistency across years. It seems to me that you want to reconcile two opposite methods. But if you're sure that's what you need, I'll come back later today with a code (no time now to test it), unless someone else suggests you something.

Best,
Charlie
Comment
giorgioconti

Join Date: Oct 2014

Posts: 13
#5

15 Nov 2016, 04:54

Charlie, I very much appreciate you afford and help!

I can provide a quote from the original paper, I am using to repeat the procedure (the quote is in italic below).

To trace the dynamics of the identified network communities over time, we matched them over contiguous years on the basis of the extent to which they consisted of the same firms. Formally, we defined the overlap between two communities as (Ci,t(intersection)Cj,t+1)/(Ci,t(union)Cj,t+1), where Ci,t(intersection)Cj,t+1 was the number of unique community members shared by both communities from year t to t+1 and Ci,t(union)Cj,t+1 was the number of all community members present in both communities. A value of 0 indicated that communities did not share any members, and 1, that they shared all members. Using this rule, we considered Ci,t and Cj,t+1 as a single dynamic community if the overlap between them was at least 30 percent and no other match provided a greater degree of overlap. Failing to satisfy the 30 percent requirement meant that the community in year t would be considered dissolved and the community in t+1 would be considered new.

Provided this logic, Cluster2 from 2002 in my example above renames to Cluster1 because the degree of overlap is 50% (2 firms shared between two clusters over 4 firms in both communities). Easier said than done, unfortunately.
Comment
Charlie Joyez

Join Date: Dec 2014

Posts: 421
#6

15 Nov 2016, 06:27

Ok, I think I've found a way to do what you want.
However I have some questions before that.
In your initial post, you say:

Originally posted by giorgioconti View Post

Percentage of member overlap between Cluster1 in 2001 and Cluster2 in 2002 is thus 50% (2 firms shared between two clusters over 4 firms in both communities).

However, I rather see a percentage of 67%: 2 firms shared, firm1 and firm2, over 3 in cluster 2 in 2002 (firm1, 2, and 9). The same percentage could be computed over the 3 firms in cluster1 in 2001 (firms1, 2 and 3). However, which reference would you like to compute percentages? In this case it is the same, but it could vary. I would tend to compute on the cluster to be recoded, so the final one (the cluster2 in 2002).

In your example also, following you logic the cluster one should be renamed in two, since firm4 and firm5 (so 2/3 of the cluster1 in 2002) belonged to the cluster2 in 2001.
However you didn't mentioned that, are you ok for this change?

At last, what to do with firms that had no previous cluster (e.g. firm8 in 2002). Do we pretend they don't change cluster over time?

If you agree on these three premises (67% of overlap, not 50; cluster2 in 2002 should be changed and we include first time firms among firm that don't change of clusters), I'd have a code for you, but for now, it is not really pretty to see, I'll try to improve it a little, waiting for your answer.

Best,
Charlie
Comment
giorgioconti

Join Date: Oct 2014

Posts: 13
#7

15 Nov 2016, 07:56

Charlie, I don't know why in the original paper the authors used the method that produces 50% consistency rather than 67%. I computed the percentage with regard to the logic in the quote from the original source. Indeed, 67% is another plausible way to measure cluster consistency.

I would appreciate if you could provide a syntax based on your three premises. Thank you!

Best,
Giorgio
Comment

Charlie Joyez

Join Date: Dec 2014
Posts: 421

15 Nov 2016, 08:45

Ok, here's the code:

Code:

clear
input str10 firm_id  cluster_id year
Firm1 1 2001
Firm2 1 2001
Firm3 1 2001
Firm4 2 2001
Firm5 2 2001
Firm6 2 2001
Firm7 2 2001
Firm4 1 2002
Firm5 1 2002
Firm8 1 2002
Firm1 2 2002
Firm2 2 2002
Firm9 2 2002
end

sort firm_id
encode firm_id,gen(firm_code)


xtset firm_code year
bysort firm_code (year) : gen L1cluster=L1.cluster

distinct cluster /*ssc install distinct*/
local nclust=r(ndistinct)

gen final_cluster=cluster
forvalues i=1/`nclust'{


gen diff_cluster`i'=(L1cluster!=cluster) if L1cluster==`i' &  year>2001

bysort cluster year : egen nb_L1cluster`i'=total(diff_cluster`i')
bysort cluster year :  gen sh_L1cluster`i'=nb_L1cluster`i'/_N

replace final_cluster=`i' if sh_L1cluster`i'>0.5

}

drop diff_cluster* nb_L1cluster* sh_L1cluster*

I'm sure the code could be improved.
Don't hesitate to ask questions.

Best,
Charlie

Last edited by Charlie Joyez; 15 Nov 2016, 08:47.

Comment

giorgioconti

Join Date: Oct 2014

Posts: 13
#9

15 Nov 2016, 09:31

Thank you very much, Charlie! This is how I had it in mind! I deeply appreciate your help!

Best,
Giorgio
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment