Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Group overlap

    Dear Stata users,

    I am sorry in advance if this question is not directly related to Stata's functionality.

    I have data on companies nested within clusters over years. Unfortunately, the algorithm assigned cluster categories in a given year randomly and now I need to define consistency between clusters.

    I have the following data now (this is just an example, the actual dataset is significantly bigger with thousands of firms nested within clusters over a decade):

    firm_id cluster_id year
    Firm1 1 2001
    Firm2 1 2001
    Firm3 1 2001
    Firm4 2 2001
    Firm5 2 2001
    Firm6 2 2001
    Firm7 2 2001
    Firm4 1 2002
    Firm5 1 2002
    Firm8 1 2002
    Firm1 2 2002
    Firm2 2 2002
    Firm9 2 2002

    What can be seen from here is that Firm1 and Firm2 are in Cluster2 in 2002. Apparently this is Cluster1 from 2001. Percentage of member overlap between Cluster1 in 2001 and Cluster2 in 2002 is thus 50% (2 firms shared between two clusters over 4 firms in both communities). What I would like to have is to rename Cluster2 in 2002 in Cluster1 given a certain percentage of members overlap (say, 50%).

    I would be grateful for you help.

    Best,
    Giorgio

  • #2
    Gorgio, this seems rather hazardous, why not recreate new cluster variable that directly fits your goal?
    What's the clustering logic here? See whether the egen group command could help you (help egen group)


    What you're trying to do is quite complicated, since you want Stata to browse all the firm_id in one cluster (each year?) and check how many of them are already registered in another cluster in another year (or just the previous year?). We could find a way to do that, but I'm not convinced about the pertinence of the result.

    And small Stata note : you don't want to "rename" Cluster2 in Cluster1, but to replace cluster==2 by cluster==1, renaming is for variable names in Stata, replace for their values.
    Also, please consider using dataex (ssc install dataex) to post example of your data.

    Best,
    Charlie

    Comment


    • #3
      Dear Charlie,

      Thank you for your reply.

      I used a clustering algorithm (e.g., hierarchical clustering) to identify the (topological) clusters of firms in a given year. It is like having the friendship networks of individuals and identifying which social groups they belong to from year to year. If we have several years, we need to run the clustering algorithm for each year. The problem is that the clusters will not be numbered consistently from year to year (e.g., if three individuals were in one cluster numbered 1 in year 2000, this cluster may be numbered 2 in year 2001, while this is the same cluster) and the membership overlap will vary (the network is dynamic). What I wanted to do is to code the consistency in numbering such clusters. The -egen group- command will not be helpful here because it will not guarantee such consistency.

      You are right, I don't want to "rename" Cluster2 in Cluster1, but to replace cluster==2 by cluster==1.

      I agree that this is not easy to do and Stata may not be the most suitable software here.

      Best,
      Giorgio

      Comment


      • #4
        Gorgio,
        Thanks for the explanations, it does make more sense. I know a little about network analysis and topological structure, and it is true Stata might not be the best way to do deal with it, but you can (although using Mata might sometimes be useful also).

        If you want the consistency to be perfect simply run the clustering algorithm for the first year and then keep with those clusters every year.
        Code:
        clear
        input str10 firm_id  cluster_id year
        Firm1 1 2001
        Firm2 1 2001
        Firm3 1 2001
        Firm4 2 2001
        Firm5 2 2001
        Firm6 2 2001
        Firm7 2 2001
        Firm4 1 2002
        Firm5 1 2002
        Firm8 1 2002
        Firm1 2 2002
        Firm2 2 2002
        Firm9 2 2002
        end
        tab cluster if firm=="Firm1"
        bysort firm (year) : replace cluster=cluster[1]
        tab cluster if firm=="Firm1"
        I'm still not really convinced by the intermediate solution you want: dynamic clusters, but arbitrary intervention to improve consistency across years. It seems to me that you want to reconcile two opposite methods. But if you're sure that's what you need, I'll come back later today with a code (no time now to test it), unless someone else suggests you something.


        Best,
        Charlie

        Comment


        • #5
          Charlie, I very much appreciate you afford and help!

          I can provide a quote from the original paper, I am using to repeat the procedure (the quote is in italic below).

          To trace the dynamics of the identified network communities over time, we matched them over contiguous years on the basis of the extent to which they consisted of the same firms. Formally, we defined the overlap between two communities as (Ci,t(intersection)Cj,t+1)/(Ci,t(union)Cj,t+1), where Ci,t(intersection)Cj,t+1 was the number of unique community members shared by both communities from year t to t+1 and Ci,t(union)Cj,t+1 was the number of all community members present in both communities. A value of 0 indicated that communities did not share any members, and 1, that they shared all members. Using this rule, we considered Ci,t and Cj,t+1 as a single dynamic community if the overlap between them was at least 30 percent and no other match provided a greater degree of overlap. Failing to satisfy the 30 percent requirement meant that the community in year t would be considered dissolved and the community in t+1 would be considered new.

          Provided this logic, Cluster2 from 2002 in my example above renames to Cluster1 because the degree of overlap is 50% (2 firms shared between two clusters over 4 firms in both communities). Easier said than done, unfortunately.

          Comment


          • #6
            Ok, I think I've found a way to do what you want.
            However I have some questions before that.
            In your initial post, you say:
            Originally posted by giorgioconti View Post

            Percentage of member overlap between Cluster1 in 2001 and Cluster2 in 2002 is thus 50% (2 firms shared between two clusters over 4 firms in both communities).
            However, I rather see a percentage of 67%: 2 firms shared, firm1 and firm2, over 3 in cluster 2 in 2002 (firm1, 2, and 9). The same percentage could be computed over the 3 firms in cluster1 in 2001 (firms1, 2 and 3). However, which reference would you like to compute percentages? In this case it is the same, but it could vary. I would tend to compute on the cluster to be recoded, so the final one (the cluster2 in 2002).

            In your example also, following you logic the cluster one should be renamed in two, since firm4 and firm5 (so 2/3 of the cluster1 in 2002) belonged to the cluster2 in 2001.
            However you didn't mentioned that, are you ok for this change?

            At last, what to do with firms that had no previous cluster (e.g. firm8 in 2002). Do we pretend they don't change cluster over time?


            If you agree on these three premises (67% of overlap, not 50; cluster2 in 2002 should be changed and we include first time firms among firm that don't change of clusters), I'd have a code for you, but for now, it is not really pretty to see, I'll try to improve it a little, waiting for your answer.

            Best,
            Charlie


            Comment


            • #7
              Charlie, I don't know why in the original paper the authors used the method that produces 50% consistency rather than 67%. I computed the percentage with regard to the logic in the quote from the original source. Indeed, 67% is another plausible way to measure cluster consistency.

              I would appreciate if you could provide a syntax based on your three premises. Thank you!

              Best,
              Giorgio

              Comment


              • #8
                Ok, here's the code:
                Code:
                clear
                input str10 firm_id  cluster_id year
                Firm1 1 2001
                Firm2 1 2001
                Firm3 1 2001
                Firm4 2 2001
                Firm5 2 2001
                Firm6 2 2001
                Firm7 2 2001
                Firm4 1 2002
                Firm5 1 2002
                Firm8 1 2002
                Firm1 2 2002
                Firm2 2 2002
                Firm9 2 2002
                end
                
                sort firm_id
                encode firm_id,gen(firm_code)
                
                
                xtset firm_code year
                bysort firm_code (year) : gen L1cluster=L1.cluster
                
                distinct cluster /*ssc install distinct*/
                local nclust=r(ndistinct)
                
                gen final_cluster=cluster
                forvalues i=1/`nclust'{
                
                
                gen diff_cluster`i'=(L1cluster!=cluster) if L1cluster==`i' &  year>2001
                
                bysort cluster year : egen nb_L1cluster`i'=total(diff_cluster`i')
                bysort cluster year :  gen sh_L1cluster`i'=nb_L1cluster`i'/_N
                
                replace final_cluster=`i' if sh_L1cluster`i'>0.5
                
                }
                
                drop diff_cluster* nb_L1cluster* sh_L1cluster*
                I'm sure the code could be improved.
                Don't hesitate to ask questions.

                Best,
                Charlie
                Last edited by Charlie Joyez; 15 Nov 2016, 08:47.

                Comment


                • #9
                  Thank you very much, Charlie! This is how I had it in mind! I deeply appreciate your help!

                  Best,
                  Giorgio

                  Comment

                  Working...
                  X