Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • The dissimilarity index

    Folks,

    I'm utilising Nick Cox's excellent command to calculate cultural team diversity.


    Code:
    . ineq country , by(team)
    
    ----------------------------------------------------------------------------------------------------
        group |             Team              freq           Simpson           entropy           dissim.
    ----------+-----------------------------------------------------------------------------------------
            1 |     ADO Den Haag                26             0.042             3.199             0.115
            2 |         AFC Ajax                27             0.041             3.225             0.142
            3 |       AZ Alkmaar                29             0.038             3.298             0.096
            4 |    De Graafschap                29             0.037             3.313             0.071
            5 |     FC Groningen                25             0.043             3.171             0.086
            6 |        FC Twente                29             0.039             3.286             0.162
            7 |       FC Utrecht                32             0.033             3.440             0.069
            8 |        Feyenoord                27             0.039             3.265             0.057
            9 |  Heracles Almelo                27             0.040             3.245             0.091
           10 |     NEC Nijmegen                26             0.050             3.064             0.224
           11 |       PEC Zwolle                38             0.029             3.558             0.123
           12 |    PSV Eindhoven                27             0.040             3.239             0.094
           13 | Roda JC Kerkrade                27             0.049             3.082             0.231
           14 |    SBV Excelsior                25             0.043             3.167             0.068
           15 |       SC Cambuur                30             0.036             3.346             0.098
           16 |          Vitesse                27             0.044             3.200             0.170
           17 |        Willem II                26             0.048             3.085             0.222
           18 |    sc Heerenveen                30             0.036             3.355             0.078
    ----------------------------------------------------------------------------------------------------
    I understand the calculation methods of the Simpson and entropy but could someone explain to me the dissim, please or guide me to any literature which discusses it?

    From my understanding of reading the about it would appear that the values calculate for dissim would be independent of the relative size of the groups used? So larger squads would be treated the same as smaller?

    Any information would be great.

  • #2
    Thanks for the endorsement.

    ineq is from SSC, as you are asked to explain. This measure is more generally

    (1/2) SUM | p - q |

    where p and q are both paired proportions that separately sum to 1. That measure can be minimally 0 when the two sets are identical and maximally 1 if one p is 1 and another q is 1 and all other proportions are 0. For then the non-zero differences are -1 and 1 in those two categories and the measure reduces to 1. So, one instance of that is proportions p = 1, 0, 0, 0 and q = 0, 0, 0, 1.

    For one set of proportions, the reference is equal proportions in each of several categories and the measures compares the observed set of proportions with that reference case.

    Whether this makes sense substantively for your problem is your call, but equal proportions is the tacit reference case for entropy and Simpson's (Gini's/Turing's/Hirschman's/Good's/Herfindahl's) measure too.



    Comment


    • #3
      Hi Nick,

      Thank you for this. For my own clarity could I just ask you something in relation to table in post #1?

      If we where to rank the dissim values from highest to lowest, could we say that lowest value could be considered the least culturally heterogeneous team/firm while the highest could be considered the most?

      Code:
      group |             Team              freq           Simpson           entropy           dissim.
      10 |     NEC Nijmegen                26             0.050             3.064             0.224

      From reading online - http://www.censusscope.org/us/s40/p7...imilarity.html

      If a city's white-black dissimilarity index were 65, that would mean that 65% of white people would need to move to another neighborhood to make whites and blacks evenly distributed across all neighborhoods.
      So how would one literally interpret a value of 0.224 as noted here? While the example I quoted utilises 2 different groups, in a team/firm there could be n different nationalities.

      Comment


      • #4
        It's the same interpretation. It's the fraction that would need to move to reproduce the reference situation.

        Comment


        • #5
          Apologies for continued query as I am having difficulty trying to wrap my head around how the
          Code:
          ineq
          command produces the dissim score in #5.

          For ease I add the data used to calculate.

          Where team is the name of the team, nation = the nationality within that team, totalsquad = the total number of individuals who make up the group and I = the number of nationalities which is within the group.

          All the examples I've seen online tend to make reference to say a region, which is encompassed within a larger region, as seen by the quote above - #5.

          Since the individuals from the Netherlands make up the largest reference cat would the 0.224 score indicate that circa 22% of Dutch individuals would need to move to a different team in order to make all other nationalities evenly distributed within a team?

          Any help to clear up my query would be most welcome.

          Code:
          team nation totalsquad i
          NEC Nijmegen Aruba 23 1
          NEC Nijmegen Australia 23 1
          NEC Nijmegen Austria 23 1
          NEC Nijmegen Belgium 23 2
          NEC Nijmegen Denmark 23 1
          NEC Nijmegen England 23 1
          NEC Nijmegen Germany 23 1
          NEC Nijmegen Netherlands 23 9
          NEC Nijmegen Poland 23 1
          NEC Nijmegen Portugal 23 1
          NEC Nijmegen Romania 23 1
          NEC Nijmegen Sweden 23 2
          NEC Nijmegen Venezuela 23 1

          Comment


          • #6
            Please use dataex (SSC) to show examples.

            What ineq call are you using here?

            Comment


            • #7
              Apologies,


              Disregard the data in #5.

              For reference this is my data;

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str16 team str18 nat long nation
              "ADO Den Haag" "Denmark"     16
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Ivory Coast" 29
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Belgium"      4
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Denmark"     16
              "ADO Den Haag" "France"      20
              "ADO Den Haag" "Japan"       30
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Suriname"    51
              "ADO Den Haag" "Netherlands" 36
              "ADO Den Haag" "Netherlands" 36
              end
              label values nation nation
              label def nation 4 "Belgium", modify
              label def nation 16 "Denmark", modify
              label def nation 20 "France", modify
              label def nation 29 "Ivory Coast", modify
              label def nation 30 "Japan", modify
              label def nation 36 "Netherlands", modify
              label def nation 51 "Suriname", modify
              ------------------ copy up to and including the previous line ------------------
              [/CODE]

              And by using the following command I get the following output.

              Code:
              . ineq nation, by(team)
              
              ----------------------------------------------------------------------
                  group |       Team        freq     Simpson     entropy     dissim.
              ----------+-----------------------------------------------------------
                      1 | ADO Den Ha          21       0.052       2.984       0.114
              ----------------------------------------------------------------------
              How do I literally interpret this dissim 0.114 when you have multiple nationalities in a group? Is it 11% of the sample would need to move into another in order for an evenly distributed group?

              Comment


              • #8
                Thanks very much for the code and example. I am not surprised that you are puzzled by results here, as they are meaningless.

                The help for ineq (SSC) starts like this with reference to minimal syntax

                ineq varname


                ineq treats varname as an additive variable -- that is, assumes totals make sense and that no negative values are present.
                But nation clearly is not an additive variable at all: it's just an arbitrary integer code. The total of nation is a meaningless number and values of nation relative to other values are meaningless too. Netherlands is not 9 times bigger than Belgium just because its code is. (It looks as if the codes come out of some alphabetic reduction, say an encode.)

                The appropriate reduction here for ineq to make sense would be to contract to frequencies first.

                Code:
                clear
                input str16 team str18 nat long nation
                "ADO Den Haag" "Denmark"     16
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Ivory Coast" 29
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Belgium"      4
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Denmark"     16
                "ADO Den Haag" "France"      20
                "ADO Den Haag" "Japan"       30
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Suriname"    51
                "ADO Den Haag" "Netherlands" 36
                "ADO Den Haag" "Netherlands" 36
                end
                
                contract team nation
                list, sep(0)
                
                     +-------------------------------+
                     |         team   nation   _freq |
                     |-------------------------------|
                  1. | ADO Den Haag        4       1 |
                  2. | ADO Den Haag       16       2 |
                  3. | ADO Den Haag       20       1 |
                  4. | ADO Den Haag       29       1 |
                  5. | ADO Den Haag       30       1 |
                  6. | ADO Den Haag       36      14 |
                  7. | ADO Den Haag       51       1 |
                     +-------------------------------+
                
                ineq _freq, by(team)
                
                -------------------------------------------------------------
                        team |       freq     Simpson     entropy     dissim.
                -------------+-----------------------------------------------
                ADO Den Haag |          7       0.465       1.219       0.524
                -------------------------------------------------------------
                The internals are simply (in this case) frequency./total frequency = proportion or probability and then the measures are calculated from a vector of probabilities.

                Comment


                • #9
                  To make clear the methods used, let's use Mata calculator style on the vector of frequencies:

                  Code:
                  . mata
                  
                  : f = (1, 2, 1, 1, 1, 14, 1)
                  
                  : f / sum(f)
                                   1             2             3             4             5
                      +-----------------------------------------------------------------------
                    1 |  .0476190476   .0952380952   .0476190476   .0476190476   .0476190476
                      +-----------------------------------------------------------------------
                                   6             7
                       -----------------------------+
                    1    .6666666667   .0476190476  |
                       -----------------------------+
                  
                  : p = f / sum(f)
                  
                  : p
                                   1             2             3             4             5
                      +-----------------------------------------------------------------------
                    1 |  .0476190476   .0952380952   .0476190476   .0476190476   .0476190476
                      +-----------------------------------------------------------------------
                                   6             7
                       -----------------------------+
                    1    .6666666667   .0476190476  |
                       -----------------------------+
                  
                  : sum(p :* ln(1 :/ p))
                    1.219136867
                  
                  : 0.5 * sum(abs(p :- (1/7)))
                    .5238095238
                  
                  : sum(p:^2)
                    .4648526077

                  Comment


                  • #10
                    Nick,

                    Thank you very much for this. Much appreciated.

                    Comment


                    • #11
                      I looked at the code, which was written in 1998. There is scope for a count option, which says count these categories first!, then do the calculations, so that goes on the to-do list.

                      Comment


                      • #12
                        On thinking about it further, I realised that about 3 years ago I had started writing a different program, so I have picked that project up and finished the job. In this case, results are

                        Code:
                        . entropyetc nation 
                        
                        ----------------------------------------------------------------------
                            Group |  Shannon H      exp(H)     Simpson   1/Simpson     dissim.
                        ----------+-----------------------------------------------------------
                              all |      1.219       3.384       0.465       2.151       0.524
                        ----------------------------------------------------------------------
                        .
                        I'll post in a new thread when the code and help are publicly accessible.

                        Comment


                        • #13
                          Hi Nick,

                          I have a panel dataset similar to the one described above. I would like to calculate team cultural (nation) inequality (per year) based on the following formula, which is slightly different than the one used in ineq:


                          Click image for larger version

Name:	image_8785.png
Views:	1
Size:	8.5 KB
ID:	1431839


                          Could you please help?

                          Thanks in advance,
                          C



                          Comment


                          • #14
                            On the contrary: I can't see any real similarity here with anything calculated by ineq. If I understand this correctly, there is a different value for each individual in a group depending on values for the other members of a group. That's not something covered by ineq at all, even by analogy. (Naturally, it is puzzling that an equation is presented with no definitions whatsoever.)

                            More positively, how to do this is already discussed in https://www.statalist.org/forums/for...-group-members You've contributed to that thread, so you should push forward there.

                            Comment


                            • #15
                              Hi Nick,

                              you interpreted correctly the equation. I didn't understand well the ineq formula, I guess.

                              Thanks a lot for the prompt reply!

                              Comment

                              Working...
                              X