The dissimilarity index

Sean O'Connor

Join Date: Jun 2014
Posts: 119

The dissimilarity index

08 Nov 2016, 08:15

Folks,

I'm utilising Nick Cox's excellent command to calculate cultural team diversity.

Code:

. ineq country , by(team)

----------------------------------------------------------------------------------------------------
    group |             Team              freq           Simpson           entropy           dissim.
----------+-----------------------------------------------------------------------------------------
        1 |     ADO Den Haag                26             0.042             3.199             0.115
        2 |         AFC Ajax                27             0.041             3.225             0.142
        3 |       AZ Alkmaar                29             0.038             3.298             0.096
        4 |    De Graafschap                29             0.037             3.313             0.071
        5 |     FC Groningen                25             0.043             3.171             0.086
        6 |        FC Twente                29             0.039             3.286             0.162
        7 |       FC Utrecht                32             0.033             3.440             0.069
        8 |        Feyenoord                27             0.039             3.265             0.057
        9 |  Heracles Almelo                27             0.040             3.245             0.091
       10 |     NEC Nijmegen                26             0.050             3.064             0.224
       11 |       PEC Zwolle                38             0.029             3.558             0.123
       12 |    PSV Eindhoven                27             0.040             3.239             0.094
       13 | Roda JC Kerkrade                27             0.049             3.082             0.231
       14 |    SBV Excelsior                25             0.043             3.167             0.068
       15 |       SC Cambuur                30             0.036             3.346             0.098
       16 |          Vitesse                27             0.044             3.200             0.170
       17 |        Willem II                26             0.048             3.085             0.222
       18 |    sc Heerenveen                30             0.036             3.355             0.078
----------------------------------------------------------------------------------------------------

I understand the calculation methods of the Simpson and entropy but could someone explain to me the dissim, please or guide me to any literature which discusses it?

From my understanding of reading the about it would appear that the values calculate for dissim would be independent of the relative size of the groups used? So larger squads would be treated the same as smaller?

Any information would be great.

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35754
#2

09 Nov 2016, 12:34

Thanks for the endorsement.

ineq is from SSC, as you are asked to explain. This measure is more generally

(1/2) SUM | p - q |

where p and q are both paired proportions that separately sum to 1. That measure can be minimally 0 when the two sets are identical and maximally 1 if one p is 1 and another q is 1 and all other proportions are 0. For then the non-zero differences are -1 and 1 in those two categories and the measure reduces to 1. So, one instance of that is proportions p = 1, 0, 0, 0 and q = 0, 0, 0, 1.

For one set of proportions, the reference is equal proportions in each of several categories and the measures compares the observed set of proportions with that reference case.

Whether this makes sense substantively for your problem is your call, but equal proportions is the tacit reference case for entropy and Simpson's (Gini's/Turing's/Hirschman's/Good's/Herfindahl's) measure too.
Comment
Sean O'Connor

Join Date: Jun 2014

Posts: 119
#3

15 Nov 2016, 07:41

Hi Nick,

Thank you for this. For my own clarity could I just ask you something in relation to table in post #1?

If we where to rank the dissim values from highest to lowest, could we say that lowest value could be considered the least culturally heterogeneous team/firm while the highest could be considered the most?

Code:

group | Team freq Simpson entropy dissim. 10 | NEC Nijmegen 26 0.050 3.064 0.224

From reading online - http://www.censusscope.org/us/s40/p7...imilarity.html

If a city's white-black dissimilarity index were 65, that would mean that 65% of white people would need to move to another neighborhood to make whites and blacks evenly distributed across all neighborhoods.

So how would one literally interpret a value of 0.224 as noted here? While the example I quoted utilises 2 different groups, in a team/firm there could be n different nationalities.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#4

15 Nov 2016, 08:33

It's the same interpretation. It's the fraction that would need to move to reproduce the reference situation.
Comment
Sean O'Connor

Join Date: Jun 2014

Posts: 119
#5

16 Nov 2016, 05:23

Apologies for continued query as I am having difficulty trying to wrap my head around how the

Code:

ineq

command produces the dissim score in #5.

For ease I add the data used to calculate.

Where team is the name of the team, nation = the nationality within that team, totalsquad = the total number of individuals who make up the group and I = the number of nationalities which is within the group.

All the examples I've seen online tend to make reference to say a region, which is encompassed within a larger region, as seen by the quote above - #5.

Since the individuals from the Netherlands make up the largest reference cat would the 0.224 score indicate that circa 22% of Dutch individuals would need to move to a different team in order to make all other nationalities evenly distributed within a team?

Any help to clear up my query would be most welcome.

Code:

team nation totalsquad i NEC Nijmegen Aruba 23 1 NEC Nijmegen Australia 23 1 NEC Nijmegen Austria 23 1 NEC Nijmegen Belgium 23 2 NEC Nijmegen Denmark 23 1 NEC Nijmegen England 23 1 NEC Nijmegen Germany 23 1 NEC Nijmegen Netherlands 23 9 NEC Nijmegen Poland 23 1 NEC Nijmegen Portugal 23 1 NEC Nijmegen Romania 23 1 NEC Nijmegen Sweden 23 2 NEC Nijmegen Venezuela 23 1
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#6

16 Nov 2016, 06:28

Please use dataex (SSC) to show examples.

What ineq call are you using here?
Comment

Sean O'Connor

Join Date: Jun 2014
Posts: 119

16 Nov 2016, 07:01

Apologies,

Disregard the data in #5.

For reference this is my data;

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str16 team str18 nat long nation
"ADO Den Haag" "Denmark"     16
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Ivory Coast" 29
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Belgium"      4
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Denmark"     16
"ADO Den Haag" "France"      20
"ADO Den Haag" "Japan"       30
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Suriname"    51
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
end
label values nation nation
label def nation 4 "Belgium", modify
label def nation 16 "Denmark", modify
label def nation 20 "France", modify
label def nation 29 "Ivory Coast", modify
label def nation 30 "Japan", modify
label def nation 36 "Netherlands", modify
label def nation 51 "Suriname", modify

------------------ copy up to and including the previous line ------------------
[/CODE]

And by using the following command I get the following output.

Code:

. ineq nation, by(team)

----------------------------------------------------------------------
    group |       Team        freq     Simpson     entropy     dissim.
----------+-----------------------------------------------------------
        1 | ADO Den Ha          21       0.052       2.984       0.114
----------------------------------------------------------------------

How do I literally interpret this dissim 0.114 when you have multiple nationalities in a group? Is it 11% of the sample would need to move into another in order for an evenly distributed group?

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35754

16 Nov 2016, 09:55

Thanks very much for the code and example. I am not surprised that you are puzzled by results here, as they are meaningless.

The help for ineq (SSC) starts like this with reference to minimal syntax

ineq varname

ineq treats varname as an additive variable -- that is, assumes totals make sense and that no negative values are present.

But nation clearly is not an additive variable at all: it's just an arbitrary integer code. The total of nation is a meaningless number and values of nation relative to other values are meaningless too. Netherlands is not 9 times bigger than Belgium just because its code is. (It looks as if the codes come out of some alphabetic reduction, say an encode.)

The appropriate reduction here for ineq to make sense would be to contract to frequencies first.

Code:

clear
input str16 team str18 nat long nation
"ADO Den Haag" "Denmark"     16
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Ivory Coast" 29
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Belgium"      4
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Denmark"     16
"ADO Den Haag" "France"      20
"ADO Den Haag" "Japan"       30
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Suriname"    51
"ADO Den Haag" "Netherlands" 36
"ADO Den Haag" "Netherlands" 36
end

contract team nation
list, sep(0)

     +-------------------------------+
     |         team   nation   _freq |
     |-------------------------------|
  1. | ADO Den Haag        4       1 |
  2. | ADO Den Haag       16       2 |
  3. | ADO Den Haag       20       1 |
  4. | ADO Den Haag       29       1 |
  5. | ADO Den Haag       30       1 |
  6. | ADO Den Haag       36      14 |
  7. | ADO Den Haag       51       1 |
     +-------------------------------+

ineq _freq, by(team)

-------------------------------------------------------------
        team |       freq     Simpson     entropy     dissim.
-------------+-----------------------------------------------
ADO Den Haag |          7       0.465       1.219       0.524
-------------------------------------------------------------

The internals are simply (in this case) frequency./total frequency = proportion or probability and then the measures are calculated from a vector of probabilities.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35754

16 Nov 2016, 12:34

To make clear the methods used, let's use Mata calculator style on the vector of frequencies:

Code:

. mata

: f = (1, 2, 1, 1, 1, 14, 1)

: f / sum(f)
                 1             2             3             4             5
    +-----------------------------------------------------------------------
  1 |  .0476190476   .0952380952   .0476190476   .0476190476   .0476190476
    +-----------------------------------------------------------------------
                 6             7
     -----------------------------+
  1    .6666666667   .0476190476  |
     -----------------------------+

: p = f / sum(f)

: p
                 1             2             3             4             5
    +-----------------------------------------------------------------------
  1 |  .0476190476   .0952380952   .0476190476   .0476190476   .0476190476
    +-----------------------------------------------------------------------
                 6             7
     -----------------------------+
  1    .6666666667   .0476190476  |
     -----------------------------+

: sum(p :* ln(1 :/ p))
  1.219136867

: 0.5 * sum(abs(p :- (1/7)))
  .5238095238

: sum(p:^2)
  .4648526077

Comment

Sean O'Connor

Join Date: Jun 2014

Posts: 119
#10

17 Nov 2016, 02:03

Nick,

Thank you very much for this. Much appreciated.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#11

17 Nov 2016, 02:18

I looked at the code, which was written in 1998. There is scope for a count option, which says count these categories first!, then do the calculations, so that goes on the to-do list.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35754

#12

20 Nov 2016, 12:37

On thinking about it further, I realised that about 3 years ago I had started writing a different program, so I have picked that project up and finished the job. In this case, results are

Code:

. entropyetc nation 

----------------------------------------------------------------------
    Group |  Shannon H      exp(H)     Simpson   1/Simpson     dissim.
----------+-----------------------------------------------------------
      all |      1.219       3.384       0.465       2.151       0.524
----------------------------------------------------------------------

.
I'll post in a new thread when the code and help are publicly accessible.

Comment

Constantinos Mammassis

Join Date: Feb 2018

Posts: 7
#13

27 Feb 2018, 07:09

Hi Nick,

I have a panel dataset similar to the one described above. I would like to calculate team cultural (nation) inequality (per year) based on the following formula, which is slightly different than the one used in ineq:

Could you please help?

Thanks in advance,
C
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#14

27 Feb 2018, 07:17

On the contrary: I can't see any real similarity here with anything calculated by ineq. If I understand this correctly, there is a different value for each individual in a group depending on values for the other members of a group. That's not something covered by ineq at all, even by analogy. (Naturally, it is puzzling that an equation is presented with no definitions whatsoever.)

More positively, how to do this is already discussed in https://www.statalist.org/forums/for...-group-members You've contributed to that thread, so you should push forward there.
Comment
Constantinos Mammassis

Join Date: Feb 2018

Posts: 7
#15

27 Feb 2018, 07:21

Hi Nick,

you interpreted correctly the equation. I didn't understand well the ineq formula, I guess.

Thanks a lot for the prompt reply!
Comment

Announcement

The dissimilarity index

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment