Calculating Blau index

Ferry Doeza

Join Date: Apr 2021

Posts: 10
#1

Calculating Blau index

03 May 2021, 09:06

Dear Statalist users,

I have a question that I cannot seem to figure out myself. I have looked at the forums, and although I have found many helpful topics (like: https://www.statalist.org/forums/for...kill-diversity), the situation does not apply to me. I have also figured out that there appear to be many names for the Blau-index, like the Simpson's, Herfindahl’s, and Herfindahl-Hirschman’s. In this post, I will explain as clearly as I can what I try to achieve.

I have the dataset below, which consists of directors per firm per year, along with the number of awards they have received. I want to capture the heterogeneity in the number of awards in a board year. However, since the Blau-Index requires categories, and Awards is a count vaiable, I need to create the category myself. Therefore, I want to use the mean of the sample, which in this case is 53. My categories for the Blau-index thus exist of: 1. above 53, and 2. below 53.

My question is thus: how could I calculate the Blau-Index by CompanyID and Year, with the category stated above (above/below the mean)?

Code:

clear input int(CompanyID Year DirectorID) str1 Gender int Awards 1111 2008 4854 "M" 45 1111 2008 2938 "F" 14 1111 2008 4927 "F" 120 1111 2008 9068 "M" 76 1111 2009 4854 "M" 45 1111 2009 2938 "F" 76 1111 2010 4854 "M" 46 2222 2008 4275 "F" 54 2222 2009 5827 "M" 65 2222 2009 5283 "M" 34 2222 2010 6912 "M" 12 2222 2010 4917 "F" 43 2222 2010 4854 "M" 59 end

Note, I have tried the following:

Code:

gen AwardsDummy = 0 replace AwardsDummy = 1 if Awards>53 bysort CompanyID Year: divcat AwardsDummy , gv gen_gv(H_AwardsDummy)

And although this works for the snippet of my data above, it does not seem to work for my whole dataset. I get an error 'too many values', which may be correct, since my dataset is big.

Thank you for your time and efforts. If something is unclear, please let me know.

Last edited by Ferry Doeza; 03 May 2021, 09:11.
Tags: None

Dirk Enzmann

Join Date: Apr 2014
Posts: 530

03 May 2021, 09:24

I wonder wheter a dichotomous variable will capture the heterogeneity sufficiently. But here is an example to calculate the Blau-Indix (GV in the output) -- note that "above the mean" and "below the mean" is ambiguous:

Code:

cap which divcat
if _rc ssc install divcat  // install -divcat- if necessary

clear
input int(CompanyID Year DirectorID) str1 Gender int Awards
1111 2008 4854 "M"  45
1111 2008 2938 "F"  14
1111 2008 4927 "F" 120
1111 2008 9068 "M"  76
1111 2009 4854 "M"  45
1111 2009 2938 "F"  76
1111 2010 4854 "M"  46
2222 2008 4275 "F"  54
2222 2009 5827 "M"  65
2222 2009 5283 "M"  34
2222 2010 6912 "M"  12
2222 2010 4917 "F"  43
2222 2010 4854 "M"  59
end

sum Awards, meanonly
recode Awards (min/`r(mean)' = 0 "<= `r(mean)'") (`r(mean)'/max = 1 "> `r(mean)'"), gen(Awards_2)
tab1 Awards_2

bys CompanyID Year: divcat Awards_2

Comment

Dirk Enzmann

Join Date: Apr 2014

Posts: 530
#3

03 May 2021, 09:32

Sorry, I did not read you post closely enough. Following your example, can you try

Code:

set tracedepth 2 set trace on bysort CompanyID Year: divcat AwardsDummy , gv gen_gv(H_AwardsDummy) set trace off
Comment
Ferry Doeza

Join Date: Apr 2021

Posts: 10
#4

03 May 2021, 09:53

Dirk Enzmann

Dear Dirk,

Thank you very much for sharing your knowledge. Unfortunately, even with your commands, I still get the 'too many values' error message. Do you have any other tips on how this can be solved?

I agree with you that a dummy variable might not capture heterogeneity sufficiently. However, with a count variable like this, I do not know a better way to calculate the Blau-index. Do you, by any chance, have a better approach for this?

Ultimately, I want to capture the female awards, so I can make a distinction between male/female awards. I do not know if this is important to know.

Thank you for your time and efforts

Last edited by Ferry Doeza; 03 May 2021, 09:59.
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 530
#5

03 May 2021, 10:34

As to the problem of "too many values": The suggestion to use -set trace- on was meant to find out where exactly the problems of too many cases occurs. Because you are looking for the measure of heterogeneity by groups (of cases), you can circumvent the problem by calculating the Blau Index (GV) for subsets of your data.

As to the problem of a better measure of heterogeneity, you should look for measures that take into account the nature of your data (here: count data). I am no expert in this field, certainly there are others who know more to answer this question.
Comment
Ferry Doeza

Join Date: Apr 2021

Posts: 10
#6

04 May 2021, 09:28

Dirk Enzmann

I see, thank you very much for your help. It now works for the Awards variable. I have a similar variable, and have also used divcat for a dummy that I have created to measure heterogeneity. However, the results of the Blau Index are negative. How can this be? Maybe important to note, almost all values are dummy coded with a 0, just a few variables are dummy coded with a '1' (to show categories). Could this be the reason for the negative Blau-Index?

Best regards
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 530
#7

04 May 2021, 10:29

The Blau index (generaliized variance: GV) should not be negative (for details see: Budesco, D. V. & Budesco, M. (2012). How to measure diversity if you must. Psychological Methods, 17(2), 215-227). That could only happen if the sum of the proportions over the categories of your variable is greater than 1 (which is impossible). Can you create a reproducible example?
1 like
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 530
#8

04 May 2021, 11:56

More precise: That the sum of the proportions over the categories of your variable is greater than 1 is only a necessary condition for the GV being negative; the sufficient condition is that the sum of the squared proportions over the categories of your variable is greater 1.
Comment
Ferry Doeza

Join Date: Apr 2021

Posts: 10
#9

05 May 2021, 03:14

Dirk Enzmann

Dear Dirk,

Thank you for your response and for sharing your knowledge. I tried to export a piece of my data to post here, but every time I did would not get negative values anymore. I thus suspected that it had something to do with my large dataset, and by my manual calculation of the Blau-Index. My previous message was unclear, I see, since it suggests that I got the negative values by using divcat. This is not the case, since I used my manual calculation below.

Code:

bys CompanyID Year Awards: gen a = _N*(_n==1) bys CompanyID Year: egen AwardsDistribution = total((_N-a^2)/(_N^2)) drop a

I split up my sample, so I would not get the 'too many values' message and I ran divcat instead. I get exactly the same positive results, but the negative results that I obtained with my manual method are now 0 with divcat. I thus think that it may have to do with rounding (since the negative values I first obtained were very small). Either way, it seems that divcat works perfectly for this.

Thank you very much for your time and efforts.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35677
#10

05 May 2021, 03:40

Your code doesn't pay any attention to whether there are missing values. I don't know if that is biting.
Comment
Ferry Doeza

Join Date: Apr 2021

Posts: 10
#11

05 May 2021, 03:49

Nick Cox

Dear Nick,

Thank you for your input. Although my code indeed does not pay attention to missing values, there are none in my CompanyID, Year and Awards variables. Although my issue is now fixed by using divcat, I am not sure why my code didn't work optimally. I suspect rounding, but I am not sure.
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 530
#12

05 May 2021, 05:13

A more serious issue is that your are categorizing (dichotomizing) your data to calculate a measure of heterogeneity which is a serious loss of information. Instead you should use a measure of heterogeneity that takes into account the nature of your data (counts), but I am no expert in this field.
Comment

Announcement

Calculating Blau index

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment