Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating Blau index

    Dear Statalist users,

    I have a question that I cannot seem to figure out myself. I have looked at the forums, and although I have found many helpful topics (like: https://www.statalist.org/forums/for...kill-diversity), the situation does not apply to me. I have also figured out that there appear to be many names for the Blau-index, like the Simpson's, Herfindahl’s, and Herfindahl-Hirschman’s. In this post, I will explain as clearly as I can what I try to achieve.

    I have the dataset below, which consists of directors per firm per year, along with the number of awards they have received. I want to capture the heterogeneity in the number of awards in a board year. However, since the Blau-Index requires categories, and Awards is a count vaiable, I need to create the category myself. Therefore, I want to use the mean of the sample, which in this case is 53. My categories for the Blau-index thus exist of: 1. above 53, and 2. below 53.

    My question is thus: how could I calculate the Blau-Index by CompanyID and Year, with the category stated above (above/below the mean)?

    Code:
    clear
    input int(CompanyID Year DirectorID) str1 Gender int Awards
    1111 2008 4854 "M"  45
    1111 2008 2938 "F"  14
    1111 2008 4927 "F" 120
    1111 2008 9068 "M"  76
    1111 2009 4854 "M"  45
    1111 2009 2938 "F"  76
    1111 2010 4854 "M"  46
    2222 2008 4275 "F"  54
    2222 2009 5827 "M"  65
    2222 2009 5283 "M"  34
    2222 2010 6912 "M"  12
    2222 2010 4917 "F"  43
    2222 2010 4854 "M"  59
    end
    Note, I have tried the following:

    Code:
    gen AwardsDummy = 0
    replace  AwardsDummy = 1 if Awards>53
    bysort CompanyID Year: divcat AwardsDummy , gv gen_gv(H_AwardsDummy)
    And although this works for the snippet of my data above, it does not seem to work for my whole dataset. I get an error 'too many values', which may be correct, since my dataset is big.

    Thank you for your time and efforts. If something is unclear, please let me know.
    Last edited by Ferry Doeza; 03 May 2021, 09:11.

  • #2
    I wonder wheter a dichotomous variable will capture the heterogeneity sufficiently. But here is an example to calculate the Blau-Indix (GV in the output) -- note that "above the mean" and "below the mean" is ambiguous:

    Code:
    cap which divcat
    if _rc ssc install divcat  // install -divcat- if necessary
    
    clear
    input int(CompanyID Year DirectorID) str1 Gender int Awards
    1111 2008 4854 "M"  45
    1111 2008 2938 "F"  14
    1111 2008 4927 "F" 120
    1111 2008 9068 "M"  76
    1111 2009 4854 "M"  45
    1111 2009 2938 "F"  76
    1111 2010 4854 "M"  46
    2222 2008 4275 "F"  54
    2222 2009 5827 "M"  65
    2222 2009 5283 "M"  34
    2222 2010 6912 "M"  12
    2222 2010 4917 "F"  43
    2222 2010 4854 "M"  59
    end
    
    sum Awards, meanonly
    recode Awards (min/`r(mean)' = 0 "<= `r(mean)'") (`r(mean)'/max = 1 "> `r(mean)'"), gen(Awards_2)
    tab1 Awards_2
    
    bys CompanyID Year: divcat Awards_2

    Comment


    • #3
      Sorry, I did not read you post closely enough. Following your example, can you try
      Code:
      set tracedepth 2
      set trace on
      bysort CompanyID Year: divcat AwardsDummy , gv gen_gv(H_AwardsDummy)
      set trace off

      Comment


      • #4
        Dirk Enzmann

        Dear Dirk,

        Thank you very much for sharing your knowledge. Unfortunately, even with your commands, I still get the 'too many values' error message. Do you have any other tips on how this can be solved?

        I agree with you that a dummy variable might not capture heterogeneity sufficiently. However, with a count variable like this, I do not know a better way to calculate the Blau-index. Do you, by any chance, have a better approach for this?

        Ultimately, I want to capture the female awards, so I can make a distinction between male/female awards. I do not know if this is important to know.

        Thank you for your time and efforts
        Last edited by Ferry Doeza; 03 May 2021, 09:59.

        Comment


        • #5
          • As to the problem of "too many values": The suggestion to use -set trace- on was meant to find out where exactly the problems of too many cases occurs. Because you are looking for the measure of heterogeneity by groups (of cases), you can circumvent the problem by calculating the Blau Index (GV) for subsets of your data.
          • As to the problem of a better measure of heterogeneity, you should look for measures that take into account the nature of your data (here: count data). I am no expert in this field, certainly there are others who know more to answer this question.

          Comment


          • #6
            Dirk Enzmann

            I see, thank you very much for your help. It now works for the Awards variable. I have a similar variable, and have also used divcat for a dummy that I have created to measure heterogeneity. However, the results of the Blau Index are negative. How can this be? Maybe important to note, almost all values are dummy coded with a 0, just a few variables are dummy coded with a '1' (to show categories). Could this be the reason for the negative Blau-Index?

            Best regards

            Comment


            • #7
              The Blau index (generaliized variance: GV) should not be negative (for details see: Budesco, D. V. & Budesco, M. (2012). How to measure diversity if you must. Psychological Methods, 17(2), 215-227). That could only happen if the sum of the proportions over the categories of your variable is greater than 1 (which is impossible). Can you create a reproducible example?

              Comment


              • #8
                More precise: That the sum of the proportions over the categories of your variable is greater than 1 is only a necessary condition for the GV being negative; the sufficient condition is that the sum of the squared proportions over the categories of your variable is greater 1.

                Comment


                • #9
                  Dirk Enzmann

                  Dear Dirk,

                  Thank you for your response and for sharing your knowledge. I tried to export a piece of my data to post here, but every time I did would not get negative values anymore. I thus suspected that it had something to do with my large dataset, and by my manual calculation of the Blau-Index. My previous message was unclear, I see, since it suggests that I got the negative values by using divcat. This is not the case, since I used my manual calculation below.

                  Code:
                  bys CompanyID Year Awards: gen a = _N*(_n==1)
                  bys CompanyID Year: egen AwardsDistribution = total((_N-a^2)/(_N^2))
                  drop a
                  I split up my sample, so I would not get the 'too many values' message and I ran divcat instead. I get exactly the same positive results, but the negative results that I obtained with my manual method are now 0 with divcat. I thus think that it may have to do with rounding (since the negative values I first obtained were very small). Either way, it seems that divcat works perfectly for this.

                  Thank you very much for your time and efforts.

                  Comment


                  • #10
                    Your code doesn't pay any attention to whether there are missing values. I don't know if that is biting.

                    Comment


                    • #11
                      Nick Cox

                      Dear Nick,

                      Thank you for your input. Although my code indeed does not pay attention to missing values, there are none in my CompanyID, Year and Awards variables. Although my issue is now fixed by using divcat, I am not sure why my code didn't work optimally. I suspect rounding, but I am not sure.

                      Comment


                      • #12
                        A more serious issue is that your are categorizing (dichotomizing) your data to calculate a measure of heterogeneity which is a serious loss of information. Instead you should use a measure of heterogeneity that takes into account the nature of your data (counts), but I am no expert in this field.

                        Comment

                        Working...
                        X