No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating Gini Coefficients from Categorical Income Data in 2000 US Census

    Hey everyone,

    This is my first post to Statalist. Please let me know if there is anything missing in terms of posting protocol.

    I need a little help creating Gini coefficients for census tracts using categorical income data.

    I’m building a longitudinal census tract-level dataset that looks at the impact of segregation and inequality on housing markets across the country. I'm still building it and having trouble constructing Gini coefficients for each census tract. I have block group data nested in census tracts and data from the 2000 decennial census, as well as 2008-2012 and ACS 2013-2017 5-year ACS estimates—so three time points. I have household income data, which is the count of the individuals in each income bracket within that geographical area, which I have collapsed into 6 brackets: inc1, inc2, inc3, inc4, inc5, inc6.

    While the ACS provides Gini coefficients, it does not for 2000 decennial data or before. So, I was planning on constructing my own Gini coefficients using the income brackets to calculate a Gini coefficient for by each census tract, so that it is comparable to what the US Census gives you with ACS data. Following Fan et. al. (2017) [], since the brackets are ordinal-categorical data instead of individual level data, my coefficients will be underestimated. Therefore, I’ll need to estimate them for all 3 years to be consistent. Since I have the tract-level Gini coefficients produced by the US Census, I can compare my estimates with those.

    My problem is how to exactly compute them. I’ve emailed Fan et. al and am still waiting to hear back about how they structured their data and wrote their syntax. They use Whitehead’s “relsgini” command, but there is very little information in Stata ado file, and “relsgini” only accepts one variable entry, which makes me think that I’ll need to convert the income brackets from wide to long. But when I do that, it only gives me an overall Gini coefficient statistic:

    . relsgini inc [fw=pop] 
    Donaldson-Weymark relative S-Gini inequality measures of inc
    delta = 2                              .55093967
    Also, there’s very little documentation on how to format the distributional sensitivity parameters and it’s unclear to me how I specify that I want the Gini Coefficients by tracts.

    Does anyone have any ideas? I’ve also been trying to calculate them with a number of different Stata user-written programs, i.e., Reardon’s seg, inequal7, and ineqdeco, but haven't had any luck. Readon’s seg command says there are too many values:
    . seg inc1 inc2 inc3 inc4 inc5, g by(tractidn) u(blkgrpidn) gen(g gini i index)
    Note: Some by-groups have fewer units than groups. Multigroup indices for these by-groups should be interpreted with caution.
    Group Variables:   inc1 inc2 inc3 inc4 inc5
    Total Counts and Diversity Measures
    too many values

    Inequal7 seems like I'm getting closer, but it doesn’t return any new variables with coefficients (only gives me one value), won't let me enter more than 5 brackets, and it’s unclear how to specify that I want the Gini coefficents for census tracts.
    . inequal7 inc1 inc2 inc3 inc4 inc5 [fw=pop] ,returnscalars 
    Warning: inc1 has 2408 values == 0 *used* in calculations
        (except for SD logs, GE(-1), GE(0) (Mean log-deviation) and GE(1) (Theil)).
    Warning: inc2 has 1671 values == 0 *used* in calculations
        (except for SD logs, GE(-1), GE(0) (Mean log-deviation) and GE(1) (Theil)).
    Warning: inc3 has 3557 values == 0 *used* in calculations
        (except for SD logs, GE(-1), GE(0) (Mean log-deviation) and GE(1) (Theil)).
    Warning: inc4 has 14690 values == 0 *used* in calculations
        (except for SD logs, GE(-1), GE(0) (Mean log-deviation) and GE(1) (Theil)).
    Warning: inc5 has 43386 values == 0 *used* in calculations
        (except for SD logs, GE(-1), GE(0) (Mean log-deviation) and GE(1) (Theil)).
                         Inequality measures |    inc1    inc2    inc3    inc4    inc5
                     Relative mean deviation | 0.30234 0.27124 0.30575 0.37661 0.45320
                    Coefficient of variation | 0.91858 0.86620 0.95843 1.19158 1.46555
                  Standard deviation of logs | 0.87052 0.72200 0.83326 0.99858 1.06416
                            Gini coefficient | 0.42406 0.38228 0.42749 0.51864 0.61085
                              Mehran measure | 0.56660 0.50508 0.56279 0.67280 0.77520
                              Piesch measure | 0.35279 0.32088 0.35984 0.44156 0.52867
                             Kakwani measure | 0.15556 0.12848 0.15750 0.22505 0.30583
                  Theil index (GE(a), a = 1) | 0.30843 0.26063 0.31756 0.43934 0.52786
           Mean Log Deviation (GE(a), a = 0) | 0.33585 0.25528 0.32601 0.46683 0.55369
               Entropy index (GE(a), a = -1) | 0.60970 0.37482 0.54366 0.84968 1.00210
    Half (Coeff.Var. squared) (GE(a), a = 2) | 0.42189 0.37515 0.45929 0.70993 1.07392
    Finally, when I use ineqdeco, it says that I have too many values and won't calculate anything:

     . ineqdeco inc [fw=pop], by (tractidn)
    too many values

    While I saw this post response from Stephen P. Jenkins, I can't figure out how to fit this loop to my data structure: ://
    Any help or if you could point me in the right direction, I’d very much appreciate it.

    I am using Stata 14.1.

    Thanks. Best,

  • #2
    The problem with using my ineqdeco or ineqdec0 and Van Kerm's inequal7 is that they are designed for unit record data. (Ditto relsgini, but that is obselete code; I advise against using it.) You have grouped (banded) data, as you say. Methods for estimation of inequality indices in that context are various, typically depending on how much information is available, e.g. whether one has quantile group share data, or frequencies within intervals (with upper and lower interval bounds given), whether one also knows the mean within the interval, and having an open-ended top interval can raise further issues. There is a large literature on this, some of which fits functional forms to the data. For a recent Stata implementation, see the rpme package by von Hippel and Powers on SSC (and the references in the articles cited there). My recollection is that they have data of a similar form to yours. Good luck.


    • #3
      Hi Stephen,

      Thanks so much for your suggestions, they were very helpful and I really appreciate them. I was able to create estimates using Von Hippel’s “rpme” package, but I wanted to follow up about how accurate my estimates are.

      In Von Hippel’s paper, "Better Estimates from Binned Income Data: Interpolated CDFs and Mean-Matching", he says that at the county level, the estimates should fall between 2% and 7% of the true values. About 5% of my estimates fall outside of that interval, that is, of about 74,000 census tracts in my data, about 95% of my estimates are within the 2%-7% of the real Ginis.

      I’m attaching a twoway scatter plot that visualizes the relationship between the gini estimates I created and what the census provides.

      Here is what I used to get my estimates:
      • I used all 16 bins for the household income variable in the census, the average household income as the grand mean, and specified harmonic for pareto statistic (I got the same results whether I specified arithmetic or didn’t specify anything).
       rpme inc bin_min bin_max, by(tractid) grand_mean(ahinc) pareto_stat(harmonic) saving(htract_ests)
      While I know that the Gini coefficients that I create will be slightly underestimated, I wanted to see check to see if there was something else I was missing. Should I be using family income or aggregated family income instead of household income?

      I guess I’m just curious if anyone thinks I should worry about that level of estimation difference? They seem pretty good to me given that they are estimated without continuous data.

      Any thoughts would be greatly appreciated.

      Click image for larger version

Name:	hgini_drop.png
Views:	1
Size:	27.3 KB
ID:	1483358


      • #4
        I'm not in a position to comment on the veracity of the estimates (I have no experience with these sorts of data), but you might try running them by Paul von Hippel. (He's an occasional contributor to this forum.) BTW The article of his that you downloaded from ResearchGate is available, published open access, at BTW2 I've just seen another related article -- by Jargowsky & Wheeler on "Estimating Income Statistics from Grouped Data: Mean-constrained Integration over Brackets", Sociological Methodology ( They claim greater accuracy than the RPME method. But (from my first glance) I saw no reference to software to implement their estimator.


        • #5
          Hi Stephen,

          I reached out to Paul and he was helpful. The Jargowsky article is a good read and sounds like it might be an improvement, but there doesn't seem to be a software implementation of their estimator yet. I'm going to reach out to them directly.

          Thanks again for all your help. I really appreciate it.



          • #6
            Hi all,
            can anyone help me some stata commands for generating or computing: 1) mean of age 2) coefficient of variation of age - for data-set looking like this?
            (i will appreciate any additional knowledge/material for stata commands and their use to a new beginner like me)
            Company Year Age
            SCOM 2008 62
            SCOM 2009 63
            SCOM 2010 52
            SCOM 2011 53
            SCOM 2012 54
            SCOM 2013 55
            SCOM 2014 56
            SCOM 2015 57
            SCOM 2016 58
            SCOM 2017 59
            Last edited by julian mwanana; 08 May 2019, 09:42.