Creating Gini Coefficients from Categorical Income Data in 2000 US Census

Kasey Zapatka

Join Date: Feb 2019

Posts: 12
#1

Creating Gini Coefficients from Categorical Income Data in 2000 US Census

11 Feb 2019, 20:52

Hey everyone,

This is my first post to Statalist. Please let me know if there is anything missing in terms of posting protocol.

I need a little help creating Gini coefficients for census tracts using categorical income data.

I’m building a longitudinal census tract-level dataset that looks at the impact of segregation and inequality on housing markets across the country. I'm still building it and having trouble constructing Gini coefficients for each census tract. I have block group data nested in census tracts and data from the 2000 decennial census, as well as 2008-2012 and ACS 2013-2017 5-year ACS estimates—so three time points. I have household income data, which is the count of the individuals in each income bracket within that geographical area, which I have collapsed into 6 brackets: inc1, inc2, inc3, inc4, inc5, inc6.

While the ACS provides Gini coefficients, it does not for 2000 decennial data or before. So, I was planning on constructing my own Gini coefficients using the income brackets to calculate a Gini coefficient for by each census tract, so that it is comparable to what the US Census gives you with ACS data. Following Fan et. al. (2017) [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4684591/], since the brackets are ordinal-categorical data instead of individual level data, my coefficients will be underestimated. Therefore, I’ll need to estimate them for all 3 years to be consistent. Since I have the tract-level Gini coefficients produced by the US Census, I can compare my estimates with those.

My problem is how to exactly compute them. I’ve emailed Fan et. al and am still waiting to hear back about how they structured their data and wrote their syntax. They use Whitehead’s “relsgini” command, but there is very little information in Stata ado file, and “relsgini” only accepts one variable entry, which makes me think that I’ll need to convert the income brackets from wide to long. But when I do that, it only gives me an overall Gini coefficient statistic:

Code:

. relsgini inc [fw=pop] Donaldson-Weymark relative S-Gini inequality measures of inc ------------------------------------------------------------------------------ delta = 2 .55093967

Also, there’s very little documentation on how to format the distributional sensitivity parameters and it’s unclear to me how I specify that I want the Gini Coefficients by tracts.

Does anyone have any ideas? I’ve also been trying to calculate them with a number of different Stata user-written programs, i.e., Reardon’s seg, inequal7, and ineqdeco, but haven't had any luck. Readon’s seg command says there are too many values:

Code:

. seg inc1 inc2 inc3 inc4 inc5, g by(tractidn) u(blkgrpidn) gen(g gini i index)

Code:

Note: Some by-groups have fewer units than groups. Multigroup indices for these by-groups should be interpreted with caution. Group Variables: inc1 inc2 inc3 inc4 inc5 Total Counts and Diversity Measures too many values r(134);

Inequal7 seems like I'm getting closer, but it doesn’t return any new variables with coefficients (only gives me one value), won't let me enter more than 5 brackets, and it’s unclear how to specify that I want the Gini coefficents for census tracts.

Code:

. . inequal7 inc1 inc2 inc3 inc4 inc5 [fw=pop] ,returnscalars Warning: inc1 has 2408 values == 0 *used* in calculations (except for SD logs, GE(-1), GE(0) (Mean log-deviation) and GE(1) (Theil)). Warning: inc2 has 1671 values == 0 *used* in calculations (except for SD logs, GE(-1), GE(0) (Mean log-deviation) and GE(1) (Theil)). Warning: inc3 has 3557 values == 0 *used* in calculations (except for SD logs, GE(-1), GE(0) (Mean log-deviation) and GE(1) (Theil)). Warning: inc4 has 14690 values == 0 *used* in calculations (except for SD logs, GE(-1), GE(0) (Mean log-deviation) and GE(1) (Theil)). Warning: inc5 has 43386 values == 0 *used* in calculations (except for SD logs, GE(-1), GE(0) (Mean log-deviation) and GE(1) (Theil)). ---------------------------------------------------------------------------------- Inequality measures | inc1 inc2 inc3 inc4 inc5 -----------------------------------------+---------------------------------------- Relative mean deviation | 0.30234 0.27124 0.30575 0.37661 0.45320 Coefficient of variation | 0.91858 0.86620 0.95843 1.19158 1.46555 Standard deviation of logs | 0.87052 0.72200 0.83326 0.99858 1.06416 Gini coefficient | 0.42406 0.38228 0.42749 0.51864 0.61085 Mehran measure | 0.56660 0.50508 0.56279 0.67280 0.77520 Piesch measure | 0.35279 0.32088 0.35984 0.44156 0.52867 Kakwani measure | 0.15556 0.12848 0.15750 0.22505 0.30583 Theil index (GE(a), a = 1) | 0.30843 0.26063 0.31756 0.43934 0.52786 Mean Log Deviation (GE(a), a = 0) | 0.33585 0.25528 0.32601 0.46683 0.55369 Entropy index (GE(a), a = -1) | 0.60970 0.37482 0.54366 0.84968 1.00210 Half (Coeff.Var. squared) (GE(a), a = 2) | 0.42189 0.37515 0.45929 0.70993 1.07392 ----------------------------------------------------------------------------------

Finally, when I use ineqdeco, it says that I have too many values and won't calculate anything:

Code:

. ineqdeco inc [fw=pop], by (tractidn) too many values

While I saw this post response from Stephen P. Jenkins, I can't figure out how to fit this loop to my data structure: ://www.stata.com/statalist/archive/2004-03/msg00287.html
Any help or if you could point me in the right direction, I’d very much appreciate it.

I am using Stata 14.1.

Thanks. Best,
Kasey
Tags: census tract, gini coefficients, inequality
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#2

12 Feb 2019, 01:45

The problem with using my ineqdeco or ineqdec0 and Van Kerm's inequal7 is that they are designed for unit record data. (Ditto relsgini, but that is obselete code; I advise against using it.) You have grouped (banded) data, as you say. Methods for estimation of inequality indices in that context are various, typically depending on how much information is available, e.g. whether one has quantile group share data, or frequencies within intervals (with upper and lower interval bounds given), whether one also knows the mean within the interval, and having an open-ended top interval can raise further issues. There is a large literature on this, some of which fits functional forms to the data. For a recent Stata implementation, see the rpme package by von Hippel and Powers on SSC (and the references in the articles cited there). My recollection is that they have data of a similar form to yours. Good luck.
Comment
Kasey Zapatka

Join Date: Feb 2019

Posts: 12
#3

12 Feb 2019, 15:10

Hi Stephen,

Thanks so much for your suggestions, they were very helpful and I really appreciate them. I was able to create estimates using Von Hippel’s “rpme” package, but I wanted to follow up about how accurate my estimates are.

In Von Hippel’s paper, "Better Estimates from Binned Income Data: Interpolated CDFs and Mean-Matching", he says that at the county level, the estimates should fall between 2% and 7% of the true values. About 5% of my estimates fall outside of that interval, that is, of about 74,000 census tracts in my data, about 95% of my estimates are within the 2%-7% of the real Ginis.

I’m attaching a twoway scatter plot that visualizes the relationship between the gini estimates I created and what the census provides.

Here is what I used to get my estimates:
I used all 16 bins for the household income variable in the census, the average household income as the grand mean, and specified harmonic for pareto statistic (I got the same results whether I specified arithmetic or didn’t specify anything).

Code:

rpme inc bin_min bin_max, by(tractid) grand_mean(ahinc) pareto_stat(harmonic) saving(htract_ests)

While I know that the Gini coefficients that I create will be slightly underestimated, I wanted to see check to see if there was something else I was missing. Should I be using family income or aggregated family income instead of household income?

I guess I’m just curious if anyone thinks I should worry about that level of estimation difference? They seem pretty good to me given that they are estimated without continuous data.

Any thoughts would be greatly appreciated.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#4

13 Feb 2019, 02:29

I'm not in a position to comment on the veracity of the estimates (I have no experience with these sorts of data), but you might try running them by Paul von Hippel. (He's an occasional contributor to this forum.) BTW The article of his that you downloaded from ResearchGate is available, published open access, at https://www.sociologicalscience.com/articles-v4-26-641/. BTW2 I've just seen another related article -- by Jargowsky & Wheeler on "Estimating Income Statistics from Grouped Data: Mean-constrained Integration over Brackets", Sociological Methodology (https://journals.sagepub.com/doi/ful...81175018782579). They claim greater accuracy than the RPME method. But (from my first glance) I saw no reference to software to implement their estimator.
1 like
Comment
Kasey Zapatka

Join Date: Feb 2019

Posts: 12
#5

18 Feb 2019, 11:49

Hi Stephen,

I reached out to Paul and he was helpful. The Jargowsky article is a good read and sounds like it might be an improvement, but there doesn't seem to be a software implementation of their estimator yet. I'm going to reach out to them directly.

Thanks again for all your help. I really appreciate it.

Best,
Kasey
Comment
julian mwanana

Join Date: Apr 2019

Posts: 17
#6

08 May 2019, 08:29

Hi all,
can anyone help me some stata commands for generating or computing: 1) mean of age 2) coefficient of variation of age - for data-set looking like this?
(i will appreciate any additional knowledge/material for stata commands and their use to a new beginner like me)
Company Year Age

SCOM 2008 62

SCOM 2009 63

SCOM 2010 52

SCOM 2011 53

SCOM 2012 54

SCOM 2013 55

SCOM 2014 56

SCOM 2015 57

SCOM 2016 58

SCOM 2017 59

Last edited by julian mwanana; 08 May 2019, 08:42.
Comment

Company	Year	Age
SCOM	2008	62
SCOM	2009	63
SCOM	2010	52
SCOM	2011	53
SCOM	2012	54
SCOM	2013	55
SCOM	2014	56
SCOM	2015	57
SCOM	2016	58
SCOM	2017	59

Announcement

Creating Gini Coefficients from Categorical Income Data in 2000 US Census

Comment

Comment

Comment

Comment

Comment