Maximizing between-group variance for a given distribution

Imed Limam

Join Date: May 2014

Posts: 39
#1

Maximizing between-group variance for a given distribution

04 Oct 2014, 11:24

Dear StataListers,

I am trying to re-allocate income (x) across members of J sub-groups classified according to a certain criterion (such as by sex, by region or by educational background of parents) so as to maximize between-group variance or inequality of distribution with respect to x. This implies that in the new counterfactual distribution sub-group incomes occupy non-overlapping intervals. To be more specific, if I have the J groups ordered from 1 to J (g1, g2, ...,gJ) with each having size nj (sum(nj)=n sample size), I need to re-assign incomes (not individuals) so as to maximize inter-group variance (or inequality) of income, while preserving the number of sub-groups, their rank ordering and relative sizes. The procedure suggested in the literature consists of allocating the lowest income to g1, then to g2, etc.

I use the following example for illustrative purpose:

/* simple example */
input ind x str1 grp
1 2000 R
2 4300 U
3 5200 R
4 8500 U
5 8800 U
6 11000 R
7 12500 U
end
sort grp

/* How to obtain the rank ordering and the new distribution that look like this

ngrp xnew
R 2000
R 4300
R 5200
U 8500
U 8800
U 11000
U 12500

I would appreciate any suggestions for obtaining this outcome especially, for a larger number of groups of different sizes.

Thank you in advance.

Imed.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#2

04 Oct 2014, 13:31

So, I'm going to assume that you have the group sizes n1, ...., nJ in local macros of those same names. I'm also going to assume that the number of groups is in a local macro named J. Here is the basic approach:

Code:

sort x local from = 1 gen ngrp = . forvalues i = 1/`J' { local to = `from' + `n`i'' - 1 replace ngrp = `i' in `from'/`to' local `from' = `to'+1 }

A couple of notes: this will leave a variable ngrp that identifies each group. You seem to want some kind of letters attached to those, though it isn't at all clear what the number/letter correspondence should be (the pattern underlying R U is not obvious): you can define the corresponding label and apply it to this variable.

If there are observations with tied values of x, this code will sort them in random order, and if the break between groups happens to fall within those tied values, the decision as to which observations go in which of those groups will be random. More generally, you need a plan for dealing with tied values of x unless you are sure that there aren't any. Also, this code will put all observations with missing values of x in the final group (because Stata sorts missing as larger than all numbers). So, you may want to drop such observations first, or figure some other way to classify them.
Comment
Imed Limam

Join Date: May 2014

Posts: 39
#3

04 Oct 2014, 16:02

Thank you Mr. Clyde for your prompt reply. In the hypothetical example of income distribution between rural and urban dwellers, my concern is to have a counterfactual distribution that would respect the peckIng order of the sub-groups i.e a reshuffling the income across sub-groups in such a way as to max between-group variation while preserving the ranking order (rural population has a lower mean income than urban population) as well as size (3 are rural and 4 are urban). The suggested procedure is to rank groups by mean income (in this case Rural would be group1 and Urban being group 2), then the lowest incomes are allocated to the members of g1, then to the members of g2, etc. In the general case of J groups, respecting the pecking order and size of groups brings down the number of possible orderings of the groups from J! to one.

Between-group variation is maximized when there is no overlapping income intervals between these groups. In the example I suggested, rural incomes interval is [2000-11000] while that of urban incomes is [4300-12500]. Under the new hypothetical distribution rural incomes would be within [2000-4300] and urban incomes within [8500-12500]. They are non-overlapping and the new distribution maximizes between-group variation. It is not important to know who gets what as long as the mean income of rural population remains below that of urban population and group sizes do not change. The income allocation scheme under the hypothetical distribution may be conceived by the order of observation (_n) in each group, I.e. after sorting observation by grp (rural first then urban) the first observation (Individual) receives the lowest income, the second receives the next level etc. I hope this helps. Thanking you for your time.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#4

04 Oct 2014, 18:54

So, the code I gave you earlier will work for this. To apply it to your example, where there are just two groups, R and U, you have to do the following before that code:

Code:

local J 2 local n1 3 local n2 4

Then you run the code from my earlier response, with one small error corrected:

Code:

sort x local from = 1 gen ngrp = . forvalues i = 1/`J' { local to = `from' + `n`i'' - 1 replace ngrp = `i' in `from'/`to' local from = `to'+1 // REMOVED ERRONEOUS QUOTES AROUND from }

Then you can finish off with:

Code:

label define ngrp 1 "R" 2 "U" label values ngrp ngrp

and you're done.

Last edited by Clyde Schechter; 04 Oct 2014, 19:00.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30118

04 Oct 2014, 19:18

It dawns on me that you may not have J, n1, and n2 fixed in advance and you may want to extract them from the data itself. In that case, the first and last blocks of code in my latest response should be modified to do that. The whole sequence would then be as follows:

Code:

 // IDENTIFY THE EXISTING GROUPS' SIZES AND MEAN X
preserve
collapse (mean) x (count) n=x, by(grp)
sort x // ORDER GROUPS LOWEST x TO HIGHEST
count
local J = r(N) // NUMBER OF GROUPS
  forvalues j = 1/`J' {
    local n`j' = n[`j']  // NUMBER OF ORIGINAL OBS IN GROUP
}
  // BUILD A LABEL FOR ngrp, NEEDED LATER
local ngrp_label
forvalues j = 1/`J' {
    local ngrp_label `ngrp_label'  `j' "`=grp[`j']'"
}
restore // DONE WITH THAT; BRING BACK ORIGINAL DATA
local from = 1
gen ngrp = .
sort x
forvalues i = 1/`J' {
    local to = `from' + `n`i'' - 1
    replace ngrp = `i' in `from'/`to'
    local from = `to'+1 // REMOVED ERRONEOUS QUOTES AROUND from
}
label define ngrp `ngrp_label'
label values ngrp ngrp
list, noobs clean

Last edited by Clyde Schechter; 04 Oct 2014, 19:21.

Comment

Imed Limam

Join Date: May 2014

Posts: 39
#6

05 Oct 2014, 00:33

Both versions work just fine. Thank you very much.
Comment

Announcement

Maximizing between-group variance for a given distribution

Comment

Comment

Comment

Comment

Comment