Rank Variable for Income, then further for region

Em Arham

Join Date: Jul 2019

Posts: 11
#1

Rank Variable for Income, then further for region

27 Jul 2019, 08:57

Hi,
I am fairly new to this forum so please excuse me if I do not follow some guidelines I am not aware of.

My question is for a dataset that contains variables such as Reported Income, Region etc.

I want to create a rank variable between 0 and 1. Essentially I would like to rank all the incomes into a reference group (lowest to highest so that 1 will be the highest income) and divide that number by the total number in the reference group.
My reference group is region, so for example all in London, all in Wales etc.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

27 Jul 2019, 12:56

I want to create a rank variable between 0 and 1. Essentially I would like to rank all the incomes into a reference group (lowest to highest so that 1 will be the highest income) and divide that number by the total number in the reference group.

I'm completely confused by this. A rank is, by definition, a positive integer, so it can not be between 0 and 1. What do you mean by a reference group? Do you mean that you want to compare one region to all the others? If so, how will you pick which one?

I suggest that you post back. When you do, use the -dataex- command to show example data.* Also, show the results you would want to see for this "rank variable" in your example, and, explain how you got it. Then perhaps somebody will be able to show you how to code it.

I am fairly new to this forum so please excuse me if I do not follow some guidelines I am not aware of.

The only guidelines here are those in the FAQ, which every member is reminded to read before they do their first post. So everybody, even the newest members, should be familiar with them before posting. It's never too late. Read the FAQ now: I think you will find them very helpful.

* If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Em Arham

Join Date: Jul 2019
Posts: 11

28 Jul 2019, 08:55

Hi Clyde,

Apologies for this. Allow me to try and explain my issue in more detail;

I am trying to understand the way reference groups of income impact life satisfaction. One of these reference groups is income of individuals within the same area/region as you.

I have a dataset containing a variable for 1) Annual Income, I have used dataex to provide example data;

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double br_fiyr
      32175.484375
   8668.6259765625
    59120.83203125
1449.7247314453125
   31282.189453125
    4375.841796875
 3983.781494140625
   9922.8271484375
     7580.80859375
     19815.2421875
    39318.24609375
 1425.958740234375
                 0
                 .
     4614.52734375
   12654.552734375
             12852
   3995.3935546875
             19968
  2943.89892578125
  13048.4619140625
   25731.712890625
 2611.257568359375
   20837.154296875
       4417.453125
  5704.38818359375
       7631.109375
  10668.1630859375
   30694.435546875
                 0
   24654.998046875
   19128.951171875
                 0
         51210.625
               230
    17652.31640625
   9881.4501953125
    9485.244140625
                .a
   21238.130859375
    11643.91796875
   12502.705078125
      5211.5546875
   10042.333984375
    6005.908203125
   19565.193359375
   9237.5322265625
   11515.060546875
         8374.5625
     24187.6796875
    6140.146484375
   15037.044921875
   2866.7470703125
   8289.6845703125
             35720
             77840
    23000.07421875
   8539.4619140625
    5346.060546875
  15006.9287109375
  5814.23095703125
             14575
  4916.06884765625
   4197.7373046875
   18756.724609375
 2764.363525390625
   21693.060546875
      12682.828125
       20595.84375
   7103.3291015625
                .a
    13369.19140625
   2372.4501953125
    34736.72265625
   12052.337890625
    8727.677734375
   5652.3017578125
    6143.029296875
   2971.1884765625
   21808.083984375
   30255.162109375
     16570.8203125
    8113.212890625
        18772.4375
    19199.98046875
    26519.19140625
  4150.27099609375
   5607.2080078125
  2149.67626953125
     21614.2265625
    16984.31640625
                .a
                 0
       26042.09375
    20101.02734375
    105341.5703125
        22256.5625
               100
  11415.8896484375
        9299.15625
end
label values br_fiyr br_fiyr

------------------ copy up to and including the previous line ------------------

Listed 100 out of 14419 observations
Use the count() option to list more

Secondly, I have a variable Region, also example data provided via dataex command;

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte br_region
 4
 4
 3
 3
 2
 2
 2
 2
 2
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 4
 4
 4
 4
 4
 4
 4
 4
 4
 4
 4
 4
 5
 5
 3
 5
 5
 5
 7
 7
 7
 7
 7
 3
 3
 7
 8
 8
 8
 8
 8
 8
 8
 8
 8
 8
 9
 9
 9
 8
10
10
10
10
11
 6
 6
12
14
15
15
15
16
17
17
17
17
17
17
17
18
 2
 7
 8
17
17
17
17
17
end
label values br_region br_region
label def br_region 2 "Outer London", modify
label def br_region 3 "R. of South East", modify
label def br_region 4 "South West", modify
label def br_region 5 "East Anglia", modify
label def br_region 6 "East Midlands", modify
label def br_region 7 "West Midlands Conurbation", modify
label def br_region 8 "R. of West Midlands", modify
label def br_region 9 "Greater Manchester", modify
label def br_region 10 "Merseyside", modify
label def br_region 11 "R. of North West", modify
label def br_region 12 "South Yorkshire", modify
label def br_region 14 "R. of Yorks & Humberside", modify
label def br_region 15 "Tyne & Wear", modify
label def br_region 16 "R. of North", modify
label def br_region 17 "Wales", modify
label def br_region 18 "Scotland", modify

------------------ copy up to and including the previous line ------------------

Listed 100 out of 14419 observations
Use the count() option to list more

If you would like to view these variables as explained by the dataset website;
Here are the URL links;
Annual Income Variable
Region Variable

My Dependant Variable is Life Satisfaction.

I have come across a paper that has done similar research, however, the authors highlight how using a rank variable for income results in higher R-squared and generally more accurate results.
The paper can be found here.

Essentially, this is what the authors have done, in their own words;

" We test a simple rank-based model according to which the individual compares themself to a sample of other people in their reference group and assesses whether each sampled individual earns more or less than themselves (Stewart et al., 2006). Those assigned “worse than” (i-1) are compared to the total number within the reference group (n-1). The ratio gives the individual a relative rank (Ri) normalized between 0 and 1:
(1) Ri = i−1 / n−1 ...
We use Ri to predict life satisfaction in a multiple regression...
Next, we compared the rank and reference income hypotheses. To do this we constructed various reference groups to explore the possibility that people compare their income to others in the same geographical region (of which there were 19 in the BHPS) "

I am trying to code my income variable similar to their 'rank' income, thus I can follow with using the reference group income (Regional income) as a comparison tool for life satisfaction.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

28 Jul 2019, 09:41

I think I understand now. The complicating factor in your data is that you have observations where the value of the income variable is missing. The text you cite does not state how this situation was handled. In the code below I assume that when a person's income is missing they are not given a "rank" nor are they counted among the people others compare themselves to when they get their rank. It is as if they are not in the data set (though I do not go so far as to drop those observations.)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte br_region double br_fiyr
 4       32175.484375
 4    8668.6259765625
 3     59120.83203125
 3 1449.7247314453125
 2    31282.189453125
 2     4375.841796875
 2  3983.781494140625
 2    9922.8271484375
 2      7580.80859375
 3      19815.2421875
 3     39318.24609375
 3  1425.958740234375
 3                  0
 3                  .
 3      4614.52734375
 3    12654.552734375
 3              12852
 3    3995.3935546875
 3              19968
 3   2943.89892578125
 3   13048.4619140625
 3    25731.712890625
 3  2611.257568359375
 3    20837.154296875
 3        4417.453125
 3   5704.38818359375
 3        7631.109375
 3   10668.1630859375
 3    30694.435546875
 3                  0
 3    24654.998046875
 4    19128.951171875
 4                  0
 4          51210.625
 4                230
 4     17652.31640625
 4    9881.4501953125
 4     9485.244140625
 4                 .a
 4    21238.130859375
 4     11643.91796875
 4    12502.705078125
 4       5211.5546875
 5    10042.333984375
 5     6005.908203125
 3    19565.193359375
 5    9237.5322265625
 5    11515.060546875
 5          8374.5625
 7      24187.6796875
 7     6140.146484375
 7    15037.044921875
 7    2866.7470703125
 7    8289.6845703125
 3              35720
 3              77840
 7     23000.07421875
 8    8539.4619140625
 8     5346.060546875
 8   15006.9287109375
 8   5814.23095703125
 8              14575
 8   4916.06884765625
 8    4197.7373046875
 8    18756.724609375
 8  2764.363525390625
 8    21693.060546875
 9       12682.828125
 9        20595.84375
 9    7103.3291015625
 8                 .a
10     13369.19140625
10    2372.4501953125
10     34736.72265625
10    12052.337890625
11     8727.677734375
 6    5652.3017578125
 6     6143.029296875
12    2971.1884765625
14    21808.083984375
15    30255.162109375
15      16570.8203125
15     8113.212890625
16         18772.4375
17     19199.98046875
17     26519.19140625
17   4150.27099609375
17    5607.2080078125
17   2149.67626953125
17      21614.2265625
17     16984.31640625
18                 .a
 2                  0
 7        26042.09375
 8     20101.02734375
17     105341.5703125
17         22256.5625
17                100
17   11415.8896484375
17         9299.15625
end
label values br_region br_region
label def br_region 2 "Outer London", modify
label def br_region 3 "R. of South East", modify
label def br_region 4 "South West", modify
label def br_region 5 "East Anglia", modify
label def br_region 6 "East Midlands", modify
label def br_region 7 "West Midlands Conurbation", modify
label def br_region 8 "R. of West Midlands", modify
label def br_region 9 "Greater Manchester", modify
label def br_region 10 "Merseyside", modify
label def br_region 11 "R. of North West", modify
label def br_region 12 "South Yorkshire", modify
label def br_region 14 "R. of Yorks & Humberside", modify
label def br_region 15 "Tyne & Wear", modify
label def br_region 16 "R. of North", modify
label def br_region 17 "Wales", modify
label def br_region 18 "Scotland", modify
label values br_fiyr br_fiyr

gen missing_income = missing(br_fiyr)
by br_region missing_income (br_fiyr), sort: gen rel_rank = (_n-1)/(_N-1) ///
    if !missing_income

By the way, in the future, when posting data examples, don't do a separate -dataex- for each variable. Post one that contains all the relevant variables if possible. (If the number of relevant variables is too large, then splitting it up is OK.)

Comment

Em Arham

Join Date: Jul 2019

Posts: 11
#5

28 Jul 2019, 10:07

Yes, there are values in the dataset for the income variable where the individual's did not respond or perhaps were unemployed during that time.
As you can see, these values have been treated as "missing" as denoted by .a or .b (It is my understanding that this is common practice in such cases).
You are also right that these cases are not meant to be given a rank, nor included in the group of comparison.

If the way I have treated missing values is wrong, please correct me and guide me towards a better solution.

Thank you for your help.
Duly noted about splitting the dataex text.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

28 Jul 2019, 10:22

If the way I have treated missing values is wrong, please correct me and guide me towards a better solution.

It is one of my axioms that all ways of treating missing values (except finding the actual values--which is usually not possible) are wrong. Some are less wrong than others. I selected the method I did (and which, it turns out, you intended) because it is simple to implement and probably less wrong than many other approaches.

That said, some people might handle this by imputing values for income in the offending observations (preferably a multiple imputation process). If the missing data are missing at random (in the technical sense of the term missing at random) then this might be a better way to proceed, but it is more complicated to implement. Moreover, missingness at random is always an unverifiable assumption, and with a "sensitive" variable like income, it is likely to be false unless your data contains several other variables that are highly predictive of income.

So, there is no other approach to this that I would endorse generically, though in particular circumstances there could be.
Comment
Em Arham

Join Date: Jul 2019

Posts: 11
#7

28 Jul 2019, 11:41

Thank you for the insight, I do believe the method you assumed and I have implemented is sufficient to proceed forward.

As for the rank variable for income, according to the paper I quoted, would you be able to help with the coding of this?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

28 Jul 2019, 12:15

The code is there in #4 at the bottom of the code block. Did you try it?
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#9

29 Jul 2019, 00:28

See also https://www.stata.com/support/faqs/s...ing-positions/ (which also explains about tied values).
Comment
Em Arham

Join Date: Jul 2019

Posts: 11
#10

01 Aug 2019, 06:49

Hi,
I really appreciate the help I received from Mr Schechter and Mr Cox,

I had a meeting today with a supervisor (as I am a MSc student writing my dissertation), and I showed my supervisor this particular conversation on the forum.
He pointed out that the data I have is panel data and a better approach would be the following code ;

sort year income bysort year: gen n=_n
bysort year: gen N=_N gen rank=n/N

He also suggested to regress different cities as reference groups by ;

if region=="London"

Would anyone be able to explain how this will work for my 'rank' or 'reference group' approach and affect my results?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

01 Aug 2019, 16:46

Well, what we have been talking about up to know and what is proposed in #10 are quite different things. Up to know, the discussion has been about calculating a rank separately within each region. That is what I understood you to want when you said "My reference group is region, so for example all in London, all in Wales etc." in #1.

The code shown in #10 will create a single ranking across all regions. Also, the code in #10 does not have the person exclude himself/herself from the ranking with others.

So apples and oranges. The question is which of these fruits is appropriate for your research goals. The latter have not been seriously discussed in this thread.

He also suggested to regress different cities as reference groups by ; if region=="London"

The only prior mention of regression in this thread is in #3, "We use Ri to predict life satisfaction in a multiple regression..." While it may be putting too much weight on the literal interpretation of a single word, that sentence seems to indicate that a single regression will be done, not a separate one for each city.

So all I can tell you is that you need to get clarity on what your research question is, and then figure out which of these proposed methodologies will answer it. The source you quote may have been asking and answering a different question from that of your thesis, and perhaps those sources were recommended to you only as being generally suggestive of an approach. The responses you have gotten in this thread have taken that source as precisely what you are aiming for and showing you ways to implement that.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#12

01 Aug 2019, 23:43

The code suggested in #10 is very crude as it ignores ties on income and could even be quite wrong as it won’t handle any missing values correctly. For what if does, a better approach is explained in the link given in #9.

Clyde Schechter’s comments apply otherwise.
Comment
Em Arham

Join Date: Jul 2019

Posts: 11
#13

02 Aug 2019, 04:01

Clyde Schechter I essentially want to do something similar to the quoted paper, however, as you mentioned the paper performs a single regression for region.
I would like to conduct separate one's, thus allowing me to report results for different cities.
Results for the hypothesis that life satisfaction from income is dependant not on absolute income but reference income. Reference income is what I am struggling to code.

Here is the section of the paper that perhaps explains what I would need;

" In each case we computed the relative rank of each individual’s income within the reference group and also the mean income of all individuals within the reference group. We then predicted each individual’s life satisfaction from (a) their relative rank within the reference group, (b) their absolute income (logarithmically transformed), and (c) mean reference group income (logarithmically transformed). "

So I would like to have a regression for regions for e.g London, Scotland etc. with (a) their relative rank within region, (b) their absolute income (logarithmically transformed), and (c) mean region income (logarithimically transformed).

From what I understand, the code in #10, creates a single ranking within all regions and thus by using if region=="London", I can give results for each city seperately.

However, I feel this code does not do well to capture the mean Income and furthermore, the mean reference group income (logarithmically transformed) as the paper quoted above does.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#14

02 Aug 2019, 18:34

From what I understand, the code in #10, creates a single ranking within all regions and thus by using if region=="London", I can give results for each city seperately.

Yes. But it means, for example, that the ranking of a Londoner will be based on the Londoner's comparison of his/her income to not just other Londoners but to those in other regions as well. Is that what you want? I have no expertise in this content area, but just from a "common sense" perspective it seems wrong to me. I would think that to the extent relative income influences life satisfaction it would be income relative to those around you in a somewhat immediate sense. But maybe in our highly connected world that is not true. Anyway, you need to be certain as to whether that is what you want.

I should also point out that putting in relative ranking and absolute income directly would not be possible: one is a linear transform of the other (either overall, or within region, depending on which relative ranking approach is used) and so one of them would be omitted. This is, I suppose, why the authors chose to log-transform the absolute income--it breaks the colinearity. But that doesn't get around the fact that you are, in effect, entering the same variable twice. The two variables should have very high correlation (somewhat less so if the ranking is done separately for each region--which is another reason why doing each region's ranking separately strikes me as better), and that in turn means that their effects will be measured with low precision. So here, from a purely statistical perspective, the logic of the approach evades me. So, again, I ask, are you sure this is what you want to do? Whatever your decisions on these design issues, I'm happy to help you implement them in code. But I just want to be sure you understand what you are asking for before you get it.
Comment
Em Arham

Join Date: Jul 2019

Posts: 11
#15

03 Aug 2019, 04:29

Thank you, the way you explained this conundrum has made me think of what exactly I would need for my results.
Apologies if it has taken too much of your time working about this.

I believe my thesis aims to evaluate life satisfaction with relative income (relative income of those around you). So as you said, the income an individual compares to would most likely be with those around him. So a Londoner is likely to compare his annual income with a fellow Londoner when determining his own income as low or high. So essentially the code would create a variable that takes the income of an individual (from London) into account as a ratio of the mean income of London as a region. Then, the results interpreted at each city level.
Comment

Announcement