Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rank Variable for Income, then further for region

    Hi,
    I am fairly new to this forum so please excuse me if I do not follow some guidelines I am not aware of.

    My question is for a dataset that contains variables such as Reported Income, Region etc.

    I want to create a rank variable between 0 and 1. Essentially I would like to rank all the incomes into a reference group (lowest to highest so that 1 will be the highest income) and divide that number by the total number in the reference group.
    My reference group is region, so for example all in London, all in Wales etc.

  • #2
    I want to create a rank variable between 0 and 1. Essentially I would like to rank all the incomes into a reference group (lowest to highest so that 1 will be the highest income) and divide that number by the total number in the reference group.
    I'm completely confused by this. A rank is, by definition, a positive integer, so it can not be between 0 and 1. What do you mean by a reference group? Do you mean that you want to compare one region to all the others? If so, how will you pick which one?

    I suggest that you post back. When you do, use the -dataex- command to show example data.* Also, show the results you would want to see for this "rank variable" in your example, and, explain how you got it. Then perhaps somebody will be able to show you how to code it.

    I am fairly new to this forum so please excuse me if I do not follow some guidelines I am not aware of.
    The only guidelines here are those in the FAQ, which every member is reminded to read before they do their first post. So everybody, even the newest members, should be familiar with them before posting. It's never too late. Read the FAQ now: I think you will find them very helpful.

    * If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.




    Comment


    • #3
      Hi Clyde,

      Apologies for this. Allow me to try and explain my issue in more detail;

      I am trying to understand the way reference groups of income impact life satisfaction. One of these reference groups is income of individuals within the same area/region as you.

      I have a dataset containing a variable for 1) Annual Income, I have used dataex to provide example data;

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input double br_fiyr
            32175.484375
         8668.6259765625
          59120.83203125
      1449.7247314453125
         31282.189453125
          4375.841796875
       3983.781494140625
         9922.8271484375
           7580.80859375
           19815.2421875
          39318.24609375
       1425.958740234375
                       0
                       .
           4614.52734375
         12654.552734375
                   12852
         3995.3935546875
                   19968
        2943.89892578125
        13048.4619140625
         25731.712890625
       2611.257568359375
         20837.154296875
             4417.453125
        5704.38818359375
             7631.109375
        10668.1630859375
         30694.435546875
                       0
         24654.998046875
         19128.951171875
                       0
               51210.625
                     230
          17652.31640625
         9881.4501953125
          9485.244140625
                      .a
         21238.130859375
          11643.91796875
         12502.705078125
            5211.5546875
         10042.333984375
          6005.908203125
         19565.193359375
         9237.5322265625
         11515.060546875
               8374.5625
           24187.6796875
          6140.146484375
         15037.044921875
         2866.7470703125
         8289.6845703125
                   35720
                   77840
          23000.07421875
         8539.4619140625
          5346.060546875
        15006.9287109375
        5814.23095703125
                   14575
        4916.06884765625
         4197.7373046875
         18756.724609375
       2764.363525390625
         21693.060546875
            12682.828125
             20595.84375
         7103.3291015625
                      .a
          13369.19140625
         2372.4501953125
          34736.72265625
         12052.337890625
          8727.677734375
         5652.3017578125
          6143.029296875
         2971.1884765625
         21808.083984375
         30255.162109375
           16570.8203125
          8113.212890625
              18772.4375
          19199.98046875
          26519.19140625
        4150.27099609375
         5607.2080078125
        2149.67626953125
           21614.2265625
          16984.31640625
                      .a
                       0
             26042.09375
          20101.02734375
          105341.5703125
              22256.5625
                     100
        11415.8896484375
              9299.15625
      end
      label values br_fiyr br_fiyr
      ------------------ copy up to and including the previous line ------------------

      Listed 100 out of 14419 observations
      Use the count() option to list more

      Secondly, I have a variable Region, also example data provided via dataex command;

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte br_region
       4
       4
       3
       3
       2
       2
       2
       2
       2
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       3
       4
       4
       4
       4
       4
       4
       4
       4
       4
       4
       4
       4
       5
       5
       3
       5
       5
       5
       7
       7
       7
       7
       7
       3
       3
       7
       8
       8
       8
       8
       8
       8
       8
       8
       8
       8
       9
       9
       9
       8
      10
      10
      10
      10
      11
       6
       6
      12
      14
      15
      15
      15
      16
      17
      17
      17
      17
      17
      17
      17
      18
       2
       7
       8
      17
      17
      17
      17
      17
      end
      label values br_region br_region
      label def br_region 2 "Outer London", modify
      label def br_region 3 "R. of South East", modify
      label def br_region 4 "South West", modify
      label def br_region 5 "East Anglia", modify
      label def br_region 6 "East Midlands", modify
      label def br_region 7 "West Midlands Conurbation", modify
      label def br_region 8 "R. of West Midlands", modify
      label def br_region 9 "Greater Manchester", modify
      label def br_region 10 "Merseyside", modify
      label def br_region 11 "R. of North West", modify
      label def br_region 12 "South Yorkshire", modify
      label def br_region 14 "R. of Yorks & Humberside", modify
      label def br_region 15 "Tyne & Wear", modify
      label def br_region 16 "R. of North", modify
      label def br_region 17 "Wales", modify
      label def br_region 18 "Scotland", modify
      ------------------ copy up to and including the previous line ------------------

      Listed 100 out of 14419 observations
      Use the count() option to list more

      If you would like to view these variables as explained by the dataset website;
      Here are the URL links;
      Annual Income Variable
      Region Variable


      My Dependant Variable is Life Satisfaction.

      I have come across a paper that has done similar research, however, the authors highlight how using a rank variable for income results in higher R-squared and generally more accurate results.
      The paper can be found here.


      Essentially, this is what the authors have done, in their own words;

      " We test a simple rank-based model according to which the individual compares themself to a sample of other people in their reference group and assesses whether each sampled individual earns more or less than themselves (Stewart et al., 2006). Those assigned “worse than” (i-1) are compared to the total number within the reference group (n-1). The ratio gives the individual a relative rank (Ri) normalized between 0 and 1:
      (1) Ri = i−1 / n−1 ...
      We use Ri to predict life satisfaction in a multiple regression...
      Next, we compared the rank and reference income hypotheses. To do this we constructed various reference groups to explore the possibility that people compare their income to others in the same geographical region (of which there were 19 in the BHPS)
      "

      I am trying to code my income variable similar to their 'rank' income, thus I can follow with using the reference group income (Regional income) as a comparison tool for life satisfaction.

      Comment


      • #4
        I think I understand now. The complicating factor in your data is that you have observations where the value of the income variable is missing. The text you cite does not state how this situation was handled. In the code below I assume that when a person's income is missing they are not given a "rank" nor are they counted among the people others compare themselves to when they get their rank. It is as if they are not in the data set (though I do not go so far as to drop those observations.)

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte br_region double br_fiyr
         4       32175.484375
         4    8668.6259765625
         3     59120.83203125
         3 1449.7247314453125
         2    31282.189453125
         2     4375.841796875
         2  3983.781494140625
         2    9922.8271484375
         2      7580.80859375
         3      19815.2421875
         3     39318.24609375
         3  1425.958740234375
         3                  0
         3                  .
         3      4614.52734375
         3    12654.552734375
         3              12852
         3    3995.3935546875
         3              19968
         3   2943.89892578125
         3   13048.4619140625
         3    25731.712890625
         3  2611.257568359375
         3    20837.154296875
         3        4417.453125
         3   5704.38818359375
         3        7631.109375
         3   10668.1630859375
         3    30694.435546875
         3                  0
         3    24654.998046875
         4    19128.951171875
         4                  0
         4          51210.625
         4                230
         4     17652.31640625
         4    9881.4501953125
         4     9485.244140625
         4                 .a
         4    21238.130859375
         4     11643.91796875
         4    12502.705078125
         4       5211.5546875
         5    10042.333984375
         5     6005.908203125
         3    19565.193359375
         5    9237.5322265625
         5    11515.060546875
         5          8374.5625
         7      24187.6796875
         7     6140.146484375
         7    15037.044921875
         7    2866.7470703125
         7    8289.6845703125
         3              35720
         3              77840
         7     23000.07421875
         8    8539.4619140625
         8     5346.060546875
         8   15006.9287109375
         8   5814.23095703125
         8              14575
         8   4916.06884765625
         8    4197.7373046875
         8    18756.724609375
         8  2764.363525390625
         8    21693.060546875
         9       12682.828125
         9        20595.84375
         9    7103.3291015625
         8                 .a
        10     13369.19140625
        10    2372.4501953125
        10     34736.72265625
        10    12052.337890625
        11     8727.677734375
         6    5652.3017578125
         6     6143.029296875
        12    2971.1884765625
        14    21808.083984375
        15    30255.162109375
        15      16570.8203125
        15     8113.212890625
        16         18772.4375
        17     19199.98046875
        17     26519.19140625
        17   4150.27099609375
        17    5607.2080078125
        17   2149.67626953125
        17      21614.2265625
        17     16984.31640625
        18                 .a
         2                  0
         7        26042.09375
         8     20101.02734375
        17     105341.5703125
        17         22256.5625
        17                100
        17   11415.8896484375
        17         9299.15625
        end
        label values br_region br_region
        label def br_region 2 "Outer London", modify
        label def br_region 3 "R. of South East", modify
        label def br_region 4 "South West", modify
        label def br_region 5 "East Anglia", modify
        label def br_region 6 "East Midlands", modify
        label def br_region 7 "West Midlands Conurbation", modify
        label def br_region 8 "R. of West Midlands", modify
        label def br_region 9 "Greater Manchester", modify
        label def br_region 10 "Merseyside", modify
        label def br_region 11 "R. of North West", modify
        label def br_region 12 "South Yorkshire", modify
        label def br_region 14 "R. of Yorks & Humberside", modify
        label def br_region 15 "Tyne & Wear", modify
        label def br_region 16 "R. of North", modify
        label def br_region 17 "Wales", modify
        label def br_region 18 "Scotland", modify
        label values br_fiyr br_fiyr
        
        gen missing_income = missing(br_fiyr)
        by br_region missing_income (br_fiyr), sort: gen rel_rank = (_n-1)/(_N-1) ///
            if !missing_income
        By the way, in the future, when posting data examples, don't do a separate -dataex- for each variable. Post one that contains all the relevant variables if possible. (If the number of relevant variables is too large, then splitting it up is OK.)

        Comment


        • #5
          Yes, there are values in the dataset for the income variable where the individual's did not respond or perhaps were unemployed during that time.
          As you can see, these values have been treated as "missing" as denoted by .a or .b (It is my understanding that this is common practice in such cases).
          You are also right that these cases are not meant to be given a rank, nor included in the group of comparison.

          If the way I have treated missing values is wrong, please correct me and guide me towards a better solution.

          Thank you for your help.
          Duly noted about splitting the dataex text.

          Comment


          • #6
            If the way I have treated missing values is wrong, please correct me and guide me towards a better solution.
            It is one of my axioms that all ways of treating missing values (except finding the actual values--which is usually not possible) are wrong. Some are less wrong than others. I selected the method I did (and which, it turns out, you intended) because it is simple to implement and probably less wrong than many other approaches.

            That said, some people might handle this by imputing values for income in the offending observations (preferably a multiple imputation process). If the missing data are missing at random (in the technical sense of the term missing at random) then this might be a better way to proceed, but it is more complicated to implement. Moreover, missingness at random is always an unverifiable assumption, and with a "sensitive" variable like income, it is likely to be false unless your data contains several other variables that are highly predictive of income.

            So, there is no other approach to this that I would endorse generically, though in particular circumstances there could be.

            Comment


            • #7
              Thank you for the insight, I do believe the method you assumed and I have implemented is sufficient to proceed forward.

              As for the rank variable for income, according to the paper I quoted, would you be able to help with the coding of this?

              Comment


              • #8
                The code is there in #4 at the bottom of the code block. Did you try it?

                Comment


                • #9
                  See also https://www.stata.com/support/faqs/s...ing-positions/ (which also explains about tied values).

                  Comment


                  • #10
                    Hi,
                    I really appreciate the help I received from Mr Schechter and Mr Cox,

                    I had a meeting today with a supervisor (as I am a MSc student writing my dissertation), and I showed my supervisor this particular conversation on the forum.
                    He pointed out that the data I have is panel data and a better approach would be the following code ;

                    sort year income bysort year: gen n=_n
                    bysort year: gen N=_N gen rank=n/N

                    He also suggested to regress different cities as reference groups by ;

                    if region=="London"


                    Would anyone be able to explain how this will work for my 'rank' or 'reference group' approach and affect my results?

                    Comment


                    • #11
                      Well, what we have been talking about up to know and what is proposed in #10 are quite different things. Up to know, the discussion has been about calculating a rank separately within each region. That is what I understood you to want when you said "My reference group is region, so for example all in London, all in Wales etc." in #1.

                      The code shown in #10 will create a single ranking across all regions. Also, the code in #10 does not have the person exclude himself/herself from the ranking with others.

                      So apples and oranges. The question is which of these fruits is appropriate for your research goals. The latter have not been seriously discussed in this thread.

                      He also suggested to regress different cities as reference groups by ; if region=="London"
                      The only prior mention of regression in this thread is in #3, "We use Ri to predict life satisfaction in a multiple regression..." While it may be putting too much weight on the literal interpretation of a single word, that sentence seems to indicate that a single regression will be done, not a separate one for each city.

                      So all I can tell you is that you need to get clarity on what your research question is, and then figure out which of these proposed methodologies will answer it. The source you quote may have been asking and answering a different question from that of your thesis, and perhaps those sources were recommended to you only as being generally suggestive of an approach. The responses you have gotten in this thread have taken that source as precisely what you are aiming for and showing you ways to implement that.

                      Comment


                      • #12
                        The code suggested in #10 is very crude as it ignores ties on income and could even be quite wrong as it won’t handle any missing values correctly. For what if does, a better approach is explained in the link given in #9.

                        Clyde Schechter’s comments apply otherwise.

                        Comment


                        • #13
                          Clyde Schechter I essentially want to do something similar to the quoted paper, however, as you mentioned the paper performs a single regression for region.
                          I would like to conduct separate one's, thus allowing me to report results for different cities.
                          Results for the hypothesis that life satisfaction from income is dependant not on absolute income but reference income. Reference income is what I am struggling to code.

                          Here is the section of the paper that perhaps explains what I would need;

                          " In each case we computed the relative rank of each individual’s income within the reference group and also the mean income of all individuals within the reference group. We then predicted each individual’s life satisfaction from (a) their relative rank within the reference group, (b) their absolute income (logarithmically transformed), and (c) mean reference group income (logarithmically transformed). "

                          So I would like to have a regression for regions for e.g London, Scotland etc. with (a) their relative rank within region, (b) their absolute income (logarithmically transformed), and (c) mean region income (logarithimically transformed).

                          From what I understand, the code in #10, creates a single ranking within all regions and thus by using if region=="London", I can give results for each city seperately.

                          However, I feel this code does not do well to capture the mean Income and furthermore, the mean reference group income (logarithmically transformed) as the paper quoted above does.



                          Comment


                          • #14
                            From what I understand, the code in #10, creates a single ranking within all regions and thus by using if region=="London", I can give results for each city seperately.
                            Yes. But it means, for example, that the ranking of a Londoner will be based on the Londoner's comparison of his/her income to not just other Londoners but to those in other regions as well. Is that what you want? I have no expertise in this content area, but just from a "common sense" perspective it seems wrong to me. I would think that to the extent relative income influences life satisfaction it would be income relative to those around you in a somewhat immediate sense. But maybe in our highly connected world that is not true. Anyway, you need to be certain as to whether that is what you want.

                            I should also point out that putting in relative ranking and absolute income directly would not be possible: one is a linear transform of the other (either overall, or within region, depending on which relative ranking approach is used) and so one of them would be omitted. This is, I suppose, why the authors chose to log-transform the absolute income--it breaks the colinearity. But that doesn't get around the fact that you are, in effect, entering the same variable twice. The two variables should have very high correlation (somewhat less so if the ranking is done separately for each region--which is another reason why doing each region's ranking separately strikes me as better), and that in turn means that their effects will be measured with low precision. So here, from a purely statistical perspective, the logic of the approach evades me. So, again, I ask, are you sure this is what you want to do? Whatever your decisions on these design issues, I'm happy to help you implement them in code. But I just want to be sure you understand what you are asking for before you get it.

                            Comment


                            • #15
                              Thank you, the way you explained this conundrum has made me think of what exactly I would need for my results.
                              Apologies if it has taken too much of your time working about this.

                              I believe my thesis aims to evaluate life satisfaction with relative income (relative income of those around you). So as you said, the income an individual compares to would most likely be with those around him. So a Londoner is likely to compare his annual income with a fellow Londoner when determining his own income as low or high. So essentially the code would create a variable that takes the income of an individual (from London) into account as a ratio of the mean income of London as a region. Then, the results interpreted at each city level.

                              Comment

                              Working...
                              X