Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate variables with ratios of subgroup values

    Hello! I got stuck solving a problem which must be very simple to address, apologies if I am overlooking something very basic!

    I am trying to generate variables that present the ratio of subgroup means for different years.

    The dataset is structured as follows: I have observations for several years, and five income groups - income quintiles (incq) - for each year. For each income group, I have created the mean annual expenditure (meanex) for different items, e.g. electricity, this is the same value for each observation in that income group already. Now I'd like to generate a new variable that contains the ratio of spending which divides mean spending of the first income quintile by the mean spending of the fifth income quintile per year (meanex if incq==1/meanex if incq==5). My ultimate aim is to be able to plot that ratio by year to see how inequality in expenditure changes over time.

    Here is a simplified example of my dataset:

    Code:
    clear
    
    input id year incq meanex
    1 1 1 2.4
    2 1 1 2.4
    3 1 2 3.1
    4 1 2 3.1
    5 1 3 4.2
    6 1 3 4.2
    7 1 4 4.8
    8 1 4 4.8
    9 1 5 5.1
    10 1 5 5.1
    11 2 1 1.1
    12 2 1 1.1
    13 2 2 1.7
    14 2 2 1.7
    15 2 3 2.1
    16 2 3 2.1
    17 2 4 2.5
    18 2 4 2.5
    19 2 5 3.6
    20 2 5 3.6
    21 3 1 1.6
    22 3 1 1.6
    23 3 2 2.3
    24 3 2 2.3
    25 3 3 2.8
    26 3 3 2.8
    27 3 4 3.7
    28 3 4 3.7
    29 3 5 6.8
    30 3 5 6.8
    end
    Many thanks in advance for your help!

  • #2
    Code:
    by year (incq), sort: assert incq[1] == 1 & incq[_N] == 5
    by year (incq): gen wanted = meanex[1]/meanex[_N]
    Note: The first command verifies that you really have quantiles 1 through 5 in each year. You may have tried to create 5 quantiles in each year, but sometimes the data are clumped in such a way that you can't actually get them. So this code checks that the data really are what you think they are. Assuming it passes the first assert, the calculation is in the next command.

    Comment


    • #3
      Hi Clyde, many thanks, that's very useful, the assert request works for my dataset, and it has done the calculation correctly for some years. However, I have a couple of questions for which your help would be greatly appreciated.

      1) In my actual dataset, there are some missings in the meanex variable because I created it for a specific region, excluding other observations. If the first observation in a year is missing (indicated by meanex[1] in the second command), the whole year then is missing. I guess I could use preserve and restore, dropping the missing variables, and then merge again, but is there an easier way round this problem?

      2) If I understand correctly, meanex[1] is the first observation in each year and [_N] is the total number of income groups, in this case five, correct? If I wanted to reverse the calculation, i.e. divide the average of the 5th income decile by the 1st, or use other income deciles for this calculation, let's say the 2nd and the 4th, how would I write this? I have tried to find it in the functions and expressions manual but couldn't find it - if you could point me to documentation about this that would be great, thank you!
      Sam

      Comment


      • #4
        In the code in #2, _N does not refer to the number of groups, it refers to the number of observations within a group.

        I did not think to test for missing data (which also breaks the code, as you have noticed) because there wasn't any in your example and based on your description of the data I didn't anticipate that problem would arise. Moral of the story: the presence or absence of missing values in a variable almost always affects the code required. When posting example data, if any of the variables needed for your calculations have missing values, post an example that contains some of those. Since working around missing values often involves extra code, you may end up with simplified code that doesn't work in your real data if you don't do that.

        The code in #2 was specifically exploiting the fact that you wanted the ratio for quintiles 1 and 5. That code cannot, as far as I can see, be adapted to use any of the other quintiles because it relied on the fact that the 1st quintile sorts at the top and the last quintile sorts at the bottom. A different approach is required. That different approach will also resolve the missing values problem.

        Code:
        forvalues i = 1/5 {
            by year, sort: egen q`i'_meanex = max(cond(incq == `i', meanex, .))
        }
        gen ratio_5_1 = q5_meanex/q1_meanex
        gen ratio_4_2 = q4_meanex/q2_meanex

        Comment


        • #5
          Thanks so much for your reply Clyde, and apologies for the slow response! This is working now! Apologies that I forgot about the missings in the first step, I only got reminded of that myself when I tried the first suggestion.

          May I ask something for clarification so I can learn for the future as I haven't used the "cond" function before? In the help file it says the cond function is structured like this: cond(x,a,b[,c]) where a happens if x is true and nonmissing, b if x is false, and c if x is missing. Is it right that in your code x is "incq == `i'", a is "max(meanex)" and b is ., is that correct? Which would mean the q`i' variable would be set to "missing" if the quintile is not "i" - but this is not what happens, they all get filled in with the value for the specified quintile - which is good because otherwise the ratio could not be calculated in the second step. But perhaps I am reading the code in the wrong way?

          Many thanks!

          Comment


          • #6
            You have understand the -cond()- function correctly, but you have misunderstood the -max()- function.

            Within each year's observations, cond(incq == `i', meanex, .)) will evaluate to either be meanex, for those observations in the i'th quintile, and missing value for the others. But now we take the maximum value of those results within the year. Now, your original variable meanex, is constant within any year's quintile by the way you created it. So within a given year we will find two kinds of observations: those with in the i'th quintile of income, where the value of the -cond()- expression is the i'th quintiles mean expenditure, and those in the other quintiles, where the value of the -cond()- expression is missing. The -max()- function returns the maximum of all the non-missing values. In this case, the non-missing values are all the same. So the end result of -max(cond(…))- is the (unique) value of meanex in the i'th quintile of that year, and this is then stored in q`i'_meanex. Rinse and repeat for i = 1 through 5.

            Comment


            • #7
              Great thank you! Yes, I did realise that the max function returns a constant, I just wondered what causes it to override the request in the command
              HTML Code:
                max(cond(incq == `i', meanex, .))
              to set values to missing ( ".") if x is not true, i.e. for missing values where cases belong to another income quintile, are you saying the "max" function overrides this automatically?

              Comment


              • #8
                Sorry, I meant "non-missing values" for cases that belong to another income quintile (even though that's interesting, too, it also fills in in the max value for missing cases within the ith income quintile that belong to another region than previously specified when I created meanex, but that makes sense because x is still true for those).

                Comment


                • #9
                  See also https://www.stata-journal.com/articl...article=dm0055 Sections 9 and 10 for a fairly systematic exposition of :@Clyde Schechter's approach -- and alternatives -- in context.

                  The point to #7 is that max() ignores missings to the extent possible. Just like say summarize given 1 2 3 42 and missing, max() won't return missing; it will return 42. So, the code is saying "use the values for one bin (by ignoring the others)".

                  Comment


                  • #10
                    Thank you Nick, that's very helpful!

                    Comment

                    Working...
                    X