Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looping running sum and total for changing range within grouped variables.

    Hi everyone,

    I'm relatively new to Stata so please excuse if the explanation of my problem sounds a bit all over the place. I will try my best:

    Here is what my data looks like:
    Code:
       
    rank percent weight
    1 1 2000
    2 1 3000
    3 2 7000
    4 2 1200
    5 3 3000
    6 3 4500
    end
    What I want to do is build a loop that for the first group, percent = 1
    - generates a variable as the cumulative sum of weight of each observation
    - generates a variable as the total sum of weight in that group

    And for the second group, percent = 2
    - generates a variable as the cumulative sum of each observation from percent = 1 and percent = 2 by the cumulative total of the weights from Perc 1 and 2,
    - generates a variable as the total sum of weight from percent = 1 and percent = 2 and so on

    In my data the values for percent range from 1-100. However, the loop should only address the first 10 groups, i.e. the first 10 percentile thresholds.

    Here is what I tried in Stata, but the results doesn't seem right:
    Code:
    local k = 1
    foreach var of varlist percent_* {
        foreach x in percent_`k' {
            bysort percent_`k' (rank`k'): gen ni_`k' = sum(weight) if `x' == `x'[_n+1]
            bysort percent_`k': egen nall_`k' = total(weight) if `x' == `x'[_n+1]
        }
        local k = `k' + 1
    }
    end
    To give an example of what I'm looking for, eventually it should look like this:
    Code:
     
    percent weight Ni 1 Nall 1 Ni 2 Nall 2 Ni 3 Nall 3
    1 2000 2000 5000 2000 13200 2000 20700
    1 3000 5000 5000 5000 13200 5000 20700
    2 7000 12000 13200 12000 20700
    2 1200 13200 13200 13200 20700
    3 3000 16200 20700
    3 4500 20700 20700
    end
    Any help is much appreciated.

    Thank you
    Moritz

    Last edited by Moritz Huth; 08 Mar 2023, 13:27. Reason: Spelling

  • #2
    The results you show are clearer than your explanation in words, so thank you for that. But they are also a very un-Stata-ish way of organizing information. I wonder what you are planning to do with these Ni and Nall variables. Perhaps if your ultimate purpose were clear, a better approach could be suggested. I say that because the arrangement with N1, Nall1, N2, Nall2, etc. is highly redundant. There is very little in the way of further data management or calculations that could be done with this in Stata, as far as I can see.

    So, the following code simply generates two variables: running_sum, and group_total. All of the numbers you wanted calculated are there, although not arranged in the way you showed. It is not that hard to create the other variables from these, but as I can't see how it will be helpful to do so, I have not written that part of the code.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int(rank percent) float weight
    1 1 2000
    2 1 3000
    3 2 7000
    4 2 1200
    5 3 3000
    6 3 4500
    end
    
    gen running_sum = sum(weight)
    by percent (rank), sort: gen group_total = running_sum[_N]
    In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Thank you for your response, Clyde. I will keep your suggestions towards -dataex- in mind.

      Here is some context:
      I have an imputed panel dataset on wealth distribution on a household level for which I want to fit a Pareto tail to. To do so, in a first step I need to estimate the distribution function based on my data using OLS, to identify the shape parameter alpha. The regression equation would look like this: ln(CCDF) = ß0 - ß1 ln(wealth)

      The CCDF is calculated by the running sum of weights (ni) and the total sum of weights (nall) = ni/nall. So in my original post I wanted to calculate the CCDF for the first 10 percentiles of the distribution using a loop. But not for each percentile individually but by accumulating them for every new variable produced by the loop.

      So like in my example above the ni, nall and CCDF should be:
      Percentile 1: {2000, 3000} -> ni1/nall1 -> {2000/5000, 5000/5000} = CCDF1 = {0.4, 1}
      Percentile 2 : {2000, 3000, 7000, 1200} -> ni2/nall2 -> {2000/13200, 5000/13200, 12000/13200, 13200/13200} = CCDF2 = {0.15, 0.38, 0.9, 1}, and so on until the 10th percentile.

      In an earlier attempt I tried the following, but this loop only calculates the running sum and total sum for each percentile individually:

      Code:
      local i = 1
      foreach var of varlist *_net_wealth {
      sort `var'
      egen ranknw_`i' = rank(-`var'), unique
      xtile percent_`i' = -`var' [weight = weight], nquantiles(100)
      gen log_nw_`i' = log(`var')
      local i = `i' + 1
      }
      
      local j = 1
      foreach var of varlist percent_* {
      foreach x in percent_`j' {
      bysort percent_`j' (ranknw_`j'): gen ni_`j' = sum(weight) if `x' <= 10
      bysort percent_`j' (ranknw_`j'): egen nall_`j' = total(weight) if `x' <= 10
      gen ccdf_`j' = ni_`j'/nall_`j'
      gen log_ccdf_`j' = log(ccdf_`j')
      }
      local j = `j' + 1
      }
      
      forvalues p = 1/5 {
      bysort pct_percent_`p': eststo: quietly reg log_ccdf_`p' log_nw_`p' if pct_percent_`p' <= 10
      }

      I hope this helps to better understand my problem and doesn't cause even more confusion haha

      Thank you
      Moritz
      Last edited by Moritz Huth; 09 Mar 2023, 05:14.

      Comment


      • #4
        For creating ni and nall for the cumulated percentiles, here ist what the code would look like without the loop:

        Code:
        sort rank
        gen ni_1 = sum(weight) if percent == 1
        egen nall_1 = total(weight) if percent == 1
        
        gen ni_2 = sum(weight) if percent <= 2
        egen nall_2 = total(weight) if percent <= 2
        
        ...
        
        gen ni_10 = sum(weight) if percent <= 10
        egen nall_10 = total(weight) if percent <= 10
        But unfortunately I'm not able to create the loop. Any ideas?

        Thank you
        Moritz

        Comment


        • #5
          OK, I get what you are trying to do.

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input int(rank percent) float weight
          1 1 2000
          2 1 3000
          3 2 7000
          4 2 1200
          5 3 3000
          6 3 4500
          end
          
          gen running_sum = sum(weight)
          by percent (rank), sort: gen group_total = running_sum[_N]
          
          forvalues i = 1/3 {
              gen ni_`i' = running_sum if percent <= `i'
              egen n_all_`i' = min(cond(percent == `i', group_total, .))
          }

          Comment


          • #6
            This worked out perfectly, thank you!

            Comment

            Working...
            X