Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to create a variable that is dependent on values of other observations in the same group.

    Hi Statalists,

    I am trying to write a code which solves the following in Stata:

    observations id_1 id_2 year rate result how I reached the result in excel
    1 1 A 2009 0.25 -0.0274 (E2-E6)/(1-E6)
    2 1 B 2017 0.1 -0.35455 SUM((E3-E9)/(1-E9);(E3-E15)/(1-E15);(E3-E17)/(1-E17))
    3 2 A 2012 0.15
    4 3 C 2013 0.3
    5 4 A 2009 0.27 0.026667 (E6-E2)/(1-E2)
    6 4 D 2014 0.12
    7 5 D 2009 0.22 0.178947 (E8-E13)/(1-E13)
    8 6 B 2017 0.23 0.274444 SUM((E9-E3)/(1-E3);(E9-E15)/(1-E15);(E9-E17)/(1-E17))
    9 7 A 2015 0.21
    10 8 C 2009 0.09
    11 9 A 2011 0.07
    12 10 D 2009 0.05 -0.21795 (E13-E8)/(1-E8)
    13 10 C 2016 0
    14 11 B 2017 0 -0.83838 SUM((E15-E3)/(1-E3);(E15-E9)/(1-E9);(E15-E17)/(1-E17))
    15 11 C 2010 0.02
    16 12 B 2017 0.3 0.613131 SUM((E17-E3)/(1-E3);(E17-E9)/(1-E9);(E17-E15)/(1-E15))


    This is an example of my dataset in excel. id_1 is the identifier for a firm, id_2 is the corresponding owner of id_1 in that specific year and rate is the corresponding rate by firm (id_1) and year. Now each observation is uniquely identified by id_1 and year but you can see that the owner(id_2) could be the same. I will give an example of what i want to calculate. id_2 and year are the same for observations 1 and 5(id_2=A and year=2009). For observation 1 i want to calculate [0.25(its own rate)-0.27(observation's 5 rate)]/[1-0.27(observation's 5 rate)]. You can see the calculations in the column "how i reached the result in excel" where E is the column of the rate. For observation 5 i want to calculate [0.27(its own rate)-0.25(observation's 1 rate)]/[1-0.25(observation's 1 rate)]. This is just an example with only two firms(id_1) under a specific owner(id_2) in a specific year. If there is only one id_1 under a specific owner in a specific year the result should be missing. If there are many id_1 under a specific id_2 in a specific year, i will give another example. This is the case for observations 2,8,14,16 where id_2=B and year=2017. For observation 2 i want to calculate [(0.1-0.23)/(1-0.23)]+[(0.1-0)/(1-0)]+[(0.1-0.3)/(1-0.3)] where 0.1=rate for observation 2, 0.23= rate for observation 8, 0=rate for observation 14 and 0.3=rate for observation 16. For observation 8 its [(0.23-0.1)/(1-0.1)]+[(0.23-0)/(1-0)]+[(0.23-0.3)/(1-0.3)], for observation 14 its [(0-0.1)/(1-0.1)]+[(0-0.23)/(1-0.23)]+[(0-0.3)/(1-0.3)] and so on. Of course this is a sample dataset and i could have even 100 id_1 under a specific id_2 in a specific year. In general the expression under a specific id-2 in a specific year is Σ(rate i-rate j)/(1-rate j), where rate i is the rate of each specific observation and rate j is the rate of all the related observations by id_2 and year.

    Many thanks for your help and this great forum.

  • #2
    Should be doable with a joinby followed by collapse:

    Code:
    clear
    input observations     id_1     str5 id_2     year     rate     result
    1     1     A     2009     0.25     -0.0274
    2     1     B     2017     0.1     -0.35455
    3     2     A     2012     0.15         .
    4     3     C     2013     0.3         .
    5     4     A     2009     0.27     0.026667
    6     4     D     2014     0.12         .
    7     5     D     2009     0.22     0.178947
    8     6     B     2017     0.23     0.274444
    9     7     A     2015     0.21         .
    10     8     C     2009     0.09         .
    11     9     A     2011     0.07         .
    12     10     D     2009     0.05     -0.21795
    13     10     C     2016     0         .
    14     11     B     2017     0     -0.83838
    15     11     C     2010     0.02         .
    16     12     B     2017     0.3     0.613131
    end
    
    drop result
    
    preserve
    rename id_1 did_1
    rename rate drate
    save temp01, replace
    restore
    
    joinby id_2 year using temp01
    drop if did_1 == id_1
    
    gen ind_rate = (rate-drate)/(1-drate)
    collapse (sum) ind_rate, by(id_1 id_2 year)
    Results are listed below. This can then be merged back to the original data:
    Code:
         +--------------------------------+
         | id_1   id_2   year    ind_rate |
         |--------------------------------|
      1. |    1      A   2009   -.0273973 |
      2. |    1      B   2017   -.3545455 |
      3. |    4      A   2009    .0266667 |
      4. |    5      D   2009    .1789474 |
      5. |    6      B   2017    .2744444 |
      6. |   10      D   2009   -.2179487 |
      7. |   11      B   2017   -.8383839 |
      8. |   12      B   2017    .6131314 |
         +--------------------------------+
    On a side note, please consider two actions: 1) Use command dataex to post data rather than tabulation. 2) If possible, appropriately break the process into shorter paragraphs; the whole long paragraph was quite difficult to read.

    Comment


    • #3
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input byte(observations id_1) str2 id_2 int year float rate
       1  1 "A " 2009 .25
       2  1 "B " 2017  .1
       3  2 "A " 2012 .15
       4  3 "C " 2013  .3
       5  4 "A " 2009 .27
       6  4 "D " 2014 .12
       7  5 "D " 2009 .22
       8  6 "B " 2017 .23
       9  7 "A " 2015 .21
      10  8 "C " 2009 .09
      11  9 "A " 2011 .07
      12 10 "D " 2009 .05
      13 10 "C " 2016   0
      14 11 "B " 2017   0
      15 11 "C " 2010 .02
      16 12 "B " 2017  .3
      end
      
      preserve
      rename (id_1 observations rate) alt_=
      tempfile copy
      save `copy'
      restore
      joinby id_2 year using `copy'
      
      gen step1 = (rate-alt_rate)/(1-alt_rate)
      by id_1 id_2 year (observations), sort: egen wanted ///
          = total(step1)
      replace wanted = wanted - step1
      keep if observations == alt_observations
      replace wanted = . if wanted == 0
      Added: Crossed with #2.
      Last edited by Clyde Schechter; 06 Apr 2023, 09:31.

      Comment


      • #4
        Many thanks for your answer Ken. I will follow your suggestions in my next post.

        Comment


        • #5
          Many thanks both for your answers. Joinby does the work, but it creates very large datasets (a billion or two of observations) since a specific owner (id_2) in a specific year in my sample could have 200,000 related affiliates. So I have to overcome memory errors in the collapse command for which I use the fast option, but still I have to wait a lot for joinby to run. Do you think there is a more efficient way to solve this problem? Many thanks in advance.

          Comment


          • #6
            I'm not sure if this will be faster or not. Try replacing
            Code:
            joinby id_2 year using `copy'
            
            // WITH
            
            rangejoin using `copy', by(id_2 year) interval(rate . .)
            -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

            Ordinarily I would try timing this on my setup to see if it results in time saving or not. But at the moment (and for the rest of the day) I have a large, computation intensive simulation running in the background, which would make my timing results, at best, difficult to interpret. (And it would also really slow down my simulation.) So I'll leave it to you to experiment with this yourself.

            Added: Correction to the syntax for -rangejoin- :
            Code:
            rangejoin rate . . using `copy', by(id_2 year)
            Another thought: adding the -fast- option to -collapse- doesn't really improve its performance that much. You can obtain a much bigger speedup by instead using -gcollapse-. -gcollapse- is part of Mauricio Caceres Bravo's -gtools- package (which also contains the very useful -greshape- command which is a huge speedup over StataCorp's -reshape-.) You can get -gtools- from SSC.
            Last edited by Clyde Schechter; 24 Apr 2023, 12:15.

            Comment

            Working...
            X