How to create a variable that is dependent on values of other observations in the same group.

Fotis Delis

Join Date: Jan 2021
Posts: 12

How to create a variable that is dependent on values of other observations in the same group.

06 Apr 2023, 06:05

Hi Statalists,

I am trying to write a code which solves the following in Stata:

observations	id_1	id_2	year	rate	result	how I reached the result in excel
1	1	A	2009	0.25	-0.0274	(E2-E6)/(1-E6)
2	1	B	2017	0.1	-0.35455	SUM((E3-E9)/(1-E9);(E3-E15)/(1-E15);(E3-E17)/(1-E17))
3	2	A	2012	0.15
4	3	C	2013	0.3
5	4	A	2009	0.27	0.026667	(E6-E2)/(1-E2)
6	4	D	2014	0.12
7	5	D	2009	0.22	0.178947	(E8-E13)/(1-E13)
8	6	B	2017	0.23	0.274444	SUM((E9-E3)/(1-E3);(E9-E15)/(1-E15);(E9-E17)/(1-E17))
9	7	A	2015	0.21
10	8	C	2009	0.09
11	9	A	2011	0.07
12	10	D	2009	0.05	-0.21795	(E13-E8)/(1-E8)
13	10	C	2016	0
14	11	B	2017	0	-0.83838	SUM((E15-E3)/(1-E3);(E15-E9)/(1-E9);(E15-E17)/(1-E17))
15	11	C	2010	0.02
16	12	B	2017	0.3	0.613131	SUM((E17-E3)/(1-E3);(E17-E9)/(1-E9);(E17-E15)/(1-E15))

This is an example of my dataset in excel. id_1 is the identifier for a firm, id_2 is the corresponding owner of id_1 in that specific year and rate is the corresponding rate by firm (id_1) and year. Now each observation is uniquely identified by id_1 and year but you can see that the owner(id_2) could be the same. I will give an example of what i want to calculate. id_2 and year are the same for observations 1 and 5(id_2=A and year=2009). For observation 1 i want to calculate [0.25(its own rate)-0.27(observation's 5 rate)]/[1-0.27(observation's 5 rate)]. You can see the calculations in the column "how i reached the result in excel" where E is the column of the rate. For observation 5 i want to calculate [0.27(its own rate)-0.25(observation's 1 rate)]/[1-0.25(observation's 1 rate)]. This is just an example with only two firms(id_1) under a specific owner(id_2) in a specific year. If there is only one id_1 under a specific owner in a specific year the result should be missing. If there are many id_1 under a specific id_2 in a specific year, i will give another example. This is the case for observations 2,8,14,16 where id_2=B and year=2017. For observation 2 i want to calculate [(0.1-0.23)/(1-0.23)]+[(0.1-0)/(1-0)]+[(0.1-0.3)/(1-0.3)] where 0.1=rate for observation 2, 0.23= rate for observation 8, 0=rate for observation 14 and 0.3=rate for observation 16. For observation 8 its [(0.23-0.1)/(1-0.1)]+[(0.23-0)/(1-0)]+[(0.23-0.3)/(1-0.3)], for observation 14 its [(0-0.1)/(1-0.1)]+[(0-0.23)/(1-0.23)]+[(0-0.3)/(1-0.3)] and so on. Of course this is a sample dataset and i could have even 100 id_1 under a specific id_2 in a specific year. In general the expression under a specific id-2 in a specific year is Σ(rate i-rate j)/(1-rate j), where rate i is the rate of each specific observation and rate j is the rate of all the related observations by id_2 and year.

Many thanks for your help and this great forum.

Tags: None

Ken Chui

Join Date: Aug 2014
Posts: 1060

06 Apr 2023, 09:15

Should be doable with a joinby followed by collapse:

Code:

clear
input observations     id_1     str5 id_2     year     rate     result
1     1     A     2009     0.25     -0.0274
2     1     B     2017     0.1     -0.35455
3     2     A     2012     0.15         .
4     3     C     2013     0.3         .
5     4     A     2009     0.27     0.026667
6     4     D     2014     0.12         .
7     5     D     2009     0.22     0.178947
8     6     B     2017     0.23     0.274444
9     7     A     2015     0.21         .
10     8     C     2009     0.09         .
11     9     A     2011     0.07         .
12     10     D     2009     0.05     -0.21795
13     10     C     2016     0         .
14     11     B     2017     0     -0.83838
15     11     C     2010     0.02         .
16     12     B     2017     0.3     0.613131
end

drop result

preserve
rename id_1 did_1
rename rate drate
save temp01, replace
restore

joinby id_2 year using temp01
drop if did_1 == id_1

gen ind_rate = (rate-drate)/(1-drate)
collapse (sum) ind_rate, by(id_1 id_2 year)

Results are listed below. This can then be merged back to the original data:

Code:

     +--------------------------------+
     | id_1   id_2   year    ind_rate |
     |--------------------------------|
  1. |    1      A   2009   -.0273973 |
  2. |    1      B   2017   -.3545455 |
  3. |    4      A   2009    .0266667 |
  4. |    5      D   2009    .1789474 |
  5. |    6      B   2017    .2744444 |
  6. |   10      D   2009   -.2179487 |
  7. |   11      B   2017   -.8383839 |
  8. |   12      B   2017    .6131314 |
     +--------------------------------+

On a side note, please consider two actions: 1) Use command dataex to post data rather than tabulation. 2) If possible, appropriately break the process into shorter paragraphs; the whole long paragraph was quite difficult to read.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30174

06 Apr 2023, 09:28

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(observations id_1) str2 id_2 int year float rate
 1  1 "A " 2009 .25
 2  1 "B " 2017  .1
 3  2 "A " 2012 .15
 4  3 "C " 2013  .3
 5  4 "A " 2009 .27
 6  4 "D " 2014 .12
 7  5 "D " 2009 .22
 8  6 "B " 2017 .23
 9  7 "A " 2015 .21
10  8 "C " 2009 .09
11  9 "A " 2011 .07
12 10 "D " 2009 .05
13 10 "C " 2016   0
14 11 "B " 2017   0
15 11 "C " 2010 .02
16 12 "B " 2017  .3
end

preserve
rename (id_1 observations rate) alt_=
tempfile copy
save `copy'
restore
joinby id_2 year using `copy'

gen step1 = (rate-alt_rate)/(1-alt_rate)
by id_1 id_2 year (observations), sort: egen wanted ///
    = total(step1)
replace wanted = wanted - step1
keep if observations == alt_observations
replace wanted = . if wanted == 0

Added: Crossed with #2.

Last edited by Clyde Schechter; 06 Apr 2023, 09:31.

Comment

Fotis Delis

Join Date: Jan 2021

Posts: 12
#4

06 Apr 2023, 09:37

Many thanks for your answer Ken. I will follow your suggestions in my next post.
Comment
Fotis Delis

Join Date: Jan 2021

Posts: 12
#5

24 Apr 2023, 06:33

Many thanks both for your answers. Joinby does the work, but it creates very large datasets (a billion or two of observations) since a specific owner (id_2) in a specific year in my sample could have 200,000 related affiliates. So I have to overcome memory errors in the collapse command for which I use the fast option, but still I have to wait a lot for joinby to run. Do you think there is a more efficient way to solve this problem? Many thanks in advance.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#6

24 Apr 2023, 11:33

I'm not sure if this will be faster or not. Try replacing

Code:

joinby id_2 year using `copy' // WITH rangejoin using `copy', by(id_2 year) interval(rate . .)

-rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Ordinarily I would try timing this on my setup to see if it results in time saving or not. But at the moment (and for the rest of the day) I have a large, computation intensive simulation running in the background, which would make my timing results, at best, difficult to interpret. (And it would also really slow down my simulation.) So I'll leave it to you to experiment with this yourself.

Added: Correction to the syntax for -rangejoin- :

Code:

rangejoin rate . . using `copy', by(id_2 year)

Another thought: adding the -fast- option to -collapse- doesn't really improve its performance that much. You can obtain a much bigger speedup by instead using -gcollapse-. -gcollapse- is part of Mauricio Caceres Bravo's -gtools- package (which also contains the very useful -greshape- command which is a huge speedup over StataCorp's -reshape-.) You can get -gtools- from SSC.

Last edited by Clyde Schechter; 24 Apr 2023, 12:15.
Comment

Announcement

How to create a variable that is dependent on values of other observations in the same group.

Comment

Comment

Comment

Comment

Comment