Stata how to calculate all combinations of differences among observations

ashkan faramarzi

Join Date: Sep 2016

Posts: 42
#1

Stata how to calculate all combinations of differences among observations

22 Sep 2021, 07:43

Hello,

I am trying to calculate the difference between the values of different rows for the same variable and sum the differences. Here is how my data looks like:

no | sic | year | advertising ($) | sum of the differences
1 | 100 | 2000 | 10 | a = abs(10-11) + abs(10-12) + abs(10-13) + abs(10-14)
2 | 100 | 2000 |11 | b = abs(11-10) + abs(11-12) + abs(11-13) + abs(11-14)
3 | 100 | 2000 |12 | c = abs(12-10) + abs(12-11) + abs(12-13) + abs(12-14)
4 | 100 | 2000 |13 | d = abs(13-10) + abs(13-11) + abs(13-12) + abs(13-14)
5 | 100 | 2000 |14 | e = abs(14-10) + abs(14-11) + abs(14-12) + abs(14-13)

What I like to do is to calculate a,b,c,d, and e the way I showed. I need this for a big dataset. How can I code this? Thanks for helping me out.

Last edited by ashkan faramarzi; 22 Sep 2021, 07:48.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#2

22 Sep 2021, 09:48

Although you do not say as much, I am guessing that you want to do this separately for each combination of sic and year, and that you are not looking to include differences in advertising across different firms or for the same firm in different years. If I have that wrong, please post back explaining exactly what you want and I can modify the code accordingly.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte(no sic) int year byte advertising 1 100 2000 10 2 100 2000 11 3 100 2000 12 4 100 2000 13 5 100 2000 14 end isid no sic year, sort preserve rename (no advertising) =_U tempfile matches save `matches' restore joinby sic year using `matches' drop if no == no_U gen delta = abs(advertising-advertising_U) by no sic year, sort: egen wanted = total(delta) by no sic year: keep if _n == 1 drop *_U delta

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
ashkan faramarzi

Join Date: Sep 2016

Posts: 42
#3

24 Sep 2021, 11:22

Thanks Clyde. Sorry about the dataex, I will do it for future posts.
I want to do this separately for each combination of "no" as I showed in a, b, c,d,e, like below (to include differences in advertising across different firms and then sum it up for each row):
a = abs(10-11) + abs(10-12) + abs(10-13) + abs(10-14)
b = abs(11-10) + abs(11-12) + abs(11-13) + abs(11-14)
c = abs(12-10) + abs(12-11) + abs(12-13) + abs(12-14)
d = abs(13-10) + abs(13-11) + abs(13-12) + abs(13-14)
e = abs(14-10) + abs(14-11) + abs(14-12) + abs(14-13)

The numbers are advertising expenses. you can forget about sic and year.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30168

24 Sep 2021, 14:26

OK, then it's

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(no sic) int year byte advertising
1 100 2000 10
2 100 2000 11
3 100 2000 12
4 100 2000 13
5 100 2000 14
end

isid no sic year, sort
preserve
rename (no sic year advertising) =_U
tempfile matches
save `matches'

restore
cross using `matches'
drop if no == no_U & year == year_U & sic == sic_U
gen delta = abs(advertising-advertising_U)
by no, sort: egen wanted = total(delta)
by no: keep if _n == 1
drop *_U delta

I do have one concern. You say you have a "big" dataset. That means different things to different people. In particular, if it's really, really big, then the -cross- command will break your memory limits. If you run into that problem, post back and I'll show you a different, more complicated, but memory-conserving way.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35780
#5

24 Sep 2021, 14:38

If you’re after Gini’s mean difference there are other ways to get it.
Comment

ashkan faramarzi

Join Date: Sep 2016
Posts: 42

25 Sep 2021, 02:39

Thanks a lot, Clyde, very helpful. But as you predicted, I am having problems with my memory as I have 216000 observations and 9000 groups (based on sic and year). I appreciate it If you could suggest a solution.
Using dataex, here how my data looks like:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long gvkey double fyear int sic float(ada group)
  2812 2000 100           0 1
 30443 2000 100           0 1
 10884 2000 100           0 1
 62341 2000 100           0 1
 65158 2000 100           0 1
 10390 2000 100           0 1
 27301 2000 100           0 1
111868 2000 100 .0018963456 1
 20302 2000 100           0 1
  1266 2000 100           0 1
121722 2000 100           0 1
 29667 2000 100           0 1
140760 2000 100  .008783899 1
 28524 2000 100           0 1
 14002 2000 100   .04970201 1
 13155 2000 100           0 1
 13550 2000 100           0 1
 23080 2000 100           0 1
140760 2001 100  .008399685 2
 30443 2001 100           0 2
 13550 2001 100           0 2
  1266 2001 100           0 2
 20302 2001 100           0 2
144343 2001 100           0 2
 29667 2001 100           0 2
 14002 2001 100  .016328929 2
 27301 2001 100           0 2
 23080 2001 100           0 2
 10884 2001 100           0 2
  2812 2001 100           0 2
111868 2001 100 .0017602402 2
 13155 2001 100           0 2
121722 2001 100           0 2
 28524 2001 100           0 2
 62341 2001 100           0 2
 10390 2001 100           0 2
 65158 2001 100           0 2
 10390 2002 100           0 3
 10884 2002 100           0 3
 28524 2002 100           0 3
 20302 2002 100           0 3
 23080 2002 100           0 3
 30443 2002 100           0 3
 13550 2002 100           0 3
  2812 2002 100           0 3
147855 2002 100           0 3
 13155 2002 100           0 3
  1266 2002 100           0 3
 27301 2002 100           0 3
 29667 2002 100           0 3
 14002 2002 100   .07437443 3
144343 2002 100           0 3
140760 2002 100  .007874016 3
121722 2002 100           0 3
 65158 2002 100           0 3
 62341 2002 100           0 3
111868 2002 100    .0025258 3
end
format %ty fyear

And this is my coding using your suggestion:
egen group=group(sic fyear)

preserve
rename (gvkey sic fyear ada group) =_U
tempfile matches
save `matches', replace
restore

gen wanted=.
su group, meanonly
qui forval i = `r(min)'/`r(max)' {
cross using `matches'
drop if gvkey == gvkey_U & fyear == fyear_U & sic == sic_U
drop if group != group_U
gen delta = abs(ada-ada_U)
by gvkey fyear, sort: egen wanted1 = total(delta)
replace wanted=wanted1 if group==`i'
bysort gvkey fyear: keep if _n == 1
drop *_U delta wanted1
}

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#7

25 Sep 2021, 11:26

In #2 I wrote code that confined the calculation of the total absolute differences to observations with the same industry and year. In #3 you responded that you did not want that restriction: you said that you wanted the differences between all different observations in the entire data set. Your code in #6 looks like an attempt to impose precisely the restriction I imposed in #2: look only at observations sharing the same sic and fyear. It does it in a kludgy way, but I cannot actually imagine what other intent you might have there--the variable group that you create cannot serve any useful purpose otherwise. If this is what you want, try the code in #2. It is not nearly as memory-intensive as the code in #4, nor yours in #6. If the code in #2 also breaks your memory limits, post back.
Comment
Bradley Tompkins

Join Date: Jun 2023

Posts: 3
#8

16 Jun 2023, 10:29

Hi All,

I have a similar issue that I am trying to solve. I have tried to apply the examples provided above but not had any success.

I need to compare each observation of na_result to one another for a given bundle_id via subtraction. I also need to compare na_order_dttm for each observation (this is a date/time variable), again via simple subtraction.

So I do not need to compare na_result observations across bundle_id 100076_3 & 103619_1, etc. The comparisons only need to be made within each bundle_id. The data set contains just over 6,000 observations from over 400 different bundle_id's.

Any help in applying the methods described above would be greatly appreciated. Apologies in advance for any mistakes made in posting (long time observer, first time poster). Many thanks.

input str9 bundle_id float na_order_dttm int na_result float order
"100076_3" 1.881838e+12 122 1
"100076_3" 1.8818635e+12 122 2
"100076_3" 1.8818793e+12 122 3
"100076_3" 1.881951e+12 125 4
"100076_3" 1.8819657e+12 126 5
"100076_3" 1.882077e+12 128 6
"103619_1" 1.8505593e+12 122 1
"103619_1" 1.850604e+12 126 2
"103619_1" 1.8506375e+12 128 3
"103619_1" 1.850652e+12 128 4
"103619_1" 1.8506808e+12 128 5
"103619_1" 1.8506904e+12 129 6
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#9

16 Jun 2023, 10:54

Code:

clear input str9 bundle_id float na_order_dttm int na_result float order "100076_3" 1.881838e+12 122 1 "100076_3" 1.8818635e+12 122 2 "100076_3" 1.8818793e+12 122 3 "100076_3" 1.881951e+12 125 4 "100076_3" 1.8819657e+12 126 5 "100076_3" 1.882077e+12 128 6 "103619_1" 1.8505593e+12 122 1 "103619_1" 1.850604e+12 126 2 "103619_1" 1.8506375e+12 128 3 "103619_1" 1.850652e+12 128 4 "103619_1" 1.8506808e+12 128 5 "103619_1" 1.8506904e+12 129 6 end format na_order_dttm %tc tempfile copy save `copy' rangejoin na_order_dttm . -1 using `copy', by(bundle_id) drop if missing(order_U) gen time_diff = clockdiff_frac(na_order_dttm_U, na_order_dttm, "h") gen na_diff = na_result - na_result_U

I have assumed that you do not need to calculate both the difference between order X and order Y as well as the difference between order Y and order X, nor the difference between any order and itself. So this code gives you the differences between any given order and all preceding orders in the bundle.

Calculating the time difference just by subtraction is probably not a good idea here, because the result will be measured in milliseconds. Given the relatively long intervals between orders, those numbers will just be confusing. So I have instead rescaled the time difference to units of hours, which I think will be more easily grasped. The difference in Na results is given as the result in the later order minus the result in the earlier order. So a positive number denotes an increase over time, and a negative one a decrease.

Added: -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, written by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Last edited by Clyde Schechter; 16 Jun 2023, 10:58.
Comment

Announcement