Count number of distinct values within a subset of a group

Arthur Carvalho Brito

Join Date: Jan 2021
Posts: 45

Count number of distinct values within a subset of a group

01 Sep 2023, 13:12

Hi, my question is a slight twist on counting distinct occurrences in a group.

Here is some example code:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long firmid str6 product str4 j float year double value
5221303 "121190" "DNK" 2003                  0
5221303 "40900"  "USA" 2003           191556.5
5221303 "71029"  "ESP" 2003                  0
5221303 "71080"  "DEU" 2003           80694.19
5221303 "71080"  "ESP" 2003            66672.8
5221303 "71080"  "FRA" 2003 386993.80
5221303 "71080"  "ITA" 2003                  0
5221303 "71151"  "DEU" 2003          125894.75
5221303 "71151"  "ESP" 2003             119350
5221303 "71159"  "DEU" 2003                  0
5221303 "71159"  "ESP" 2003                  0
5221303 "71159"  "FIN" 2003                  0
5221303 "71159"  "SWE" 2003                  0
5221303 "71190"  "DEU" 2003                  0
5221303 "71190"  "ESP" 2003                  0
5221303 "71190"  "FRA" 2003                  0
5221303 "71230"  "ESP" 2003                  0
5221303 "71230"  "FRA" 2003                  0
5221303 "71239"  "ESP" 2003            35667.8
5221303 "71239"  "FRA" 2003           49720.72
5221303 "80240"  "FRA" 2003                  0
5221303 "80240"  "ITA" 2003           204699.9
5221303 "81120"  "AUS" 2003           125420.1
5221303 "81120"  "BEL" 2003                  0
5221303 "81120"  "CAN" 2003              25200
5221303 "81120"  "CHE" 2003                  0
5221303 "81120"  "DEU" 2003              19493
5221303 "81120"  "DNK" 2003                  0
5221303 "81120"  "FRA" 2003          719039.84
5221303 "81120"  "GBR" 2003              26525
5221303 "81120"  "ITA" 2003          148453.38
5221303 "81120"  "NLD" 2003            92302.6
5221303 "81120"  "NZL" 2003              19426
5221303 "81120"  "USA" 2003           74740.14
5221303 "81190"  "DEU" 2003            15364.5
5221303 "81190"  "FRA" 2003                  0
end

An observation in my dataset consists of a firmid-year-j-product combination.

I want to calculate the number of distinct products each firm sold to a non-USA country in 2003. This variable, numprod_nonUS_2003, should be added as a new column. So even observations not corresponding to 2003 or to the USA should have the value of the corresponding firm (meaning a collapse won't actually work).

Note also that if a firm doesn't sell, it is still recorded in the dataset with value = 0, so we need to count only observations with value > 0.

I managed to do this if I conditioned instead on USA sales, which is easier, since a firmid-year-product combination is unique if you fix the country. I did it by running:

(2003 is the second year in my data)

Code:

bysort firmid: egen numprod_US_2003 = total(j == "USA" & year == 2003 & value > 0)
bysort firmid product j (year): replace numprod_US_2003 = numprod_US_2003[2]

I don't know however how to adapt this to the non-USA case.

Last edited by Arthur Carvalho Brito; 01 Sep 2023, 14:05.

Tags: bysort, count

Daniel Schaefer

Join Date: Mar 2020
Posts: 822

01 Sep 2023, 13:54

Seems like you're looking for this:

Code:

bysort firmid: egen numprod_nonUS_2003 = total(j != "USA" & year == 2003 & value > 0)
bysort firmid product j (year): replace numprod_nonUS_2003 = numprod_nonUS_2003[2]

Comment

Arthur Carvalho Brito

Join Date: Jan 2021

Posts: 45
#3

01 Sep 2023, 14:01

Daniel Schaefer thank you for the suggestion, but this won't work - a firm may sell the same product to multiple countries other than the US. This suggestion would count them multiple times
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 822
#4

01 Sep 2023, 14:03

Okay, can you provide a data example that covers that case?
Comment
Arthur Carvalho Brito

Join Date: Jan 2021

Posts: 45
#5

01 Sep 2023, 14:06

Daniel Schaefer I just edited the data with a new example, focusing on a single firm (and removing the USA, for simplicity).

Your solution would give an answer of 17. If I counted things (by hand) correctly, the answer should be 6 instead
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#6

01 Sep 2023, 14:28

If I understand what is being sought, this is a situation where what would otherwise be a straightforward calculation becomes complicated because there are many observations that are excluded from the analysis that are physically in the way. The use of frames provides a simple approach. You move only the relevant observations into the frame, do the calculation there, and then bring the results back.

Code:

frame put firmid product if value > 0 & j != "USA" & year == 2003, into(possibilities) frame possibilities { by firmid (product), sort: gen wanted = sum(product != product[_n-1]) by firmid (product): keep if _n == _N list, noobs clean } frlink m:1 firmid, frame(possibilities) frget wanted, from(possibilities)
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 822
#7

01 Sep 2023, 14:28

Thank you. What about this? Seems to work correctly with the example data.

Code:

sort firmid product value bysort firmid: egen numprod_nonUS_2003 = total(j != "USA" & year == 2003 & value > 0 & product[_n] != product[_n + 1] & firmid[_n] == firmid[_n + 1])

Edit:

I believe there is an edge case that will overcount the total if I don't include & firmid[_n] == firmid[_n + 1]in this pattern. Changed the above to reflect that.

Last edited by Daniel Schaefer; 01 Sep 2023, 14:35.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#8

01 Sep 2023, 14:34

The code in #7 is simpler than that in #6. But I do not think it is correct. It seems that if it happens that the first observation for a particular product after the -sort- command happens to be one for which j == "USA", then that product will fail to be counted.
1 like
Comment
Arthur Carvalho Brito

Join Date: Jan 2021

Posts: 45
#9

01 Sep 2023, 14:44

Daniel Schaefer Not quite there yet... for some reason it gives the wrong answer. It shows 4, instead of 6. Before your edit, it would show 5

Last edited by Arthur Carvalho Brito; 01 Sep 2023, 14:49.
Comment
Arthur Carvalho Brito

Join Date: Jan 2021

Posts: 45
#10

01 Sep 2023, 14:51

Clyde Schechter your solution seems nice, thank you. I am not used to using frames, but I could give it a try. Would be similar to preserving and then merging back, right?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#11

01 Sep 2023, 15:01

Yes, this could also be accomplished by preserving the original data, weeding out the extraneous observations, calculating the number of distinct products per firm, and then merging back. If your data set is small, there would be no noticeable performance difference between these two ways of doing it. In a larger data set, because it avoids the disk, I would expect frames to be faster, though I have never actually tested that in practice.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 822
#12

01 Sep 2023, 15:04

Thank you Clyde, you are quite right. #7 is not correct.
Comment

Announcement

Count number of distinct values within a subset of a group

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment