Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Count number of distinct values within a subset of a group

    Hi, my question is a slight twist on counting distinct occurrences in a group.

    Here is some example code:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long firmid str6 product str4 j float year double value
    5221303 "121190" "DNK" 2003                  0
    5221303 "40900"  "USA" 2003           191556.5
    5221303 "71029"  "ESP" 2003                  0
    5221303 "71080"  "DEU" 2003           80694.19
    5221303 "71080"  "ESP" 2003            66672.8
    5221303 "71080"  "FRA" 2003 386993.80
    5221303 "71080"  "ITA" 2003                  0
    5221303 "71151"  "DEU" 2003          125894.75
    5221303 "71151"  "ESP" 2003             119350
    5221303 "71159"  "DEU" 2003                  0
    5221303 "71159"  "ESP" 2003                  0
    5221303 "71159"  "FIN" 2003                  0
    5221303 "71159"  "SWE" 2003                  0
    5221303 "71190"  "DEU" 2003                  0
    5221303 "71190"  "ESP" 2003                  0
    5221303 "71190"  "FRA" 2003                  0
    5221303 "71230"  "ESP" 2003                  0
    5221303 "71230"  "FRA" 2003                  0
    5221303 "71239"  "ESP" 2003            35667.8
    5221303 "71239"  "FRA" 2003           49720.72
    5221303 "80240"  "FRA" 2003                  0
    5221303 "80240"  "ITA" 2003           204699.9
    5221303 "81120"  "AUS" 2003           125420.1
    5221303 "81120"  "BEL" 2003                  0
    5221303 "81120"  "CAN" 2003              25200
    5221303 "81120"  "CHE" 2003                  0
    5221303 "81120"  "DEU" 2003              19493
    5221303 "81120"  "DNK" 2003                  0
    5221303 "81120"  "FRA" 2003          719039.84
    5221303 "81120"  "GBR" 2003              26525
    5221303 "81120"  "ITA" 2003          148453.38
    5221303 "81120"  "NLD" 2003            92302.6
    5221303 "81120"  "NZL" 2003              19426
    5221303 "81120"  "USA" 2003           74740.14
    5221303 "81190"  "DEU" 2003            15364.5
    5221303 "81190"  "FRA" 2003                  0
    end


    An observation in my dataset consists of a firmid-year-j-product combination.

    I want to calculate the number of distinct products each firm sold to a non-USA country in 2003. This variable, numprod_nonUS_2003, should be added as a new column. So even observations not corresponding to 2003 or to the USA should have the value of the corresponding firm (meaning a collapse won't actually work).

    Note also that if a firm doesn't sell, it is still recorded in the dataset with value = 0, so we need to count only observations with value > 0.

    I managed to do this if I conditioned instead on USA sales, which is easier, since a firmid-year-product combination is unique if you fix the country. I did it by running:

    (2003 is the second year in my data)

    Code:
    bysort firmid: egen numprod_US_2003 = total(j == "USA" & year == 2003 & value > 0)
    bysort firmid product j (year): replace numprod_US_2003 = numprod_US_2003[2]
    I don't know however how to adapt this to the non-USA case.
    Last edited by Arthur Carvalho Brito; 01 Sep 2023, 14:05.

  • #2
    Seems like you're looking for this:

    Code:
    bysort firmid: egen numprod_nonUS_2003 = total(j != "USA" & year == 2003 & value > 0)
    bysort firmid product j (year): replace numprod_nonUS_2003 = numprod_nonUS_2003[2]

    Comment


    • #3
      Daniel Schaefer thank you for the suggestion, but this won't work - a firm may sell the same product to multiple countries other than the US. This suggestion would count them multiple times

      Comment


      • #4
        Okay, can you provide a data example that covers that case?

        Comment


        • #5
          Daniel Schaefer I just edited the data with a new example, focusing on a single firm (and removing the USA, for simplicity).

          Your solution would give an answer of 17. If I counted things (by hand) correctly, the answer should be 6 instead

          Comment


          • #6
            If I understand what is being sought, this is a situation where what would otherwise be a straightforward calculation becomes complicated because there are many observations that are excluded from the analysis that are physically in the way. The use of frames provides a simple approach. You move only the relevant observations into the frame, do the calculation there, and then bring the results back.
            Code:
            frame put firmid product if value > 0 & j != "USA" & year == 2003, into(possibilities)
            frame possibilities {
                by firmid (product), sort: gen wanted = sum(product != product[_n-1])
                by firmid (product): keep if _n == _N
                list, noobs clean
            }
            frlink m:1 firmid, frame(possibilities)
            frget wanted, from(possibilities)

            Comment


            • #7
              Thank you. What about this? Seems to work correctly with the example data.

              Code:
              sort firmid product value
              bysort firmid: egen numprod_nonUS_2003 = total(j != "USA" & year == 2003 & value > 0 & product[_n] != product[_n + 1] & firmid[_n] == firmid[_n + 1])
              Edit:

              I believe there is an edge case that will overcount the total if I don't include & firmid[_n] == firmid[_n + 1]in this pattern. Changed the above to reflect that.
              Last edited by Daniel Schaefer; 01 Sep 2023, 14:35.

              Comment


              • #8
                The code in #7 is simpler than that in #6. But I do not think it is correct. It seems that if it happens that the first observation for a particular product after the -sort- command happens to be one for which j == "USA", then that product will fail to be counted.

                Comment


                • #9
                  Daniel Schaefer Not quite there yet... for some reason it gives the wrong answer. It shows 4, instead of 6. Before your edit, it would show 5
                  Last edited by Arthur Carvalho Brito; 01 Sep 2023, 14:49.

                  Comment


                  • #10
                    Clyde Schechter your solution seems nice, thank you. I am not used to using frames, but I could give it a try. Would be similar to preserving and then merging back, right?

                    Comment


                    • #11
                      Yes, this could also be accomplished by preserving the original data, weeding out the extraneous observations, calculating the number of distinct products per firm, and then merging back. If your data set is small, there would be no noticeable performance difference between these two ways of doing it. In a larger data set, because it avoids the disk, I would expect frames to be faster, though I have never actually tested that in practice.

                      Comment


                      • #12
                        Thank you Clyde, you are quite right. #7 is not correct.

                        Comment

                        Working...
                        X