Flag most frequent values in a subgroup & apply flag to same values in other subgroups

Chrissa Tsatsaronis

Join Date: Jul 2020

Posts: 3
#1

Flag most frequent values in a subgroup & apply flag to same values in other subgroups

15 Jul 2020, 05:59

Hi all,

I have a dataset with a variable surg_type (“surgery type”) that can have the values 0, 1 or 2 and a string variable ops_ko3 (“ops code”) with 20 possible values (see below for an example of the dataset). I want to find the top 5 ops codes that occur most frequently for surg_type == 2 and then find out how frequently those 5 codes occur for all 3 surgery types (i.e. I want to find that ops_ko “501” is the most frequent ops code for surgery type 2 and find that it occurs “X” times for surgery type 2, “Y” times for surgery type 1 and “Z” times for surgery type 0).

Notes:
Unfortunately, I cannot use any community-contributed Stata commands (so nmodes is not an option), as I will need to send the code to run elsewhere and cannot request any installations

In the example the values for ops_ko3 all happen to be numerical, however, they can also be alphanumerical

I found a similar question on a previous thread and tried to write a code based on the solution by Jeph found here: https://www.stata.com/statalist/arch.../msg00918.html.

bysort surg_type ops_ko3 : gen num=_N
bysort surg_type ops_ko3 : gen tmp= num if _n == 1
gsort -surg_type -tmp
gen top5num = tmp[5]
gen byte top5ops=num>=top10num if surg_type == 2
drop tmp top5num
tab ops_ko3 if top5ops == 1, sort

By doing this I find the top 5 values of ops_ko for surgery type 2. However, I have not been able to write a code to find how frequently those 5 ops codes occur for surgery types 0 and 1. Is there a way to flag all ops codes that are flagged for surg_type == 2 ?

I would appreciate any help you can offer! Thank you.
Chrissa

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float surg_type str3 ops_ko3 0 "529" 0 "565" 0 "555" 1 "589" 2 "542" 0 "501" 1 "523" 1 "578" 1 "589" 0 "578" 0 "542" 0 "542" 1 "589" 0 "542" 1 "589" 0 "578" 0 "542" 0 "565" 0 "578" 0 "542" 0 "578" 0 "555" 2 "529" 0 "542" 0 "578" 0 "542" 0 "506" 2 "523" 0 "501" 0 "518" 1 "589" 0 "542" 0 "532" 0 "542" 0 "555" 0 "535" 0 "532" 0 "535" 0 "555" 0 "542" 0 "578" 2 "578" 1 "578" 0 "523" 0 "542" 0 "542" 0 "542" 0 "523" 0 "555" 0 "578" 2 "501" 0 "560" 0 "535" 1 "589" 0 "578" 0 "542" 0 "560" 0 "535" 0 "578" 2 "565" 0 "560" 1 "589" 1 "589" 0 "542" 1 "589" 0 "542" 0 "565" 0 "542" 1 "589" 2 "501" 0 "542" 0 "542" 0 "560" 0 "542" 0 "578" 1 "589" 0 "578" 0 "542" 2 "501" 0 "560" 0 "578" 0 "578" 0 "560" 0 "555" 0 "542" 0 "560" 0 "542" 0 "542" 0 "542" 2 "578" 0 "578" 0 "578" 0 "578" 0 "578" 1 "589" 0 "542" 2 "508" 0 "560" 0 "542" 0 "535" end
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10213

15 Jul 2020, 06:45

Thanks for the data example. You should get there using contract and merge (or frlink m:1)

Code:

frame put surg_type ops_ko3, into(vars)
frame vars{
    contract surg_type ops_ko3
    frame put ops_ko3 if surg_type==2, into(two)
    frlink m:1 ops_ko3, frame(two)
    list if inlist(surg_type, 0,1), sepby(surg_type)
}

Res.:

Code:

.     list if inlist(surg_type, 0,1), sepby(surg_type)

     +----------------------------------+
     | surg_t~e   ops_ko3   _freq   two |
     |----------------------------------|
  1. |        0       501       2     1 |
  2. |        0       506       1     . |
  3. |        0       518       1     . |
  4. |        0       523       2     3 |
  5. |        0       529       1     4 |
  6. |        0       532       2     . |
  7. |        0       535       5     . |
  8. |        0       542      27     5 |
  9. |        0       555       6     . |
 10. |        0       560       8     . |
 11. |        0       565       3     6 |
 12. |        0       578      17     7 |
     |----------------------------------|
 13. |        1       523       1     3 |
 14. |        1       578       2     7 |
 15. |        1       589      12     . |
     +----------------------------------+

gsort using -_freq to get top 5 values within surg_type.

Last edited by Andrew Musau; 15 Jul 2020, 06:48.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35719

15 Jul 2020, 06:59

You don't need any community-contributed commands here. You just need to calculate the group frequencies and then fool around a bit with some code, sorting on the frequencies and so forth.

Warning: the 5 most common categories might not be uniquely defined, as in your data example, where several categories have frequency 1.

Code:

. * most common types visible in a table 
. tab ops_ko3 if surg_type == 2, sort  

    ops_ko3 |      Freq.     Percent        Cum.
------------+-----------------------------------
        501 |          3       30.00       30.00
        578 |          2       20.00       50.00
        508 |          1       10.00       60.00
        523 |          1       10.00       70.00
        529 |          1       10.00       80.00
        542 |          1       10.00       90.00
        565 |          1       10.00      100.00
------------+-----------------------------------
      Total |         10      100.00

. 
. * to get them into variables and then a local macro 
. bysort surg_type  ops_ko3 : gen freq = _N 

. egen tag = tag(surg_type ops_ko3) if surg_type == 2 

. gsort -tag -freq 

. list in 1/5 

     +---------------------------------+
     | surg_t~e   ops_ko3   freq   tag |
     |---------------------------------|
  1. |        2       501      3     1 |
  2. |        2       578      2     1 |
  3. |        2       565      1     1 |
  4. |        2       542      1     1 |
  5. |        2       529      1     1 |
     +---------------------------------+

. 
. levelsof ops_ko3 in 1/5, local(modes) sep(,)
`"501"',`"529"',`"542"',`"565"',`"578"'

. 
. tab ops_ko3 surg_type if inlist(ops_ko3, `modes') 

           |            surg_type
   ops_ko3 |         0          1          2 |     Total
-----------+---------------------------------+----------
       501 |         2          0          3 |         5 
       529 |         1          0          1 |         2 
       542 |        27          0          1 |        28 
       565 |         3          0          1 |         4 
       578 |        17          2          2 |        21 
-----------+---------------------------------+----------
     Total |        50          2          8 |        60

Comment

Chrissa Tsatsaronis

Join Date: Jul 2020

Posts: 3
#4

15 Jul 2020, 11:47

Hi Andrew,

Your suggestion to use a frame and frlink was very useful, thank you! I had never used them before.
Comment
Chrissa Tsatsaronis

Join Date: Jul 2020

Posts: 3
#5

15 Jul 2020, 11:58

Hi Nick,

Thanks for your response. It helped and I solved my problem!
Comment
Sam Ciaranca

Join Date: Jun 2023

Posts: 2
#6

20 Jun 2023, 20:50

Hi Nick,

I have found your guidance to be helpful in the past and was hopeful that I could ask a follow-up question. I have a similar situation to the original poster wherein I am trying to determine the top 3 or 5 most frequently occurring NAICS codes (industrial designations) in a dataset per each state. I attempted to use the process you described above by adjusting the first entry to read as "tab naics if state == Wyoming, sort".

This returned an error, indicating "Wyoming not found." I thought this may be due to its nature as a string variable, so I attempted to encode the states as numerical entries using "encode state, gen(statenum)" and then utilizing the new "statenum" variable, but this had the same result.

Is there any way that you can see that I could make this code work for my situation? I am relatively new to Stata, but tried to exhaust all avenues before reaching out. Thank you in advance for any reply.

Sam
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#7

20 Jun 2023, 20:55

If state is a string variable the code should be

Code:

tab naics if state == "Wyoming", sort

Without the quotes, Stata thinks you want to compare the variable state to a variable named Wyoming. As you have no such variable, Stata informs you that it could not find it. Whenever you want to refer to an actual string value, it must be enclosed in quotes.
2 likes
Comment
Sam Ciaranca

Join Date: Jun 2023

Posts: 2
#8

21 Jun 2023, 09:14

Thank you very much, Clyde! Hoping to learn more and become an active member of this community down the road.
Comment

Announcement