Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Flag most frequent values in a subgroup & apply flag to same values in other subgroups

    Hi all,

    I have a dataset with a variable surg_type (“surgery type”) that can have the values 0, 1 or 2 and a string variable ops_ko3 (“ops code”) with 20 possible values (see below for an example of the dataset). I want to find the top 5 ops codes that occur most frequently for surg_type == 2 and then find out how frequently those 5 codes occur for all 3 surgery types (i.e. I want to find that ops_ko “501” is the most frequent ops code for surgery type 2 and find that it occurs “X” times for surgery type 2, “Y” times for surgery type 1 and “Z” times for surgery type 0).

    Notes:
    • Unfortunately, I cannot use any community-contributed Stata commands (so nmodes is not an option), as I will need to send the code to run elsewhere and cannot request any installations
    • In the example the values for ops_ko3 all happen to be numerical, however, they can also be alphanumerical
    I found a similar question on a previous thread and tried to write a code based on the solution by Jeph found here: https://www.stata.com/statalist/arch.../msg00918.html.

    bysort surg_type ops_ko3 : gen num=_N
    bysort surg_type ops_ko3 : gen tmp= num if _n == 1
    gsort -surg_type -tmp
    gen top5num = tmp[5]
    gen byte top5ops=num>=top10num if surg_type == 2
    drop tmp top5num
    tab ops_ko3 if top5ops == 1, sort

    By doing this I find the top 5 values of ops_ko for surgery type 2. However, I have not been able to write a code to find how frequently those 5 ops codes occur for surgery types 0 and 1. Is there a way to flag all ops codes that are flagged for surg_type == 2 ?

    I would appreciate any help you can offer! Thank you.
    Chrissa


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float surg_type str3 ops_ko3
    0 "529"
    0 "565"
    0 "555"
    1 "589"
    2 "542"
    0 "501"
    1 "523"
    1 "578"
    1 "589"
    0 "578"
    0 "542"
    0 "542"
    1 "589"
    0 "542"
    1 "589"
    0 "578"
    0 "542"
    0 "565"
    0 "578"
    0 "542"
    0 "578"
    0 "555"
    2 "529"
    0 "542"
    0 "578"
    0 "542"
    0 "506"
    2 "523"
    0 "501"
    0 "518"
    1 "589"
    0 "542"
    0 "532"
    0 "542"
    0 "555"
    0 "535"
    0 "532"
    0 "535"
    0 "555"
    0 "542"
    0 "578"
    2 "578"
    1 "578"
    0 "523"
    0 "542"
    0 "542"
    0 "542"
    0 "523"
    0 "555"
    0 "578"
    2 "501"
    0 "560"
    0 "535"
    1 "589"
    0 "578"
    0 "542"
    0 "560"
    0 "535"
    0 "578"
    2 "565"
    0 "560"
    1 "589"
    1 "589"
    0 "542"
    1 "589"
    0 "542"
    0 "565"
    0 "542"
    1 "589"
    2 "501"
    0 "542"
    0 "542"
    0 "560"
    0 "542"
    0 "578"
    1 "589"
    0 "578"
    0 "542"
    2 "501"
    0 "560"
    0 "578"
    0 "578"
    0 "560"
    0 "555"
    0 "542"
    0 "560"
    0 "542"
    0 "542"
    0 "542"
    2 "578"
    0 "578"
    0 "578"
    0 "578"
    0 "578"
    1 "589"
    0 "542"
    2 "508"
    0 "560"
    0 "542"
    0 "535"
    end

  • #2
    Thanks for the data example. You should get there using contract and merge (or frlink m:1)

    Code:
    frame put surg_type ops_ko3, into(vars)
    frame vars{
        contract surg_type ops_ko3
        frame put ops_ko3 if surg_type==2, into(two)
        frlink m:1 ops_ko3, frame(two)
        list if inlist(surg_type, 0,1), sepby(surg_type)
    }
    Res.:

    Code:
    .     list if inlist(surg_type, 0,1), sepby(surg_type)
    
         +----------------------------------+
         | surg_t~e   ops_ko3   _freq   two |
         |----------------------------------|
      1. |        0       501       2     1 |
      2. |        0       506       1     . |
      3. |        0       518       1     . |
      4. |        0       523       2     3 |
      5. |        0       529       1     4 |
      6. |        0       532       2     . |
      7. |        0       535       5     . |
      8. |        0       542      27     5 |
      9. |        0       555       6     . |
     10. |        0       560       8     . |
     11. |        0       565       3     6 |
     12. |        0       578      17     7 |
         |----------------------------------|
     13. |        1       523       1     3 |
     14. |        1       578       2     7 |
     15. |        1       589      12     . |
         +----------------------------------+
    gsort using -_freq to get top 5 values within surg_type.
    Last edited by Andrew Musau; 15 Jul 2020, 06:48.

    Comment


    • #3
      You don't need any community-contributed commands here. You just need to calculate the group frequencies and then fool around a bit with some code, sorting on the frequencies and so forth.

      Warning: the 5 most common categories might not be uniquely defined, as in your data example, where several categories have frequency 1.

      Code:
      . * most common types visible in a table 
      . tab ops_ko3 if surg_type == 2, sort  
      
          ops_ko3 |      Freq.     Percent        Cum.
      ------------+-----------------------------------
              501 |          3       30.00       30.00
              578 |          2       20.00       50.00
              508 |          1       10.00       60.00
              523 |          1       10.00       70.00
              529 |          1       10.00       80.00
              542 |          1       10.00       90.00
              565 |          1       10.00      100.00
      ------------+-----------------------------------
            Total |         10      100.00
      
      . 
      . * to get them into variables and then a local macro 
      . bysort surg_type  ops_ko3 : gen freq = _N 
      
      . egen tag = tag(surg_type ops_ko3) if surg_type == 2 
      
      . gsort -tag -freq 
      
      . list in 1/5 
      
           +---------------------------------+
           | surg_t~e   ops_ko3   freq   tag |
           |---------------------------------|
        1. |        2       501      3     1 |
        2. |        2       578      2     1 |
        3. |        2       565      1     1 |
        4. |        2       542      1     1 |
        5. |        2       529      1     1 |
           +---------------------------------+
      
      . 
      . levelsof ops_ko3 in 1/5, local(modes) sep(,)
      `"501"',`"529"',`"542"',`"565"',`"578"'
      
      . 
      . tab ops_ko3 surg_type if inlist(ops_ko3, `modes') 
      
                 |            surg_type
         ops_ko3 |         0          1          2 |     Total
      -----------+---------------------------------+----------
             501 |         2          0          3 |         5 
             529 |         1          0          1 |         2 
             542 |        27          0          1 |        28 
             565 |         3          0          1 |         4 
             578 |        17          2          2 |        21 
      -----------+---------------------------------+----------
           Total |        50          2          8 |        60

      Comment


      • #4
        Hi Andrew,

        Your suggestion to use a frame and frlink was very useful, thank you! I had never used them before.

        Comment


        • #5
          Hi Nick,

          Thanks for your response. It helped and I solved my problem!

          Comment


          • #6
            Hi Nick,

            I have found your guidance to be helpful in the past and was hopeful that I could ask a follow-up question. I have a similar situation to the original poster wherein I am trying to determine the top 3 or 5 most frequently occurring NAICS codes (industrial designations) in a dataset per each state. I attempted to use the process you described above by adjusting the first entry to read as "tab naics if state == Wyoming, sort".

            This returned an error, indicating "Wyoming not found." I thought this may be due to its nature as a string variable, so I attempted to encode the states as numerical entries using "encode state, gen(statenum)" and then utilizing the new "statenum" variable, but this had the same result.

            Is there any way that you can see that I could make this code work for my situation? I am relatively new to Stata, but tried to exhaust all avenues before reaching out. Thank you in advance for any reply.

            Sam

            Comment


            • #7
              If state is a string variable the code should be
              Code:
              tab naics if state == "Wyoming", sort
              Without the quotes, Stata thinks you want to compare the variable state to a variable named Wyoming. As you have no such variable, Stata informs you that it could not find it. Whenever you want to refer to an actual string value, it must be enclosed in quotes.

              Comment


              • #8
                Thank you very much, Clyde! Hoping to learn more and become an active member of this community down the road.

                Comment

                Working...
                X