Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating a row total sum for each time a strpos from the list is found

    Hi, I am currently working with ICD10 codes and getting stuck on a question I feel like probably has a simple answer! Currently, I have a wide dataset, with diagnosis codes listed as dx_. encounter_id is the unique identifier and diagnosis codes range from dx1-dx45. Some of these diagnosis codes represent fractures. For each encounter_id, I want to sum up how many fractures they have. I have a list of relevant string prefixes that correspond to fractures in unique locations. For example, "S02", "S22", "S82", "S72".

    How can I generate a variable, fracture_number, that counts how many times an individual has a dx = any of the strings listed above? Some people may have 4 fractures in different locations and I want to capture their total number.

    thank you so much! Data example below

    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int encounter_id str8(dx1 dx2 dx3 dx4 dx5 dx6)
    1 "S22.22XB" "S51.802A" "W32.0XXA" "S27.2XXA" "S27.321A" "S21.301A"
    2 "J96.90" "W33.01XA" "R40.2112" "R40.2212" "T14.90XA" "R40.2312"
    3 "D64.9" "M79.7" "M86.18" "M86.68" "G82.20" "N31.9"
    4 "S01.93XA" "Z66" "S01.93XA" "R40.2212" "R00.1" "X72.XXXA"
    5 "Z51.5" "S02.40DB" "I82.621" "X73.0XXA" "R40.2363" "J98.11"
    6 "Z68.37" "M89.78" "S02.40DS" "E66.9" "D64.9" "J98.11"
    7 "S36.81XA" "Z78.1" "T81.4XXA" "K65.1" "S39.093A" "K56.7"
    8 "S32.491B" "X94.0XXA" "D62" "B19.20" "S32.491B" "S32.391B"
    9 "S32.89XB" "M21.371" "S39.81XA" "T79.7XXA" "S36.892A" "D72.829"
    10 "S01.522A" "S01.502A" "R06.09" "S02.66XB" "Z78.1" "S01.542A"
    11 "S21.242A" "S27.322A" "S24.111A" "S22.029A" "R40.2242" "T79.4XXA"
    12 "S75.002A" "I82.432" "I10" "W32.0XXA" "D64.9" "R40.2142"
    13 "S31.109A" "S36.114A" "S31.813A" "Z88.0" "S32.301B" "S31.635A"
    14 "S72.434B" "Z18.89" "W32.0XXA" "M86.8X6" "M25.561" "S81.041A"
    15 "F11.90" "R06.00" "O99.513" "R42" "O99.89" "G54.0"

  • #2
    Easier done in long layout.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int encounter_id str8(dx1 dx2 dx3 dx4 dx5 dx6)
     1 "S22.22XB" "S51.802A" "W32.0XXA" "S27.2XXA" "S27.321A" "S21.301A"
     2 "J96.90"   "W33.01XA" "R40.2112" "R40.2212" "T14.90XA" "R40.2312"
     3 "D64.9"    "M79.7"    "M86.18"   "M86.68"   "G82.20"   "N31.9"   
     4 "S01.93XA" "Z66"      "S01.93XA" "R40.2212" "R00.1"    "X72.XXXA"
     5 "Z51.5"    "S02.40DB" "I82.621"  "X73.0XXA" "R40.2363" "J98.11"  
     6 "Z68.37"   "M89.78"   "S02.40DS" "E66.9"    "D64.9"    "J98.11"  
     7 "S36.81XA" "Z78.1"    "T81.4XXA" "K65.1"    "S39.093A" "K56.7"   
     8 "S32.491B" "X94.0XXA" "D62"      "B19.20"   "S32.491B" "S32.391B"
     9 "S32.89XB" "M21.371"  "S39.81XA" "T79.7XXA" "S36.892A" "D72.829" 
    10 "S01.522A" "S01.502A" "R06.09"   "S02.66XB" "Z78.1"    "S01.542A"
    11 "S21.242A" "S27.322A" "S24.111A" "S22.029A" "R40.2242" "T79.4XXA"
    12 "S75.002A" "I82.432"  "I10"      "W32.0XXA" "D64.9"    "R40.2142"
    13 "S31.109A" "S36.114A" "S31.813A" "Z88.0"    "S32.301B" "S31.635A"
    14 "S72.434B" "Z18.89"   "W32.0XXA" "M86.8X6"  "M25.561"  "S81.041A"
    15 "F11.90"   "R06.00"   "O99.513"  "R42"      "O99.89"   "G54.0"   
    end
    
    reshape long dx, i(encounter_id) j(condition)
    bys encounter_id: gen total= sum(regexm(dx, "S02|S22|S82|S72"))
    by encounter_id : replace total= total[_N]
    *RESHAPE WIDE ONLY IF YOU HAVE TO
    reshape wide dx, i(encounter_id) j(condition)

    Res.:

    Code:
    . l, sep(0)
    
         +------------------------------------------------------------------------------------+
         | encoun~d        dx1        dx2        dx3        dx4        dx5        dx6   total |
         |------------------------------------------------------------------------------------|
      1. |        1   S22.22XB   S51.802A   W32.0XXA   S27.2XXA   S27.321A   S21.301A       1 |
      2. |        2     J96.90   W33.01XA   R40.2112   R40.2212   T14.90XA   R40.2312       0 |
      3. |        3      D64.9      M79.7     M86.18     M86.68     G82.20      N31.9       0 |
      4. |        4   S01.93XA        Z66   S01.93XA   R40.2212      R00.1   X72.XXXA       0 |
      5. |        5      Z51.5   S02.40DB    I82.621   X73.0XXA   R40.2363     J98.11       1 |
      6. |        6     Z68.37     M89.78   S02.40DS      E66.9      D64.9     J98.11       1 |
      7. |        7   S36.81XA      Z78.1   T81.4XXA      K65.1   S39.093A      K56.7       0 |
      8. |        8   S32.491B   X94.0XXA        D62     B19.20   S32.491B   S32.391B       0 |
      9. |        9   S32.89XB    M21.371   S39.81XA   T79.7XXA   S36.892A    D72.829       0 |
     10. |       10   S01.522A   S01.502A     R06.09   S02.66XB      Z78.1   S01.542A       1 |
     11. |       11   S21.242A   S27.322A   S24.111A   S22.029A   R40.2242   T79.4XXA       1 |
     12. |       12   S75.002A    I82.432        I10   W32.0XXA      D64.9   R40.2142       0 |
     13. |       13   S31.109A   S36.114A   S31.813A      Z88.0   S32.301B   S31.635A       0 |
     14. |       14   S72.434B     Z18.89   W32.0XXA    M86.8X6    M25.561   S81.041A       1 |
     15. |       15     F11.90     R06.00    O99.513        R42     O99.89      G54.0       0 |
         +------------------------------------------------------------------------------------+

    Comment


    • #3
      Here's another way to do it:


      Code:
      gen wanted = 0 
      
      quietly forval j = 1/6 { 
          foreach s in S02 S22 S82 S72 { 
              replace wanted = wanted + strpos(dx`j', "`s'") == 1 
          }
      }
      I agree with Andrew Musau's implication. For most problems in Stata, you are better off with a long data layout (often called long format or long structure). Otherwise many problems will need to be tackled with a temporary reshape long, as in #2, or a loop over variables, as here.

      Comment


      • #4
        Or directly in wide format like this:

        1. Install -egenmore- by typing from within Stata "find egenmore" and following the installation instructions.

        2. Then

        Code:
        egen fractures = rcount(dx*), cond(substr(@,1,3)=="S02" | substr(@,1,3)=="S22" | substr(@,1,3)=="S82" | substr(@,1,3)=="S72")
        
        . list, sep(0)
        
             +---------------------------------------------------------------------------------------+
             | encoun~d        dx1        dx2        dx3        dx4        dx5        dx6   fractu~s |
             |---------------------------------------------------------------------------------------|
          1. |        1   S22.22XB   S51.802A   W32.0XXA   S27.2XXA   S27.321A   S21.301A          1 |
          2. |        2     J96.90   W33.01XA   R40.2112   R40.2212   T14.90XA   R40.2312          0 |
          3. |        3      D64.9      M79.7     M86.18     M86.68     G82.20      N31.9          0 |
          4. |        4   S01.93XA        Z66   S01.93XA   R40.2212      R00.1   X72.XXXA          0 |
          5. |        5      Z51.5   S02.40DB    I82.621   X73.0XXA   R40.2363     J98.11          1 |
          6. |        6     Z68.37     M89.78   S02.40DS      E66.9      D64.9     J98.11          1 |
          7. |        7   S36.81XA      Z78.1   T81.4XXA      K65.1   S39.093A      K56.7          0 |
          8. |        8   S32.491B   X94.0XXA        D62     B19.20   S32.491B   S32.391B          0 |
          9. |        9   S32.89XB    M21.371   S39.81XA   T79.7XXA   S36.892A    D72.829          0 |
         10. |       10   S01.522A   S01.502A     R06.09   S02.66XB      Z78.1   S01.542A          1 |
         11. |       11   S21.242A   S27.322A   S24.111A   S22.029A   R40.2242   T79.4XXA          1 |
         12. |       12   S75.002A    I82.432        I10   W32.0XXA      D64.9   R40.2142          0 |
         13. |       13   S31.109A   S36.114A   S31.813A      Z88.0   S32.301B   S31.635A          0 |
         14. |       14   S72.434B     Z18.89   W32.0XXA    M86.8X6    M25.561   S81.041A          1 |
         15. |       15     F11.90     R06.00    O99.513        R42     O99.89      G54.0          0 |
             +---------------------------------------------------------------------------------------+

        Comment


        • #5
          Or like this, more or less Nick's approach in #3, but the idea more to reduce the problem to a situation where -egen- rowfunction can be applied:

          Code:
          forvalues i = 1/6 {
          gen yesdx`i' = substr(dx`i',1,3)=="S02" | substr(dx`i',1,3)=="S22" | substr(dx`i',1,3)=="S82" | substr(dx`i',1,3)=="S72"
          }
          
          egen fractures = rowtotal(yesdx*)

          Comment


          • #6
            Hi! Sorry for the late reply, I realized I didn't have notifications set for this post.

            Thank you all SO much! I ended up going with Joro's suggestion, as I need this data set in the wide format for the rest of the analysis. It worked brilliantly! Really appreciate your wisdom.

            Nick, I was wondering- what is the function of the "quietly" code? I've seen it in other code suggestions you've written but I'm unsure of when to specify "quietly" when writing a loop.

            Comment


            • #7
              -egenmore- with -rcount- is a beautiful contribution among so many great things done by Nick Cox. It is indeed so helpful to save time for coding in Stata. This issue is a good example of that with the code of Joro Kolev in #4.

              By the way, a small contribution: utilizing -inlist- saves some typings in Joro's solution.
              Code:
              egen fractures = rcount(dx*), c(inlist(substr(@,1,3),"S02","S22","S82","S72"))

              Comment


              • #8
                #6 Julia Silver quietly suppresses output to the main window; here it could be put on the replace or either foreach loop with the same result, of suppressing messages on how many values have been changed.

                I suppress output I don't care about, but conversely you shouldn't use it when you want to see output -- or, as can easily be true, you are not completely confident of your code and want a record of what happens while debugging. Confidence can easily be misplaced. I can think I know what I am doing and use quietly, find my code must be wrong, remove the quietly and find and fix the dopey errors, and then reinstate quietly.

                #7 Romalpa Akzo Thanks for the appreciation. Romalpa is correct in that the rcount() function for egen can be found in egenmore on SSC, as already mentioned by Joro Kolev in #4. What to say? I wrote it in 2001, but I don't use it any more. It's, in the first instance, just a personal style choice. Often in Stataland there is a choice between

                1. A direct community-contributed command that is designed for the purpose.

                2. A combination of official commands that will get you there in a few steps without installing anything.

                Naturally, all commands of type 1 are written using official commands, indirectly if not directly.

                Experienced Stata users like Joro and Romalpa can encounter a community-contributed command, like it and use it, and that's good: that's why such commands were written.

                Other experienced Stata users (@Clyde Schechter is one) make relatively little use of community-contributed commands and prefer generally to use standard commands. Their experience is built on the number of problems where they can see immediately "You need to do this, and then that, and then this other thing, and so forth" and write it out step by step.

                It's about more than style choices:

                1. The more users know standard techniques such as initialising a variable and then looping to modify its contents, the better placed they will be when there is no community-contributed command (that anyone remembers) to solve their next problem.

                2. Some users are behind firewalls so thick that they can't install community-contributed commands. easily or at all.

                3. Some users are in a network of collaborators, including perhaps inexperienced users of Stata. It's a lot simpler if scripts that are shared minimise the number of community-contributed commands that must be installed.

                #2 and #3 are often combined when users are working with confidential data, e.g. medical data on people.

                Last edited by Nick Cox; 22 Jun 2021, 01:23.

                Comment

                Working...
                X