Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting number of specific observations in a group of variables by unique id

    Hi all- first post. I'm using Stata 15.1 on a Mac. I'm working with health claims data with the goal to produce a patient's multi-morbidity count using 1 or 2 generated variables. I've attached toy data below to help.

    claim = claim number
    id = patient id
    dx1 - dx4 = medical diagnosis

    For each patient id, I want Stata to scan variables dx1 dx2 dx3 dx4 across all of the patient's claims and return a count of the number of specific diagnoses that appear at least once.

    For example, let's say diagnoses A, B, and C are of interest. Patient 1 has been given an A, B, and C diagnosis at least once, so I would want the command to return the value of "3" for patient 1. Patient 2 has been given the diagnosis of A and B at least once, but not C, so I would want the command to return the value of "2".

    I've looked through a number of egen possibilities, including tag, but none of them seem to help. Any suggestions?


    Code:
    clear
    input str2 claim float id str1(dx1 dx2 dx3 dx4)
    "1"  1 "C" "T" "E" "B"
    "2"  1 "C" "U" "E" "B"
    "3"  1 "C" "F" "E" ""
    "4"  1 "B" "A" "E" ""
    "5"  1 "B" "A" ""  ""
    "6"  1 "C" "E" ""  ""
    "7"  1 "D" ""  ""  ""
    "8"  1 "D" ""  ""  ""
    "9"  2 "D" "B" "T" "E"
    "10" 2 "D" "B" "T" "E"
    "11" 2 "D" "T" "E" ""
    "12" 2 "A" "T" ""  ""
    "13" 2 "B" "T" ""  ""
    "14" 2 "B" ""  ""  ""
    "15" 2 "B" ""  ""  ""
    end

  • #2
    Since the diagnosis columns are not sorted in any order, I think the easiest would be to do this:

    Code:
    /* claim level */
    
    generate _temp_a=1 if dx1=="A" | dx2=="A" | dx3=="A" | dx4=="A"
    generate _temp_b=1 if dx1=="B" | dx2=="B" | dx3=="B" | dx4=="B"
    generate _temp_c=1 if dx1=="C" | dx2=="C" | dx3=="C" | dx4=="C"
    
    /* patient level */
    
    foreach var in _temp_a _temp_b _temp_c {
    egen `var'_patient=sum(`var'), by(id)
    replace `var'_patient=1 if `var'_patient>1 & `var'_patient!=.
    }
    
    /* final categories */
    
    generate dxcategory=3 if _temp_a_patient==1 & _temp_b_patient==1 & _temp_c_patient==1
    replace dxcategory=2 if _temp_a_patient==1 & _temp_b_patient==1 & _temp_c_patient!=1
    Last edited by Jenny Williams; 10 Oct 2019, 12:55.

    Comment


    • #3
      Thanks Jenny - that works. Since I have 28 different diagnoses that can be considered in the multi-morbidity total, I wonder if your final line of code could simplify to:
      Code:
      egen dxcategory = rowtotal(_has_a_patient _has_b_patient _has_c_patient)
      It seems to give the same result.

      Comment


      • #4
        Absolutely.

        One side caution is that using "egen rowtotal" will allow missing values in the count, whereas using "dxcategory=_temp_a_patient + _temp_b_patient + _temp_c_patient" will only count rows where the all of the variables are non-missing.

        You can also save some space in the first line of code (if the codes for each condition are simple letters or coded numeric values):
        Code:
        foreach condition in A B C D E F Z {
        generate _temp_`condition'=1 if dx1=="`condition'" | dx2=="`condition'" | dx3=="`condition'" | dx4=="`condition'" | dx5=="`condition'" | dx6=="`condition'" /*...repeating all the way to dx28*/
        }
        Last edited by Jenny Williams; 10 Oct 2019, 14:03.

        Comment


        • #5
          There is a small error in Jenny's code.
          Code:
          replace `var'_patient=1 if `var'_patient>1 & `var'_patient!=.
          should be
          Code:
          replace `var'_patient=1 if `var'_patient>0 & `var'_patient!=.
          You could also do it like this. It's based on the same idea but the code is a bit simpler.
          Code:
          gen dx_all = dx1 + dx2 + dx3 + dx4
          
          gen count = 0
          foreach n in A B C {
              bysort id: egen `n' = max(strpos(dx_all, "`n'")>0)
              replace count = count + `n'
              drop `n'
              }

          Comment


          • #6
            There is a dedicated egen function to do this in egenmore (SSC). Written up in my rowwise column in the Stata Journal.

            Comment


            • #7
              Jenny, Wouter, and Nick: Thank you all for your input, I think I have enough to move forward with my work and I'm very pleased wit hthe result of my first Statalist post. A few last comments if I may:

              Wouter: I think Jenny had it right to begin with using:
              Code:
               
               replace `var'_patient=1 if `var'_patient>1 & `var'_patient!=.
              This is because the preceding line of code creates variable that return either a 1, an integer >1, or a 0.
              Since, in this context, integers >1 have the same meaning as true 1's, the code only needs to replace numbers >1.

              Jenny:
              Code:
               
               egen dxcategory = rowtotal(_temp_a_patient _temp_b_patient _temp_c_patient)
              Rowtotal works because the "patient level" code that generated the column totals for each patient returns a "0" when it totals a column containing all missing values. In other words, temp_a_patient _temp_b_patient _temp_c_patient do not contain missing values, even if _temp_a _temp_b _temp_c do. At least that's how it's executing on my end. Nick: Thanks for the reference!

              Comment


              • #8
                Wouter: I think Jenny had it right to begin with
                This is true indeed, I stand corrected.

                Glad you found a solution.

                Comment


                • #9
                  Sorry: #6 is not pertinent. I misread the problem badly,

                  To compensate, here is another way to think about it. You might want to pepper the code with list commands if it's not clear what is happening.


                  Code:
                  clear
                  input str2 claim float id str1(dx1 dx2 dx3 dx4)
                  "1"  1 "C" "T" "E" "B"
                  "2"  1 "C" "U" "E" "B"
                  "3"  1 "C" "F" "E" ""
                  "4"  1 "B" "A" "E" ""
                  "5"  1 "B" "A" ""  ""
                  "6"  1 "C" "E" ""  ""
                  "7"  1 "D" ""  ""  ""
                  "8"  1 "D" ""  ""  ""
                  "9"  2 "D" "B" "T" "E"
                  "10" 2 "D" "B" "T" "E"
                  "11" 2 "D" "T" "E" ""
                  "12" 2 "A" "T" ""  ""
                  "13" 2 "B" "T" ""  ""
                  "14" 2 "B" ""  ""  ""
                  "15" 2 "B" ""  ""  ""
                  end
                  
                  save dx 
                  
                  reshape long dx, i(id claim) j(which)
                  drop if !inlist(dx, "A", "B", "C")
                  bysort id dx : drop if _n > 1 
                  by id: gen wanted = _N 
                  collapse wanted, by(id)
                  
                  
                  merge 1:m id  using dx 
                  
                  tabdisp id, c(wanted)

                  Comment


                  • #10
                    Thanks Nick -

                    Also, I also found a good solution at, of all places, a Stata FAQ:
                    HTML Code:
                    https://www.stata.com/support/faqs/data-management/check-for-range-of-icd-codes/
                    Looks like this is the exact code I need, with a few minor modifications, and looks a lot like Jenny's. The icd9 command in Stata is very helpful for this kind of work!


                    Comment


                    • #11
                      I wouldn't have thought of that!

                      Comment

                      Working...
                      X