Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • arrays in stata?

    hi all,

    I m wondering if there is a way in stata to use arrays ( like in sas) to search through multiple icd codes.

    the data i have is in this format: (each line is a hospitalization)
    id date diag1 diag2 diag3 diag4 diag5
    1 20jan2018 t40.0
    1 16oct2018 t40.6
    2 04apr2018 t40.3
    each hospitalization (each line) has the date of the event and 5 diag codes( icd codes)

    i would like to identify all of the diagnosis codes that are t40.0-t40.6 such that if present, then i can create a variable pat_diagnosis=1, if absent then pat_diagnosis=0

    with sas we set up an array , but with stata, i can only think of using for loops like this:

    gen pat_diagnosis=0
    forv i=1/5{
    replace pat_diagnosis`i'=1 if diag`i'>= "t40.0" & diag`i'<="T40.6"
    }

    any thoughts on a better way than this?

    thanks in advance
    vishal

  • #2
    https://www.stata.com/support/faqs/d...nt-sas-arrays/

    In this particular case, your loop is confused. I think you mean

    Code:
    gen pat_diagnosis = 0
    forv i=1/5 {
        replace pat_diagnosis = 1  if diag`i'>= "t40.0" & diag`i'<="t40.6"
    }
    Last edited by Nick Cox; 02 Apr 2019, 11:10.

    Comment


    • #3
      Vishal,

      Quick note that Stata has two specialized commands to clean ICD-9 and ICD-10 codes, or to generate a flag if a variable contains a code in a certain range. The range feature is a nice convenience command for your purpose. You could simply do:

      Code:
      forv i=1/5{
      icd10 generate pat`v' = diag`v', range(T400/T406)
      }
      egen pat_flag = rowmax(pat?)
      drop pat?
      I don't believe there's a direct equivalent of SAS arrays. If you needed to speed the above up, it might run faster if you reshaped the data to long (Stata command is reshape long, SAS equivalent is PROC TRANSPOSE, I believe).
      Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

      When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

      Comment


      • #4
        The original loop as modified by Nick is fine and looks like the most efficient solution.
        If you are looking for something different, you can think along the lines of:
        Code:
        egen diag = concat(diag*), punct("|")
        generate flag = strpos(diag,"t40.")>0
        Clearly you can adjust the expression for flag to focus only on some diagnoses.

        Best, Sergiy
        Last edited by Sergiy Radyakin; 02 Apr 2019, 12:10.

        Comment


        • #5
          Originally posted by Nick Cox View Post
          https://www.stata.com/support/faqs/d...nt-sas-arrays/

          In this particular case, your loop is confused. I think you mean

          Code:
          gen pat_diagnosis = 0
          forv i=1/5 {
          replace pat_diagnosis = 1 if diag`i'>= "t40.0" & diag`i'<="t40.6"
          }
          What of you you have 3 or more diagnosis codes, like D45, D473 , D474. How will you create a loop to extract.
          Thank you very much!

          Comment


          • #6
            I have had similar issue and Nick's code was extremely helpful, but I got stuck beyond having 3 or more codes.
            Any help is extremely appreciated.

            Comment


            • #7
              This exmple gives me error:

              gen pat_diag = 0
              forv i=1/30 {
              replace pat_diag = 1 if I10_DX`i'<="D45" & I10_DX`i'>= "D473"& I10_DX`i'<="D474"
              }

              Comment


              • #8
                I can't see a syntax error in #7. I note that

                Code:
                DX`i' <= "D45"
                is generously inclusive, so that anything beginning "A" "B" "C" "D1" "D2" "D3" and some things beginning "D4" would fit. What was the error message?

                Comment


                • #9
                  Thanks Nick.
                  Sorry, I was not specific enough. I meant, when I ran above code, and tabulate results, I get
                  pat_diagnosis | Freq. Percent Cum.
                  ------------+-----------------------------------
                  0 | 7,135,090 100.00 100.00
                  ------------+-----------------------------------
                  Total | 7,135,090 100.00

                  Comment


                  • #10
                    unlike Nick, I do see a problem with the code in #7 - you need parentheses after the ampersand (and at the end)

                    Comment


                    • #11
                      Sorry; I missed something more fundamental -- as did my friend Rich Goldstein, It's impossible for a string to be BOTH earlier than or equal to D45 in alphanumeric sort order AND between D473 and D474 . Hence the syntax is legal, but no observations will satisfy the mutually contradictory criteria, as you found, You need some different mix of operators.

                      Comment


                      • #12
                        Indeed, I am interested in adding D45, D473 and D474. I modify the code as below and seems to work

                        gen pat_diag = 0
                        forv i=1/30 {
                        replace pat_diag = 1 if I10_DX`i' =="D45" |I10_DX`i'=="D473" | I10_DX`i'=="D474"


                        pat_diag | Freq. Percent Cum.
                        ------------+-----------------------------------
                        0 | 7,104,464 99.57 99.57
                        1 | 30,626 0.43 100.00
                        ------------+-----------------------------------
                        Total | 7,135,090 100.00

                        Based on Nick's code, I am confident this result is correct.

                        My question now is, how to get separate group of patients with D45, D473, and D474.

                        This what I did:
                        gen pv=0

                        . forv i=1/30 {
                        2. replace pv = 1 if I10_DX`i' =="D45"

                        3. }

                        ta pv

                        pv | Freq. Percent Cum.
                        ------------+-----------------------------------
                        0 | 7,130,991 99.94 99.94
                        1 | 4,099 0.06 100.00
                        ------------+-----------------------------------
                        Total | 7,135,090 100.00



                        gen et=0

                        . forv i=1/30 {
                        2. replace et = 1 if I10_DX`i' =="D473"
                        3. }



                        . ta et

                        et | Freq. Percent Cum.
                        ------------+-----------------------------------
                        0 | 7,108,341 99.63 99.63
                        1 | 26,749 0.37 100.00
                        ------------+-----------------------------------
                        Total | 7,135,090 100.00



                        gen mf=0

                        . forv i=1/30 {
                        2. replace mf = 1 if I10_DX`i' =="D474"
                        3. }

                        ta mf

                        mf | Freq. Percent Cum.
                        ------------+-----------------------------------
                        0 | 7,134,964 100.00 100.00
                        1 | 126 0.00 100.00
                        ------------+-----------------------------------
                        Total | 7,135,090 100.00


                        Now I am intersted in percentages of each diagnosis:



                        I am not 100% confident this method I used to eventually get these percents is correct.

                        [ just to remind that, I looking for these 3 diagnosis: polycythemia, essential thrombocythemia, and myelofibrosis, from Diagnosis( I10_DX1 to I10_DX30). I want to capture all those diagnosis in the database. I then want to separate them into individual diagnoss with their frequencies]

                        I will need some reassurance or otherwise.

                        Thank you.



                        Comment


                        • #13
                          i meant *frequncies

                          Comment


                          • #14
                            I am wondering if someone can help me to sort out this array, where code is CD10, dates means date of this code retrieved from central registry, there are 70 codes var, and corresponding dates, in each code var there are hundreds of CD10 codes
                            code1 code2 code3 date1 date2 date3
                            c25 01/01/2015
                            c25 01/02/2016
                            c25 02/02/2017
                            Thanks
                            zaed

                            Comment


                            • #15
                              thanks all!

                              Comment

                              Working...
                              X