Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to match diagnoses with diagnosis dates when there are several of each variable?

    Hello there,

    This is my first post but I am a long time reader of this forum, so thank you for all the help.

    I have a dataset with several hundred thousand observations (participants) with in-patient diagnoses variables (ICD codes) and diagnoses dates. For example, there are 243 diagnoses variables (s_41270_0_0 to s_41270_0_242), so each participant can have up to 243 diagnoses. There are also 243 diagnoses dates that correspond to the 243 diagnoses (ts_41262_0_0-ts_41262_0_242).

    I would like to create a variable for each participant that identifies if they have had a certain diagnosis within a certain range of dates. The diagnosis is COVID-19, which has two ICD codes U071 or U072. I have been able to create a variable that identifies whether or not they have had a COVID-19 diagnosis:

    Code:
    generate COVID19hosp1=0
    
    foreach var of varlist s_41270_0_0-s_41270_0_242 {
        replace COVID19hosp1=1 if `var'=="U071" | `var'=="U072" 
    }
    tab COVID19hosp1
    However, my problem is I don't know how to match the dates with each diagnosis to specify if their COVID-19 diagnosis was within a certain date range. I hope this makes sense. I cannot share the data because it is confidential but I am happy to provide more info for clairfication.

    Thank you,
    Andrew

  • #2
    My answer presumes each of your ts* dates is stored as a Stata date variable. If not, you'll need to convert them. I'd note that, even though your data is confidential, you could have offered a small amount of fake data as I did below, which I wouldn't think would violate your confidentiality restrictions, and which is what the FAQ recommends one do with confidential data.

    I've guessed about your data layout, so the following might not work for you:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str4(s_41270_0_0 s_41270_0_1 s_41270_0_2) float(ts_41270_0_0 ts_41270_0_1 ts_41270_0_2)
    "UO72" "A484" "B447" 22455 22226 22507
    "D123" "A503" "B836" 22374 22335 22100
    ""     "U071" "F999" 22196 21598 21663
    end
    format t* %td
    //
    local date1 = daily("11/12/2018", "MDY")  // first date of interest to you
    local date2 = daily("5/30/2019", "MDY")  // last date of interest to you
    gen byte wanted = 0
    forval i = 0/2 {   //
      quiet replace wanted = inlist(s_41270_0_`i', "U071", "U072") & ///
                             inrange(ts_41270_0_`i', `date1', `date2') ///
                             if (wanted == 0)
    }

    Comment


    • #3
      Mike Lacy's solution in #2 is good. But I have a suggestion. If this is a data set that you will be using often for queries like this, the wide layout of the data is a cumbersome impediment to your work. If you -reshape- the data to long layout, you can then do this kind of thing with simpler code, and it will also execute more efficiently. You will pay a one-time price in that -reshape-ing a data set this size is slow. But once you do that and save the data set that way, you will gain considerable advantage, both for this kind of operation and for most data management and analysis problems.

      The -reshape-ing is:
      Code:
      gen long obs_no = _n
      reshape long s_41270_0_ ts_41270_0, i(obs_no) j(seq)
      Do that once, and save the resulting data set to use in further work. Note that the first command is not needed if, as I suspect, your data set contains a patient ID variable. If so, skip that command, and replace obs_no by the name of the patient ID variable in the -i()- option of the -reshape- command.

      Once you have the data in long format, the code for problems like the one in #1 is similar to that in #2, but a little more compact:
      Code:
      local date1 = daily("11/12/2018", "MDY")  // first date of interest to you
      local date2 = daily("5/30/2019", "MDY")  // last date of interest to you
      by obs_no (seq), sort: egen wanted = max(inlist(s_41270_0_, "U071", "U072") ///
                                      & inrange(ts_41270_0, `date1', `date2'))
      By the way, you can get the -reshape- to long done faster if you use one of the community-contributed programs that have been written to address the slowness of StataCorp's -reshape- command. There is -tolong-, by Rafal Raciborski, available from SSC. And there is -greshape-, which is part of Mauricio Caceres Bravo's -gtools- package, which can be obtained at github.com/mcaceresb/stata-gtools.

      Comment


      • #4
        Mike,

        Thank you very much for the response, it makes sense and is clear. I'll try running that code tomorrow. You were correct in presuming that each of my ts* dates is stored as a Stata date variable and about the layout of the data. I'll be sure to give an example using fake data in a future post.

        Clyde,

        Thanks to you too. I'll also try your solution of reshaping the data to long as I will have to do different (but similar) operations on other hospitalization and death variables. It is a fairly large dataset and my computer is not the quickest, so the community-contributed programs might come in handy.

        Thanks again, this is very helpful!
        Andrew

        Comment

        Working...
        X