Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dynamically generate variable names in forvals loop

    I have a dataset of patient stays in hospital rooms. The data contains each patient's id, room location, and start and end datetimes for their stay in those rooms. I am trying to build a for loop to identify periods of time where two patients were in the same room at the same time. I have built a for loop that I was hoping would dynamically generate 3 new columns for a patient with roommates:

    Code:
    sort location start_datetime end_datetime
    
    forvalues x = 1/3 {
                  by location: gen double r_start_datetime_`x' = start_datetime[_n-`x'] if id != id[_n-`x']
                  by location: gen double r_end_datetime_`x' = end_datetime[_n-`x']
                  by location: gen long r_id_`x' = id[_n-`_x'] if start_datetime <= r_end_datetime_`x' & r_start_datetime_`x' >= start_datetime
    }
    For now I am just using x = 1/3 to test out the loop on smaller groups of rows. Running this for loop generates r_start_datetime_1 and r_end_datetime_1, but they are just the start_ and end_datetimes of the previous rows. No r_ids are generated and I get an error that reads: "_n- invalid name". I assume this error comes from the _`x' syntax of naming the variables, but I'm not sure.

    And here is a sample of the dataset

    id location start_datetime end_datetime
    42989001 106^0615 14jan2009 21:15:03 16jan2009 09:53:00
    42935002 106^0615 04feb2009 17:49:31 09feb2009 10:02:53
    43463003 106^0615 23jun2009 18:23:22 25jun2009 07:26:12
    43941004 106^0615 13nov2009 16:23:00 18nov2009 11:21:59
    44203005 106^0615 27mar2010 09:05:39 28mar2010 14:59:31
    44486006 106^0615 26apr2010 18:08:21 27apr2010 14:58:19
    44443007 106^0615 04may2010 18:21:26 06may2010 09:12:10
    44569008 106^0615 25may2010 18:01:47 27may2010 09:32:38
    44540009 106^0615 18jun2010 17:20:35 24jun2010 17:09:59
    44462010 106^0615 29jun2010 19:19:43 01jul2010 14:29:12
    Last edited by Julien Weinstein; 31 May 2023, 10:56.

  • #2
    I recommend a different approach. There is no need for any loops here. You pair up each observation with all other observations mentioning the same location. Then eliminate those that match an original observation with itself, and then those where the second observation starts before the first one. Then you just have to identify which of the remaining pairs have time overlaps and keep those:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long id str8 location double(start_datetime end_datetime)
    42989001 "106^0615" 1547586903000 1547718780000
    42935002 "106^0615" 1549388971000 1549792973000
    43463003 "106^0615" 1561400602000 1561533972000
    43941004 "106^0615" 1573748580000 1574162519000
    44203005 "106^0615" 1585299939000 1585407571000
    44486006 "106^0615" 1587924501000 1587999499000
    44443007 "106^0615" 1588616486000 1588756330000
    44569008 "106^0615" 1590429707000 1590571958000
    44540009 "106^0615" 1592500835000 1593018599000
    44462010 "106^0615" 1593458383000 1593613752000
    end
    format %tc start_datetime
    format %tc end_datetime
    
    //    CREATE A DATA SET PAIRING ALL OBSERVATIONS PERTAINING TO THE SAME LOCATION
    preserve
    ds location, not
    local other_vars `r(varlist)'
    rename (`other_vars') =_2
    tempfile copy
    save `copy'
    
    restore
    rename (`other_vars') =_1
    joinby location using `copy'
    drop if id_1 == id_2 & start_datetime_1 == start_datetime_2 // EXCLUDE SELF PAIRS
    drop if start_datetime_1 > start_datetime_2 // KEEP PAIRS WITH ADMIT 1 BEFORE ADMIT 2
    
    //    KEEP THOSE OBSERVATIONS WHERE THE OCCUPANCY PERIODS OVERLAP
    keep if min(end_datetime_1, end_datetime_2) >= start_datetime_2
    Note: The above code assumes that your start and end datetime variables are actually Stata internal format clock variables. If they are not, you will have to convert them to that first so that chronological ordering is possible.

    I also note that in your example data there aren't any roommates. It would have been better to show an example that contained some, so that I could have verified that the code picks those up.

    In the future, when showing data examples, please use the -dataex- command to do so, as I have done in this response. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Added: If your original data set is very large, the -joinby- command may fail because the resulting data set of pairs is too large to be accommodated in memory. In that case, I would recommend breaking up your original data set into chunks, each chunk consisting of all observations pertaining to some subset of the values of the location variables, and each chunk small enough that all pairs of same-location observations will fit in memory at one time. Then find the roommates in the chunk and save the results. Finally, append the saved results together.
    Last edited by Clyde Schechter; 31 May 2023, 11:24.

    Comment


    • #3
      Thank you, Clyde. When I get to rename(`other_vars') =_2, I get a syntax error for rename. I should also note that my full data set has about 420,000 rows, so I'm not sure if that would be an issue for the joinby.

      Additionally, is this possible to accomplish with for loops or any other method? I am fairly new to stata and so it would be nice if I could use code that is at a sort of beginner level. For instance, I'm not very familiar with local macros, and how that is being leveraged here. And does this account for the possibility of one patient having multiple roommates throughout their stays in a room?

      In natural language, the code you provided is:
      1. Saving the initial data
      2. Subsetting the data on id, start_datetime, and end_datetime.
      3. Generating a local macro to hold these variables
      4. Adding an "_2" to the ends of these variable names?
      5. Storing a copy of this updated data

      6. Restoring the initial data
      7. Adding an "_1" to id, start_datetime, and end_datetime.
      8. Joining the initial data with the updated data with _2's on the varlist based on the "location" column
      9. Dropping all rows where id_1 matches id_2 and start_datetime_1 matches start_datetime_2
      10. Dropping rows where start_datetime_1 is after start_datetime_2

      11. Keeping overlap periods

      Is this correct? Also, with the "_2", doesn't that only allow us to account for up to two roommates per patient? What if a patient has more than two roommates? Also, how does this code compare rows to other rows? Would that be with the joinby on location?
      Last edited by Julien Weinstein; 31 May 2023, 12:04.

      Comment


      • #4
        When I get to rename(`other_vars') =_2, I get a syntax error for rename.
        There is nothing wrong with that syntax, and it runs on my setup. I can think of two possible reasons you are encountering a problem:
        1. You are using an archaic version of Stata that does not support renameing groups of variables in one command. If so, please indicate what version of Stata you are using.
        2. You are running the code one line at a time. Because the code contains local macros, which you say you are unfamiliar with, you cannot do that. Local macros, by their very nature, only survive in one program. And if you run the code one line at a time, each line is understood by Stata as a program. So the local macro gets define and then immediately disappears before you can use it. You must run the code in one fell swoop from beginning to end. (Well, from the place where a local macro is defined until its final use.)
        In natural language, the code you provided is:
        1. Saving the initial data
        2. Subsetting the data on id, start_datetime, and end_datetime.
        3. Generating a local macro to hold these variables
        4. Adding an "_2" to the ends of these variable names?
        5. Storing a copy of this updated data

        6. Restoring the initial data
        7. Adding an "_1" to id, start_datetime, and end_datetime.
        8. Joining the initial data with the updated data with _2's on the varlist based on the "location" column
        9. Dropping all rows where id_1 matches id_2 and start_datetime_1 matches start_datetime_2
        10. Dropping rows where start_datetime_1 is after start_datetime_2

        11. Keeping overlap periods

        Is this correct?
        1 is correct. 2 is not; there is no subsetting of the data involved in this code. The -ds- command and the -local- command following it just create a list of the names of variables other than location. This list is subsequently used to support renaming those variables into a _1 series and a _2 series. 4-11 are correct.

        Also, with the "_2", doesn't that only allow us to account for up to two roommates per patient? What if a patient has more than two roommates?
        If there are more than two roommates, they will appear in separate observations. The end-result of this code will be a data set in which each observation contains information about a pair of people who were in the same location at the same time. If a particular person on a particular admission had, say, 5 roommates over the course of the admission, there will be 5 observations in the data set to correspond to it. Note that the code does not produce separate observations for the id1-id2 pairing and the id2-id1 pairing. The pairs it produces are ordered, with the id1 person being the one who occupies the location first. So in some of the 5 observations, the person will appear as id1 (the ones where he/she is the prior occupant) and in others as id2 (the ones where he/she is the subsequent occupant).

        This is known as long data layout. For most purposes in Stata long data layout works better, usually much better, than wide data layout (one observation per patient with separate variables for each roommate). If you are going to be doing something in Stata that actually does work better with wide layout, there is a -reshape- command that can transform the data to wide. But since so little in Stata works well with wide data, I am leaving the data long.

        Also, how does this code compare rows to other rows? Would that be with the joinby on location?
        Yes.

        I should also note that my full data set has about 420,000 rows, so I'm not sure if that would be an issue for the joinby.
        Well, it may be a problem. But it also depends on how many locations there are. I suggest you try it. If -joinby- runs out of memory, post back and I'll show you modified code that will work around this problem. (Or, you can just break the data set into chunks and process them separately, just as I recommended in #2.)

        Additionally, is this possible to accomplish with for loops or any other method?
        Probably it can be done this way. But the code would be so complicated, opaque, and unmaintainable that it would be programming malpractice to do it that way. It would probably also take me several hours to develop, and I don't have that much time available today.

        Added: I can certainly understand the inclination to try to use familiar approaches that you have perhaps mastered in other programming languages. But the Stata programming language, notwithstanding a superficial resemblance to C, is really quite different and based on a different conceptual model of data management. While the language does include looping commands, their usefulness in Stata is primarily limited to loops that count the number of iterations, loops that iterate over names of variables, and, sometimes, loops that iterate over the values of a variable or set of variables. The last of those is often unwieldy and slow and sometimes better alternatives are available. Looping over observations is hardly ever the right approach in Stata; better alternatives are almost always available.

        To become an effective, proficient Stata user, you need to embrace some new ways of thinking about data and programming.

        Added: Regarding your data set having 420,000 observations, the number of locations would have to be very large, probably unrealistically large for hospital-type data, in order for the code in #2 not to blow-up memory at the -joinby-. So here's a very simple modification that automates the process of chunking the data: it processes a single location at a time and appends the results together. It does this by wrapping the code from #2 (with a minor modification, replacing -joinby- with -cross- since there is only one location in play at any time now) in a program and iterating the code over the levels of the location variable.

        Code:
        //    CREATE A DATA SET PAIRING ALL OBSERVATIONS PERTAINING TO THE SAME LOCATION
        capture program drop one_location
        program define one_location
            preserve
            ds location, not
            local other_vars `r(varlist)'
            rename (`other_vars') =_2
            tempfile copy
            save `copy'
        
            restore
            rename (`other_vars') =_1
            cross using `copy'
            drop if id_1 == id_2 & start_datetime_1 == start_datetime_2 // EXCLUDE SELF PAIRS
            drop if start_datetime_1 > start_datetime_2 // KEEP PAIRS WITH ADMIT 1 BEFORE ADMIT 2
        
            //    KEEP THOSE OBSERVATIONS WHERE THE OCCUPANCY PERIODS OVERLAP
            keep if min(end_datetime_1, end_datetime_2) >= start_datetime_2
           exit
        end
        
        runby one_location, by(location)
        -runby is written by Robert Picard and me; it is available from SSC.
        Last edited by Clyde Schechter; 31 May 2023, 12:49.

        Comment


        • #5
          Clyde, thanks again for your detailed reply. I really appreciate it.

          please indicate what version of Stata you are using
          I am running Stata/MP 17.0.

          You must run the code in one fell swoop from beginning to end. (Well, from the place where a local macro is defined until its final use.)
          I tried running the whole thing in one go and got through the rename (`other_vars')=_2 without errors, but once it got to rename (`others_vars') =_1 it threw the same rename syntax error. I should mention I am running a bunch of code to clean up the original data set prior to running the code you provided, so I'm not sure if this could cause any issues with the local macro. If you'd like I can share that code here.

          The end-result of this code will be a data set in which each observation contains information about a pair of people who were in the same location at the same time.
          Ah, okay. I understand. So the _1 and _2 are used to compare pairs of observations of different patients in the same room during overlapping time periods, and if this is the case then a new row is generated for each patient that meets this criteria (long format), as opposed to new columns (wide format) for each roommate?

          Comment


          • #6
            I tried running the whole thing in one go and got through the rename (`other_vars')=_2 without errors, but once it got to rename (`others_vars') =_1 it threw the same rename syntax error. I should mention I am running a bunch of code to clean up the original data set prior to running the code you provided, so I'm not sure if this could cause any issues with the local macro. If you'd like I can share that code here.
            If the "clean up" code falls between the definition of local macro other_vars and its subsequent use, then, yes, you should show it; it may be interfering somehow. If not, then I think you should scrutinize just the -rename- command itself to make sure it is typographically exactly correct, because that syntax is correct. I have run this code on my own setup and it does not throw error messages.

            Comment


            • #7
              Ah, you were right. It was a typographical error. I had a space between = and _1 in my .do file. Thank you for updating your solution with a chunking implementation. I did indeed run into the following error in testing out the original code at the joinby:
              Code:
              op. sys. refuses to provide memory
              I installed runby and will see how it goes. Thanks again for your help today!

              Comment

              Working...
              X