Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Loops, if and date

    Hi!

    Im currently working on a big data set (UK Biobank) with roughly 500 000 individuals. I have made a training set with 1000 individuals that I'm currently writing the code in. I want to know if one variable at baseline is associated with a specific disease over time. That disease in question has different codes, e.g I48, and I481-489. Linked to every diagnose is a column with a date (date for the diagnose). The problem is that there is 242 columns, whereof most has no data in it.

    Im only interested in one of the diagnose and columns above, the one with the earliest date (since it is time to disease that are interesting).

    The diagnose is in one of these columns: s_41270_0_0 - s_41270_0_242 and the date in this ts_41280_0_0 - ts_41280_0_242. They are linked, so the diagnose in s_41270_0_0 was given on the date that are shown in ts_41280_0_0 etc.

    I have made a loop for s_41270_0_0, so that everyone that has either of those diagnose codes mentioned above are coded 1 for the diagnose (afib):

    foreach var of varlist s_41270_0_0 - s_41270_0_242 {
    recode afib (0=1) if (`var'=="I48" | `var'=="I480" | `var'=="I481" ///
    | `var'=="I482" | `var'=="I483" | `var'=="I484" | `var'=="I485" ///
    | `var'=="I486" | `var'=="I487" | `var'=="I488" | `var'=="I489")

    }

    Due to the large dataset I need a loop. What I need is to drop those who got the diagnose (41270-column) with afib=1 before a certain date (41280-column). What I think is tricky is that there are several diagnoses and also dates. And only in the column with the earliest date with one of those diagnoses are interesting. I hope I explained so you could understand. English isn't my first launguage.

    Is it possible to make a loop for this in STATA?

  • #2
    You are fortunate in the naming convention used for your variables. We can loop over the suffix 0-242 rather than over variable names. Perhaps this untested code starts you in a useful direction. I don't fully understand what you seek, so this just finds any patient who received an afib diagnosis as an initial diagnosis on the earliest date.
    Code:
    generate afib = 0
    generate earliest = rowmin(ts_41280_0_*)
    forvalues i=0/242 {
        local var =   s_41270_0_`i'
        local time = ts_41280_0_`i'
        replace afib==1 if `time'==earliest & ///
             (`var'=="I48"  | `var'=="I480" | `var'=="I481" ///
            | `var'=="I482" | `var'=="I483" | `var'=="I484" | `var'=="I485" ///
            | `var'=="I486" | `var'=="I487" | `var'=="I488" | `var'=="I489")
    }

    Comment


    • #3
      I note that the condition

      Code:
      substr(`var', 1, 3) == "I48"
      matches all the cases specified, but whether it matches anything you don't want I can't tell.

      Comment


      • #4
        I would do this differently in several ways. This wide layout of the data makes what is actually a pretty simple problem complicated. And it is likely to remain that way with everything else you do it. So the first thing I would do is change this to a long layout. Then I would rename the variables to something that imposes less cognitive burden. Finally I would use the wonderful -icd10- command to do the heavy lifting on the matches. You didn't show example data, so this code is just an outline of how to proceed and you will need to adapt it. Notice, by the way, that when done this way, the code is completely transparent, and no loops are required.

        Code:
        reshape long s_41270_0_ ts_41280_0_, i(person_id) j(seq)
        rename s_41270_0_ dx_code
        rename ts_41280_0_ dx_date
        
        icd10 generate afib = dx_code, range(I48*)
        
        by person_id (dx_date), sort: egen earliest_afib = min(cond(afib, dx_date, .))
        
        drop if earliest_afib < mdy(1, 31, 2010) // OR WHATEVER YOUR "CERTAIN DATE" IS
        Evidently, replace -mdy(1,31,2010)- by the actual "certain date" you have in mind.

        I recommend saving your data in this long layout and using this, in preference to the original. It will simplify nearly anything you do in Stata. There are only a small number of Stata commands that work better with wide data layouts.

        Note: If your data set is very large, the -reshape- may be quite slow. You can speed that up by using -greshape- or -tolong-, which are user-written commands that are much faster than official Stata's -reshape-. -tolong- is available from SSC. -greshape- is part of the -gtools- suite at github.com/mcaceresb/stata-gtools.


        .
        Last edited by Clyde Schechter; 15 Feb 2023, 14:48.

        Comment


        • #5
          Re #3. Yes, as it happens, I48* matches all and only the ICD10 codes for atrial fibrillation.

          Comment


          • #6
            Thank you all for the swift replies. Clyde, I tried with the reshape and tolong and the first I had to break and the other got much more observations. I don't know if I did correct. But I think I going with the data as it is, if possible.

            William, I tried the solution with rowmin but that doesn't work for me, see below (highlight in red). Maybe you can see where the obvious mistake is if using this method forward.

            greetings Andreas

            gen afib=0

            foreach var of varlist s_41270_0_* {
            replace `var'="I48" if (`var'=="I480" | `var'=="I481" ///
            | `var'=="I482" | `var'=="I483" | `var'=="I484" | `var'=="I485" ///
            | `var'=="I486" | `var'=="I487" | `var'=="I488" | `var'=="I489")

            }

            foreach var of varlist s_41270_0_0-* {
            recode afib (0=1) if `var'=="I48"

            }


            So far so good. The problem is the next step:

            generate earliest = rowmin(ts_41280_0_*)
            forvalues i=0/242 {
            local var = s_41270_0_`i'
            local time = ts_41280_0_`i'
            replace afib==1 if `time'==earliest & `var'=="I48"
            }

            . generate earliest = rowmin(ts_41280_0_*)
            unknown function rowmin()
            r(133);

            end of do-file

            Comment


            • #7
              rowmin() is an egen function and only works with egen. So replace generate with egen.

              Also my guess is that you really need

              Code:
              local var s_41270_0_`i'
              local time ts_41280_0_`i'
              replace afib==1 if `time'==earliest & substr(`var', 1, 3) =="I48"

              Comment


              • #8
                Originally posted by Nick Cox View Post
                rowmin() is an egen function and only works with egen. So replace generate with egen.

                Also my guess is that you really need

                Code:
                local var s_41270_0_`i'
                local time ts_41280_0_`i'
                replace afib==1 if `time'==earliest & substr(`var', 1, 3) =="I48"
                foreach var of varlist s_41270_0_* {
                replace `var'="I48" if (`var'=="I480" | `var'=="I481" ///
                | `var'=="I482" | `var'=="I483" | `var'=="I484" | `var'=="I485" ///
                | `var'=="I486" | `var'=="I487" | `var'=="I488" | `var'=="I489")

                }

                foreach var of varlist s_41270_0_* {
                recode afib (0=1) if `var'=="I48"

                }

                egen earliest = rowmin(ts_41280_0_*)
                forvalues i=0/242 {
                local var = s_41270_0_`i'
                local time = ts_41280_0_`i'
                replace afib=1 if (`time'==earliest & substr(`var', 1, 3)=="I48")

                . egen earliest = rowmin(ts_41280_0_*)
                (129 missing values generated)

                . forvalues i=0/242 {
                2. local var = s_41270_0_`i'
                3. local time = ts_41280_0_`i'
                4. replace afib=1 if (`time'==earliest & substr(`var', 1, 3)=="I48")
                5. }
                C820 not found
                r(111);

                end of do-file

                Egen worked better, but in the output I get a different number every time I run the analysis (highlighted in red). Also the number in earliest is no date, so I have to get MDY in there somewhere I think. We are going to exclude those with afib-diagnose at baseline, i.e. on examination date (ts_53_0_0).

                Comment


                • #9
                  Code:
                  local var = s_41270_0_`i'
                  local time = ts_41280_0_`i'
                  The code suggested in post #7 does NOT include equal signs.

                  Please, you have in several cases taken suggested code and, rather than copy it as given, made changes (replace egen with generate, add equal signs) that cause the code to not work. If the code doesn't work as given, report that fact and what the problem was, rather than change the code and the report a problem with code that was not suggested.
                  Last edited by William Lisowski; 16 Feb 2023, 08:01.

                  Comment


                  • #10
                    Originally posted by William Lisowski View Post
                    Code:
                    local var = s_41270_0_`i'
                    local time = ts_41280_0_`i'
                    The code suggested in post #7 does NOT include equal signs.

                    Please, you have in several cases taken suggested code and, rather than copy it as given, made changes (replace egen with generate, add equal signs) that cause the code to not work. If the code doesn't work as given, report that fact and what the problem was, rather than change the code and the report a problem with code that was not suggested.
                    William, as you can see in post #8, the analysis was with equal signs after "local var" and "local time".

                    Comment


                    • #11
                      Originally posted by Andreas Rydell View Post

                      William, as you can see in post #8, the analysis was with equal signs after "local var" and "local time".
                      And that is a mistake, the analysis in post 8 should not have equal signs because the code in post 7 did have equal signs.

                      Comment


                      • #12
                        Originally posted by William Lisowski View Post

                        And that is a mistake, the analysis in post 8 should not have equal signs because the code in post 7 did have equal signs.
                        Ok. I used the one you wrote #2.

                        This is the current output. Generate rowmin didnt work, so I tried egen instead (tip from Nick #7), do-file code in green, output in red (excluding all the changes made from replace):


                        foreach var of varlist s_41270_0_* {
                        replace `var'="I48" if (`var'=="I480" | `var'=="I481" ///
                        | `var'=="I482" | `var'=="I483" | `var'=="I484" | `var'=="I485" ///
                        | `var'=="I486" | `var'=="I487" | `var'=="I488" | `var'=="I489")

                        }
                        gen afib=0
                        egen earliest = rowmin(ts_41280_0_*)
                        forvalues i=0/242 {
                        local var s_41270_0_`i'
                        local time ts_41280_0_`i'
                        replace afib==1 if `time'==earliest & substr(`var', 1, 3) =="I48"
                        }



                        . egen earliest = rowmin(ts_41280_0_*)
                        (129 missing values generated)

                        . forvalues i=0/242 {
                        2. local var s_41270_0_`i'
                        3. local time ts_41280_0_`i'
                        4. replace afib==1 if `time'==earliest & substr(`var', 1, 3) =="I48"
                        5. }
                        == invalid name
                        r(198);

                        end of do-file

                        r(198);


                        We actually want to exclude those individuals that has I48 diagnose (in s_41270_0_*) with a date (ts_41280_0_*) that is before examination date ts_53_0_0.


                        EDIT: this didnt get an output error:
                        gen afib=0
                        foreach var of varlist s_41270_0_* {
                        replace `var'="I48" if (`var'=="I480" | `var'=="I481" ///
                        | `var'=="I482" | `var'=="I483" | `var'=="I484" | `var'=="I485" ///
                        | `var'=="I486" | `var'=="I487" | `var'=="I488" | `var'=="I489")

                        }

                        egen earliest = rowmin(ts_41280_0_*)
                        forvalues i=0/242 {
                        local var s_41270_0_`i'
                        local time ts_41280_0_`i'
                        recode afib (0=1) if `time'==earliest & substr(`var', 1, 3) =="I48"
                        }


                        What I'm in dire need of now is following:

                        earliest has to be in the same date-format as examdate1 (first examination date in UKBB). Then only those with I48 diagnose and earliest <=examination date are to be excluded.

                        Examination date and diagnose date has following format:
                        02jun2008

                        earliest looks like this: 14416, 19586 etc.

                        Last edited by Andreas Rydell; 16 Feb 2023, 10:14.

                        Comment


                        • #13
                          Aha! This was not in post #1 and as a result I was very confused about what you wanted to obtain.
                          We actually want to exclude those individuals that has I48 diagnose (in s_41270_0_*) with a date (ts_41280_0_*) that is before examination date ts_53_0_0
                          Thank you. That clears things up in my mind.

                          The incorrect code in post #2, for which I apologize, should be replaced with
                          Code:
                          generate exclude = 0
                          forvalues i=0/242 {
                              local var  s_41270_0_`i'
                              local time ts_41280_0_`i'
                              replace exclude=1 if substr(`var', 1, 3) =="I48" & `time'<=ts_53_0_0
                          }
                          Then, any individual who had an AFIB diagnosis whose corresponding exam data was before the date in ts_53_0_0 will have exclude==1, all others will have exclude==0.

                          Comment


                          • #14
                            Thank you William. I going to try that next. I did a data check now and unfortunately it didnt get correct. Im going to try to explain. The code was:

                            gen afib=0
                            foreach var of varlist s_41270_0_* {
                            replace `var'="I48" if (`var'=="I480" | `var'=="I481" ///
                            | `var'=="I482" | `var'=="I483" | `var'=="I484" | `var'=="I485" ///
                            | `var'=="I486" | `var'=="I487" | `var'=="I488" | `var'=="I489")

                            }
                            gen flimmer=0

                            foreach var of varlist s_41270_0_* {
                            recode flimmer (0=1) if `var'=="I48"

                            }
                            sum flimmer

                            egen earliest = rowmin(ts_41280_0_*)
                            format earliest %td
                            forvalues i=0/242 {
                            local var s_41270_0_`i'
                            local time ts_41280_0_`i'
                            recode afib (0=1) if `time'==earliest & substr(`var', 1, 3) =="I48"
                            }

                            "flimmer" (as we call fibrillation") was made to control the output. 65 individuals had flimmer=1, e.g I48 on some of the visits. I looked at one of these individuals and he had I48 diagnos in the first visit (s_41270_0_0) and z121 in s_41270_0_1. The diagnose date for I48 was 05nov2018 and for z121 03dec2014. In the earliest it was 03dec2014.

                            Hence afib is only coded when i48 had the earliest date, not earliest among I48 only.
                            Last edited by Andreas Rydell; 16 Feb 2023, 11:35.

                            Comment


                            • #15
                              Update.

                              It seems to work as I mentioned now using this code. Thank everyone that helped to steer me right.

                              foreach var of varlist s_41270_0_* {
                              replace `var'="I48" if (`var'=="I480" | `var'=="I481" ///
                              | `var'=="I482" | `var'=="I483" | `var'=="I484" | `var'=="I485" ///
                              | `var'=="I486" | `var'=="I487" | `var'=="I488" | `var'=="I489")

                              }
                              gen afib=0
                              foreach var of varlist s_41270_0_* {
                              recode afib (0=1) if `var'=="I48"

                              }
                              generate exclude = 0
                              forvalues i=0/242 {
                              local var s_41270_0_`i'
                              local time ts_41280_0_`i'
                              replace exclude=1 if substr(`var', 1, 3) =="I48" & `time'<=examdate1
                              }
                              tab exclude
                              tab afib

                              exclude | Freq. Percent Cum.
                              ------------+-----------------------------------
                              0 | 1,041 99.05 99.05
                              1 | 10 0.95 100.00
                              ------------+-----------------------------------
                              Total | 1,051 100.00

                              .
                              end of do-file

                              . do "/var/folders/0k/xtg98ydd4lx911m7_cq_7f980000gq/T//SD05053.000000"

                              . tab afib

                              afib | Freq. Percent Cum.
                              ------------+-----------------------------------
                              0 | 986 93.82 93.82
                              1 | 65 6.18 100.00
                              ------------+-----------------------------------
                              Total | 1,051 100.00


                              Interpretation: 10 out of 65 in this training dataset (only 1000 individuals) had a diagnose prior to baseline examination and should be excluded. I double checked all 10 and some more from the other 55 and it was 100% correct.

                              Comment

                              Working...
                              X