Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Repeat command issues: foreach

    I have data with a laboratory value on each line, and each STUDYID has with multiple laboratory values. I am trying to match the closest height (in time) to each individual laboratory value's date (each line). Each individual has between 2-200 heights and are in wide format.
    i.e. the data looks like this:
    STUDYID LBDTM ContactDate1 Height1 ContactDate2 Height2 ContactDate3 Height3... and so on.

    I am trying to write a loop command to get the difference between each ContactDate and the LabDateTime (LBDTM)

    I am writing a code like this:

    foreach var of varlist `ContactDate*' {
    gen HEIGHTDIFFERENCE= abs(LBDTM- `ContactDate*')
    }

    What am I doing wrong? I'm getting multiple errors, usually syntax.
    Ideally, I would have multiple variables with the absolute value between each contact date and the laboratory date, then be able to identify the minimum value.
    Thanks in advance for any help.

  • #2
    First line should be

    Code:
    foreach var of varlist ContactDate* {
    You don't need the compound quotes. Second line should be:

    Code:
    gen hd_`var' = abs(LBDTM - `var')
    Two things: `var' is the local you need to reference at each iteration of the loop, not `ContactDate*'. Second, the variable name you are generating needs to change on each iteration of the loop. Otherwise you are generating a variable that already exists, and you'll get an error. Putting it all together:

    Code:
    foreach var of varlist ContactDate* {
        gen hd_`var' = abs(LBDTM - `var')
    }
    This should be much easier in long format by the way. No loop required.

    Comment


    • #3
      It worked beautifully! Thank you so very much for your help.

      Comment


      • #4
        Hi Amy,

        I'm glad #2 worked, but it occurs to me you might actually want your HEIGHTDIFFERENCE variables to have the same postfix as your other variables. Let's say that you have 10 ContactDate columns. You might prefer something like this:

        Code:
        forv i = 1/10{
            gen HEIGHTDIFFERENCE`i' = abs(LBDTM - ContactDate`i')
        }
        Although, I just want to reiterate, this and many other operations are much easier in long format. Stata almost always expects data like this in long format. See

        Code:
        help reshape long
        Last edited by Daniel Schaefer; 01 Nov 2023, 11:55.

        Comment


        • #5
          That's great, thank you. It does appear that the new hd_ variable does have the same postfix variable using your first method as well - but I will explore the second as an alternative too.
          I have never had to use data like this in the wide format before, which is indeed why I am having issues. Based on the repeated observations in two ways, I think I have to have one set of data in wide in order to merge them?

          I am now trying to generate a variable to find the lowest value within the heighdifference variables, to match to the closest for each lab value line and drop the remainder of the observations.

          gen bestheight=min(hd_ContactDate*) doesn't work for this?

          Comment


          • #6
            It does appear that the new hd_ variable does have the same postfix variable using your first method as well - but I will explore the second as an alternative too.
            Oh, your right, thanks, Amy, for pointing that out. I guess I meant "just the postfix, not the entire variable name" - but I see #2 works just fine too.

            Based on the repeated observations in two ways
            I'm not sure what you mean by "repeated observations in two ways". Do you mean that Height and ContactDate are both in wide format? That shouldn't be a problem. You don't provide example data, so I can't observe the structure of your data or test my code before sending it to you, but it seems like you just need something like this:

            Code:
            reshape long ContactDate Height, i(STUDYID) j(postfix)
            gen bestheight=min(hd_ContactDate*) doesn't work for this?
            Nope. Two things: min() is a weird function. It actually expects a list of arguments, not a variable. So you could say:

            Code:
            gen bestheight=min(1, 2, 3, 4, -5)
            to get a single column filled with -5, but you wont get the minimum of a variable with that function. Notice as well, you don't give min a single variable as an argument, you give it a list of variables - but then you expect it to -generate- a single variable. That is syntactically ambiguous, and would probably lead to an error even if min (e.g.,) found the minimum of a variable. What do you expect as output? Do you want a single column at the end of all of this that has the minimum value for each row across these variables? Do you want a single column as output that has the overall minimum for every row and column across these variables? Or do you want a separate column with a single minimum value that corresponds to the given column of hd_ContactDate*?

            Comment


            • #7
              Thanks so much for your help with this. Maybe I am thinking about it incorrectly.
              Here is a sample of the data to give you a better sense:
              STUDYID LBDTM LBORRES AgeinDays1 ContactDate1 Height1 AgeinDays2 ContactDate2 Height2 AgeinDays3 ContactDate3 Height3
              PAKI-0001 21-Jul-16 0.8 1515 10-Jun-02 3' 1.24" 1830 ######## 3' 3.92" 2187 ######## 3' 6.32"
              PAKI-0001 23-Jul-16 0.9 1515 10-Jun-02 3' 1.24" 1830 ######## 3' 3.92" 2187 ######## 3' 6.32"
              PAKI-0001 17-Oct-16 0.8 1515 10-Jun-02 3' 1.24" 1830 ######## 3' 3.92" 2187 ######## 3' 6.32"
              PAKI-0001 ######## 0.9 1515 10-Jun-02 3' 1.24" 1830 ######## 3' 3.92" 2187 ######## 3' 6.32"
              PAKI-0001 ######## 0.8 1515 10-Jun-02 3' 1.24" 1830 ######## 3' 3.92" 2187 ######## 3' 6.32"
              PAKI-0001 9-Jul-20 0.7 1515 10-Jun-02 3' 1.24" 1830 ######## 3' 3.92" 2187 ######## 3' 6.32"
              PAKI-0001 22-Jul-21 0.9 1515 10-Jun-02 3' 1.24" 1830 ######## 3' 3.92" 2187 ######## 3' 6.32"
              PAKI-0001 8-Nov-21 1.1 1515 10-Jun-02 3' 1.24" 1830 ######## 3' 3.92" 2187 ######## 3' 6.32"
              PAKI-0001 28-Jul-22 0.9 1515 10-Jun-02 3' 1.24" 1830 ######## 3' 3.92" 2187 ######## 3' 6.32"
              PAKI-0001 ######## 1.05 1515 10-Jun-02 3' 1.24" 1830 ######## 3' 3.92" 2187 ######## 3' 6.32"
              PAKI-0001 3-Aug-23 0.93 1515 10-Jun-02 3' 1.24" 1830 ######## 3' 3.92" 2187 ######## 3' 6.32"
              PAKI-0003 5-Aug-13 0.2 82 ######## 1' 10.992" 86 ######## 1' 10.992" 89 ######## 1' 10.992"
              PAKI-0003 ######## 0.2 82 ######## 1' 10.992" 86 ######## 1' 10.992" 89 ######## 1' 10.992"
              PAKI-0003 ######## 0.2 82 ######## 1' 10.992" 86 ######## 1' 10.992" 89 ######## 1' 10.992"
              PAKI-0003 ######## 0.3 82 ######## 1' 10.992" 86 ######## 1' 10.992" 89 ######## 1' 10.992"
              PAKI-0003 ######## 0.2 82 ######## 1' 10.992" 86 ######## 1' 10.992" 89 ######## 1' 10.992"
              As you can see, there are multiple lab results (LBRORRES) each at different times (LBDTM) for each STUDYID. Each STUDYID has many heights over time. The height data was originally long, but given I need to compare each available height to each lab value (I think) one of them needs to be wide, so I reshaped it to merge it into the lab data.

              The goal is to find the height (based on contact date) obtained closest to each individual LBORRES in order to interpret the result.
              My plan was to calculate the absolute value of time between the lab result on each line and each available height. You helped me with this above.
              Next, I need to identify which of these is the lowest (i.e. the closest in time) so that I can drop the rest of the heights for that STUDYID/LBORRES and hopefully that will leave me with just one Contact Date (n) Age in Days (n) and Height (n) which is best for that row.
              Each STUDYID has up to 200 heights, making it difficult to type out the individual variables, and all STUDYIDs have a different number of heights and lab results.
              Maybe this makes my problem clearer?

              Thanks again so much for taking the time to help me

              Comment


              • #8
                Thank you! I know you are new here so I just want to briefly point out that you should always give data examples generated using the -dataex- command. What you have here is missing some important meta data that -dataex- would provide.

                However, this does clarify a few things. So you want to find the height measurement taken closest to the LBORRES date? Okay.

                (I think) one of them needs to be wide
                Not necessarily, but you've gotten this far: Let's continue along with your original train of thought. You want to find the minimum of the difference we just calculated across rows, then get the height from the corresponding height column and summarize all of that in a new column? Okay, we can do that with a linear search algorithm:

                Code:
                * just setting up some random data here, assuming 10 of each variable again.
                clear
                set obs 100
                forv i = 1/10 {
                    gen height`i' = runiform()
                    gen hd_ContactDate`i' = runiform()
                }
                
                * here's the algorithm.
                gen minimum = hd_ContactDate1
                gen nearest_height = height1
                forv i = 2/10 {
                    replace nearest_height = height`i' if hd_ContactDate`i' < minimum
                    replace minimum = hd_ContactDate`i' if hd_ContactDate`i' < minimum
                }
                
                * just double checking the final set of minimums against egen rowmin.
                egen rowmin = rowmin(hd_ContactDate*)
                assert rowmin == minimum
                Suppose two hd_ContactDate variables have a match for the minimum time. I don't know if that's possible in your dataset but suppose it is: In that case, we use the first corresponding height variable in the order given by the postfix.

                Comment


                • #9
                  I thank Daniel Schaefer gave you excellent advice when he said that this should be done in long layout, not wide. The tableau of data shown in #7 requires a lot of surgery to import into Stata, so this code is untested and may contain errors, but here's how to do it with long data:
                  Code:
                  reshape long ContactDate Height, i(STUDYID LBDTM) j(_j)
                  gen delta = abs(ContactDate-LBDTM)
                  by STUDYID LBDTM (delta ContactDate), sort: keep if _n == 1
                  This will pair up each lab result with the height and contact date that is closest in time to the lab date, whether preceding or following. In the event of a tie for closest (say, one 3 days before and another 3 days after the lab date), the earlier one is used.

                  While the code bears some similarities to that spelled out for the wide data, it is more compact, and more readable, largely because it does not require any explicit loops at all. And, if your data set is very large, this will run much faster.

                  Comment


                  • #10
                    Thanks Clyde. Consider this randomly generated test case:

                    Code:
                    clear
                    set obs 100
                    set seed 4759345
                    egen STUDYID = seq(), f(1) t(10) b(10)
                    bysort STUDYID: gen LBDTM = _n
                    
                    forv i = 1/10 {
                        gen Height`i' = runiformint(1, 10)
                        gen ContactDate`i' = runiform(1, 10)
                    }
                    This is the solution Amy and I worked through together:

                    Code:
                    foreach var of varlist ContactDate* {
                        gen hd_`var' = abs(LBDTM - `var')
                    }
                    
                    gen minimum = hd_ContactDate1
                    gen nearest_height = Height1
                    forv i = 2/10 {
                        replace nearest_height = Height`i' if hd_ContactDate`i' < minimum
                        replace minimum = hd_ContactDate`i' if hd_ContactDate`i' < minimum
                    }
                    If we run that, then execute Clyde's solution:

                    Code:
                    reshape long ContactDate Height, i(STUDYID LBDTM) j(_j)
                    gen delta = abs(ContactDate-LBDTM)
                    by STUDYID LBDTM (delta ContactDate), sort: keep if _n == 1
                    
                    assert Height == nearest_height
                    We see that the two solutions produce equivalent results - at least on this test case. They are, however, not "the same" for all of the reasons Clyde gives at the end of #9.
                    Last edited by Daniel Schaefer; 01 Nov 2023, 15:10.

                    Comment


                    • #11
                      That works perfectly, thank you!
                      Yes, I am new to the forum and stata in general so I appreciate the advice.

                      Finishing this up, I need to lastly bring in the AgeinDays as well which I assumed would be easy but I'm having some trouble with this:

                      gen nearest_age=AgeinDays1
                      replace nearest_age = AgeinDays`i' if hd_ContactDate`i'== minimum

                      Seems that I am missing something simple as to why this isn't working.
                      Again, so appreciative of your time and help today.

                      Comment


                      • #12
                        Sorry, but why not:

                        Code:
                        reshape long AgeinDays1 ContactDate Height, i(STUDYID LBDTM) j(_j)
                        gen delta = abs(ContactDate-LBDTM)
                        by STUDYID LBDTM (delta ContactDate), sort: keep if _n == 1
                        Or am I missing something? I guess if you want to continue on with these loops you can finish up by modifying the code so far like this:

                        Code:
                        foreach var of varlist ContactDate* {
                            gen hd_`var' = abs(LBDTM - `var')
                        }
                        
                        gen minimum = hd_ContactDate1
                        gen nearest_height = Height1
                        gen corresponding_AgeinDays = AgeinDays1
                        forv i = 2/10 {
                            replace nearest_height = Height`i' if hd_ContactDate`i' < minimum
                            replace corresponding_AgeinDays = AgeinDays`i' if hd_ContactDate`i' < minimum
                            replace minimum = hd_ContactDate`i' if hd_ContactDate`i' < minimum
                        }
                        After which there is no need to reshape. You can just drop all of the redundant data.

                        Code:
                        drop ContactDate* hd_ContactDate* AgeinDays* Height*
                        I might also -rename- corresponding_AgeinDays AgeinDays as well. I use the long variable name within the loop here so that its a little clearer to the reader what is going on.

                        Only you know your data and where you are at in the process, but just consider, it might be simplest to reload the data from the file system while it is still long format, then rerun the last two lines from #9.

                        Comment


                        • #13
                          I just want to add, next time, if you get stuck, you might just want to post your overall goal along with a data example generated with -dataex-. You started with a very specific question about a loop in #1, so you got a very specific answer about a loop in #2. I'm happy to think this through with you, but most of the time, you probably wanted the solution Clyde gave in #9 quickly. The best way to get clear, accurate answers quickly is to be clear about what your goal is, provide a data example generated with -dataex-, the code you've tried so far, and any error messages the code produced. Just follow that format and you'll likely get excellent, high quality answers quickly most of the time.
                          Last edited by Daniel Schaefer; 01 Nov 2023, 16:33.

                          Comment


                          • #14
                            Thank you so much! Appreciate the advice and all the help on this. Will take it to heart for next time.

                            Comment

                            Working...
                            X