Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Individual matching for two exposure variables in a follow up cohort

    Dear Stata users,
    I have a huge follow up data of more than 4 million visits (for cancer screening) made by about 1 million women. The maximum number of visits a woman can have is 10. There are two symptom variables in the data set, which can occur at any visit during 1-10 visits. A woman can have more than one symptoms also. Below is an example data, for w1 woman, she had symptoms in her 6th visit, now, I want to find a similar women without a visit with symptoms in the same visit (i.e., 6th visit) matched by 4 background variables (not shown below). If a woman had more than one symptoms during 1-10 visits, I just consider the first visit with symptoms. How do I find the similar non-symptomatic visit for a given visit with symptom matched by 4 variables? Is it possible to match two exposure variables (visit with symptoms) to unexposed (visit without symptoms)?

    My exposure group is visit with symptoms and comparison group is visits without symptoms. The matching ratio is 1:1 and I have no difficulty finding non-symptomatic visits using 4 matching variables. My follow-up time starts from the visit date with symptoms and ends at the exact date of death (due to cancer or other cause) or at last visit date/loss to follow up.
    Women number of visits year of visit symptom 1 symptom 2 death
    w1 1 1992 0 0 0
    2 1994 0 0 0
    3 1996 0 0 0
    4 1998 0 0 0
    5 2000 0 0 0
    6 2002 1 0 0
    7 2004 0 0 0
    8 2006 0 0 0
    9 2008 0 0 1
    10 2010 . . .
    w2 1 1996 0 0 0
    2 1998 1 0 0
    3 2000 0 0 0
    4 2002 0 0 0
    5 2004 0 0 0
    6 2006 0 0 0
    7 2008 0 1 0
    8 2010 0 0 0
    9 2012 0 0 0
    10 2014 0 0 0
    Thank you.

    kind regards,
    Deependra

  • #2
    You could change the data structure to wide, and concatenate the symptom variables to one string variable like for w1 "00000100." Then you might use grep to search for patterns on those string variables.

    Comment


    • #3
      Noting the size of the dataset it is possible to concatenate strings in long format without reshaping.

      Code:
      clear
      input id symptom
      1 0
      1 0
      1 0
      1 1
      1 0
      1 .
      2 0
      2 0
      2 0
      2 0
      2 0
      2 0
      end
      tostring symptom, generate(sympstr)
      sort id, stable
      by id: generate concat = sympstr if _n == 1
      by id: replace concat = concat[_n-1] + sympstr if _n > 1
      by id: replace concat = concat[_N]
      list, sepby(id)

      Comment


      • #4
        Originally posted by Dave Airey View Post
        Noting the size of the dataset it is possible to concatenate strings in long format without reshaping.

        Code:
        clear
        input id symptom
        1 0
        1 0
        1 0
        1 1
        1 0
        1 .
        2 0
        2 0
        2 0
        2 0
        2 0
        2 0
        end
        tostring symptom, generate(sympstr)
        sort id, stable
        by id: generate concat = sympstr if _n == 1
        by id: replace concat = concat[_n-1] + sympstr if _n > 1
        by id: replace concat = concat[_N]
        list, sepby(id)



        Hi Dave,
        I was out of reach through internet for some days, sorry for that.

        Thank you very much for the codes. The codes works perfectly in long format.

        I now got the new symptom variable with indication of the round of occurrence and frequency.

        Since there are 40,000 visits with symptom and rest 3... million visits without symptom. I would like to match every symptomatic visit made by women with asymptomatic visit at 1:1 ratio, also by the order of occurrence. For example, if a symptom was reported in women's 4th visit, I want to find asymptomatic women and match only the 4th visit, given that asymptomatic women had no symptoms reported in her first three visits. The new variable generated above have several outcomes depending upon at which visit the symptom was reported (or the visit was missing), how do I pick the right _nth symptomatic visit and match to the right _nth asymptomatic visit number? I have not used 'grep' command that you mentioned above.

        After that, I have the visit date variable and exact date of death, and could easily calculate the follow-up time.
        Looking forward for your kind help.

        kind regards,
        Deependra

        Comment


        • #5
          You can google for grep help if you don't have any books on the topic. For example:

          https://www.stata.com/support/faqs/d...r-expressions/
          https://stats.idre.ucla.edu/stata/fa...r-expressions/
          https://www.stata.com/meeting/wcsug0...ros_reg_ex.pdf


          Here is a toy example matching on the first id using your requirements.

          Code:
          clear
          
          input id str5 sympstr
          1 "01001"
          2 "00010"
          3 "0000."
          4 "01000"
          end
          
          generate match_id_1 = regexm(sympstr, "00[0-1\.][0-1\.][0-1\.]")
          
          . list, clean
          
                 id   sympstr   match_~1  
            1.    1     01001          0  
            2.    2     00010          1  
            3.    3     0000.          1  
            4.    4     01000          0

          Comment


          • #6
            Originally posted by Dave Airey View Post
            You can google for grep help if you don't have any books on the topic. For example:

            https://www.stata.com/support/faqs/d...r-expressions/
            https://stats.idre.ucla.edu/stata/fa...r-expressions/
            https://www.stata.com/meeting/wcsug0...ros_reg_ex.pdf


            Here is a toy example matching on the first id using your requirements.

            Code:
            clear
            
            input id str5 sympstr
            1 "01001"
            2 "00010"
            3 "0000."
            4 "01000"
            end
            
            generate match_id_1 = regexm(sympstr, "00[0-1\.][0-1\.][0-1\.]")
            
            . list, clean
            
            id sympstr match_~1
            1. 1 01001 0
            2. 2 00010 1
            3. 3 0000. 1
            4. 4 01000 0



            Hi Dave,
            Thank you for the reply.
            I still could not understand the matching thing. Below is an example of a women who had 5 visits, she had symptom (symp_str) in her 2nd visit, and after concatenate the symptom variable the values goes like "0", "01", "010".. etc. I need control for symp_str using the index visit when symptom was reported, this means in the new generated variable I will have "1" in second visit while other value "0" is missing and by matching try to find a women without symptom "0" in the whole visit history.
            Using the codes you mentioned above I am able to create a new matching variable but all visits are indicated as "0". However, the idea is to find another random women without symptom as indicated "0" in her visit history and hence, marked as "0" in her second visit while other visits are left as missing. Thus a woman with symptom in a given visit with have another woman without symptom in that respective visit number.

            n_obs true_order_inv symp_str concat match_id_symp
            883 1 0 0 0
            883 2 1 01 0
            883 3 0 010 0
            883 4 0 0100 0
            883 5 0 01000 0


            kind regards,
            Deependra

            Comment


            • #7

              In my toy example, the first woman had a symptom in the second place, e.g., "01". So I searched the remaining women using the above grep command such that the other women had to have "00" and then could have any combination following (either 0,1, or .). You would need to modify the grep command for each possible search position.

              Comment

              Working...
              X