Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data cleaning

    Hello everyone,
    I have a very large dataset, in which each unique id has multiple columns (different codes) and different dates. I'm trying to figure out a way to extract ids with desired codes and their oldest dates, and then calculate the time difference between codes for each id, and I'm lost any suggestions on how should I do it? I'd really appreciate your help.
    a small sample of my dataset:
    id code date
    1 00190 20160204
    1 00300 20180110
    1 00670 20160114
    1 01420 20160115
    1 01480 20160419
    1 13132 20160108
    1 13132 20180110
    1 13133 20160108
    2 13133 20180110
    2 00190 20160204
    2 00300 20160419
    2 00670 20160114
    2 01420 20160114
    2 01480 20160114
    2 13132 20160114
    2 13132 20160114
    2 00670 20180112
    Attached Files
    Last edited by Sama Noroozi; 11 Apr 2023, 00:26.

  • #2
    Originally posted by Sama Noroozi View Post
    . . . a way to extract ids with desired codes
    If you have few codes
    Code:
    tempfile dataset
    quietly save `dataset'
    keep if inlist(code, <comma separated list of desired codes>)
    contract id, freq(discard)
    merge 1:m id using `dataset', assert(match using) keep(match)
    If you have many codes, then you'd probably want to do this with a separate dataset that contains the desired codes (one observation per desired code) and, starting from that dataset in memory
    Code:
    merge 1:m code using `dataset', keep(match) // temporary file created as shown above
    contract id, freq(discard)
    merge 1:m id using `dataset', keep(match)

    and their oldest dates
    Code:
    input byte id str5 code str10 date
    . . .
    end
    
    generate int dt1 = date(date, "YMD")
    format dt1 %tdCY-N-D
    quietly bysort id code (dt1): keep if _n == 1
    Note: because your dates' format complies with the international standard (ISO-8601), you don't actually need to convert your dates to Stata's internal format in order to do this first step. But, because you'll need to convert them for the next step (below), I went ahead and did the conversion upfront.



    . . . and then calculate the time difference between codes for each id
    Code:
    quietly bysort id (dt1): generate int del = dt - dt[_n-1]

    Comment


    • #3
      Note that that's the oldest date for each distinct code within id, and the time differences between distinct codes within id. I assume that that's what you were seeking.

      Comment


      • #4
        thank you so much for your quick response. I'm sorry, I had to explain my data better. My initial inclusion criteria were ICD codes for a specific disease, and my goal is to look at the incidence of another disease in this population and calculate the time difference between the oldest encounters of these two diagnoses. there are 63,373 unique patient IDs with 37,176,440 observations for 10 variables (of which I only will use date and ICD code). each unique id had multiple encounters with the date and their ICD codes (that can be the same ICD codes with different dates). additionally, each of these two diagnoses has multiple ICD codes (meaning different patients might have different codes, but have the same diagnosis). I also have another big file with the same patient IDs for surgeries (with their own specific codes), and I'm trying to figure out how to merge these files as well, cause I also want to know the time difference between the oldest dates of surgery of the first disease and diagnosis of the second disease. I'm a novice in STATA, but trying to find my way! Thank you so much for your time and your huge help. let me know if I need to provide more information. thanks

        Comment


        • #5
          attached is the do file I have prepared so far.
          Attached Files
          Last edited by Sama Noroozi; 16 Apr 2023, 03:09.

          Comment


          • #6
            Originally posted by Sama Noroozi View Post
            My initial inclusion criteria were ICD codes for a specific disease, and my goal is to look at the incidence of another disease in this population and calculate the time difference between the oldest encounters of these two diagnoses. there are 63,373 unique patient IDs with 37,176,440 observations for 10 variables (of which I only will use date and ICD code). each unique id had multiple encounters with the date and their ICD codes (that can be the same ICD codes with different dates). additionally, each of these two diagnoses has multiple ICD codes (meaning different patients might have different codes, but have the same diagnosis). I also have another big file with the same patient IDs for surgeries (with their own specific codes), and I'm trying to figure out how to merge these files as well, cause I also want to know the time difference between the oldest dates of surgery of the first disease and diagnosis of the second disease.
            I'm afraid that I don't follow your description and so I doubt that I can be of much practical help.

            The description strikes me, though, as trying to accomplish too much at once and as a result the endeavor has becomed jumbled. So, about the only suggestion I could make here is to divide and conquer, breaking the thing into smaller, more cognitively manageable tasks, and tackling each one by one.

            You can always save the intermediate datasets that result from these component tasks, and have things come together when companion-task components are ready. For now, you can save the intermediate datasets as permanent datasets, and when you have everything worked out, convert them into temporary datasets (using tempfile) during your final code refactoring in order to leave a cleaner working environment behind when your production code completes.

            Comment

            Working...
            X