Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need help with dataset with multiple dates

    Hi I am working with a dataset which ahs multiple dates for the same diagnosis for a patient and I only want to keep the observations with the earliest date and drop the rest. I am going to explain a bit how the dataset is so that it can be understandable.

    ID DIAGNOSIS DATE
    1 CKD 2/9/2000
    1 CKD 4/7/2008
    1 CHF 2/5/2005
    1 CHF 5/5/2009

    What I want it to like is to have a column of diagnosis and only the first date.
    Please help me out. Thanks

    Ushma

  • #2
    With a tableau like you show, which does not include metadata that is critical to answering your specific question, it is impossible to know whether your variable DATE is, in fact, a Stata internal format date variable, or if it is some string that, to human eyes, looks like a date, or if it is some other integer with value labels that look like dates. If it is anything other than a Stata internal format date variable, the first thing you need to do is convert it to that. Once that is done:

    Code:
    by ID (DATE), sort: keep if _n == 1
    In the future, it is best not to make people who want to help you guess about your data. The helpful way to show example data and avoid questions like this, is to use the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Hi
      Thank you so much. I wanted to clarify one more thing. Can I also do it for each diagnosis separately? Like can I use the same code with a condition like if CKD =1 ?

      Comment


      • #4
        If you want it separately for each diagnosis it's:
        Code:
        by ID DIAGNOSIS (DATE), sort: keep if _n == 1
        Last edited by Clyde Schechter; 08 May 2021, 15:01. Reason: correct error in code

        Comment


        • #5
          I am sorry if its a bit confusing. What I tried to do is have a separate column for each diagnosis and I want to keep only the first date for each one of them

          Comment


          • #6
            That requires some additional steps beyond what is given in #4.

            Code:
            by ID, sort: gen seq = _n
            reshape wide DIAGNOSIS DATE, i(id) j(seq)
            That said, be careful what you wish for. I don't know what you plan to do with the data, but most things in Stata are easier (or only possible) with the data in the long layout that you have after the code in #4. Putting the diagnoses and dates "side by side" makes the data easier for human eyes to grasp, but harder for Stata to work with. There are only a few things in Stata that work best with this kind of data laid out wide. So, I've shown you how to do what you ask, but I suggest you reconsider whether it is wise to do so.

            Comment


            • #7
              Clyde Schechter
              The data looks like this right now after using the code:
              clear
              input long mrn str11 diagnosis1 long date_n1 str11 diagnosis2 long date_n2 str11 diagnosis3 long date_n3
              1558 "CKD" 18883 "ESKD" 18898 "HLD" 21193
              6394 "CAD" 20217 "CHF" 20108 "CKD" 19606
              22243 "CKD" 17098 "ESKD" 19901 "HLD" 20395
              30383 "CKD" 13696 "ESKD" 15888 "HTN" 14828
              30490 "CAD" 18430 "CHF" 19667 "HTN" 13818
              33638 "CKD" 15505 "ESKD" 17065 "HLD" 15979
              44044 "CKD" 16218 "ESKD" 16771 "HLD" 21094
              52431 "CAD" 14004 "CHF" 14661 "CKD" 13892


              My problem is that diagnosis1 diagnosis2 are not the same for each mrn I wanted it to be the same.

              Comment


              • #8
                Well, unless everybody in the data set has all exactly the same diagnoses, you can't have that. I'm not talking about any limitation of Stata: if you think it through, there is no possible organization of the data that satisfies those criteria. But you can come close. Start out with your original data, as it was in #1. Then do this:


                Code:
                by id diagnosis (date), sort: keep if _n == 1
                You will now have a data set that looks like the -dataex- in the code below, to which you can apply:

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input long mrn str11 diagnosis long date
                 1558 "CKD"  18883
                 1558 "ESKD" 18898
                 1558 "HLD"  21193
                 6394 "CAD"  20217
                 6394 "CHF"  20108
                 6394 "CKD"  19606
                22243 "CKD"  17098
                22243 "ESKD" 19901
                22243 "HLD"  20395
                30383 "CKD"  13696
                30383 "ESKD" 15888
                30383 "HTN"  14828
                30490 "CAD"  18430
                30490 "CHF"  19667
                30490 "HTN"  13818
                33638 "CKD"  15505
                33638 "ESKD" 17065
                33638 "HLD"  15979
                44044 "CKD"  16218
                44044 "ESKD" 16771
                44044 "HLD"  21094
                52431 "CAD"  14004
                52431 "CHF"  14661
                52431 "CKD"  13892
                end
                format %td date
                
                replace diagnosis = strtoname(diagnosis)
                
                rename date _date
                reshape wide @_date, i(mrn) j(diagnosis) string
                Note: This code runs fairly smoothly if all the values of the diagnosis variable are well-behaved. I have taken the precaution of applying the strtoname() function to deal with the possibility that some of them have embedded spaces or non-alphanumeric characters. The problem this doesn't solve is if any of the diagnosis variable values is longer than 27 characters. If that happens, when -reshape- tries to create a new variable to correspond to the date of that diagnosis, it will break because of the 32 character limit on variable names. If it's just off by a few characters you could replace _date by _dt, or just dt, or just d as the case may be in the last two lines of the code. If any of the entries in diagnosis are already more than 32 characters, then you will have to change that outright.

                Again, I will emphasize that you shouldn't go for this layout unless you have a clear understanding why you need it. It will likely prove an albatross around your neck if you do further data management or analysis in Stata.

                Comment

                Working...
                X