Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing earlier dates all across my dataset

    Hello everyone,
    I have a fairly large dataset (~35000 observations) that is arranged as such:

    Patient ID Date Variable X
    1 1/12/2000 00:00:00 100
    1 1/10/1992 00:00:00 120
    2 2/15/2003 00:00:00 98
    2 1/8/2005 00:00:00 79

    My question is, if I want to keep only the EARLIEST date for every patient and drop all later dates, how can I go about doing that efficiently?

    Thank you,
    MQR

  • #2
    Code:
    bysort ID (Date) : keep if _n == 1

    Comment


    • #3
      Thank you for getting back to me Nick! This didn't work, unfortunately - but it did set me off on the right foot. Here's a sample of the data:
      Name ID Date
      Joe 1 5/30/1995 00:00:00
      Joe 1 6/17/1994 00:00:00
      Al 2 6/3/1997 00:00:00
      Al 2 5/8/1997 00:00:00

      Before -keep- I actually decided to do this:
      Code:
      bysort ID Date: gen x = _n
      This actually led to 'x' being 1 for all. Probably because they were all unique values?

      Any other advice/input would be highly appreciated. Have been struggling for the greater part of my day with this.

      Comment


      • #4
        On the flip-side,
        Code:
        sort ID Date
        followed by
        Code:
        bysort mrn: gen x = _n
        seems to have given me what i need!
        Awesome.

        Comment


        • #5
          Mohammad:

          I have to say that I am at a loss to know why my code "did not work". You don't explain why not.

          It is possible I interpreted your post #1 incorrectly. Your title says "removing earlier dates"; your body of text says "keep only the EARLIEST date"; one of these must be wrong. But I took your emphasis to be decisive.

          Nor can I make sense of #4. This refers to an unexplained variable mrn and drops or keeps nothing.

          This example is modelled on your data example in #1 and my code in #2 and will be seen to keep only the EARLIEST date.

          Code:
          clear 
          input ID str9 SDate    
          1    "1/12/2000" 
          1    "1/10/1992" 
          2    "2/15/2003" 
          2    "1/8/2005" 
          end 
          gen Date = daily(SDate, "MDY") 
          format Date %td 
          bysort ID (Date) : keep if _n == 1
          
          list 
          
               +----------------------------+
               | ID       SDate        Date |
               |----------------------------|
            1. |  1   1/10/1992   10jan1992 |
            2. |  2   2/15/2003   15feb2003 |
               +----------------------------+

          Comment


          • #6
            I apologise, let me clear a few things from the previous posts:
            - When you said "bysort ID (Date): keep if _n == 1" , i thought "(Date)" meant my date variable, without the parenthesis. That was causing the issue.
            - 'mrn' was actually Patient ID, I was a bit excited that it worked and didn't pay attention while posting. I eventually did keep the earlier dates.

            However I'm still struggling with the next step which is similar in nature. If you're willing to help, I can post it here? Otherwise I can post it in the general forum.

            Thanks again, you've been very helpful.

            Comment


            • #7
              The parentheses have a vital syntactic role. See the help on by.

              This is the General forum and naturally you may post other questions if you have them. Who answers is contingent!

              Comment


              • #8
                Just looked up the parenthesis in help -by-. Very cool! Learn something new everyday.
                Here's my current problem. I have two columns containing dates. edate contains the earlier dates that I managed to get through your help and merged into a new document. zdate contains those dates as well as additional dates. I want to drop all observations that don't match with edate. Please see example below:

                Name ID zdate edate
                Allen 1 9/11/2002 9/11/2002
                Allen 1 9/11/2002 9/11/2002
                Allen 1 9/11/2002 9/11/2002
                Allen 1 5/5/2003 9/11/2002
                Allen 1 5/5/2003 9/11/2002
                Joan 2 6/17/1994 6/17/1994
                Joan 2 5/30/1995 6/17/1994
                Joan 2 5/30/1995 6/17/1994
                Joan 2 5/30/1995 6/17/1994


                So in the above example, I would want to keep all observations for Allen that coincide with edate (which would be the first three), and then keep just the first observation for Joan which coincides with edate, and so on for the rest of the data.

                Thank you all for your help!

                Comment


                • #9
                  Code:
                  drop if zdate != edate 
                  
                  * or 
                  
                  keep if zdate == edate

                  Comment


                  • #10
                    Brilliant. worked like a charm.
                    Thank you very much Nick!

                    Comment

                    Working...
                    X