Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to discard all but the lowest date in an already sorted list of patients.

    Dear Statalist.

    Firstly I am relatively new to Stata.
    I have a list with patients IDs, dates and lots of other data.
    Now I have sorted the list in Patient ID and then dates (of the observation) 'sort ID Date'.
    I need to be able to drop ALL BUT the earliest date in my dataset for each patient ID. How is this achieved?

    eg.

    ID Date
    1 01/01/2000 <keep
    1 10/06/2011
    1 22/03/2017
    2 30/09/2003 <keep
    2 12/10/2010
    3 06/02/2001 <keep
    3 28/08/2005
    3 11/02/2009
    3 15/05/2015

    to this

    ID Date
    1 01/01/2000
    2 30/09/2003
    3 06/02/2001



    Furthermore if need be, is it possible to tell STATA to only keep a specific date on a single patient and discard the rest?
    To use the example above.
    for patient ID 1 Keep date nr 2 (10/06/2011)
    for ID 2 keep Date nr 2 (12/10/2010)
    for ID 3 keep the last date nr 4 (5/05/2015).

    Thanks
    Regards
    Puriya

  • #2
    Assuming that Date is a really a numeric variable with date display format:

    Code:
    bysort ID (Date) : keep if _n == 1 

    Comment


    • #3
      Originally posted by Nick Cox View Post
      Assuming that Date is a really a numeric variable with date display format:

      Code:
      bysort ID (Date) : keep if _n == 1 

      I have to apologise for not being precise.
      I had found this previously, however the issue is that it keeps only the first date. I have many of the same dates and I would like to keep all of the first dates.
      For example I have 5 different datasets under the date 01/01/2000 for patient ID 1.

      Thanks for your reply.

      Comment


      • #4
        Try:

        Code:
        bysort ID (Date): keep if Date == Date[1]

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Try:

          Code:
          bysort ID (Date): keep if Date == Date[1]
          Thanks! Much appreciated.
          That did the trick. Do you know if I can use the line for a specific ID instead of all ID's at once?
          Say I want to only keep a specific date for patient ID 3 and not drop dates for the rest of the IDs.

          Comment


          • #6
            Well, you could modify that code, but I think for an idiosyncratic approach like this, a different way is better:

            Code:
            drop if ID == 3 & Date != specific_date
            If what you want to do is keep only the first date for ID 3, and leave everything else alone:

            Code:
            by ID (Date), sort: keep if (Date == Date[1]) | (ID != 3)

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              Well, you could modify that code, but I think for an idiosyncratic approach like this, a different way is better:

              Code:
              drop if ID == 3 & Date != specific_date
              If what you want to do is keep only the first date for ID 3, and leave everything else alone:

              Code:
              by ID (Date), sort: keep if (Date == Date[1]) | (ID != 3)
              When I use both codes it constantly tells me "0 observations deleted". I am unsure of what is incorrect? Something tells me that I am not putting in the correct format for dates. How would specific_date be filled in? Isn't it just 14/04/2008?

              Comment


              • #8
                Isn't it just 14/04/2008?
                No, it isn't. See what you got:

                Code:
                . display 14/04/2008
                .00174303
                I can pretty much guarantee you that there is no date in your data set that is equal to .00174303.

                Stata's handling of dates and times is complicated, and you are going to have to bite the bullet and learn about it to make progress in Stata. Read -help datetime- first and then read the entire associated manual section. It's a long read. And nobody remembers all the details: we all have to refer back to it for details when we actually write code. But at least you will know what the fundamental approach is and you'll remember most, if not all, of the functions you'll need to use on a regular basis. So what you want is:

                Code:
                drop if ID == 3 & Date != td(14apr2008)

                But all of that said, I cannot comprehend how you are getting 0 observations deleted here. By my reasoning, unless you have no observations with ID == 3, your first try would result in dropping all observations with ID = 3, because Date will never equal 0.00174303. And the second should work correctly because it does not require you to actually specify the date: it should keep the first date for ID == 3 and everything with ID != 3. So I think you should show a) an example of your data, and b) the exact code you ran and the exact output you got from Stata. Please remember to follow the advice in FAQ #12, using -dataex- for the example data, and placing the exact code and output directly between code delimiters, pasting what you copy from the Results window or your log file. (If you don't have -dataex- installed, run -ssc install dataex- and then run -help dataex- to read the instructions for using it.)

                Comment


                • #9
                  Originally posted by Clyde Schechter View Post
                  But all of that said, I cannot comprehend how you are getting 0 observations deleted here. By my reasoning, unless you have no observations with ID == 3, your first try would result in dropping all observations with ID = 3, because Date will never equal 0.00174303. And the second should work correctly because it does not require you to actually specify the date: it should keep the first date for ID == 3 and everything with ID != 3. So I think you should show a) an example of your data, and b) the exact code you ran and the exact output you got from Stata. Please remember to follow the advice in FAQ #12, using -dataex- for the example data, and placing the exact code and output directly between code delimiters, pasting what you copy from the Results window or your log file. (If you don't have -dataex- installed, run -ssc install dataex- and then run -help dataex- to read the instructions for using it.)
                  The second line of code you wrote actually drops everything with patient ID 3. It does not even leave the first date. Which is odd.
                  Code:
                   
                   by ID (Date), sort: keep if (Date == Date[1]) | (ID != 3)
                  I will get back to you with the rest. I don't know if that helps. Also Doesn't the drop function only work on columns? That may be why it get 0 observations dropped. I need to delete rows not columns.

                  Comment


                  • #10
                    First terminology: Stata data sets do not have rows and columns. Spreadsheets do. The sooner you stop thinking about Stata datasets as if they were spreadsheets, the better you will become at using them. While there are certain analogies between rows and observations, and columns and variables, they only correspond loosely and, for the most part, your spreadsheet-honed instincts will not help. Indeed, they will often lead you astray.

                    -drop- works with both variables and observations. If you specify -drop varlist- then those variables are removed from the data set. If you specify -drop condition-, then observations meeting that condition will be removed from the data set. Similar comments apply to -keep-.

                    Now, I've generated a toy data set here to illustrate how the code works. Note that both ID = 3 and ID = 4 here have a repeat on the first date. So this should be a fair test of the code. It should keep only those two observations for ID = 3 and should leave ID 4 (and all the others) alone.
                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input float(ID Date)
                    1 18409
                    1 18476
                    1 18550
                    1 18777
                    1 19123
                    2 18343
                    2 18595
                    2 19023
                    2 19086
                    2 19485
                    3 18326
                    3 18326
                    3 19110
                    3 20457
                    3 20617
                    4 18722
                    4 18722
                    4 19724
                    4 19900
                    4 20636
                    5 18477
                    5 19593
                    5 19606
                    5 19701
                    5 20232
                    end
                    format %td Date
                    
                    set more off
                    
                    list, noobs sepby(ID)
                    
                    by ID (Date), sort: keep if (Date == Date[1]) | ID != 3
                    
                    list, noobs sepby(ID)
                    As you can see, it retains the first date from ID 3, and leaves all other ID's untouched, as advertised.

                    So either there is something rather odd about your data set or you are doing something that is somehow different from this command. Please show an example of your data that replicates your problem and show the exact code and Stata response including a listing of the resulting data so I can see what's going on and try to help.

                    Comment


                    • #11
                      Originally posted by Clyde Schechter View Post
                      As you can see, it retains the first date from ID 3, and leaves all other ID's untouched, as advertised.

                      So either there is something rather odd about your data set or you are doing something that is somehow different from this command. Please show an example of your data that replicates your problem and show the exact code and Stata response including a listing of the resulting data so I can see what's going on and try to help.
                      Okay Here I have linked two screenshots of the data. I have removed anything that is directly patient related (CPR).

                      https://www.dropbox.com/s/mrraru7f9w...55.36.png?dl=0

                      https://www.dropbox.com/s/24vlry0im2...09.19.png?dl=0

                      I would like to be able to use CPR as a variable, but as you see STATA does not accept this (0 data dropped). It was first in red text (String?) then I made it into blue text, but it doesn't want to become black like the rest. Therefore I have used PT_ID. Anyways that is a whole other matter.

                      The problem is that the code drops all observations with the specific variable PT_ID when used (eg 77177). Both codes do. It does not leave out the dates I want it to.
                      Also it is good to know that I should not think about STATA as having rows and colums.
                      Last edited by Puriya Daniel Yazdanfard; 11 Apr 2017, 16:11.

                      Comment


                      • #12
                        I'm afraid screen shots are useless for this purpose. I can't import the screen shots into Stata to try running the code with your data. In #8 I explained how to post data examples usefully--with -dataex- --. I can't help you if you don't help me.

                        However, in this case, I think I see what is wrong without having to try running anything.

                        Code:
                        // MY CODE
                        by ID (Date), sort: keep if (Date == Date[1]) | (ID != 3)
                        
                        // YOUR CODE
                        by PT_ID (Aud_Dato), sort: keep if (Aud_Dato == [1]) | (PT_ID != 77177)
                        
                        // NOTICE THE CRUCIAL DIFFERENCE!
                        The problem with the other command relates to your red/blue/black mixup. As already mentioned, a screen shot of the data is not really helpful here. But I have a sense of what is probably going on. Try running -count if CPR == 110-. My bet is that you will get 0 as the response. Am I right? How can that be. I'll bet if you scroll through your data in the browser you'll see some blue 110's in that CPR variable. Am I right again?

                        That's because -encode- is the wrong way to create this variable. You started out with strings that looked like "110" or "1102" or "1806". An you wanted to end up with a numeric variable with values like 110, 1102, and 1806, right? But that isn't what -encode- does. To do that, you need -destring-. What -encode- creates is a variable whose actual values are 1, 2, 3, ... up to the number of distinct values in the original CPR variable. Then what it does is it labels those numbers 1, 2, 3, with the string representation that the original variable had: these are called value labels. The Browser shows those in blue. When you see blue in the Browser you know you are looking not at the real data but at the value labels attached to those data. In particular, if those blue values look like numbers, you are almost certainly in trouble. Because you might expect that calculations performed on those blue numbers (including testing equality to other numbers) would work properly--but they don't because the blue numbers are just for display purposes: the real numbers aren't shown there. If, by chance, you want to see what the real numbers to which those labels are attached are, you can run -browse, nolabel-. Try it with what you have and I'm confident you will see that the actual numbers in your CPR variable are not what you intended and thought they were.

                        So to salvage CPR you have to go back to your original variable and -destring- it. Then I think you will find that my code works as promised. Your data simply weren't what you thought they are and what you represented them to be when you first posted.

                        I strongly urge you to take a break from working on this project and get familiar with the basics of Stata data management. The time you invest in reading the Getting Started [GS] and User's Guide [U] parts of the manual will be amply repaid with faster progress in your work and less time spent posting questions in this forum and waiting for someone to respond. It's a big read; but well worth the effort. And, no, you won't be able to remember it all. But you will become acquainted with the commands that are used all the time by all Stata users. When you see a problem like this, you'll at least remember some commands that might be helpful; and then you can look up the details of the syntax in the help file. You'll understand things like the difference between -encode- and -destring-, and know what each is best used for.


                        Comment


                        • #13
                          Originally posted by Clyde Schechter View Post
                          I'm afraid screen shots are useless for this purpose. I can't import the screen shots into Stata to try running the code with your data. In #8 I explained how to post data examples usefully--with -dataex- --. I can't help you if you don't help me.

                          However, in this case, I think I see what is wrong without having to try running anything.

                          I strongly urge you to take a break from working on this project and get familiar with the basics of Stata data management. The time you invest in reading the Getting Started [GS] and User's Guide [U] parts of the manual will be amply repaid with faster progress in your work and less time spent posting questions in this forum and waiting for someone to respond. It's a big read; but well worth the effort. And, no, you won't be able to remember it all. But you will become acquainted with the commands that are used all the time by all Stata users. When you see a problem like this, you'll at least remember some commands that might be helpful; and then you can look up the details of the syntax in the help file. You'll understand things like the difference between -encode- and -destring-, and know what each is best used for.

                          Thank you very much for your help Sir. It is much appreciated and I hope I haven't been too much of a hassle. I will get dataex for the future.
                          I figured why the CPR data was input as string data. Apparently one of the values were incorrectly a letter. I managed to fix this and now it works correctly. I figured this out after trying.
                          Code:
                          destring CPR, generate (CPR1)
                          Which gave me an error code saying some of the data was not numeric.

                          Funny the way STATA stores data. I tried what you told me with Browse nolable- and you are correct the CPR data was stored differently.

                          Thank you for the recommendations on what to read. As I said I am new to STATA and would like to get better at it for use in the world of medicine. Excel had its limits. If you have anything else you recommend for new starters then do tell me so.

                          Comment

                          Working...
                          X