Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to Keep Duplicated Variables based on the Value of another Column

    Hello all!

    I am working on cleaning up a dataset, and I do not know how to do so. Here are some of the relevant parts of the data set that has approximately 400,000 observations, with duplicates based on an ID number. What I want to do is keep the case IDs that correspond to the highest outcome. So, for the following:
    ID Number Date Outcome
    3 2/2/22 4
    3 2/2/22 3
    3 2/2/22 3
    3 2/2/22 2
    I want to keep only the first row because it has the highest code. Some IDs have 5 corresponding values for outcomes; some have 2; I think one even has 10.

    I have tried to google this, but got very confused by duplicates and dups. I'd really appreciate any suggestions anyone has.


  • #2
    Jaylin:
    tweaking your example a bit:
    Code:
    . input ID_Number Date Outcome
    
         ID_Number       Date    Outcome
      1. 3 2022 4
      2. 3 2022 3
      3. 3 2022 3
      4. 3 2022 2
      5. end
    
    . bysort ID_Number: egen wanted=max( Outcome)
    
    . list
    
         +------------------------------------+
         | ID_Num~r   Date   Outcome   wanted |
         |------------------------------------|
      1. |        3   2022         4        4 |
      2. |        3   2022         3        4 |
      3. |        3   2022         3        4 |
      4. |        3   2022         2        4 |
         +------------------------------------+
    
    
    . keep if Outcome==wanted
    
    
    . list
    
         +------------------------------------+
         | ID_Num~r   Date   Outcome   wanted |
         |------------------------------------|
      1. |        3   2022         4        4 |
         +------------------------------------+
    
    .
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      What's unclear in #1 includes whether you want to do this separately by date. If you do, then code in the spirit of @Carlo Lazzaro's suggestion might be

      Code:
      bysort ID_number Date (Outcome) : keep if _n == _N
      There is no need to calculate the maximum in a new variable -- unless you have missing values on Outcome. If you do, then

      Code:
      gen OK = !missing(Outcome)
      bysort OK ID_number Date (Outcome) : keep if OK & _n == _N
      Leave Date out of the code if it is immaterial.

      Note that the problem here is not one of duplicates, as the problem hinges on selecting observations with different values.
      Last edited by Nick Cox; 08 May 2022, 04:27.

      Comment


      • #4
        Thank you both!!! I had some missing values, so the combination of your suggestions worked perfectly. I am so grateful for your help. Thank you!

        Comment

        Working...
        X