Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keeping most recent variable based on different years

    Hello,

    I have the following data below and I am trying to keep the most recent observation year for each ID that I have. I have tried different methods that did not work. Can someone please help? Thanks!
    ID Year Attribute
    A564 2018 Dog
    A564 2019 Dog
    A564 2020 Dog
    A447 2018 Cat
    A447 2019 Cat
    A125 2019 Bird
    A125 2020 Bird
    A125 2021 Bird
    A478 2020 Mouse
    My goal is to only keep the most recent observation for each ID (ie 2020 for Dog, 2019 for Cat, 2021 for Bird, and 2020 for Mouse). I am trying the code below but it is not working:
    Code:
     by ID, gsort -Year: gen keep = _n == 1
    As it doesn't allow me to gsort with by. I tried the gsort on the preceding line and my error was that the by function cannot run because it is not sorted. I even tried to create an increasing variable (order) that takes the value 1 for 2021, 2 for 2022 and so on and it also didn't work.

    Thanks!

  • #2
    No need for -gsort- here anyway:
    Code:
    by ID (Year), sort: keep if _n == _N
    There are situations where you need to use -gsort- and in those cases you have to run -gsort- first as a separate command. Then in your -by- command, you must terminate your list of sort variables before you reach the first variable that is reverse sorted, and not call for -sort- in the -by-. So, for example, to do this one with -gsort- you could do it as:
    Code:
    gsort ID -Year
    by ID: keep if _n == 1 // NOTE: No mention of Year, nor of -sort-
    That said, this approach is less transparent and should be reserved for situations that cannot be done in the simpler way shown above.

    Comment


    • #3

      Code:
      bysort ID (Year) : keep if _n == _N
      is likely to be closer to what you want.

      Comment


      • #4
        Well your code doesn't work because it's not legal syntax (and smells like ChatGPT).

        Here's a reproducible data example from your post.

        [code]
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str4 id int year str5 attribute
        "A564" 2018 "Dog"  
        "A564" 2019 "Dog"  
        "A564" 2020 "Dog"  
        "A447" 2018 "Cat"  
        "A447" 2019 "Cat"  
        "A125" 2019 "Bird"
        "A125" 2020 "Bird"
        "A125" 2021 "Bird"
        "A478" 2020 "Mouse"
        end
        
        bysort id (year) : gen keep = _n == _N
        Result

        Code:
        . list, sepby(id)
        
             +-------------------------------+
             |   id   year   attrib~e   keep |
             |-------------------------------|
          1. | A125   2019       Bird      0 |
          2. | A125   2020       Bird      0 |
          3. | A125   2021       Bird      1 |
             |-------------------------------|
          4. | A447   2018        Cat      0 |
          5. | A447   2019        Cat      1 |
             |-------------------------------|
          6. | A478   2020      Mouse      1 |
             |-------------------------------|
          7. | A564   2018        Dog      0 |
          8. | A564   2019        Dog      0 |
          9. | A564   2020        Dog      1 |
             +-------------------------------+
        Stata only knows how to sort in one direction, namely ascending order, when used in conjunction with the -by- prefix. You can read by by looking at documentation for by, bysort, sort and system variables (which explains the meaning of _N, among others).

        Comment


        • #5
          Thank you all

          Comment

          Working...
          X