Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to only keep data value if it has all associated values in a certain range

    Dear Statalist user,

    I have a problem regarding my survey panel data that should be fairly easy to solve but I am unable to accomplish it. I found similar threads but was unable to find a fitting solution.
    (e.g.https://www.statalist.org/forums/for...-in-panel-data)

    My data is structured as follows; I was unable to present a data example straight from my set since my school computer will not let me install datax:

    PID_____year
    101_____2000
    101_____2001
    101_____2002
    101_____2003
    201_____2001
    201_____2002

    I have multiple person ID's (PID) and I want to know how to isolate only the PIDs that have data from 2000 to 2003 consistently.

    For example, here it would keep PID 101, but would drop PID 201 since it does not have all these year values.

    So far, I have only achieved results such as a range, but then that includes all PIDs that have years 2000 to 2003 regardless of whether it is consistent for all years in between.

    I have used keep commands so that all the years I wanted were there, as well as bysort PID year: drop if _N<=3, but these either just keep a range of values that don't have consistent yearly observations, or they delete all my observations.

    Please if you know a solution to keep my PID's that only have observations for all the years from 2000 to 2003, I would greatly appreciate it.

    Thanks,

    Josh

  • #2
    Does this help? The count should be 4 for you to want the panel.

    Code:
    clear
    input PID     year
    101     2000
    101     2001
    101     2002
    101     2003
    201     2001
    201     2002
    end
    
    egen count = total(inrange(year, 2000, 2003)), by(PID)
    
    list, sepby(PID)
    
    
         +--------------------+
         | PID   year   count |
         |--------------------|
      1. | 101   2000       4 |
      2. | 101   2001       4 |
      3. | 101   2002       4 |
      4. | 101   2003       4 |
         |--------------------|
      5. | 201   2001       2 |
      6. | 201   2002       2 |
         +--------------------+
    Note that

    Code:
    bysort PID year: drop if _N <= 3
    will indeed drop everything as each PID, year combination occurs at most once.

    Code:
    bysort PID: drop if _N < 4
    would be fine if there are no years before 2000 or after 2003.


    Comment


    • #3
      Hi Nick,

      Thanks for your answer.

      I see using the

      egen count = total(inrange(year, 2000, 2003)), by(PID)

      and

      bysort PID: drop if _N < 4

      commands, I use able to get only the PIDs that had responses for all fours years (2000-2003). I just notice now one difference with my data that is remaining, which is that it seems to no longer be ordered chronologically.

      For example, instead of
      PID_____year
      101_____2000
      101_____2001
      101_____2002

      it is

      PID_____year
      101_____2001
      101_____2000
      101_____2002

      I'm new to stata and I don't know if this is an issue, but for the sake of ease, I was just wondering if there's a way to change it back to the former chronological format. Thanks again.

      Comment


      • #4
        You may not be able to download dataex but you should be able to use CODE identifiers.

        Code:
        sort PID year
        will resolve this.

        Comment

        Working...
        X