Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Subset Data by ID

    Dear all,

    suppose an unbalanced panel data set reaching from 2010 to 2019 with around 390k observations (long format) and only three variables: "Person ID" (PID), "Income" and "year of survey" (syear). I wish to subset the data in a way that only the oversvations of individuals who gave an answer in 2010 and 2011 are kept in the dataframe.

    I tried out different approaches, for example using inlist() combined with the unqiue PID's of 2010 and 2011 but ended up with several errors.

    If anyone can come up with an easy solution I would be highly thankful!

    Best regards,

    Neil

  • #2
    Something like the following untested code may help.

    Code:
    by PID, sort: egen nmatch = total( inlist(year,2010,2011) & !missing(income) ) 
    by PID: keep if nmatch==2

    Comment


    • #3
      Hello Neil,

      try this

      Code:
      sort pid year
      isid pid year
      tempvar x z
      by pid: egen `x' = count(income) if inlist(year, 2010, 2011)
      by pid: egen `z' = max(`x')
      keep if `z'==2
      drop `x' `z'
      list, sepby(pid)
      Best, Sergiy

      Comment


      • #4
        Thank you both for taking your time to help me with this issue.

        William Lisowski Output produces exactly what I want and seems highly efficient.

        Sergiy Radyakin The method is similar to William Lisowski's, however it produced one error:

        . sort PID year
        . isid PID year
        variables persnr _year do not uniquely identify the observations





        Comment


        • #5
          If you read the output of help isid you will understand that the isid command is reporting an error in your panel data. To be a panel dataset, each combination of PID and year must be distinct. The isid command is telling you that that is not the case - at least one combination of persnr and _year are duplicated in your data. Your data is not a panel, unbalanced or otherwise.

          If there is a PID with multiple observations for 2010 or multiple observations for 2011, the results produced by the code in post #2 may be incorrect. Sergiy's code included the isid command to check your assertion that your data was indeed panel data, an assumption on which both his code and mine rested.
          Last edited by William Lisowski; 30 Apr 2021, 05:23.

          Comment


          • #6
            Thanks again for the excellent clarification! I am a little surprised myself to find duplicates in the dataset and will now go back to the data cleansing with greater care.

            Also, by no means did I want to judge which code works better - my current understanding is truly not sufficient for that.

            I am glad that the community is so helpful, even for newcomers to Stata.

            Best regards Neil

            Comment

            Working...
            X