Subset Data by ID

Neil Murray

Join Date: Apr 2021

Posts: 8
#1

Subset Data by ID

29 Apr 2021, 08:46

Dear all,

suppose an unbalanced panel data set reaching from 2010 to 2019 with around 390k observations (long format) and only three variables: "Person ID" (PID), "Income" and "year of survey" (syear). I wish to subset the data in a way that only the oversvations of individuals who gave an answer in 2010 and 2011 are kept in the dataframe.

I tried out different approaches, for example using inlist() combined with the unqiue PID's of 2010 and 2011 but ended up with several errors.

If anyone can come up with an easy solution I would be highly thankful!

Best regards,

Neil
Tags: data, panel, subset
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

29 Apr 2021, 09:15

Something like the following untested code may help.

Code:

by PID, sort: egen nmatch = total( inlist(year,2010,2011) & !missing(income) ) by PID: keep if nmatch==2
2 likes
Comment

Sergiy Radyakin

Join Date: Apr 2014
Posts: 1867

29 Apr 2021, 09:33

Hello Neil,

try this

Code:

sort pid year
isid pid year
tempvar x z
by pid: egen `x' = count(income) if inlist(year, 2010, 2011)
by pid: egen `z' = max(`x')
keep if `z'==2
drop `x' `z'
list, sepby(pid)

Best, Sergiy

Comment

Neil Murray

Join Date: Apr 2021

Posts: 8
#4

30 Apr 2021, 03:10

Thank you both for taking your time to help me with this issue.

William Lisowski Output produces exactly what I want and seems highly efficient.

Sergiy Radyakin The method is similar to William Lisowski's, however it produced one error:

. sort PID year
. isid PID year
variables persnr _year do not uniquely identify the observations
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

30 Apr 2021, 05:20

If you read the output of help isid you will understand that the isid command is reporting an error in your panel data. To be a panel dataset, each combination of PID and year must be distinct. The isid command is telling you that that is not the case - at least one combination of persnr and _year are duplicated in your data. Your data is not a panel, unbalanced or otherwise.

If there is a PID with multiple observations for 2010 or multiple observations for 2011, the results produced by the code in post #2 may be incorrect. Sergiy's code included the isid command to check your assertion that your data was indeed panel data, an assumption on which both his code and mine rested.

Last edited by William Lisowski; 30 Apr 2021, 05:23.
2 likes
Comment
Neil Murray

Join Date: Apr 2021

Posts: 8
#6

30 Apr 2021, 05:50

Thanks again for the excellent clarification! I am a little surprised myself to find duplicates in the dataset and will now go back to the data cleansing with greater care.

Also, by no means did I want to judge which code works better - my current understanding is truly not sufficient for that.

I am glad that the community is so helpful, even for newcomers to Stata.

Best regards Neil
Comment

Announcement

Comment

Comment

Comment

Comment

Comment