Identifying same respondents in many rounds of a survey

Mikhail Balaev

Join Date: Aug 2015
Posts: 13

Identifying same respondents in many rounds of a survey

14 Oct 2021, 10:35

Hello,
I have some 20 rounds of a survey where each respondent has a unique ID. Some IDs reappear in the following surveys, some do not. The following surveys have new respondents who were not present in previous survey rounds. I want to filter only those IDs that were present in all survey stages. I combined all ID variables in a single data set with 20 variables, but I'm not sure how to check all of them at the same time for the values that appear in all variables. Here's an example of the data:

S1	S2	S3
V012630001S55310123330013	V019740001S15410815060012	V030010006S10160105900112
V012630001S55310666920012	V019740001S16410444180013	V030010006S10160107710122
V012630001S55310737830011	V019740001S30410254100013	V030010006S10160207700122
V012630001S55310757820012	V019740001S35410998670013	V030010006S10160459420112

So, for example, I need to see if the value of "V012630001S55310123330013" appears in S2 AND S3 AND ... S20.

Thank you!

Tags: None

George Ford

Join Date: Aug 2014

Posts: 3146
#2

14 Oct 2021, 10:42

Code:

bys id: g count = _n bys id: egen appear = max(count) summ appear keep if appear == r(max)

if you are merging seperate files, used -joinby- and it will delete any unmatched. at the end, you'll have only id's that appear in all surveys.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#3

14 Oct 2021, 10:46

In your example, there are no IDs that appear in all three stages. But I think the following code does what you need:

Code:

gen long obs_no = _n reshape long s, i(obs_no) by s (_j), sort: keep if _N == 20 drop _j obs_no duplicates drop

Note: this code assumes there are no variables s21 or higher in your data set. It also assumes that no ID appears more than once in any of the s1 through s20 variables separately.

At the end of this code, the data set contains all and only those IDs that appear in all of s1 through s20.

Added: Crossed with #2. I don't understand the approach there, which uses a variable called id that is not instantiated in the example data. Nevertheless, I agree with his suggestion that while building the merged data set it is probably best to remove "deficient" IDs during the -merge- process rather than weed them out at the end.

Last edited by Clyde Schechter; 14 Oct 2021, 10:50.
1 like
Comment
Mikhail Balaev

Join Date: Aug 2015

Posts: 13
#4

14 Oct 2021, 15:18

Thank you - both suggestions are great - either merging or reshaping will solve the problem.
Comment
George Ford

Join Date: Aug 2014

Posts: 3146
#5

15 Oct 2021, 07:34

If the datasets are identical in layout (have the same variables), you can use -append- to stack them (creating a long dataset as in #3). Then use #2.
1 like
Comment
Mikhail Balaev

Join Date: Aug 2015

Posts: 13
#6

17 Oct 2021, 08:29

thanks!

Last edited by Mikhail Balaev; 17 Oct 2021, 09:03. Reason: figured out
Comment

Announcement

Identifying same respondents in many rounds of a survey

Comment

Comment

Comment

Comment

Comment