Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying same respondents in many rounds of a survey

    Hello,
    I have some 20 rounds of a survey where each respondent has a unique ID. Some IDs reappear in the following surveys, some do not. The following surveys have new respondents who were not present in previous survey rounds. I want to filter only those IDs that were present in all survey stages. I combined all ID variables in a single data set with 20 variables, but I'm not sure how to check all of them at the same time for the values that appear in all variables. Here's an example of the data:
    S1 S2 S3
    V012630001S55310123330013 V019740001S15410815060012 V030010006S10160105900112
    V012630001S55310666920012 V019740001S16410444180013 V030010006S10160107710122
    V012630001S55310737830011 V019740001S30410254100013 V030010006S10160207700122
    V012630001S55310757820012 V019740001S35410998670013 V030010006S10160459420112
    So, for example, I need to see if the value of "V012630001S55310123330013" appears in S2 AND S3 AND ... S20.

    Thank you!

  • #2
    Code:
    bys id:  g count = _n 
    bys id:  egen appear = max(count)
    summ appear
    keep if appear == r(max)
    if you are merging seperate files, used -joinby- and it will delete any unmatched. at the end, you'll have only id's that appear in all surveys.

    Comment


    • #3
      In your example, there are no IDs that appear in all three stages. But I think the following code does what you need:

      Code:
      gen long obs_no = _n
      reshape long s, i(obs_no)
      by s (_j), sort: keep if _N == 20
      drop _j obs_no
      duplicates drop
      Note: this code assumes there are no variables s21 or higher in your data set. It also assumes that no ID appears more than once in any of the s1 through s20 variables separately.

      At the end of this code, the data set contains all and only those IDs that appear in all of s1 through s20.

      Added: Crossed with #2. I don't understand the approach there, which uses a variable called id that is not instantiated in the example data. Nevertheless, I agree with his suggestion that while building the merged data set it is probably best to remove "deficient" IDs during the -merge- process rather than weed them out at the end.
      Last edited by Clyde Schechter; 14 Oct 2021, 10:50.

      Comment


      • #4
        Thank you - both suggestions are great - either merging or reshaping will solve the problem.

        Comment


        • #5
          If the datasets are identical in layout (have the same variables), you can use -append- to stack them (creating a long dataset as in #3). Then use #2.

          Comment


          • #6
            thanks!
            Last edited by Mikhail Balaev; 17 Oct 2021, 09:03. Reason: figured out

            Comment

            Working...
            X