Hi all,
I'm putting this out here because it was a conundrum that took me hours to resolve, but I think I figured it out. Hoping Nick Cox, who wrote the duplicates command, and/or others with experience will weigh in with thoughts/suggestions for tackling this in the future.
I had a situation where a previous analyst created "cleaned" datasets from a survey that other analysts used in their analyses, but we only became aware that there some people filled out the survey more than once after analyses were done. (That part's a long but interesting story having to do with the panel recruiter’s sampling approach + SurveyMonkey's metadata naming + us not wanting to collect individual identifiers; I’ll write it up at some point so others can learn from our mistake.)
The previous analyst didn't document all the cleaning and we didn't have the resources (eg money) for me to spend hours recreating their results from various undocumented cleaning files, so my approach was:
(1) download the source datasets from SurveyMonkey
(2) show that there were duplicates by individual identifier panelistid (not survey response identifier, which is for some reason labeled respondentid by SurveyMonkey) using the duplicates command
I was pretty happy with myself that I came up with this hack, but when I tried to actually do it myself, I got an unpleasant surprise. (This is the part that took me hours to figure out but probably should have been obvious.) It turned out the SurveyMonkey export was in reverse chronological order. So when I exported it, what ended up being the 2nd observation that was dropped (technical documentation says “duplicates drop drops all but the first occurrence of each group of duplicated observations.”) was actually the first survey a specific person filled out, if they filled it out more than once.
The implications of 2% of panel respondents filling out more than one survey notwithstanding, I finally figured out to sort the dataset by panelistid and startdate before making the ndupe variable:
My suggestions are:
Analysts:
Stata:
It would help to clarify this in the technical documentation by adding something like this:
Thanks in advance,
Michelle
I'm putting this out here because it was a conundrum that took me hours to resolve, but I think I figured it out. Hoping Nick Cox, who wrote the duplicates command, and/or others with experience will weigh in with thoughts/suggestions for tackling this in the future.
I had a situation where a previous analyst created "cleaned" datasets from a survey that other analysts used in their analyses, but we only became aware that there some people filled out the survey more than once after analyses were done. (That part's a long but interesting story having to do with the panel recruiter’s sampling approach + SurveyMonkey's metadata naming + us not wanting to collect individual identifiers; I’ll write it up at some point so others can learn from our mistake.)
The previous analyst didn't document all the cleaning and we didn't have the resources (eg money) for me to spend hours recreating their results from various undocumented cleaning files, so my approach was:
(1) download the source datasets from SurveyMonkey
(2) show that there were duplicates by individual identifier panelistid (not survey response identifier, which is for some reason labeled respondentid by SurveyMonkey) using the duplicates command
- duplicates report panelistid
- cap drop dupe
- duplicates tag panelistid, gen(dupe)
- codebook dupe
- cap drop ndupe
- by panelistid: gen ndupe=_n
- codebook ndupe
- list panelistid respondentid if ndupe==2
I was pretty happy with myself that I came up with this hack, but when I tried to actually do it myself, I got an unpleasant surprise. (This is the part that took me hours to figure out but probably should have been obvious.) It turned out the SurveyMonkey export was in reverse chronological order. So when I exported it, what ended up being the 2nd observation that was dropped (technical documentation says “duplicates drop drops all but the first occurrence of each group of duplicated observations.”) was actually the first survey a specific person filled out, if they filled it out more than once.
The implications of 2% of panel respondents filling out more than one survey notwithstanding, I finally figured out to sort the dataset by panelistid and startdate before making the ndupe variable:
- sort panelistid startdate
- cap drop ndupe
- by panelistid: gen ndupe=_n
- codebook ndupe
- list panelistid respondentid startdate ndupe if dupe==1, sepby(panelistid)
My suggestions are:
Analysts:
- if you’re using duplicates drop in the case where individual respondents are completing more than one survey when they shouldn’t be doing that, and
- you decide that you want to drop the 2nd survey that they fill out,
Stata:
It would help to clarify this in the technical documentation by adding something like this:
duplicates drop drops all but the first occurrence of each group of duplicated observations in the database and makes no assumptions about the way variables are ordered. So, for example, if you want to drop the second survey an individual respondent completed, you would first [have to sort on the individual identifier and then] the date they completed the survey. It might be good to explore this a bit more first using the sort command and generating a variable that tags the 2nd observation so you can be clear about which observations will be dropped.Nick and/or others, happy to hear your thoughts on this so I can make sure I’m analyzing and documenting this correctly.
Thanks in advance,
Michelle
Comment