Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How duplicates drop works with surveys imported from SurveyMonkey by sort

    Hi all,

    I'm putting this out here because it was a conundrum that took me hours to resolve, but I think I figured it out. Hoping Nick Cox, who wrote the duplicates command, and/or others with experience will weigh in with thoughts/suggestions for tackling this in the future.

    I had a situation where a previous analyst created "cleaned" datasets from a survey that other analysts used in their analyses, but we only became aware that there some people filled out the survey more than once after analyses were done. (That part's a long but interesting story having to do with the panel recruiter’s sampling approach + SurveyMonkey's metadata naming + us not wanting to collect individual identifiers; I’ll write it up at some point so others can learn from our mistake.)

    The previous analyst didn't document all the cleaning and we didn't have the resources (eg money) for me to spend hours recreating their results from various undocumented cleaning files, so my approach was:
    ​​​​​​

    (1) download the source datasets from SurveyMonkey

    (2) show that there were duplicates by individual identifier panelistid (not survey response identifier, which is for some reason labeled respondentid by SurveyMonkey) using the duplicates command
    • duplicates report panelistid
    (3) tag the duplicates
    • cap drop dupe
    • duplicates tag panelistid, gen(dupe)
    • codebook dupe
    (4) create a new var, ndupe, that = the 2nd occurrence of each observation in the dataset for a panelist
    • cap drop ndupe
    • by panelistid: gen ndupe=_n
    • codebook ndupe
    (5) list the panelistid and the respondentid for the 2nd occurrence of an individual’s response in the dataset
    • list panelistid respondentid if ndupe==2
    (6) document the heck out of everything and tell analysts to use the list I made to drop the specific respondentid (i.e., observation) in whatever analysis dataset they were using.


    I was pretty happy with myself that I came up with this hack, but when I tried to actually do it myself, I got an unpleasant surprise. (This is the part that took me hours to figure out but probably should have been obvious.) It turned out the SurveyMonkey export was in reverse chronological order. So when I exported it, what ended up being the 2nd observation that was dropped (technical documentation says “duplicates drop drops all but the first occurrence of each group of duplicated observations.”) was actually the first survey a specific person filled out, if they filled it out more than once.

    The implications of 2% of panel respondents filling out more than one survey notwithstanding, I finally figured out to sort the dataset by panelistid and startdate before making the ndupe variable:
    • sort panelistid startdate
    • cap drop ndupe
    • by panelistid: gen ndupe=_n
    • codebook ndupe
    • list panelistid respondentid startdate ndupe if dupe==1, sepby(panelistid)
    now it was clear that, after the sort, the 2nd observation in the dataset was actually the second one that the panelist filled out.


    My suggestions are:

    Analysts:
    1. if you’re using duplicates drop in the case where individual respondents are completing more than one survey when they shouldn’t be doing that, and
    2. you decide that you want to drop the 2nd survey that they fill out,
    then make sure to sort your dataset first by the date they complete the survey (or maybe individual identifier and date?)

    Stata:

    It would help to clarify this in the technical documentation by adding something like this:
    duplicates drop drops all but the first occurrence of each group of duplicated observations in the database and makes no assumptions about the way variables are ordered. So, for example, if you want to drop the second survey an individual respondent completed, you would first [have to sort on the individual identifier and then] the date they completed the survey. It might be good to explore this a bit more first using the sort command and generating a variable that tags the 2nd observation so you can be clear about which observations will be dropped.
    Nick and/or others, happy to hear your thoughts on this so I can make sure I’m analyzing and documenting this correctly.

    Thanks in advance,

    Michelle

  • #2
    I've spent a lot of time working with SurveyMonkey output (and hated the export format of that platform), so I can empathize with the challenges of having to take over a previous analyst's code and understand what has been or must be done to move forwards.

    I don't think the documentation needs improvement.
    -duplicates- (as the name implies) regards exactly the information contained in -varlist- as being the only relevant information and duplicate copies are literally redundant. In this sense, if you only specify respondent ID as the -varlist-, and there happen to be multiple entries by one or more respondent, then you are in effect asking duplicates to disregard all but one observation for each unique value of respondent ID. Said another way, duplicates is not supposed to prioritize one observation over others, because such a priority implies that this information is no longer one copy among duplicates.

    My advice is this. It is up to the programmer or analyst to know whether there are additional constraints (such as sort order) that need to be considered, and if so, what the most appropriate approach should be. If the planned approach doesn't apply, or if you are unsure which of multiple approaches to use, conduct a tiny experiment on some simulated or representative data. In this particular example of keeping the chronologically first response, the following would have been more appropriate over -duplicates-.

    Code:
    * pseudocode
    bysort panelist_id (submission_time) : keep if _n==1

    Comment


    • #3
      My problem was that when I saw the duplicates command, my mind interpreted "drops all but the first occurrence" as "first chronologically" for no valid reason. I just assumed that the exported dataset would be in chronological order. Now my eyes are opened.


      Originally posted by Leonardo Guizzetti View Post
      If the planned approach doesn't apply, or if you are unsure which of multiple approaches to use, conduct a tiny experiment on some simulated or representative data. In this particular example of keeping the chronologically first response, the following would have been more appropriate over -duplicates-.

      Code:
      * pseudocode
      bysort panelist_id (submission_time) : keep if _n==1
      This is always a great thing to remember. I'll be sure to take that step in the future, and I hope this thread will help someone else, at least.

      Michelle

      Comment


      • #4
        In fact, this problem arose in part because O.P. did not actually use the -duplicates drop- command. Rather, -duplicates tag- was used to identify where surplus observations per panelist are found, and then, not understanding the organization of the data, used incorrect code to create a drop list.

        Had O.P. attempted to remove surplus observations with the -duplicates drop- command, Stata would have dropped zero observations, thereby conserving the data, and alerting her to the fact that the surplus observations are not in fact duplicates, but contain different information. Hopefully, that would have prompted a closer look at the data, with which she would have discovered the unanticipated ordering of the observations.

        More generally, in Stata it is never wise to make assumptions about the sort order of data unless you, yourself, have just sorted them and are sure that nothing done since then might scramble the sort order, or have just consulted the macro function -sortedby-.

        Added: I will add my own sympathies for having to work with this kind of data. My experience has been that it is not only Survey Monkey. Rather, survey sites whose primary customers are businesses in general produce data sets that are not easily adapted to research type analysis. Research-friendly data is not what their customers are looking for. Their customers are typically looking for something they can throw into a spreadsheet to do simple summaries, not in-depth analysis.
        Last edited by Clyde Schechter; 26 Jun 2022, 14:22.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          In fact, this problem arose in part because O.P. did not actually use the -duplicates drop- command. Rather, -duplicates tag- was used to identify where surplus observations per panelist are found, and then, not understanding the organization of the data, used incorrect code to create a drop list.

          Had O.P. attempted to remove surplus observations with the -duplicates drop- command, Stata would have dropped zero observations, thereby conserving the data, and alerting her to the fact that the surplus observations are not in fact duplicates, but contain different information. Hopefully, that would have prompted a closer look at the data, with which she would have discovered the unanticipated ordering of the observations.
          .
          Just to be clear,I did have success using the duplicates drop command in the same way I used duplicates tag, by including the variable with duplicates:

          duplicates drop panelistid

          That's how I discovered that it didn't drop the observations I expected because the dataset was not sorted in chronological order (my incorrect assumption).

          Originally posted by Clyde Schechter View Post
          More generally, in Stata it is never wise to make assumptions about the sort order of data unless you, yourself, have just sorted them and are sure that nothing done since then might scramble the sort order, or have just consulted the macro function -sortedby-.

          Added: I will add my own sympathies for having to work with this kind of data. My experience has been that it is not only Survey Monkey. Rather, survey sites whose primary customers are businesses in general produce data sets that are not easily adapted to research type analysis. Research-friendly data is not what their customers are looking for. Their customers are typically looking for something they can throw into a spreadsheet to do simple summaries, not in-depth analysis.
          Thanks again for your help. The importance of understanding the way data is sorted is probably one of the biggest lessons I've learned about Stata. The other big lesson is about working with a commercial company that collects data, rather than collecting it myself or using a recognized dataset from an established epidemiological study.

          Michelle

          Comment

          Working...
          X