Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Detecting duplicate observations in data

    I am working with a survey data set of over 300 variables and near 10,000 observations. There is a risk that some data are fabricated. I need to detect observations that are over 90% similar. Do you think this is possible with Stata? I know command duplicate that detects duplications, but only 100% duplications can be detected.

  • #2
    Masood:
    did you take a look at the -group- option under -help egen-?
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      "over 90% similar": what's your definition of that?

      Comment


      • #4
        Carlo, option group will not help as I have over 300 variables, and then it will make millions of groups... It will be out of control.

        Prof. Cox, so the data set is the responses to a questionnaire of 300 questions. I expect no two respondents to have over 90% similarities in answer the 300 questions. In other words, I want to tag observations that are over 90% similar to any other observation in responses to 300 questions/variables.

        Comment


        • #5
          I don't think that's a specification precise enough to allow easy coding (or indeed any coding).

          Comment


          • #6
            Thanks Prof. Cox. Someone told me of a windows software called Hunter that can detect similar observations with a cut-off of any %. I couldn't find Hunter, so I was hoping that Stata could do such thing - maybe in the future?

            Thanks

            Comment


            • #7
              I can only speak for myself. I am credited with writing duplicates in Stata and stop short completely of all fuzzy, probabilistic or partial definitions of duplicates.

              Comment


              • #8
                Prof. Cox, you will do the world of survey analysts / Stata users a great favor if you could develop such program that could check if any two observation are let's say 90% similar to each other. Irrespective of that, I always admire your work.

                Comment


                • #9
                  Thanks for the compliment, but it really won't happen. The problem is quite literally far too fuzzy to have any appeal to me.

                  Comment


                  • #10
                    If the responses are multiple choice (single character responses), one approach would be to create a single string of 300 characters containing the responses, and then apply -reclink- (type -findit reclink-) to match the file to itself, generating a match score. Then you can see which ones have score greater than, eg, 90%.

                    Comment


                    • #11
                      Here's a brute-force approach. I'm defining the duplication proportion, psame`i', as the proportion of the nvar variables for which the current observation is the same as the ith observation. Despite the loop over observations, the speed of the following seemed OK, as the run time scaled at less than _N^2.

                      The result is a variable called "dupelist" that contains a list of observation numbers that duplicate the current case at greater than the specified criterion.

                      Code:
                      clear
                      local criterion = 0.7 // duplication proportion, just for illustration
                      // Make example data
                      set obs 100
                      gen id = string(_n)
                      local nvar = 8
                      local nval = 4
                      forval i = 1/`nvar' {
                         gen byte x`i' = ceil(runiform() * `nval')
                      }
                      // Compare each observation to each other observation.
                      forval i = 1/`=_N' {
                         gen psame`i' = 0
                         forval j = 1/`nvar' {
                            qui replace psame`i' = psame`i' + 1 if (x`j' ==  x`j'[`i'])
                         }
                         qui replace psame`i' = psame`i'/`nvar'
                      }
                      // Make a list of near-duplicates for each observation
                      gen str dupelist = ""  
                      forval i = 1/`=_N' {
                         qui replace dupelist = dupelist + " " + string(`i') ///
                           if (`i' != _n) & (psame`i' > `criterion')
                      }
                      // Inspect the near-duplicates and decide what you want to do.
                      list dupelist if !missing(dupelist)
                      Does this fit the problem as you are thinking of it?

                      Regards, Mike

                      Comment


                      • #12
                        Thanks Mike. I won't have a chance to work this on Stata until this coming werkend. I am not sure however how I can I include all the variables, let's say 300 using this approach. Anyway, I will let you know how it goes.

                        Comment


                        • #13
                          Hey,
                          I have a similar but slightly different Problem.
                          In my case its just about one Variable (Company Name), where I would like to detect Close duplicates.
                          The reason why I want to do so, is that I´m suspecting subsidiaries in the Company list.
                          So the question is, is it possible to detect them with a similar Close. I know that I will have to check the results by Hand, but at least I would have an indicator.
                          BMW Germany
                          BMW USA
                          BMW Canda
                          Bosch Thailand
                          Bosch gmbh

                          Comment


                          • #14
                            Hi Jay,
                            Your problem, fortunately is not as complicated as mine.
                            I think in your case you can use strpos, a string function. How it works? So if you look at the following example, I created a variable and assigned value 1 if BMW is mentioned in your string variable, assigned 2 if Bosch is mentioned, 3 if Toyota is mentioned, and so on.

                            gen Made=1 if strpos(variable_name, "BMW")
                            replace Made=2 if strpos(variable_name, "Bosch")
                            replace Made=3 if strpos(variable_name, "Toyota")
                            replace Made=4 .
                            ..

                            the create labels not to mistake the codes:

                            lab def made 1 BMW 2 Bosch 3 Toyota 4 ...
                            lab val Made made


                            This works well if:
                            1. You know all the names
                            2. You know that there is no overlap of names

                            Comment


                            • #15
                              I had posed this question long time ago, and since then I have found a solution.

                              My question was to flag almost similar observations (not exactly similar, which duplicate does). For instance, finding observations that are similar 90% and above.

                              Noble Kuriakose have developed percentmatch, which does the job and is available on SSC.

                              Comment

                              Working...
                              X