Detecting duplicate observations in data

Masood Sadat

Join Date: Aug 2014

Posts: 38
#1

Detecting duplicate observations in data

20 Jul 2015, 09:20

I am working with a survey data set of over 300 variables and near 10,000 observations. There is a risk that some data are fabricated. I need to detect observations that are over 90% similar. Do you think this is possible with Stata? I know command duplicate that detects duplications, but only 100% duplications can be detected.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17850
#2

20 Jul 2015, 09:23

Masood:
did you take a look at the -group- option under -help egen-?

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#3

20 Jul 2015, 09:39

"over 90% similar": what's your definition of that?
Comment
Masood Sadat

Join Date: Aug 2014

Posts: 38
#4

20 Jul 2015, 11:24

Carlo, option group will not help as I have over 300 variables, and then it will make millions of groups... It will be out of control.

Prof. Cox, so the data set is the responses to a questionnaire of 300 questions. I expect no two respondents to have over 90% similarities in answer the 300 questions. In other words, I want to tag observations that are over 90% similar to any other observation in responses to 300 questions/variables.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#5

20 Jul 2015, 11:30

I don't think that's a specification precise enough to allow easy coding (or indeed any coding).
1 like
Comment
Masood Sadat

Join Date: Aug 2014

Posts: 38
#6

20 Jul 2015, 11:35

Thanks Prof. Cox. Someone told me of a windows software called Hunter that can detect similar observations with a cut-off of any %. I couldn't find Hunter, so I was hoping that Stata could do such thing - maybe in the future?

Thanks
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#7

20 Jul 2015, 11:53

I can only speak for myself. I am credited with writing duplicates in Stata and stop short completely of all fuzzy, probabilistic or partial definitions of duplicates.
Comment
Masood Sadat

Join Date: Aug 2014

Posts: 38
#8

20 Jul 2015, 12:04

Prof. Cox, you will do the world of survey analysts / Stata users a great favor if you could develop such program that could check if any two observation are let's say 90% similar to each other. Irrespective of that, I always admire your work.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#9

20 Jul 2015, 12:16

Thanks for the compliment, but it really won't happen. The problem is quite literally far too fuzzy to have any appeal to me.
Comment
Jeph Herrin

Join Date: Apr 2014

Posts: 335
#10

20 Jul 2015, 12:27

If the responses are multiple choice (single character responses), one approach would be to create a single string of 300 characters containing the responses, and then apply -reclink- (type -findit reclink-) to match the file to itself, generating a match score. Then you can see which ones have score greater than, eg, 90%.
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2449

#11

21 Jul 2015, 10:25

Here's a brute-force approach. I'm defining the duplication proportion, psame`i', as the proportion of the nvar variables for which the current observation is the same as the ith observation. Despite the loop over observations, the speed of the following seemed OK, as the run time scaled at less than _N^2.

The result is a variable called "dupelist" that contains a list of observation numbers that duplicate the current case at greater than the specified criterion.

Code:

clear
local criterion = 0.7 // duplication proportion, just for illustration
// Make example data
set obs 100
gen id = string(_n)
local nvar = 8
local nval = 4
forval i = 1/`nvar' {
   gen byte x`i' = ceil(runiform() * `nval')
}
// Compare each observation to each other observation.
forval i = 1/`=_N' {
   gen psame`i' = 0
   forval j = 1/`nvar' {
      qui replace psame`i' = psame`i' + 1 if (x`j' ==  x`j'[`i'])
   }
   qui replace psame`i' = psame`i'/`nvar'
}
// Make a list of near-duplicates for each observation
gen str dupelist = ""  
forval i = 1/`=_N' {
   qui replace dupelist = dupelist + " " + string(`i') ///
     if (`i' != _n) &amp; (psame`i' &gt; `criterion')
}
// Inspect the near-duplicates and decide what you want to do.
list dupelist if !missing(dupelist)

Does this fit the problem as you are thinking of it?

Regards, Mike

Comment

Masood Sadat

Join Date: Aug 2014

Posts: 38
#12

24 Jul 2015, 06:35

Thanks Mike. I won't have a chance to work this on Stata until this coming werkend. I am not sure however how I can I include all the variables, let's say 300 using this approach. Anyway, I will let you know how it goes.
Comment
Jay Rege

Join Date: Sep 2014

Posts: 135
#13

03 Aug 2015, 02:47

Hey,
I have a similar but slightly different Problem.
In my case its just about one Variable (Company Name), where I would like to detect Close duplicates.
The reason why I want to do so, is that I´m suspecting subsidiaries in the Company list.
So the question is, is it possible to detect them with a similar Close. I know that I will have to check the results by Hand, but at least I would have an indicator.
BMW Germany

BMW USA

BMW Canda

Bosch Thailand

Bosch gmbh
Comment
Masood Sadat

Join Date: Aug 2014

Posts: 38
#14

11 Aug 2015, 06:22

Hi Jay,
Your problem, fortunately is not as complicated as mine.
I think in your case you can use strpos, a string function. How it works? So if you look at the following example, I created a variable and assigned value 1 if BMW is mentioned in your string variable, assigned 2 if Bosch is mentioned, 3 if Toyota is mentioned, and so on.

gen Made=1 if strpos(variable_name, "BMW")
replace Made=2 if strpos(variable_name, "Bosch")
replace Made=3 if strpos(variable_name, "Toyota")
replace Made=4 ...

the create labels not to mistake the codes:

lab def made 1 BMW 2 Bosch 3 Toyota 4 ...
lab val Made made

This works well if:
1. You know all the names
2. You know that there is no overlap of names
Comment
Masood Sadat

Join Date: Aug 2014

Posts: 38
#15

09 Jun 2016, 05:49

I had posed this question long time ago, and since then I have found a solution.

My question was to flag almost similar observations (not exactly similar, which duplicate does). For instance, finding observations that are similar 90% and above.

Noble Kuriakose have developed percentmatch, which does the job and is available on SSC.
Comment

Announcement