Finding duplicate variables

Alejandro Torres

Join Date: Jan 2018

Posts: 152
#1

Finding duplicate variables

29 Aug 2020, 20:50

Dear Statalisters,

I am writing you because I would like to find out if I have duplicate variables in my database.
I have been looking in previous posts, but I only found a recommendation about doing a pwcorr, that I think is an interesting possibility, but considering that I have almost 400 variables is going to be hard considering the large output.
If you could advice me, I really appreciate it.

Thank you so much !!
Alejandro
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

29 Aug 2020, 21:35

Well, there is the brute force approach of just looping over all your pairs of variables with -assert var1 == var2- and noticing the ones where the assertion is false. But with 400 variables you have around 80,000 such pairs, which is probably going to take too long. And even if it's relatively fast, it's obviously very inefficient.

The code below will do something a little more sophisticated. It will only work with all numeric variables. (And it can be modified to work with only string variables.) Since a string variable can never be a duplicate of a numeric variable, if you don't already know whether the possible duplicates are string or numeric, worst comes to worst, you do this twice, modifying it to do strings the second time through. The idea is that if two variables are equal, then they certainly agree on the first non missing observation. So we begin by collapsing down to the first non-missing observation for each variable, flip the data on its side and sort the values, keeping only values that are themselves duplicates. So this gives us a list of possible equal variables that is probably a lot smaller than the whole set. Then we apply the brute force method of comparing all possible pairs of those. Unless you have a large number of variables that all have the same first non-missing observation, this should work fairly efficiently.

Code:

preserve quietly ds, has(type numeric) local all_vars `r(varlist)' gen one = 1 collapse (firstnm) `all_vars', by(one) rename (`all_vars') _= reshape long _, i(one) j(varname) string sort _ keep if inlist(_, _[_n-1], _[_n+1]) levelsof varname, local(candidates) clean restore keep `candidates' display `"`candidates'"' local n_candidates: word count `candidates' local duplicate_variables forvalues i = 1/`n_candidates' { local var1: word `i' of `candidates' forvalues j = `=`i'+1'/`n_candidates' { local var2: word `j' of `candidates' capture assert `var1' == `var2' if c(rc) == 0 { local duplicate_variables `duplicate_variables' `var1' == `var2'; } } } display `"`duplicate_variables'"'

The final output of the program is a listing of all pairs of duplicate variables. If there aren't any such pairs, it will produce no output.

I developed and tested this on a toy data set with 400 variables having 1000 observations. The values were random numbers, and two pairs of variables were set equal to each other. The code found the pairs correctly with an execution time of 3.005 seconds. The -reshape- command is the only slow part.

Last edited by Clyde Schechter; 29 Aug 2020, 21:43.
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

29 Aug 2020, 21:40

I think this would do it:

Code:

. foreach v of varlist * {
  2. foreach w of varlist * {
  3. capture assert `v'==`w'
  4. if _rc==0 & "`v'"!="`w'" drop `w'
  5. }
  6. }

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35734

30 Aug 2020, 03:45

One special case that can be quite common is that you have duplicate variables that are equal to each other because they are all missing -- in turn because they were imported from a source with blank columns. Typically such variables are useless. You can get rid of them all at once.

findname (Stata Journal) will find them, and will handle numeric and string variables in the same call. After that, you can apply drop directly;

Alternatively, missings (Stata Journal) has a hook to get rid of them all.

Code:

. sysuse auto, clear
(1978 Automobile Data)

. gen garbage1 = ""
(74 missing values generated)

. gen garbage2 = .
(74 missing values generated)

. findname, all(missing(@))
garbage1  garbage2

. missings dropvars, force

Checking missings in make price mpg rep78 headroom trunk weight length turn displacement gear_ratio
    foreign garbage1 garbage2:
74 observations with missing values

note: garbage1 garbage2 dropped

. search findname, sj

Search of official help files, FAQs, Examples, and Stata Journals

SJ-20-2 dm0048_4  . . . . . . . . . . . . . . . . Software update for findname
        (help findname if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q2/20   SJ 20(2):504
        new options include columns()

SJ-15-2 dm0048_3  . . . . . . . . . . . . . . . . Software update for findname
        (help findname if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q2/15   SJ 15(2):605--606
        updated to be able to find strL variables

SJ-12-1 dm0048_2  . . . . . . . . . . . . . . . . Software update for findname
        (help findname if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q1/12   SJ 12(1):167
        correction for handling embedded double quote characters

SJ-10-4 dm0048_1  . . . . . . . . . . . . . . . . Software update for findname
        (help findname if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q4/10   SJ 10(4):691
        update for not option

SJ-10-2 dm0048  . . . . . . . . . . . . . .  Speaking Stata: Finding variables
        (help findname if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q2/10   SJ 10(2):281--296
        produces a list of variable names showing which variables
        have specific properties, such as being of string type, or
        having value labels attached, or having a date format

Code:


. search dm0085, entry

Search of official help files, FAQs, Examples, and Stata Journals

SJ-17-3 dm0085_1  . . . . . . . . . . . . . . . . Software update for missings
        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q3/17   SJ 17(3):779
        identify() and sort options have been added

SJ-15-4 dm0085  Speaking Stata: A set of utilities for managing missing values
        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q4/15   SJ 15(4):1174--1185
        provides command, missings, as a replacement for, and extension
        of, previous commands nmissing and dropmiss

In each case (1) the command was announced in a paper in the Stata Journal (2) if the command seems useful, you should install from the latest update site. In the case of missings, the command name is a common keyword, so dm0085 is revealed above as a specific search term. So a search for that reveals the latest update as I write and if you are reading this some time after 2020 there may be a further update.

Last edited by Nick Cox; 30 Aug 2020, 04:00.

Comment

Alejandro Torres

Join Date: Jan 2018

Posts: 152
#5

31 Aug 2020, 10:18

Dear Clyde, Joro, and Nick,

What a pleasure see to see you in the post, I really thank you the time answering and even more with those complete answers.
I am going to work on that now, so I really appreciate your help.

Thank you so much again,

Alejandro
Comment

Announcement