Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding duplicate variables

    Dear Statalisters,

    I am writing you because I would like to find out if I have duplicate variables in my database.
    I have been looking in previous posts, but I only found a recommendation about doing a pwcorr, that I think is an interesting possibility, but considering that I have almost 400 variables is going to be hard considering the large output.
    If you could advice me, I really appreciate it.

    Thank you so much !!
    Alejandro

  • #2
    Well, there is the brute force approach of just looping over all your pairs of variables with -assert var1 == var2- and noticing the ones where the assertion is false. But with 400 variables you have around 80,000 such pairs, which is probably going to take too long. And even if it's relatively fast, it's obviously very inefficient.

    The code below will do something a little more sophisticated. It will only work with all numeric variables. (And it can be modified to work with only string variables.) Since a string variable can never be a duplicate of a numeric variable, if you don't already know whether the possible duplicates are string or numeric, worst comes to worst, you do this twice, modifying it to do strings the second time through. The idea is that if two variables are equal, then they certainly agree on the first non missing observation. So we begin by collapsing down to the first non-missing observation for each variable, flip the data on its side and sort the values, keeping only values that are themselves duplicates. So this gives us a list of possible equal variables that is probably a lot smaller than the whole set. Then we apply the brute force method of comparing all possible pairs of those. Unless you have a large number of variables that all have the same first non-missing observation, this should work fairly efficiently.

    Code:
    preserve
    quietly ds, has(type numeric)
    local all_vars `r(varlist)'
    gen one = 1
    collapse (firstnm) `all_vars', by(one)
    rename (`all_vars') _=
    reshape long _, i(one) j(varname) string
    sort _
    keep if inlist(_, _[_n-1], _[_n+1])
    levelsof varname, local(candidates) clean
    
    restore
    keep `candidates'
    display `"`candidates'"'
    local n_candidates: word count `candidates'
    local duplicate_variables
    forvalues i = 1/`n_candidates' {
        local var1: word `i' of `candidates'
        forvalues j = `=`i'+1'/`n_candidates' {
            local var2: word `j' of `candidates'
            capture assert `var1' == `var2'
            if c(rc) == 0 {
                local duplicate_variables `duplicate_variables' `var1' == `var2';
            }
        }
    }
    
    display `"`duplicate_variables'"'
    The final output of the program is a listing of all pairs of duplicate variables. If there aren't any such pairs, it will produce no output.

    I developed and tested this on a toy data set with 400 variables having 1000 observations. The values were random numbers, and two pairs of variables were set equal to each other. The code found the pairs correctly with an execution time of 3.005 seconds. The -reshape- command is the only slow part.
    Last edited by Clyde Schechter; 29 Aug 2020, 21:43.

    Comment


    • #3
      I think this would do it:

      Code:
      . foreach v of varlist * {
        2. foreach w of varlist * {
        3. capture assert `v'==`w'
        4. if _rc==0 & "`v'"!="`w'" drop `w'
        5. }
        6. }

      Comment


      • #4
        One special case that can be quite common is that you have duplicate variables that are equal to each other because they are all missing -- in turn because they were imported from a source with blank columns. Typically such variables are useless. You can get rid of them all at once.

        findname (Stata Journal) will find them, and will handle numeric and string variables in the same call. After that, you can apply drop directly;

        Alternatively, missings (Stata Journal) has a hook to get rid of them all.

        Code:
        . sysuse auto, clear
        (1978 Automobile Data)
        
        . gen garbage1 = ""
        (74 missing values generated)
        
        . gen garbage2 = .
        (74 missing values generated)
        
        . findname, all(missing(@))
        garbage1  garbage2
        
        . missings dropvars, force
        
        Checking missings in make price mpg rep78 headroom trunk weight length turn displacement gear_ratio
            foreign garbage1 garbage2:
        74 observations with missing values
        
        note: garbage1 garbage2 dropped
        
        . search findname, sj
        
        Search of official help files, FAQs, Examples, and Stata Journals
        
        SJ-20-2 dm0048_4  . . . . . . . . . . . . . . . . Software update for findname
                (help findname if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                Q2/20   SJ 20(2):504
                new options include columns()
        
        SJ-15-2 dm0048_3  . . . . . . . . . . . . . . . . Software update for findname
                (help findname if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                Q2/15   SJ 15(2):605--606
                updated to be able to find strL variables
        
        SJ-12-1 dm0048_2  . . . . . . . . . . . . . . . . Software update for findname
                (help findname if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                Q1/12   SJ 12(1):167
                correction for handling embedded double quote characters
        
        SJ-10-4 dm0048_1  . . . . . . . . . . . . . . . . Software update for findname
                (help findname if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                Q4/10   SJ 10(4):691
                update for not option
        
        SJ-10-2 dm0048  . . . . . . . . . . . . . .  Speaking Stata: Finding variables
                (help findname if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                Q2/10   SJ 10(2):281--296
                produces a list of variable names showing which variables
                have specific properties, such as being of string type, or
                having value labels attached, or having a date format
        Code:
        
        . search dm0085, entry
        
        Search of official help files, FAQs, Examples, and Stata Journals
        
        SJ-17-3 dm0085_1  . . . . . . . . . . . . . . . . Software update for missings
                (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                Q3/17   SJ 17(3):779
                identify() and sort options have been added
        
        SJ-15-4 dm0085  Speaking Stata: A set of utilities for managing missing values
                (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                Q4/15   SJ 15(4):1174--1185
                provides command, missings, as a replacement for, and extension
                of, previous commands nmissing and dropmiss
        In each case (1) the command was announced in a paper in the Stata Journal (2) if the command seems useful, you should install from the latest update site. In the case of missings, the command name is a common keyword, so dm0085 is revealed above as a specific search term. So a search for that reveals the latest update as I write and if you are reading this some time after 2020 there may be a further update.
        Last edited by Nick Cox; 30 Aug 2020, 04:00.

        Comment


        • #5
          Dear Clyde, Joro, and Nick,

          What a pleasure see to see you in the post, I really thank you the time answering and even more with those complete answers.
          I am going to work on that now, so I really appreciate your help.

          Thank you so much again,

          Alejandro

          Comment

          Working...
          X