2 identical values between any two columns

Esteban Jara

Join Date: Jun 2020

Posts: 91
#1

2 identical values between any two columns

21 Feb 2023, 12:26

Hello.
I have a file with about 30 columns of integers. I need to create a dummy variable that indicates if at least 2 values per row of any columns have identical values between them.
For example
12 13 12 14: dummy = 1
12 13 14 12: dummy = 1
12 13 14 15: dummy = 0

How can I do it? Thank you
Tags: None

Mike Lacy

Join Date: Apr 2014
Posts: 2449

21 Feb 2023, 13:17

This might be easier in long format, but here's what I'd do presuming you have wide format:

Code:

// Create some example data, which might or might not resemble what Esteban actually has.
clear
set seed 4523
set obs 20
forval i = 1/5 {
   gen byte x`i' = runiformint(1, 10)
}
// end example data
//
// Get variable list.
quiet ds x*
local vlist `r(varlist)'
local nvar: word count `vlist'
// Compare all possible pairs of variables to detect any matches.
gen byte atleast1match = 0
forval i = 1/`=`nvar'-1' {
   forval j = `=`i'+ 1'/`nvar' {
      local v1: word `i' of `vlist'
      local v2: word `j' of `vlist'
      qui replace atleast1match = atleast1match + (`v1' == `v2') ///
         if (atleast1match == 0)
  }
}

One could do this somewhat more efficiently, but it was fast enough when I tried it with 30 variables and 1000 observations.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#3

21 Feb 2023, 15:13

As Mike Lacy points out, it is simpler to do this in long layout:

Code:

gen `c(obs_t)' obs_no = _n reshape long var, i(obs_no) j(seq) by obs_no (var), sort: egen byte wanted = max(var == var[_n-1]) // AND IF YOU WANT TO GO BACK TO WIDE LAYOUT: reshape wide

Notice how much shorter and clearer the code is this way. It can't be said enough: in Stata, long data sets are almost always easier to work with than wide ones. Unless you know you will be doing one of the few things that Stata does more readily with wide data, you should prefer to work with a long data set. Indeed, you will probably be better off keeping the data in the long layout and skipping that final -reshape wide-, unless you know that what you are going to do next is better done with a wide layout.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Esteban Jara

Join Date: Jun 2020

Posts: 91
#4

22 Feb 2023, 09:47

Thank you both very much for the answer. I understand well what you tell me about working with "reshape long" instead of "wide". The problem is that for various reasons, I need to work in a "wide" structure, one of which is that my database has almost 30 million rows, in addition to several variables. I'm close to what my computer can handle I think. I'll try to do something like Mike's proposal, hopefully it works.

Again, thank you very much
Comment

Announcement

2 identical values between any two columns

Comment

Comment

Comment