Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • 2 identical values between any two columns

    Hello.
    I have a file with about 30 columns of integers. I need to create a dummy variable that indicates if at least 2 values per row of any columns have identical values between them.
    For example
    12 13 12 14: dummy = 1
    12 13 14 12: dummy = 1
    12 13 14 15: dummy = 0

    How can I do it? Thank you

  • #2
    This might be easier in long format, but here's what I'd do presuming you have wide format:
    Code:
    // Create some example data, which might or might not resemble what Esteban actually has.
    clear
    set seed 4523
    set obs 20
    forval i = 1/5 {
       gen byte x`i' = runiformint(1, 10)
    }
    // end example data
    //
    // Get variable list.
    quiet ds x*
    local vlist `r(varlist)'
    local nvar: word count `vlist'
    // Compare all possible pairs of variables to detect any matches.
    gen byte atleast1match = 0
    forval i = 1/`=`nvar'-1' {
       forval j = `=`i'+ 1'/`nvar' {
          local v1: word `i' of `vlist'
          local v2: word `j' of `vlist'
          qui replace atleast1match = atleast1match + (`v1' == `v2') ///
             if (atleast1match == 0)
      }
    }
    One could do this somewhat more efficiently, but it was fast enough when I tried it with 30 variables and 1000 observations.

    Comment


    • #3
      As Mike Lacy points out, it is simpler to do this in long layout:
      Code:
      gen `c(obs_t)' obs_no = _n
      reshape long var, i(obs_no) j(seq)
      by obs_no (var), sort: egen byte wanted = max(var == var[_n-1])
      // AND IF YOU WANT TO GO BACK TO WIDE LAYOUT:
      reshape wide
      Notice how much shorter and clearer the code is this way. It can't be said enough: in Stata, long data sets are almost always easier to work with than wide ones. Unless you know you will be doing one of the few things that Stata does more readily with wide data, you should prefer to work with a long data set. Indeed, you will probably be better off keeping the data in the long layout and skipping that final -reshape wide-, unless you know that what you are going to do next is better done with a wide layout.

      In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      Comment


      • #4
        Thank you both very much for the answer. I understand well what you tell me about working with "reshape long" instead of "wide". The problem is that for various reasons, I need to work in a "wide" structure, one of which is that my database has almost 30 million rows, in addition to several variables. I'm close to what my computer can handle I think. I'll try to do something like Mike's proposal, hopefully it works.

        Again, thank you very much

        Comment

        Working...
        X