Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • keeping only unique values using egen concat

    Hello,

    I am trying to generate an indicator called mo_all using egen and concat that records all the unique values across a set of variables mo#. Specifically, my code is egen mo_all = concat(mo1_1 mo2_1 mo3_1 mo4_1 mo5_1 mo6_1 mo7_1 mo8_1 mo9_1 mo10_1 mo11_1 mo12_1 mo13_1 mo14_1 mo15_1 mo16_1), punct(" ")

    In this new indicator, I'd like to record only unique values across the variables. For example, in the table below, mo_all in row 1 would equal "1 2 3 4 5", in row 2 it would equal "6 7 8 9", and in row 3 it would equal "10 11".

    My current code records every value, so row 2 equals "6 7 7 8 9"
    mo1 mo2 mo3 mo4 mo5
    1 2 3 4 5
    6 7 7 8 9
    10 10 10 11 11
    12 13 14 15 16
    Could anyone help me with the code to remove all duplicate values within the string variable mo_all?

    Thank you for your time and help!

    Tom


  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str2(mo1 mo2 mo3 mo4 mo5)
    "1"  "2"  "3"  "4"  "5"
    "6"  "7"  "7"  "8"  "9"
    "10" "10" "10" "11" "11"
    "12" "13" "14" "15" "16"
    end
    
    g n=_n
    
    reshape long mo, i(n)
    
    by n mo, sort: keep if _n==1
    
    reshape wide
    
    egen mo_all = concat(mo*), punct(" ")
    
    replace mo_all = stritrim(mo_all)
    
    list mo_all
    
         +----------------+
         |         mo_all |
         |----------------|
      1. |      1 2 3 4 5 |
      2. |        6 7 8 9 |
      3. |          10 11 |
      4. | 12 13 14 15 16 |
         +----------------+
    Last edited by Øyvind Snilsberg; 12 Feb 2022, 15:05.

    Comment


    • #3
      Here's another approach, a key feature of which is to concatenate after eliminating duplicates.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(mo1 mo2 mo3 mo4 mo5)
       1  2  3  4  5
       6  7  7  8  9
      10 10 10 11 11
      12 13 14 15 16
      end
      list
      //
      local max = 5
      forval i = 1/`=`max' -1' {
         forval j = `=`i'+1'/`max' {
            qui replace mo`j' = . if mo`j' == mo`i'
         }
      }
      egen mo_all = concat(mo*), punct(" ")
      replace mo_all = stritrim(subinstr(mo_all, ".", "", .))
      list

      Comment


      • #4
        While I would, in general, prefer Oyvind Snilsberg's solution in #2, here is an alternative that you might use if you have a good reason to prefer keeping the data in wide layout:

        Code:
        gen wanted = ""
        foreach v of varlist mo* {
            replace wanted = wanted + `v' + " " if !strpos(wanted, `v')
        }
        replace wanted = trim(wanted)
        
        list, noobs  clean
        Added: Crossed with #3, which shows yet another approach.

        Comment


        • #5
          Thank you all very much for your responses and for the different options. It's nice to see all of the different ways to do this. I appreciate your time.

          Comment


          • #6
            Clyde Schechter one thing I noticed with your solution is that there are duplicate combinations if values are in different orders in mo*. See rows 2 and 3 in the picture below. Is there a way to modify your code to combine those combinations (e.g., "12 89" would have a frequency of 376 (194+182))?

            Click image for larger version

Name:	Capture.PNG
Views:	1
Size:	7.6 KB
ID:	1649785


            Thanks,
            Tom

            Comment


            • #7
              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str2(mo1 mo2 mo3 mo4 mo5)
              "1"  "2"  "3"  "4"  "5"
              "6"  "7"  "7"  "8"  "9"
              "10" "10" "10" "11" "11"
              "12" "13" "14" "15" "16"
              "5"  "4"  "3"  "2"  "1"
              end
              
              
              quietly ds mo*
              local nvars: word count `r(varlist)'
              rowsort mo*, gen(mo_sorted1-mo_sorted`nvars')
              gen wanted = ""
              foreach v of varlist mo_sorted* {
                  replace wanted = wanted + `v' + " " if !strpos(wanted, `v')
              }
              replace wanted = trim(wanted)
              drop mo_sorted*
              
              list, noobs  clean
              Note: -rowsort- is written by Nick Cox and is available from SSC.

              Comment


              • #8
                The version of rowsort on SSC was written for Stata 8 and requires numeric variables. The version released with

                https://www.stata-journal.com/articl...article=pr0046

                was written for Stata 9 up, allows string variables too, and remains the most versatile version available.

                Comment


                • #9
                  Yet another solution:
                  Code:
                  mata :
                  
                      newvarname = "mo_all"
                      newvarformat = "str14" // NB mata does not promote: set sufficient width   
                      vars = "mo?" 
                  
                      st_addvar(newvarformat, newvarname) 
                  
                      for (i=1; i<=st_nobs(); i++) {    
                  
                          st_sstore(i, newvarname, strtrim(invtokens(uniqrows(st_sdata(i, vars)')')))
                      }
                  
                  end

                  Comment

                  Working...
                  X