Collapsing concatenated distinct values by group

Manuel PenaH

Join Date: Aug 2021
Posts: 12

Collapsing concatenated distinct values by group

02 Aug 2021, 23:24

I would like to collapse string variables after obtaining a concatenated list of all distinct values by group. For example, my data looks like this:

HTML Code:

clear
input str8 item str8 color
"Shirt" "Green"
"Shirt" "Green"
"Shirt" "Gray"
"Jeans" "Gray"
"Jeans" "Gray"
"Jeans" "Blue"
"Jeans" "Black"
end

list
     +---------------+
     |  item   color |
     |---------------|
  1. | Shirt   Green |
  2. | Shirt   Green |
  3. | Shirt    Gray |
  4. | Jeans    Gray |
  5. | Jeans    Gray |
     |---------------|
  6. | Jeans    Blue |
  7. | Jeans   Black |
     +---------------+
preserve

And I would like to collapse by item while not losing any of the distinct values of color. I would like these distinct values in alphabetical order as a comma-separated list. So far I have managed to accomplish it this way:

HTML Code:

egen group_id = group(item)
gen color_concat = ""
forvalues val=1/2 {
    levelsof color if group_id == `val', clean separate(", ")
    replace color_concat = r(levels) if group_id == `val'
}

collapse (first) color_concat, by(item)
list
     +---------------------------+
     |  item        color_concat |
     |---------------------------|
  1. | Jeans   Black, Blue, Gray |
  2. | Shirt         Gray, Green |
     +---------------------------+

However, I wonder if this is a poor use of levelsof and if there is a faster, more elegant solution. I am trying to apply this to datasets with over several hundred thousand observations and this gets very, very slow. I have tried this solution with the code below, but unfortunately I still obtain repeated levels rather than distinct. I would appreciate any help with this.

HTML Code:

restore
sort item color, stable
by item : gen allcolors = color[1]
by item : replace allcolors = allcolors[_n-1] + ", " + color if _n > 1
by item : replace allcolors = allcolors[_N]

collapse (first) allcolors, by(item)
list
     +---------------------------------+
     |  item                 allcolors |
     |---------------------------------|
  1. | Jeans   Black, Blue, Gray, Gray |
  2. | Shirt        Gray, Green, Green |
     +---------------------------------+

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 36053
#2

03 Aug 2021, 02:34

First go

Code:

duplicates drop item color
1 like
Comment
Manuel PenaH

Join Date: Aug 2021

Posts: 12
#3

03 Aug 2021, 07:46

Great, thank you, Nick!
Comment

Announcement

Collapsing concatenated distinct values by group

Comment

Comment