I would like to collapse string variables after obtaining a concatenated list of all distinct values by group. For example, my data looks like this:
And I would like to collapse by item while not losing any of the distinct values of color. I would like these distinct values in alphabetical order as a comma-separated list. So far I have managed to accomplish it this way:
However, I wonder if this is a poor use of levelsof and if there is a faster, more elegant solution. I am trying to apply this to datasets with over several hundred thousand observations and this gets very, very slow. I have tried this solution with the code below, but unfortunately I still obtain repeated levels rather than distinct. I would appreciate any help with this.
HTML Code:
clear
input str8 item str8 color
"Shirt" "Green"
"Shirt" "Green"
"Shirt" "Gray"
"Jeans" "Gray"
"Jeans" "Gray"
"Jeans" "Blue"
"Jeans" "Black"
end
list
+---------------+
| item color |
|---------------|
1. | Shirt Green |
2. | Shirt Green |
3. | Shirt Gray |
4. | Jeans Gray |
5. | Jeans Gray |
|---------------|
6. | Jeans Blue |
7. | Jeans Black |
+---------------+
preserve
HTML Code:
egen group_id = group(item)
gen color_concat = ""
forvalues val=1/2 {
levelsof color if group_id == `val', clean separate(", ")
replace color_concat = r(levels) if group_id == `val'
}
collapse (first) color_concat, by(item)
list
+---------------------------+
| item color_concat |
|---------------------------|
1. | Jeans Black, Blue, Gray |
2. | Shirt Gray, Green |
+---------------------------+
HTML Code:
restore
sort item color, stable
by item : gen allcolors = color[1]
by item : replace allcolors = allcolors[_n-1] + ", " + color if _n > 1
by item : replace allcolors = allcolors[_N]
collapse (first) allcolors, by(item)
list
+---------------------------------+
| item allcolors |
|---------------------------------|
1. | Jeans Black, Blue, Gray, Gray |
2. | Shirt Gray, Green, Green |
+---------------------------------+

Comment