Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Collapsing concatenated distinct values by group

    I would like to collapse string variables after obtaining a concatenated list of all distinct values by group. For example, my data looks like this:
    HTML Code:
    clear
    input str8 item str8 color
    "Shirt" "Green"
    "Shirt" "Green"
    "Shirt" "Gray"
    "Jeans" "Gray"
    "Jeans" "Gray"
    "Jeans" "Blue"
    "Jeans" "Black"
    end
    
    list
         +---------------+
         |  item   color |
         |---------------|
      1. | Shirt   Green |
      2. | Shirt   Green |
      3. | Shirt    Gray |
      4. | Jeans    Gray |
      5. | Jeans    Gray |
         |---------------|
      6. | Jeans    Blue |
      7. | Jeans   Black |
         +---------------+
    preserve
    And I would like to collapse by item while not losing any of the distinct values of color. I would like these distinct values in alphabetical order as a comma-separated list. So far I have managed to accomplish it this way:
    HTML Code:
    egen group_id = group(item)
    gen color_concat = ""
    forvalues val=1/2 {
        levelsof color if group_id == `val', clean separate(", ")
        replace color_concat = r(levels) if group_id == `val'
    }
    
    collapse (first) color_concat, by(item)
    list
         +---------------------------+
         |  item        color_concat |
         |---------------------------|
      1. | Jeans   Black, Blue, Gray |
      2. | Shirt         Gray, Green |
         +---------------------------+
    However, I wonder if this is a poor use of levelsof and if there is a faster, more elegant solution. I am trying to apply this to datasets with over several hundred thousand observations and this gets very, very slow. I have tried this solution with the code below, but unfortunately I still obtain repeated levels rather than distinct. I would appreciate any help with this.
    HTML Code:
    restore
    sort item color, stable
    by item : gen allcolors = color[1]
    by item : replace allcolors = allcolors[_n-1] + ", " + color if _n > 1
    by item : replace allcolors = allcolors[_N]
    
    collapse (first) allcolors, by(item)
    list
         +---------------------------------+
         |  item                 allcolors |
         |---------------------------------|
      1. | Jeans   Black, Blue, Gray, Gray |
      2. | Shirt        Gray, Green, Green |
         +---------------------------------+

  • #2
    First go

    Code:
    duplicates drop item color

    Comment


    • #3
      Great, thank you, Nick!

      Comment

      Working...
      X