No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to tell if all items in a group are equal?

    The data below is a simplified example to illustrate the problem I am working on. Basically, I am trying to figure out the number of families where all members have the same favorite color. So in this example, I would want a list of two families (Smith and Silver). For the purpose of this example, we can assume that the last name is unique and there are not multiple Smith families.

    I thought about creating a binary variable for each color and then taking the average by family. If the average isn't 0 or 1, then there must be some differences within the family. I'd end up with a lot of extra variables and this doesn't seem like the most efficient way to solve the problem. Any other suggestions?

    Thanks in advance!
    Family (last name) First Name Favorite Color
    Smith Sarah Blue
    Smith Sally Blue
    Smith Sue Blue
    Smith John Blue
    Doe Megan Red
    Doe Jack Purple
    Johnson Michael Red
    Johnson Mary Orange
    Johnson Tom Green
    Johnson Richard Blue
    Johnson Joe Red
    Johnson Jimmy Blue
    Silver Susan Purple
    Silver James Purple
    Silver Josh Purple

  • #2
    This FAQ may help:


    • #3
      Thanks Nick. That's much more straightforward than what I had in mind. I was thinking of using egen to compute the SD within families, and then (by family) setting variable same = SD==0. That approach works, but is far clunkier than necessary. My colleague in a discussion forum for another stats package would describe it as Rubish (in honor of Rube Goldberg).

      Madison, here's an example using the data you posted.

      * Example generated by -dataex-. To install: ssc install dataex
      input str12(last first favcol)
      "Smith"   "Sarah"   "Blue"  
      "Smith"   "Sally"   "Blue"  
      "Smith"   "Sue"     "Blue"  
      "Smith"   "John"    "Blue"  
      "Doe"     "Megan"   "Red"  
      "Doe"     "Jack"    "Purple"
      "Johnson" "Michael" "Red"  
      "Johnson" "Mary"    "Orange"
      "Johnson" "Tom"     "Green"
      "Johnson" "Richard" "Blue"  
      "Johnson" "Joe"     "Red"  
      "Johnson" "Jimmy"   "Blue"  
      "Silver"  "Susan"   "Purple"
      "Silver"  "James"   "Purple"
      "Silver"  "Josh"    "Purple"
      generate order1 = _n // preserve original order of observations
      by last (favcol), sort: gen same = favcol[1] == favcol[_N]
      sort order1 // restore original order of observations
      list last first favcol same, sepby(last)
      Output from the -list- command:

      . list last first favcol same, sepby(last)
           |    last     first   favcol   same |
        1. |   Smith     Sarah     Blue      1 |
        2. |   Smith     Sally     Blue      1 |
        3. |   Smith       Sue     Blue      1 |
        4. |   Smith      John     Blue      1 |
        5. |     Doe     Megan      Red      0 |
        6. |     Doe      Jack   Purple      0 |
        7. | Johnson   Michael      Red      0 |
        8. | Johnson      Mary   Orange      0 |
        9. | Johnson       Tom    Green      0 |
       10. | Johnson   Richard     Blue      0 |
       11. | Johnson       Joe      Red      0 |
       12. | Johnson     Jimmy     Blue      0 |
       13. |  Silver     Susan   Purple      1 |
       14. |  Silver     James   Purple      1 |
       15. |  Silver      Josh   Purple      1 |

      PS- Re the Rubish approach described above, I forgot to say that one would have to convert the string variable for color into a numeric variable before computing the SD. E.g.,

      // Generate numeric version of favorite color variable
      encode favcol, generate(color)
      Last edited by Bruce Weaver; 14 Jun 2019, 15:58. Reason: Added the postscript.
      Bruce Weaver
      Stata version: 16.1 IC (Windows)