Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identify all groups that have a common element

    I am working with real estate appraisal data. Each appraisal (appr_id) has multiple comps (comp_id) associated with it. Comps can be used across multiple appraisals. My data contains 27 million observations, and looks like:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float unique_id str1(appr_id comp_id)
     1 "1" "h"
     2 "1" "c"
     3 "1" "s"
     4 "2" "a"
     5 "2" "h"
     6 "2" "m"
     7 "3" "a"
     8 "3" "y"
     9 "3" "x"
    10 "4" "p"
    11 "4" "q"
    12 "5" "z"
    13 "5" "p"
    14 "6" "w"
    15 "6" "t"
    end
    Appraisal 1 has a comp in common with appraisal 2 (h), and appraisal 3 has a comp in common with appraisal 2 (a), so appraisals 1-3 should have the same value in the new variable. Appraisal 4 and appraisal 5 both use p as a comp, so they should have the same value in the new variable. Appraisal 6 does not share a comp with any of the other appraisals, so it should have its own value in the new variable. The final dataset should look like:


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float unique_id str1(appr_id comp_id) float newvar
     1 "1" "h" 1
     2 "1" "c" 1
     3 "1" "s" 1
     4 "2" "a" 1
     5 "2" "h" 1
     6 "2" "m" 1
     7 "3" "a" 1
     8 "3" "y" 1
     9 "3" "x" 1
    10 "4" "p" 2
    11 "4" "q" 2
    12 "5" "z" 2
    13 "5" "p" 2
    14 "6" "w" 3
    15 "6" "t" 3
    end

    I'm guessing there's an easy solution, but I've hit the wall. Any help would be greatly appreciated. Thanks!

  • #2
    It's not clear how your logic relates the example data to the expected solution. At the very least, the set of rules of incomplete. For example, why would appraisals 1, 2 and 3 all get the same value of 1, when they share two different comps ? What happens if any two appraisals share more than one comp?

    Comment


    • #3
      Sorry that was my explanation was incomplete. Within an appraisal, all of the comps are essentially identical properties. Roughly speaking,

      h = s = c (all in appraisal 1).

      a = h = m (all in appraisal 2)

      h is the same comp, but just used in different appraisals. Because h shows up in both appraisals, those 5 properties are basically the same ( h = s = c = a = m.).

      The three comps in appraisal 3 are basically all the same property (a=y=x), and because a is also in appraisal 2, then

      h = s = c = a = m = y = x.

      All of those properties (comp_id) get the same value of newvar, 1.


      From appraisal 4 (p = q) and 5 (z = p), we get:

      p = q = z

      All three of those comps get the same value of newvar, 2.


      From appraisal 6, w = t. w and t do not appear in any of the other appraisals, so get the same value of newvar, 3.

      I hope this makes more sense. Sorry for the confusion..





      Comment


      • #4
        Not quite sure if it works for more complicated cases.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float unique_id str1(appr_id comp_id)
         1 "1" "h"
         2 "1" "c"
         3 "1" "s"
         4 "2" "a"
         5 "2" "h"
         6 "2" "m"
         7 "3" "a"
         8 "3" "y"
         9 "3" "x"
        10 "4" "p"
        11 "4" "q"
        12 "5" "z"
        13 "5" "p"
        14 "6" "w"
        15 "6" "t"
        end
        
        sort comp_id
        gen n = sum(comp_id != comp_id[_n-1])
        bys n: replace n = . if _N == 1
        local a = 1
        local c = 1
        while `a' > 0 | `c' > 0 {
            bys appr_id (n): replace n = n[_n-1] if n[_n-1] != .
            bys comp_id (n): egen c = sd(n)
            quietly sum c
            local c = r(mean)
            
            bys comp_id (n): replace n = n[_n-1] if n[_n-1] != .
            bys appr_id (n): egen a = sd(n)
            quietly sum a
            local a = r(mean)
            
            drop a c
        }
        
        sort n appr_id
        egen g = group(n)
        replace g = g[_n-1] + (appr_id != appr_id[_n-1] & g == .) if g == .
        drop n
        sort unique_id
        Code:
        . list
        
             +----------------------------------+
             | unique~d   appr_id   comp_id   g |
             |----------------------------------|
          1. |        1         1         h   1 |
          2. |        2         1         c   1 |
          3. |        3         1         s   1 |
          4. |        4         2         a   1 |
          5. |        5         2         h   1 |
             |----------------------------------|
          6. |        6         2         m   1 |
          7. |        7         3         a   1 |
          8. |        8         3         y   1 |
          9. |        9         3         x   1 |
         10. |       10         4         p   2 |
             |----------------------------------|
         11. |       11         4         q   2 |
         12. |       12         5         z   2 |
         13. |       13         5         p   2 |
         14. |       14         6         w   3 |
         15. |       15         6         t   3 |
             +----------------------------------+

        Comment


        • #5
          This is great. Thank you!

          Comment


          • #6
            You can use group_id (from SSC) to group appr_id identifiers that share one or more comp_id identifiers:

            Code:
            * create a numeric identifier for appraisals
            egen long newid = group(appr_id)
            
            * group appraisals if they share comps
            group_id newid, match(comp_id)
            
            list, sepby(newid)
            Code:
            . list, sepby(newid)
            
                 +--------------------------------------+
                 | unique~d   appr_id   comp_id   newid |
                 |--------------------------------------|
              1. |        1         1         h       1 |
              2. |        2         1         c       1 |
              3. |        3         1         s       1 |
              4. |        4         2         a       1 |
              5. |        5         2         h       1 |
              6. |        6         2         m       1 |
              7. |        7         3         a       1 |
              8. |        8         3         y       1 |
              9. |        9         3         x       1 |
                 |--------------------------------------|
             10. |       10         4         p       4 |
             11. |       11         4         q       4 |
             12. |       12         5         z       4 |
             13. |       13         5         p       4 |
                 |--------------------------------------|
             14. |       14         6         w       6 |
             15. |       15         6         t       6 |
                 +--------------------------------------+
            
            .

            Comment

            Working...
            X