Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • generating an index

    Hi, I am trying to generate the index "z" in Stata, so that Z has the same value for all Xs and Ys. For example:

    x y z
    1 1 1
    1 2 1
    1 3 1
    2 4 2
    2 5 3
    2 6 3
    4 7 4
    5 2 1
    6 3 1
    7 8 5

    Any help would be much appreciated. Thanks!

  • #2
    The rules there aren't clear (at least to me).

    Comment


    • #3
      For example the set x=1 intersects the sets y= 1, 2 and 3. But set y=2 also intersects the set x= 5. Therefore I want to cluster these sets. I want to create an index Z with the same value for x = 1, 5,6 and y=1,2,3.

      Comment


      • #4
        Either your description is still not sufficient or your example is not correct. However, I think that this is what you want.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(x y z)
        1 1 1
        1 2 1
        1 3 1
        2 4 2
        2 5 3
        2 6 3
        4 7 4
        5 2 1
        6 3 1
        7 8 5
        end
        
        bys x (y): gen tag= y if _n==_N
        bys x: egen lastob= max(tag)
        sort y
        replace lastob= lastob[_n-1] if lastob<lastob[_n-1] & _n!=1
        egen Z= group(lastob)
        Result:

        Code:
        . sort Z y x
        
        . list, sepby(Z)
        
             +------------------------------+
             | x   y   z   tag   lastob   Z |
             |------------------------------|
          1. | 1   1   1     .        3   1 |
          2. | 1   2   1     .        3   1 |
          3. | 5   2   1     2        3   1 |
          4. | 1   3   1     3        3   1 |
          5. | 6   3   1     3        3   1 |
             |------------------------------|
          6. | 2   4   2     .        6   2 |
          7. | 2   5   3     .        6   2 |
          8. | 2   6   3     6        6   2 |
             |------------------------------|
          9. | 4   7   4     7        7   3 |
             |------------------------------|
         10. | 7   8   5     8        8   4 |
             +------------------------------+

        Comment


        • #5
          Thanks Andrew, but it's not quite what I need. Not sure how to explain... All x’s that have common y’s get the same index z (z is the index I created manually in the example above, but I would like to create it using code). For instance, if x = 1 and x = 5 have any instance where they have a common y, say y = 2, then both x = 1 and x = 5 will get the same index. Again, if x=1 and x = 6 have any common y, say y = 3, then these two x should also get the same index, and hence, x= 1, x = 5, and x= 6 will all get the common index. I added a few more examples. I hope it makes my question clearer.

          +-------------+
          | x y z |
          |-------------|
          1. | 1 1 1 |
          2. | 1 2 1 |
          3. | 1 3 1 |
          4. | 2 4 2 |
          5. | 2 5 3 |
          |-------------|
          6. | 2 6 3 |
          7. | 4 7 4 |
          8. | 5 2 1 |
          9. | 6 3 1 |
          10. | 7 8 5 |
          |-------------|
          11. | 8 9 6 |
          12. | 9 5 3 |
          13. | 9 10 3 |
          14. | 11 11 7 |
          15. | 12 8 5 |
          |-------------|
          16. | 13 9 6 |
          17. | 13 12 6 |
          18. | 13 13 6 |
          19. | 14 14 8 |
          20. | 15 14 8 |
          +-------------

          Comment


          • #6
            Hi Gabi,

            I tried sorting by z (and previously) by y, to see if I could figure out your decision rule, but I confess I can't figure it out.

            1) It would be really helpful if you used Stata's dataex command to share your data. (It makes it a *ton* easier for others to help you). Since most Stata users haven't used it before coming to Statalist, I created a tutorial on it on Youtube here. (Watch at 2x speed :-)

            2) I think it would also be helpful if you could give us some context (i.e. what do x, y, and z stand for?) For example, as I read your example in post #5, I wondered, are you asking "If person1 and person5 both know person #2 (y=2), create a marker for this." Or, create a listing for all of the relationships that two individuals have in common. If that's the case, Stata has a lot of tools for social ties / network analysis.

            Code:
            * Example shared via -dataex-. To install: ssc install dataex
            clear
            input byte(z y x)
            1  1  1
            1  2  1
            1  2  5
            1  3  1
            1  3  6
            2  4  2
            3  5  2
            3  5  9
            3  6  2
            3 10  9
            4  7  4
            5  8  7
            5  8 12
            6  9  8
            6  9 13
            6 12 13
            6 13 13
            7 11 11
            8 14 14
            8 14 15
            end
            
            sort z y x
            . list, sepby(z)
            
                 +-------------+
                 | z    y    x |
                 |-------------|
              1. | 1    1    1 |
              2. | 1    2    1 |
              3. | 1    2    5 |
              4. | 1    3    1 |
              5. | 1    3    6 |
                 |-------------|
              6. | 2    4    2 |
                 |-------------|
              7. | 3    5    2 |
              8. | 3    5    9 |
              9. | 3    6    2 |
             10. | 3   10    9 |
                 |-------------|
             11. | 4    7    4 |
                 |-------------|
             12. | 5    8    7 |
             13. | 5    8   12 |
                 |-------------|
             14. | 6    9    8 |
             15. | 6    9   13 |
             16. | 6   12   13 |
             17. | 6   13   13 |
                 |-------------|
             18. | 7   11   11 |
                 |-------------|
             19. | 8   14   14 |
             20. | 8   14   15 |
                 +-------------+
            Also, in your data, it's not clear why z==2 isn't lumpted with z==3. (Because x==2 intersects with y==4 and with y==5.)

            Comment


            • #7
              Thanks Andrew, but it's not quite what I need. Not sure how to explain... All x’s that have common y’s get the same index z (z is the index I created manually in the example above, but I would like to create it using code). For instance, if x = 1 and x = 5 have any instance where they have a common y, say y = 2, then both x = 1 and x = 5 will get the same index. Again, if x=1 and x = 6 have any common y, say y = 3, then these two x should also get the same index, and hence, x= 1, x = 5, and x= 6 will all get the common index. I added a few more examples. I hope it makes my question clearer.
              Thanks for the additional explanation. You are correct that my code in #4 would fail in a large data set, I had not tested how robust it was. Here is one way, although note that the error in your second group remains.

              Code:
              * Example shared via -dataex-. To install: ssc install dataex
              clear
              input byte(z y x)
              1  1  1
              1  2  1
              1  2  5
              1  3  1
              1  3  6
              2  4  2
              3  5  2
              3  5  9
              3  6  2
              3 10  9
              4  7  4
              5  8  7
              5  8 12
              6  9  8
              6  9 13
              6 12 13
              6 13 13
              7 11 11
              8 14 14
              8 14 15
              end
              
              preserve
              duplicates tag y, gen(dup)
              keep if dup==1
              keep y x
              bys y: egen x2= min(x)
              keep x x2
              contract x x2
              drop _freq
              tempfile x
              save `x'
              restore
              merge m:1 x using `x'
              replace x2 = x if missing(x2)
              egen Z= group(x2)
              sort x2

              Result:

              Code:
              . l z y x Z, sepby(Z)
              
                   +-----------------+
                   | z    y    x   Z |
                   |-----------------|
                1. | 1    2    1   1 |
                2. | 1    1    1   1 |
                3. | 1    3    6   1 |
                4. | 1    3    1   1 |
                5. | 1    2    5   1 |
                   |-----------------|
                6. | 3    5    9   2 |
                7. | 3    5    2   2 |
                8. | 3    6    2   2 |
                9. | 3   10    9   2 |
               10. | 2    4    2   2 |
                   |-----------------|
               11. | 4    7    4   3 |
                   |-----------------|
               12. | 5    8   12   4 |
               13. | 5    8    7   4 |
                   |-----------------|
               14. | 6    9    8   5 |
               15. | 6   13   13   5 |
               16. | 6   12   13   5 |
               17. | 6    9   13   5 |
                   |-----------------|
               18. | 7   11   11   6 |
                   |-----------------|
               19. | 8   14   14   7 |
               20. | 8   14   15   7 |
                   +-----------------+

              Comment


              • #8
                Thanks, it worked!

                Comment


                • #9
                  1. The code of Andrew Musau in #7 would not be working when the number of "bridges" (to establish the link in this 2-way-connection of x y) is 2 or more. An illustration could be found with the below sample, wherein, x = 13, despite has no direct connection with x = 2, but would be still expected in a same "index" (z) through 2 bridges: x=13 and x = 9 both have y = 10, while x = 9 connects with x = 2 since both share y = 5.
                  Code:
                  clear
                  input byte(x y) float z
                   1  1 1
                   1  2 1
                   1  3 1
                   2  4 2
                   2  5 2
                   2  6 2
                   4  7 3
                   5  2 1
                   6  3 1
                   7  8 4
                   8  9 5
                   9  5 2
                   9 10 2
                  11 11 6
                  12  8 4
                  13 10 2
                  13 12 2
                  13 13 2
                  14 14 7
                  15 14 7
                  end
                  2. An untested fix of my own is under construction, if not mentioning an available community-contributed package specifically serving for this issue. Later on, I would be willing to share further if anyone might find it useful. But for now, may I honestly be curious for any extension or amendment by Andrew, or anyone, for a better solution? I am very much appreciate to learn through this interesting discussion.

                  Comment


                  • #10
                    Thanks Romalpa Akzo for looking into this. I am always impressed by the quality and thoroughness of your codes. Applying my code in #7 to your example data, it gives me advanced warning of a problem at the merge stage.

                    Code:
                    . merge m:1 x using `x'
                    variable x does not uniquely identify observations in the using data
                    r(459);
                    After running the contract command, what is apparent is that there are still duplicate observations in terms of the variable x, in this case, x=9.

                    Code:
                    . l, clean
                            x   x2    
                      1.    1    1         
                      2.    2    2        
                      3.    5    1         
                      4.    6    1         
                      5.    7    7        
                      6.    9    2    
                      7.    9    9        
                      8.   12    7        
                      9.   13    9       
                     10.   14   14        
                     11.   15   14
                    An amendment to my code needs to focus on how to deal with duplicates at this stage. I can provide a quick fix in the case of a single duplicate, as in your example data, but I need to think a little bit more about a general approach. I will post back if and when I succeed to figure this out.

                    Code:
                    clear
                    input byte(x y) float z
                     1  1 1
                     1  2 1
                     1  3 1
                     2  4 2
                     2  5 2
                     2  6 2
                     4  7 3
                     5  2 1
                     6  3 1
                     7  8 4
                     8  9 5
                     9  5 2
                     9 10 2
                    11 11 6
                    12  8 4
                    13 10 2
                    13 12 2
                    13 13 2
                    14 14 7
                    15 14 7
                    end
                    
                    preserve
                    duplicates tag y, gen(dup)
                    keep if dup==1
                    keep y x
                    bys y: egen x2= min(x)
                    keep x x2
                    contract x x2
                    drop _freq
                    duplicates tag x, gen(dup)
                    bys x2: egen dup2= max(dup)
                    bys dup2: egen x3= min(x2) if dup2
                    replace x2= x3 if !missing(x3)
                    keep x x2
                    contract x x2
                    drop _freq
                    tempfile x
                    save `x'
                    restore
                    merge m:1 x using `x'
                    replace x2= x if missing(x2)
                    egen Z= group(x2)
                    sort x2

                    Result:
                    Code:
                    . list x y z Z, sepby(Z)
                    
                         +-----------------+
                         |  x    y   z   Z |
                         |-----------------|
                      1. |  1    2   1   1 |
                      2. |  5    2   1   1 |
                      3. |  6    3   1   1 |
                      4. |  1    3   1   1 |
                      5. |  1    1   1   1 |
                         |-----------------|
                      6. |  9   10   2   2 |
                      7. |  9    5   2   2 |
                      8. | 13   10   2   2 |
                      9. |  2    6   2   2 |
                     10. | 13   12   2   2 |
                     11. | 13   13   2   2 |
                     12. |  2    5   2   2 |
                     13. |  2    4   2   2 |
                         |-----------------|
                     14. |  4    7   3   3 |
                         |-----------------|
                     15. | 12    8   4   4 |
                     16. |  7    8   4   4 |
                         |-----------------|
                     17. |  8    9   5   5 |
                         |-----------------|
                     18. | 11   11   6   6 |
                         |-----------------|
                     19. | 15   14   7   7 |
                     20. | 14   14   7   7 |
                         +-----------------

                    Comment


                    • #11
                      Thanks Andrew Musau for your kind discussion. Please check your code in #10 with the more "complicated" example below. Notice that x = 6 needs up to 4 bridges to connect with x =2.
                      Code:
                      clear
                      input float x byte y float z
                      1  1 1
                      1  2 1
                      1  3 1
                      2  4 2
                      2  5 2
                      2  6 2
                      4  6 2
                      4  7 2
                      4  8 2
                      5  8 2
                      5  9 2
                      5 10 2
                      6 10 2
                      6 11 2
                      6 12 2
                      7  3 1
                      7 13 1
                      7 14 1
                      8 14 1
                      8 15 1
                      8 16 1
                      end

                      Comment


                      • #12
                        The more complicated the connections become, you will need to reproduce something close to the code of group_twoway (from SSC) to deal with the duplicates issue. Here is an application calling the program and using your revised data. I will try out a different approach and post.

                        Code:
                        clear
                        input float x byte y float z
                        1  1 1
                        1  2 1
                        1  3 1
                        2  4 2
                        2  5 2
                        2  6 2
                        4  6 2
                        4  7 2
                        4  8 2
                        5  8 2
                        5  9 2
                        5 10 2
                        6 10 2
                        6 11 2
                        6 12 2
                        7  3 1
                        7 13 1
                        7 14 1
                        8 14 1
                        8 15 1
                        8 16 1
                        end
                        
                        preserve
                        duplicates tag y, gen(dup)
                        keep if dup==1
                        keep y x
                        bys y: egen x2= min(x)
                        keep x x2
                        contract x x2
                        drop _freq
                        *to install ssc install group_twoway
                        group_twoway x x2, gen(gr)
                        keep x gr
                        contract x gr
                        drop _freq
                        tempfile x
                        save `x'
                        restore
                        merge m:1 x using `x'
                        replace gr= x*0.1 if missing(gr)
                        egen Z= group(gr)

                        Result:

                        Code:
                        . list x y z Z, sepby(Z)
                        
                             +----------------+
                             | x    y   z   Z |
                             |----------------|
                          1. | 1    1   1   1 |
                          2. | 1    2   1   1 |
                          3. | 1    3   1   1 |
                             |----------------|
                          4. | 2    4   2   2 |
                          5. | 2    5   2   2 |
                          6. | 2    6   2   2 |
                          7. | 4    6   2   2 |
                          8. | 4    7   2   2 |
                          9. | 4    8   2   2 |
                         10. | 5    8   2   2 |
                         11. | 5    9   2   2 |
                         12. | 5   10   2   2 |
                         13. | 6   10   2   2 |
                         14. | 6   11   2   2 |
                         15. | 6   12   2   2 |
                             |----------------|
                         16. | 7    3   1   1 |
                         17. | 7   13   1   1 |
                         18. | 7   14   1   1 |
                         19. | 8   14   1   1 |
                         20. | 8   15   1   1 |
                         21. | 8   16   1   1 |
                             +----------------+

                        Comment


                        • #13
                          Andrew Musau, thanks for your sharing.

                          1. Actually, for me, it is likely that group_twoway was born to serve for this issue. See the below code, which is expected to be just adequate to solve this issue.
                          Code:
                          gen y1 = y + 100
                          group_twoway x y1, gen(Z)
                          2. However, as you might have noticed, I have avoided mentioning this package just to reserve the discussion on reproducing the logic of group_twoway out of its “black-box”. In this regard, your discussion is highly appreciate, despite your direction and mine seems quite different. Below is my approach.
                          Code:
                          egen a0 = group(x)
                          egen a1 = min(a0), by(y)
                          
                          local l=1
                          while r(N) <_N{
                          egen a`=`l'+1' = min(a`l'), by(a`=`l'-1')
                          count if a`l++' == a`l'
                          }
                          
                          egen Z = group(a`l')
                          drop a*

                          Comment


                          • #14
                            This is very concise, thank you.

                            Comment

                            Working...
                            X