generating an index

Gabi Chioran

Join Date: Jan 2019

Posts: 4
#1

generating an index

09 Jan 2019, 08:39

Hi, I am trying to generate the index "z" in Stata, so that Z has the same value for all Xs and Ys. For example:

x y z
1 1 1
1 2 1
1 3 1
2 4 2
2 5 3
2 6 3
4 7 4
5 2 1
6 3 1
7 8 5

Any help would be much appreciated. Thanks!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35708
#2

09 Jan 2019, 08:43

The rules there aren't clear (at least to me).
Comment
Gabi Chioran

Join Date: Jan 2019

Posts: 4
#3

09 Jan 2019, 08:58

For example the set x=1 intersects the sets y= 1, 2 and 3. But set y=2 also intersects the set x= 5. Therefore I want to cluster these sets. I want to create an index Z with the same value for x = 1, 5,6 and y=1,2,3.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10213

09 Jan 2019, 11:12

Either your description is still not sufficient or your example is not correct. However, I think that this is what you want.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(x y z)
1 1 1
1 2 1
1 3 1
2 4 2
2 5 3
2 6 3
4 7 4
5 2 1
6 3 1
7 8 5
end

bys x (y): gen tag= y if _n==_N
bys x: egen lastob= max(tag)
sort y
replace lastob= lastob[_n-1] if lastob<lastob[_n-1] & _n!=1
egen Z= group(lastob)

Result:

Code:

. sort Z y x

. list, sepby(Z)

     +------------------------------+
     | x   y   z   tag   lastob   Z |
     |------------------------------|
  1. | 1   1   1     .        3   1 |
  2. | 1   2   1     .        3   1 |
  3. | 5   2   1     2        3   1 |
  4. | 1   3   1     3        3   1 |
  5. | 6   3   1     3        3   1 |
     |------------------------------|
  6. | 2   4   2     .        6   2 |
  7. | 2   5   3     .        6   2 |
  8. | 2   6   3     6        6   2 |
     |------------------------------|
  9. | 4   7   4     7        7   3 |
     |------------------------------|
 10. | 7   8   5     8        8   4 |
     +------------------------------+

Comment

Gabi Chioran

Join Date: Jan 2019

Posts: 4
#5

09 Jan 2019, 14:23

Thanks Andrew, but it's not quite what I need. Not sure how to explain... All x’s that have common y’s get the same index z (z is the index I created manually in the example above, but I would like to create it using code). For instance, if x = 1 and x = 5 have any instance where they have a common y, say y = 2, then both x = 1 and x = 5 will get the same index. Again, if x=1 and x = 6 have any common y, say y = 3, then these two x should also get the same index, and hence, x= 1, x = 5, and x= 6 will all get the common index. I added a few more examples. I hope it makes my question clearer.

+-------------+
| x y z |
|-------------|
1. | 1 1 1 |
2. | 1 2 1 |
3. | 1 3 1 |
4. | 2 4 2 |
5. | 2 5 3 |
|-------------|
6. | 2 6 3 |
7. | 4 7 4 |
8. | 5 2 1 |
9. | 6 3 1 |
10. | 7 8 5 |
|-------------|
11. | 8 9 6 |
12. | 9 5 3 |
13. | 9 10 3 |
14. | 11 11 7 |
15. | 12 8 5 |
|-------------|
16. | 13 9 6 |
17. | 13 12 6 |
18. | 13 13 6 |
19. | 14 14 8 |
20. | 15 14 8 |
+-------------
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#6

09 Jan 2019, 16:25

Hi Gabi,

I tried sorting by z (and previously) by y, to see if I could figure out your decision rule, but I confess I can't figure it out.

1) It would be really helpful if you used Stata's dataex command to share your data. (It makes it a *ton* easier for others to help you). Since most Stata users haven't used it before coming to Statalist, I created a tutorial on it on Youtube here. (Watch at 2x speed :-)

2) I think it would also be helpful if you could give us some context (i.e. what do x, y, and z stand for?) For example, as I read your example in post #5, I wondered, are you asking "If person1 and person5 both know person #2 (y=2), create a marker for this." Or, create a listing for all of the relationships that two individuals have in common. If that's the case, Stata has a lot of tools for social ties / network analysis.

Code:

* Example shared via -dataex-. To install: ssc install dataex clear input byte(z y x) 1 1 1 1 2 1 1 2 5 1 3 1 1 3 6 2 4 2 3 5 2 3 5 9 3 6 2 3 10 9 4 7 4 5 8 7 5 8 12 6 9 8 6 9 13 6 12 13 6 13 13 7 11 11 8 14 14 8 14 15 end sort z y x . list, sepby(z) +-------------+ | z y x | |-------------| 1. | 1 1 1 | 2. | 1 2 1 | 3. | 1 2 5 | 4. | 1 3 1 | 5. | 1 3 6 | |-------------| 6. | 2 4 2 | |-------------| 7. | 3 5 2 | 8. | 3 5 9 | 9. | 3 6 2 | 10. | 3 10 9 | |-------------| 11. | 4 7 4 | |-------------| 12. | 5 8 7 | 13. | 5 8 12 | |-------------| 14. | 6 9 8 | 15. | 6 9 13 | 16. | 6 12 13 | 17. | 6 13 13 | |-------------| 18. | 7 11 11 | |-------------| 19. | 8 14 14 | 20. | 8 14 15 | +-------------+

Also, in your data, it's not clear why z==2 isn't lumpted with z==3. (Because x==2 intersects with y==4 and with y==5.)
1 like
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10213

10 Jan 2019, 00:34

Thanks Andrew, but it's not quite what I need. Not sure how to explain... All x’s that have common y’s get the same index z (z is the index I created manually in the example above, but I would like to create it using code). For instance, if x = 1 and x = 5 have any instance where they have a common y, say y = 2, then both x = 1 and x = 5 will get the same index. Again, if x=1 and x = 6 have any common y, say y = 3, then these two x should also get the same index, and hence, x= 1, x = 5, and x= 6 will all get the common index. I added a few more examples. I hope it makes my question clearer.

Thanks for the additional explanation. You are correct that my code in #4 would fail in a large data set, I had not tested how robust it was. Here is one way, although note that the error in your second group remains.

Code:

* Example shared via -dataex-. To install: ssc install dataex
clear
input byte(z y x)
1  1  1
1  2  1
1  2  5
1  3  1
1  3  6
2  4  2
3  5  2
3  5  9
3  6  2
3 10  9
4  7  4
5  8  7
5  8 12
6  9  8
6  9 13
6 12 13
6 13 13
7 11 11
8 14 14
8 14 15
end

preserve
duplicates tag y, gen(dup)
keep if dup==1
keep y x
bys y: egen x2= min(x)
keep x x2
contract x x2
drop _freq
tempfile x
save `x'
restore
merge m:1 x using `x'
replace x2 = x if missing(x2)
egen Z= group(x2)
sort x2

Result:

Code:

. l z y x Z, sepby(Z)

     +-----------------+
     | z    y    x   Z |
     |-----------------|
  1. | 1    2    1   1 |
  2. | 1    1    1   1 |
  3. | 1    3    6   1 |
  4. | 1    3    1   1 |
  5. | 1    2    5   1 |
     |-----------------|
  6. | 3    5    9   2 |
  7. | 3    5    2   2 |
  8. | 3    6    2   2 |
  9. | 3   10    9   2 |
 10. | 2    4    2   2 |
     |-----------------|
 11. | 4    7    4   3 |
     |-----------------|
 12. | 5    8   12   4 |
 13. | 5    8    7   4 |
     |-----------------|
 14. | 6    9    8   5 |
 15. | 6   13   13   5 |
 16. | 6   12   13   5 |
 17. | 6    9   13   5 |
     |-----------------|
 18. | 7   11   11   6 |
     |-----------------|
 19. | 8   14   14   7 |
 20. | 8   14   15   7 |
     +-----------------+

Comment

Gabi Chioran

Join Date: Jan 2019

Posts: 4
#8

10 Jan 2019, 05:16

Thanks, it worked!
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#9

10 Jan 2019, 09:46

1. The code of Andrew Musau in #7 would not be working when the number of "bridges" (to establish the link in this 2-way-connection of x y) is 2 or more. An illustration could be found with the below sample, wherein, x = 13, despite has no direct connection with x = 2, but would be still expected in a same "index" (z) through 2 bridges: x=13 and x = 9 both have y = 10, while x = 9 connects with x = 2 since both share y = 5.

Code:

clear input byte(x y) float z 1 1 1 1 2 1 1 3 1 2 4 2 2 5 2 2 6 2 4 7 3 5 2 1 6 3 1 7 8 4 8 9 5 9 5 2 9 10 2 11 11 6 12 8 4 13 10 2 13 12 2 13 13 2 14 14 7 15 14 7 end

2. An untested fix of my own is under construction, if not mentioning an available community-contributed package specifically serving for this issue. Later on, I would be willing to share further if anyone might find it useful. But for now, may I honestly be curious for any extension or amendment by Andrew, or anyone, for a better solution? I am very much appreciate to learn through this interesting discussion.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10213

#10

11 Jan 2019, 00:15

Thanks Romalpa Akzo for looking into this. I am always impressed by the quality and thoroughness of your codes. Applying my code in #7 to your example data, it gives me advanced warning of a problem at the merge stage.

Code:

. merge m:1 x using `x'
variable x does not uniquely identify observations in the using data
r(459);

After running the contract command, what is apparent is that there are still duplicate observations in terms of the variable x, in this case, x=9.

Code:

. l, clean
        x   x2    
  1.    1    1         
  2.    2    2        
  3.    5    1         
  4.    6    1         
  5.    7    7        
  6.    9    2    
  7.    9    9        
  8.   12    7        
  9.   13    9       
 10.   14   14        
 11.   15   14

An amendment to my code needs to focus on how to deal with duplicates at this stage. I can provide a quick fix in the case of a single duplicate, as in your example data, but I need to think a little bit more about a general approach. I will post back if and when I succeed to figure this out.

Code:

clear
input byte(x y) float z
 1  1 1
 1  2 1
 1  3 1
 2  4 2
 2  5 2
 2  6 2
 4  7 3
 5  2 1
 6  3 1
 7  8 4
 8  9 5
 9  5 2
 9 10 2
11 11 6
12  8 4
13 10 2
13 12 2
13 13 2
14 14 7
15 14 7
end

preserve
duplicates tag y, gen(dup)
keep if dup==1
keep y x
bys y: egen x2= min(x)
keep x x2
contract x x2
drop _freq
duplicates tag x, gen(dup)
bys x2: egen dup2= max(dup)
bys dup2: egen x3= min(x2) if dup2
replace x2= x3 if !missing(x3)
keep x x2
contract x x2
drop _freq
tempfile x
save `x'
restore
merge m:1 x using `x'
replace x2= x if missing(x2)
egen Z= group(x2)
sort x2

Result:

Code:

. list x y z Z, sepby(Z)

     +-----------------+
     |  x    y   z   Z |
     |-----------------|
  1. |  1    2   1   1 |
  2. |  5    2   1   1 |
  3. |  6    3   1   1 |
  4. |  1    3   1   1 |
  5. |  1    1   1   1 |
     |-----------------|
  6. |  9   10   2   2 |
  7. |  9    5   2   2 |
  8. | 13   10   2   2 |
  9. |  2    6   2   2 |
 10. | 13   12   2   2 |
 11. | 13   13   2   2 |
 12. |  2    5   2   2 |
 13. |  2    4   2   2 |
     |-----------------|
 14. |  4    7   3   3 |
     |-----------------|
 15. | 12    8   4   4 |
 16. |  7    8   4   4 |
     |-----------------|
 17. |  8    9   5   5 |
     |-----------------|
 18. | 11   11   6   6 |
     |-----------------|
 19. | 15   14   7   7 |
 20. | 14   14   7   7 |
     +-----------------

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#11

11 Jan 2019, 02:20

Thanks Andrew Musau for your kind discussion. Please check your code in #10 with the more "complicated" example below. Notice that x = 6 needs up to 4 bridges to connect with x =2.

Code:

clear input float x byte y float z 1 1 1 1 2 1 1 3 1 2 4 2 2 5 2 2 6 2 4 6 2 4 7 2 4 8 2 5 8 2 5 9 2 5 10 2 6 10 2 6 11 2 6 12 2 7 3 1 7 13 1 7 14 1 8 14 1 8 15 1 8 16 1 end
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10213

#12

11 Jan 2019, 04:23

The more complicated the connections become, you will need to reproduce something close to the code of group_twoway (from SSC) to deal with the duplicates issue. Here is an application calling the program and using your revised data. I will try out a different approach and post.

Code:

clear
input float x byte y float z
1  1 1
1  2 1
1  3 1
2  4 2
2  5 2
2  6 2
4  6 2
4  7 2
4  8 2
5  8 2
5  9 2
5 10 2
6 10 2
6 11 2
6 12 2
7  3 1
7 13 1
7 14 1
8 14 1
8 15 1
8 16 1
end

preserve
duplicates tag y, gen(dup)
keep if dup==1
keep y x
bys y: egen x2= min(x)
keep x x2
contract x x2
drop _freq
*to install ssc install group_twoway
group_twoway x x2, gen(gr)
keep x gr
contract x gr
drop _freq
tempfile x
save `x'
restore
merge m:1 x using `x'
replace gr= x*0.1 if missing(gr)
egen Z= group(gr)

Result:

Code:

. list x y z Z, sepby(Z)

     +----------------+
     | x    y   z   Z |
     |----------------|
  1. | 1    1   1   1 |
  2. | 1    2   1   1 |
  3. | 1    3   1   1 |
     |----------------|
  4. | 2    4   2   2 |
  5. | 2    5   2   2 |
  6. | 2    6   2   2 |
  7. | 4    6   2   2 |
  8. | 4    7   2   2 |
  9. | 4    8   2   2 |
 10. | 5    8   2   2 |
 11. | 5    9   2   2 |
 12. | 5   10   2   2 |
 13. | 6   10   2   2 |
 14. | 6   11   2   2 |
 15. | 6   12   2   2 |
     |----------------|
 16. | 7    3   1   1 |
 17. | 7   13   1   1 |
 18. | 7   14   1   1 |
 19. | 8   14   1   1 |
 20. | 8   15   1   1 |
 21. | 8   16   1   1 |
     +----------------+

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#13

12 Jan 2019, 06:46

Andrew Musau, thanks for your sharing.

1. Actually, for me, it is likely that group_twoway was born to serve for this issue. See the below code, which is expected to be just adequate to solve this issue.

Code:

gen y1 = y + 100 group_twoway x y1, gen(Z)

2. However, as you might have noticed, I have avoided mentioning this package just to reserve the discussion on reproducing the logic of group_twoway out of its “black-box”. In this regard, your discussion is highly appreciate, despite your direction and mine seems quite different. Below is my approach.

Code:

egen a0 = group(x) egen a1 = min(a0), by(y) local l=1 while r(N) <_N{ egen a`=`l'+1' = min(a`l'), by(a`=`l'-1') count if a`l++' == a`l' } egen Z = group(a`l') drop a*
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10213
#14

13 Jan 2019, 00:33

This is very concise, thank you.
Comment

Announcement