I want to calculate how many friends a user has, but I don't know how to

Xu Ji

Join Date: Sep 2022

Posts: 15
#1

I want to calculate how many friends a user has, but I don't know how to

03 Oct 2022, 09:20

(Cross-posted on https://stackoverflow.com/questions/...nt-know-how-to)

Please apologize for the title, I don't know how I can otherwise formulate that!

So I have here a dataset from steam which includes

Code:

steamid

= individual user on steam, and

Code:

steam_b

= another user which is a friend of this. Now I want to calculate how many friends each

Code:

steamid

has.

Here is a sample from my dataset:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str17(steamid steamid_b) "76561197960265729" "76561197967144365" "76561197960265730" "76561197960265733" "76561197960265730" "76561197960265733" "76561197960265730" "76561197960265733" "76561197960265733" "76561197964770089" "76561197960265733" "76561197964770089" "76561197960265733" "76561197964770089" "76561197960265733" "76561197964770089" "76561197960265733" "76561197964770089" "76561197960265738" "76561198010062752" end

It looks for the first time as if every user has only one friend, but we can see that

Code:

steamid

also sometimes appears as

Code:

steam_b

so it means he/she has actually 2 friends. For example,

Code:

76561197960265733

has the friend

Code:

76561197964770089

but he/she also appears as friend of

Code:

76561197960265730

so

Code:

76561197960265733

actually has 2 friends. Which command can I use to caculate such relationships? I don't if it is relevant, but the dataset has over 32 million observations.

Last edited by Xu Ji; 03 Oct 2022, 09:45.
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3152
#2

03 Oct 2022, 09:37

need to use dataex for sample data
Comment
Xu Ji

Join Date: Sep 2022

Posts: 15
#3

03 Oct 2022, 09:47

Thank you George for reminding, I have edited it
Comment

George Ford

Join Date: Aug 2014
Posts: 3152

03 Oct 2022, 09:58

Probably a better way, especially with so many observations, but this might work.

Code:

g friends = .
levelsof steamid, local(levels)
  foreach s of local levels {
     qui gunique steamid_b if steamid==`s'
    local m1 = r(J)
    qui gunique steamid if steamid==`s'
    local m2 = r(J)
    qui replace friends = `m1'+`m2' if steamid==`s'
    di %10.0f `s' _col(20) r(J)
}

Last edited by George Ford; 03 Oct 2022, 10:16.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#5

03 Oct 2022, 11:15

I think George Ford meant to say:

Code:

g friends = . levelsof steamid, local(levels) foreach s of local levels { qui gunique steamid_b if steamid==`s' local m1 = r(J) qui gunique steamid if steamid_b==`s' local m2 = r(J) qui replace friends = `m1'+`m2' if steamid==`s' di %10.0f `s' _col(20) r(J) }

FWIW, I would also precede this code with -isid steamid steamid_b- to verify that there are no duplicate observations in the data set that would lead to double-counting, and also -assert steamid != steamid_b-, so that we don't count anybody as his/her own friend.

Finally, -gunique- is a user-written command, part of the highly useful -gtools- suite, which is available from SSC.

Added: I would caution O.P. to be patient. Even though -gunique- is fast, all of those -if- conditions are going to add up to a lot of time in a data set with 32,000,000 observations. This is going to be slow. It is also one of the few real life situations I've seen where there is no clear way to improve on this with -runby-.

Added: Also, on reflection I think this code will fail if there are any people who only appear in steamid_b but never appear in steam_id, because their friends never get recorded in friends. Is O.P. sure that this never happens?

Last edited by Clyde Schechter; 03 Oct 2022, 11:24.
1 like
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

03 Oct 2022, 11:16

I think this does what you want.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str17(steamid steamid_b)
"76561197960265729" "76561197967144365"
"76561197960265730" "76561197960265733"
"76561197960265730" "76561197960265733"
"76561197960265730" "76561197960265733"
"76561197960265733" "76561197964770089"
"76561197960265733" "76561197964770089"
"76561197960265733" "76561197964770089"
"76561197960265733" "76561197964770089"
"76561197960265733" "76561197964770089"
"76561197960265738" "76561198010062752"
end
tempfile original
save `"`original'"'
rename (steamid steamid_b) (steamid_b steamid)
order steamid steamid_b
append using `"`original'"'
sort steamid steamid_b
duplicates drop steamid steamid_b, force
by steamid: generate friends = _N
list, sepby(steamid)

Code:

. list, sepby(steamid)

     +-------------------------------------------------+
     |           steamid           steamid_b   friends |
     |-------------------------------------------------|
  1. | 76561197960265729   76561197967144365         1 |
     |-------------------------------------------------|
  2. | 76561197960265730   76561197960265733         1 |
     |-------------------------------------------------|
  3. | 76561197960265733   76561197960265730         2 |
  4. | 76561197960265733   76561197964770089         2 |
     |-------------------------------------------------|
  5. | 76561197960265738   76561198010062752         1 |
     |-------------------------------------------------|
  6. | 76561197964770089   76561197960265733         1 |
     |-------------------------------------------------|
  7. | 76561197967144365   76561197960265729         1 |
     |-------------------------------------------------|
  8. | 76561198010062752   76561197960265738         1 |
     +-------------------------------------------------+

Comment

Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#7

03 Oct 2022, 11:35

I'm noticing the comment on stack overflow mentioned an adjacency list. I just want to take a moment to point out that you appear to already have an adjacency list. It should be given by the columns steamid and steamid_b. If so, every (steamid, steamid_b) pair should represent a unique relationship. If that were the case, then the number of appearances of a unique ID in steamid should be equal to the number of friends steamid has.

However, several relationships are repeated in the example data. In fact, if you take the set of unique relationships, every steamid except one has exactly one neighbor in the example data you provide. Can you explain why these relationships are not unique?

If the relationships were unique, this would be sufficient:

Code:

sort steamid unique steamid_b, by(steamid) gen(numfriends) sort steamid numfriends replace numfriends = numfriends[_n-1] if numfriends == .

Last edited by Daniel Schaefer; 03 Oct 2022, 11:45. Reason: Clyde already addressed my comment with respect to #4
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#8

03 Oct 2022, 12:06

Looking at #6, I guess I've misunderstood, and mistakenly assumed that every single ego is represented in steamid. I suppose in such a large dataset, you wouldn't want to list A -> B and B -> A as separate entries. I still don't understand why some entries appear to be repeated, but I suppose if OP is happy with #6, then I don't require an answer.
Comment
Xu Ji

Join Date: Sep 2022

Posts: 15
#9

03 Oct 2022, 12:48

I forgot to mention that the steamid repeat themselves because each row stands for one game which this user has, so e.g., if "76561197960265733" appears 5 times, it means he/she has bought 5 games on steam. So I should better not drop those "duplicates" because I need them.
Also there will not be a situation where an id appears only in steamid_b or only in steamid (only if he/she has no friend at all).
I will try those codes now, but it takes really very long to run any of those codes, I will give you answer if it works. But thank you very much for your support! I really appreciate that!

Last edited by Xu Ji; 03 Oct 2022, 12:54.
Comment

Xu Ji

Join Date: Sep 2022
Posts: 15

#10

03 Oct 2022, 13:12

Originally posted by George Ford View Post

Probably a better way, especially with so many observations, but this might work.

Code:

g friends = .
levelsof steamid, local(levels)
foreach s of local levels {
qui gunique steamid_b if steamid==`s'
local m1 = r(J)
qui gunique steamid if steamid==`s'
local m2 = r(J)
qui replace friends = `m1'+`m2' if steamid==`s'
di %10.0f `s' _col(20) r(J)
}

So I tried this, but it shows

Code:

. levelsof steamid, local(levels)
macro substitution results in line that is too long
r(920);

Comment

Xu Ji

Join Date: Sep 2022
Posts: 15

#11

03 Oct 2022, 13:28

GUYS, I have a big problem here.

I was trying the code from George:

Code:

. use "H:\Bachlorarbeit\steam_25.09.dta"

. g friends = .
(32,227,518 missing values generated)

. 
. levelsof steamid, local(levels)
macro substitution results in line that is too long
r(920);

. 
.   foreach s of local levels {
  2. 
.      qui gunique steamid_b if steamid==`s'
  3. 
.     local m1 = r(J)
  4. 
.     qui gunique steamid if steamid_b==`s'
  5. 
.     local m2 = r(J)
  6. 
.     qui replace friends = `m1'+`m2' if steamid==`s'
  7. 
.     di %10.0f `s' _col(20) r(J)
  8.   9 10 11descr friends
 12. exit, clear
 13. exit, clear
 14. exit, clear
 15. clear
 16. use "H:\Bachlorarbeit\steam_25.09.dta", clear
 17.  18 19 20 21 22 23 24 25 26 27 28save "H:\Bachlorarbeit\steam_25.09.dta", replace
 29. exit
 30.  31clear
 32. save "H:\Bachlorarbeit\steam_25.09.dta", replace
 33. pause

No matter what I do, STATA doesn't react. So it is not doing any command at all. I can also not really close STATA because I am running STATA over the university's serve, so I closed the "software" and when I open it again, it is just the same. What can I do now?

Last edited by Xu Ji; 03 Oct 2022, 13:31.

Comment

Julia Simon

Join Date: Apr 2022

Posts: 37
#12

03 Oct 2022, 14:18

Correct me if I'm wrong, but I think Stata "doesn't react" because it still believes you are building a loop, so it will execute all the commands you just typed once you close the loop with the bracket "}".

Last edited by Julia Simon; 03 Oct 2022, 14:21.
2 likes
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#13

03 Oct 2022, 14:28

Perhaps the following will do what you want, now that you have made it clear that you want the counts provided by the code in post #6 to be merged back into your original data.

And a hint: don't test your code on 32 million observations. Create a subset and when you get your code working on in, then try the entire dataset.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str17(steamid steamid_b)
"76561197960265729" "76561197967144365"
"76561197960265730" "76561197960265733"
"76561197960265730" "76561197960265733"
"76561197960265730" "76561197960265733"
"76561197960265733" "76561197964770089"
"76561197960265733" "76561197964770089"
"76561197960265733" "76561197964770089"
"76561197960265733" "76561197964770089"
"76561197960265733" "76561197964770089"
"76561197960265738" "76561198010062752"
end
tempfile original
save `"`original'"'

keep steamid steamid_b
tempfile idlist
save `"`idlist'"'
rename (steamid steamid_b) (steamid_b steamid)
order steamid steamid_b
append using `"`idlist'"'
sort steamid steamid_b
duplicates drop steamid steamid_b, force
by steamid: generate friends = _N
drop steamid_b
duplicates drop
list, clean
save `"`idlist'"', replace

use `"`original'"', clear
merge m:1 steamid using `"`idlist'"', keep(match)
drop _merge
sort steamid steamid_b
list, clean

Code:

. list, clean

                 steamid           steamid_b   friends  
  1.   76561197960265729   76561197967144365         1  
  2.   76561197960265730   76561197960265733         1  
  3.   76561197960265730   76561197960265733         1  
  4.   76561197960265730   76561197960265733         1  
  5.   76561197960265733   76561197964770089         2  
  6.   76561197960265733   76561197964770089         2  
  7.   76561197960265733   76561197964770089         2  
  8.   76561197960265733   76561197964770089         2  
  9.   76561197960265733   76561197964770089         2  
 10.   76561197960265738   76561198010062752         1

Comment

George Ford

Join Date: Aug 2014

Posts: 3152
#14

03 Oct 2022, 14:57

#13 looks proper.

and when you run on the full data set, go have dinner.
1 like
Comment
Xu Ji

Join Date: Sep 2022

Posts: 15
#15

04 Oct 2022, 03:29

Thank you William, it has worked perfectly! Thank you so much!!
Comment

Announcement