Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • I want to calculate how many friends a user has, but I don't know how to

    (Cross-posted on https://stackoverflow.com/questions/...nt-know-how-to)

    Please apologize for the title, I don't know how I can otherwise formulate that!

    So I have here a dataset from steam which includes
    Code:
    steamid
    = individual user on steam, and
    Code:
    steam_b
    = another user which is a friend of this. Now I want to calculate how many friends each
    Code:
    steamid
    has.

    Here is a sample from my dataset:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str17(steamid steamid_b)
    "76561197960265729" "76561197967144365"
    "76561197960265730" "76561197960265733"
    "76561197960265730" "76561197960265733"
    "76561197960265730" "76561197960265733"
    "76561197960265733" "76561197964770089"
    "76561197960265733" "76561197964770089"
    "76561197960265733" "76561197964770089"
    "76561197960265733" "76561197964770089"
    "76561197960265733" "76561197964770089"
    "76561197960265738" "76561198010062752"
    end

    It looks for the first time as if every user has only one friend, but we can see that
    Code:
    steamid
    also sometimes appears as
    Code:
    steam_b
    so it means he/she has actually 2 friends. For example,
    Code:
    76561197960265733
    has the friend
    Code:
    76561197964770089
    but he/she also appears as friend of
    Code:
    76561197960265730
    so
    Code:
    76561197960265733
    actually has 2 friends. Which command can I use to caculate such relationships? I don't if it is relevant, but the dataset has over 32 million observations.
    Last edited by Xu Ji; 03 Oct 2022, 09:45.

  • #2
    need to use dataex for sample data

    Comment


    • #3
      Thank you George for reminding, I have edited it

      Comment


      • #4
        Probably a better way, especially with so many observations, but this might work.
        Code:
        g friends = .
        levelsof steamid, local(levels)
          foreach s of local levels {
             qui gunique steamid_b if steamid==`s'
            local m1 = r(J)
            qui gunique steamid if steamid==`s'
            local m2 = r(J)
            qui replace friends = `m1'+`m2' if steamid==`s'
            di %10.0f `s' _col(20) r(J)
        }
        Last edited by George Ford; 03 Oct 2022, 10:16.

        Comment


        • #5
          I think George Ford meant to say:
          Code:
          g friends = .
          levelsof steamid, local(levels)
            foreach s of local levels {
               qui gunique steamid_b if steamid==`s'
              local m1 = r(J)
              qui gunique steamid if steamid_b==`s'
              local m2 = r(J)
              qui replace friends = `m1'+`m2' if steamid==`s'
              di %10.0f `s' _col(20) r(J)
          }
          FWIW, I would also precede this code with -isid steamid steamid_b- to verify that there are no duplicate observations in the data set that would lead to double-counting, and also -assert steamid != steamid_b-, so that we don't count anybody as his/her own friend.

          Finally, -gunique- is a user-written command, part of the highly useful -gtools- suite, which is available from SSC.

          Added: I would caution O.P. to be patient. Even though -gunique- is fast, all of those -if- conditions are going to add up to a lot of time in a data set with 32,000,000 observations. This is going to be slow. It is also one of the few real life situations I've seen where there is no clear way to improve on this with -runby-.

          Added: Also, on reflection I think this code will fail if there are any people who only appear in steamid_b but never appear in steam_id, because their friends never get recorded in friends. Is O.P. sure that this never happens?
          Last edited by Clyde Schechter; 03 Oct 2022, 11:24.

          Comment


          • #6
            I think this does what you want.
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str17(steamid steamid_b)
            "76561197960265729" "76561197967144365"
            "76561197960265730" "76561197960265733"
            "76561197960265730" "76561197960265733"
            "76561197960265730" "76561197960265733"
            "76561197960265733" "76561197964770089"
            "76561197960265733" "76561197964770089"
            "76561197960265733" "76561197964770089"
            "76561197960265733" "76561197964770089"
            "76561197960265733" "76561197964770089"
            "76561197960265738" "76561198010062752"
            end
            tempfile original
            save `"`original'"'
            rename (steamid steamid_b) (steamid_b steamid)
            order steamid steamid_b
            append using `"`original'"'
            sort steamid steamid_b
            duplicates drop steamid steamid_b, force
            by steamid: generate friends = _N
            list, sepby(steamid)
            Code:
            . list, sepby(steamid)
            
                 +-------------------------------------------------+
                 |           steamid           steamid_b   friends |
                 |-------------------------------------------------|
              1. | 76561197960265729   76561197967144365         1 |
                 |-------------------------------------------------|
              2. | 76561197960265730   76561197960265733         1 |
                 |-------------------------------------------------|
              3. | 76561197960265733   76561197960265730         2 |
              4. | 76561197960265733   76561197964770089         2 |
                 |-------------------------------------------------|
              5. | 76561197960265738   76561198010062752         1 |
                 |-------------------------------------------------|
              6. | 76561197964770089   76561197960265733         1 |
                 |-------------------------------------------------|
              7. | 76561197967144365   76561197960265729         1 |
                 |-------------------------------------------------|
              8. | 76561198010062752   76561197960265738         1 |
                 +-------------------------------------------------+

            Comment


            • #7
              I'm noticing the comment on stack overflow mentioned an adjacency list. I just want to take a moment to point out that you appear to already have an adjacency list. It should be given by the columns steamid and steamid_b. If so, every (steamid, steamid_b) pair should represent a unique relationship. If that were the case, then the number of appearances of a unique ID in steamid should be equal to the number of friends steamid has.

              However, several relationships are repeated in the example data. In fact, if you take the set of unique relationships, every steamid except one has exactly one neighbor in the example data you provide. Can you explain why these relationships are not unique?

              If the relationships were unique, this would be sufficient:

              Code:
              sort steamid
              unique steamid_b, by(steamid) gen(numfriends)
              sort steamid numfriends
              replace numfriends = numfriends[_n-1] if numfriends == .
              Last edited by Daniel Schaefer; 03 Oct 2022, 11:45. Reason: Clyde already addressed my comment with respect to #4

              Comment


              • #8
                Looking at #6, I guess I've misunderstood, and mistakenly assumed that every single ego is represented in steamid. I suppose in such a large dataset, you wouldn't want to list A -> B and B -> A as separate entries. I still don't understand why some entries appear to be repeated, but I suppose if OP is happy with #6, then I don't require an answer.

                Comment


                • #9
                  I forgot to mention that the steamid repeat themselves because each row stands for one game which this user has, so e.g., if "76561197960265733" appears 5 times, it means he/she has bought 5 games on steam. So I should better not drop those "duplicates" because I need them.
                  Also there will not be a situation where an id appears only in steamid_b or only in steamid (only if he/she has no friend at all).
                  I will try those codes now, but it takes really very long to run any of those codes, I will give you answer if it works. But thank you very much for your support! I really appreciate that!
                  Last edited by Xu Ji; 03 Oct 2022, 12:54.

                  Comment


                  • #10
                    Originally posted by George Ford View Post
                    Probably a better way, especially with so many observations, but this might work.
                    Code:
                    g friends = .
                    levelsof steamid, local(levels)
                    foreach s of local levels {
                    qui gunique steamid_b if steamid==`s'
                    local m1 = r(J)
                    qui gunique steamid if steamid==`s'
                    local m2 = r(J)
                    qui replace friends = `m1'+`m2' if steamid==`s'
                    di %10.0f `s' _col(20) r(J)
                    }
                    So I tried this, but it shows
                    Code:
                    . levelsof steamid, local(levels)
                    macro substitution results in line that is too long
                    r(920);

                    Comment


                    • #11
                      GUYS, I have a big problem here.

                      I was trying the code from George:

                      Code:
                      . use "H:\Bachlorarbeit\steam_25.09.dta"
                      
                      . g friends = .
                      (32,227,518 missing values generated)
                      
                      . 
                      . levelsof steamid, local(levels)
                      macro substitution results in line that is too long
                      r(920);
                      
                      . 
                      .   foreach s of local levels {
                        2. 
                      .      qui gunique steamid_b if steamid==`s'
                        3. 
                      .     local m1 = r(J)
                        4. 
                      .     qui gunique steamid if steamid_b==`s'
                        5. 
                      .     local m2 = r(J)
                        6. 
                      .     qui replace friends = `m1'+`m2' if steamid==`s'
                        7. 
                      .     di %10.0f `s' _col(20) r(J)
                        8.   9 10 11descr friends
                       12. exit, clear
                       13. exit, clear
                       14. exit, clear
                       15. clear
                       16. use "H:\Bachlorarbeit\steam_25.09.dta", clear
                       17.  18 19 20 21 22 23 24 25 26 27 28save "H:\Bachlorarbeit\steam_25.09.dta", replace
                       29. exit
                       30.  31clear
                       32. save "H:\Bachlorarbeit\steam_25.09.dta", replace
                       33. pause
                      No matter what I do, STATA doesn't react. So it is not doing any command at all. I can also not really close STATA because I am running STATA over the university's serve, so I closed the "software" and when I open it again, it is just the same. What can I do now?
                      Last edited by Xu Ji; 03 Oct 2022, 13:31.

                      Comment


                      • #12
                        Correct me if I'm wrong, but I think Stata "doesn't react" because it still believes you are building a loop, so it will execute all the commands you just typed once you close the loop with the bracket "}".
                        Last edited by Julia Simon; 03 Oct 2022, 14:21.

                        Comment


                        • #13
                          Perhaps the following will do what you want, now that you have made it clear that you want the counts provided by the code in post #6 to be merged back into your original data.

                          And a hint: don't test your code on 32 million observations. Create a subset and when you get your code working on in, then try the entire dataset.
                          Code:
                          * Example generated by -dataex-. For more info, type help dataex
                          clear
                          input str17(steamid steamid_b)
                          "76561197960265729" "76561197967144365"
                          "76561197960265730" "76561197960265733"
                          "76561197960265730" "76561197960265733"
                          "76561197960265730" "76561197960265733"
                          "76561197960265733" "76561197964770089"
                          "76561197960265733" "76561197964770089"
                          "76561197960265733" "76561197964770089"
                          "76561197960265733" "76561197964770089"
                          "76561197960265733" "76561197964770089"
                          "76561197960265738" "76561198010062752"
                          end
                          tempfile original
                          save `"`original'"'
                          
                          keep steamid steamid_b
                          tempfile idlist
                          save `"`idlist'"'
                          rename (steamid steamid_b) (steamid_b steamid)
                          order steamid steamid_b
                          append using `"`idlist'"'
                          sort steamid steamid_b
                          duplicates drop steamid steamid_b, force
                          by steamid: generate friends = _N
                          drop steamid_b
                          duplicates drop
                          list, clean
                          save `"`idlist'"', replace
                          
                          use `"`original'"', clear
                          merge m:1 steamid using `"`idlist'"', keep(match)
                          drop _merge
                          sort steamid steamid_b
                          list, clean
                          Code:
                          . list, clean
                          
                                           steamid           steamid_b   friends  
                            1.   76561197960265729   76561197967144365         1  
                            2.   76561197960265730   76561197960265733         1  
                            3.   76561197960265730   76561197960265733         1  
                            4.   76561197960265730   76561197960265733         1  
                            5.   76561197960265733   76561197964770089         2  
                            6.   76561197960265733   76561197964770089         2  
                            7.   76561197960265733   76561197964770089         2  
                            8.   76561197960265733   76561197964770089         2  
                            9.   76561197960265733   76561197964770089         2  
                           10.   76561197960265738   76561198010062752         1

                          Comment


                          • #14
                            #13 looks proper.

                            and when you run on the full data set, go have dinner.

                            Comment


                            • #15
                              Thank you William, it has worked perfectly! Thank you so much!!

                              Comment

                              Working...
                              X