Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting number of group members who make a choice, at the person-choice level

    For data on individuals in groups choosing items, at the individual by "set of possible items" level, I want to count the number of other group members who chose that item. For example, data on students in class groups who are choosing books:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(id class) str1(possible_book book) float num_classmates
     1 10 "A" "A" 2
     1 10 "B" "A" 2
     1 10 "C" "A" 2
     2 10 "A" "A" 2
     2 10 "B" "A" 2
     2 10 "C" "A" 2
     3 10 "A" "A" 2
     3 10 "B" "A" 2
     3 10 "C" "A" 2
     4 10 "A" "B" 0
     4 10 "B" "B" 0
     4 10 "C" "B" 0
     5 20 "A" "B" 1
     5 20 "B" "B" 1
     5 20 "C" "B" 1
     6 20 "A" "C" 3
     6 20 "B" "C" 3
     6 20 "C" "C" 3
     7 20 "A" "C" 3
     7 20 "B" "C" 3
     7 20 "C" "C" 3
     8 20 "A" "C" 3
     8 20 "B" "C" 3
     8 20 "C" "C" 3
     9 20 "A" "C" 3
     9 20 "B" "C" 3
     9 20 "C" "C" 3
    10 20 "A" "B" 1
    10 20 "B" "B" 1
    10 20 "C" "B" 1
    end
    So ideally num_classmates would contain, for each id-book, how many of id's classmates chose that book.

    I'm not sure how to do this without something very inefficient like looping through each observation. The attempt above was:

    Code:
    bysort class possible_book book: egen num = count(id)
    replace num = num - 1 // Don't count self as a classmate
    which seems to be correct for the id-choice actually chosen (i.e. when possible_book == book), but incorrect for the other possible choices. In the above example, person 1 chose book A, and has two classmates (persons 2 and 3) who also chose A. But person 1 has one classmate who chose B (person 4), and no classmates who chose C. Ideally this could be extended to condition on other attributes of the person as well, such as number of classmates who are male, etc.

    I found it difficult to know what to search for this question, so any guidance would be appreciated!
    Last edited by Thomas Connelly; 03 Sep 2024, 19:17.

  • #2
    Thank you for using -dataex- on your very first post!

    The following will do what you ask:
    Code:
    tempname working
    frame put id class book, into(`working')
    frame `working' {
        duplicates drop
        rangestat (count) wanted = id, by(book) excludeself interval(class 0 0)
    }
    frlink m:1 id class book, frame(`working')
    frget wanted, from(`working')
    -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer; it is available from SSC.

    This is adaptable to some kinds of restrictions. If you wanted to restrict to number of classmates of the same sex, and assuming you had a variable, sex, that was coded 0/1, you could modify it to:

    Code:
    tempname working
    frame put id class book sex, into(`working')
    frame `working' {
        duplicates drop
        rangestat (count) wanted = id, by(book sex) excludeself interval(class 0 0)
        list, noobs clean
    }
    frlink m:1 id class book, frame(`working')
    frget wanted, from(`working')
    This clearly generalizes to a restriction of the pair being the same on any attribute, whether categorical or numeric. But to count only the male classmates who purchased the book is a bit trickier. Assuming you have a variable male coded 0 (female)/ 1 (male)
    Code:
    tempname working
    frame put id class book male, into(`working')
    frame `working' {
        duplicates drop
        gen id2 = id if male
        rangestat (count) wanted = id2, by(book) excludeself interval(class 0 0)
        list, noobs clean
    }
    frlink m:1 id class book, frame(`working')
    frget wanted, from(`working')
    This could be easily generalized to any categorical attribute, and you could impose multiple simultaneous or alternative such constraints by modifying the -if- condition in the -gen id2 = ...- command.

    Where this won't work is if you want to restrict the count to, say, people within 0.5 on their GPA. If you will need something like that, a slightly different overall approach is needed--post back if you will need that.

    Added: I take back the last paragraph. If you wanted to restrict the count to people within 0.5 on the GPA you could do it this way:
    Code:
    tempname working
    frame put id class book gpa, into(`working')
    frame `working' {
        duplicates drop
        rangestat (count) wanted = id, by(book class) excludeself interval(gpa -0.5 0.5)
        list, noobs clean
    }
    frlink m:1 id class book, frame(`working')
    frget wanted, from(`working')
    However, you cannot extend this to multiple interval constraints on continuous variables. That does require a different approach, which, if needed, please post back with suitable example data.
    Last edited by Clyde Schechter; 03 Sep 2024, 20:06.

    Comment


    • #3
      Clyde, thanks for the quick response!

      I did not know about rangestat, so thank you for bringing it to my attention.

      When I run your code, I get the same values in "wanted" as "num_classmates" in my example above. But I would like, for instance, the value of num_classmates for the second row to be the number of classmates of person 1 who purchased book B, which is 1 (just person 4), and the value for the third row to be the number of classmates of person 1 who purchased book C, which is 0. Instead, for all of the observations containing person 1, I get the number of classmates of person 1 who purchased book A (2 people, persons 2 and 3).

      rangestat looks like an elegant way to correctly produce the number of classmates of each person who purchased the same book as that person did, but when we merge with the original data I don't see how it could give me the number of classmates who purchased any different book.

      Comment


      • #4
        Sorry, I completely misunderstood the original request. I thought you had calculated num_classmates by hand and were looking for code that would match it.

        What you actually want is, in fact, simpler, and can be easily done entirely with native Stata commands:

        Code:
        preserve
        keep id class book
        duplicates drop
        rename (id book) =_U
        tempfile holding
        save `holding'
        
        restore
        joinby class using `holding'
        gen byte wanted = (possible_book == book_U & id != id_U)
        collapse (first) book (sum) wanted, by(id class possible_book)
        You can add additional constraints by just tacking on additional logical terms to the -gen byte wanted = ...- command.

        Comment


        • #5
          Clyde, thank so much! This does seem to do exactly what I want.

          If I'm understanding correctly, it looks like what the joinby step does is expand the data to include every combination of individuals within a class (N^2), multiplied by the number of possible choices, and repeated for each class. While it ends up getting collapsed ultimately, this intermediate step seems like it can get prohibitively large when the data contain many classes, individuals in a class, and/or possible books.

          For example, 1000 classes with 100 individuals in each class, and 1000 possible books would require 100^2 * 1000 * 1000 = 10 billion observations. Even though the final collapsed data would only contain 100 individuals by 1000 books, a much more manageable 100,000.

          Might there be an alternative that does not require such an expansion of the data first, or does this seem to be the only way to do it?

          Comment


          • #6
            You are quite right about the possibility of a combinatorial explosion. I had even considered that possibility but I imagined that it would not arise in this situation. What school has 1000 classes with 100 individuals in each. And 1000 possible books? It didn't seem plausible, though, I suppose that only proves that the bounds of my imagination are too limited.

            Here we can exploit the fact that the information required at any point is limited to that of a single class. So we can break the data into single-class subsets, process each separately, and then put the final results together. That will reduce the size of the largest intermediate data set by a factor of 1,000. And 10,000,000 observations is quite workable. Now, it isn't necessary to actually physically split the data into single-class subsets. The -runby- program, written by Robert Picard and me, available from SSC, automates this process for you.

            Code:
            preserve
            keep id class book
            duplicates drop
            rename (id book) =_U
            save holding, replace
            
            restore
            
            capture program drop one_class
            program define one_class
                joinby class using holding
                gen byte wanted = (possible_book == book_U & id != id_U)
                collapse (first) book (sum) wanted, by(id possible_book)
                exit
            end
            
            runby one_class, by(class) verbose
            Note that unlike the earlier code, -holding- is a real file, not a -tempfile-. This is important: -runby- has no way to pass the name of a -tempfile- down into -program one_class-, so a -tempfile- cannot be used here.

            In the (unlikely?) case that even this does not reduce the memory burden enough for your use case, you can make it even smaller by putting the code from -preserve- through -restore- inside -program one_class- at the beginning. That way you will never have an intermediate data set larger than the square of the number of observations in the largest class. But this will run more slowly because there is twice as much disk thrashing involved, as the holding file gets created anew for each class.
            Last edited by Clyde Schechter; 05 Sep 2024, 09:47. Reason: To mention that -runby- is available from SSC.

            Comment

            Working...
            X