Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting distinct values across two variables

    I have a dataset of tournament chess games. I want to calculate the total number of players in each tournament ('event'), as well as the number of female players. eid is the event id, whitepid is the ID of the player playing with white, and blackpid is the ID of the player playing with black. The complication is that in a tournament, a player may play zero, one or more than one game with each colour. I am not interested in the number of distinct players who played with a given colour in the tournament, but rather the total number. But adding the number of distinct players by colour will double-count. Can anyone help? I have attached some sample data from the first event (eid == 1).

    (PS, First post, sorry for etiquette violations.)

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long(eid whitepid blackpid) float(whitefemale blackfemale)
    1 2253  505 0 0
    1 2352 2224 0 0
    1 2325 1380 1 0
    1 2199 2247 0 0
    1 2361 2319 0 0
    1  460  536 0 0
    1 2304 2198 0 0
    1 2342 2240 0 0
    1 2328 2198 0 0
    1 2276 2199 0 0
    1 2285 2324 1 0
    1 2240 2216 0 0
    1 2374 2253 0 0
    1 2323 2328 0 0
    1 1271 2276 0 0
    1 2346 2374 0 0
    1 2331 2209 0 0
    1  888 2216 0 0
    1 2321  536 0 0
    1 2355 2346 0 0
    1 2198 2326 0 0
    1 2215 2346 0 0
    1 2358 2240 0 0
    1 2351 2355 0 0
    1 2247 2327 0 0
    1 2361 2306 0 0
    1 2320 2374 0 0
    1 2093 1380 0 0
    1 2319 2304 0 0
    1 2216 2093 0 0
    1 2336 2204 0 0
    1 2215 2093 0 0
    1 2351  460 0 0
    1 2336 2331 0 0
    1 1271 2358 0 0
    1 2355 1271 0 0
    1 2321 2204 0 0
    1 2338 2299 0 0
    1 2240 2224 0 0
    1 2360  739 1 0
    1 2276 2322 0 0
    1 2197 1271 0 0
    1 2198  460 0 0
    1 2328 2327 0 0
    1 2352 2324 0 0
    1 2198 2240 0 0
    1 2306 2360 0 1
    1 2245 2304 0 0
    1 2320 2304 0 0
    1 2323 2351 0 0
    1 2247 2342 0 0
    1 2245 2276 0 0
    1  508 2330 0 0
    1 2304  536 0 0
    1 1380  505 0 0
    1 2224 2199 0 0
    1 2240  449 0 0
    1 2245  444 0 0
    1 2253 2360 0 1
    1 2299 2374 0 0
    1 2323  888 0 0
    1 2321 2320 0 0
    1 2326 2197 0 0
    1  739  536 0 0
    1 2247 2332 0 1
    1 2332 2093 1 0
    1 2354 2347 0 0
    1 2199 2304 0 0
    1 2199 2285 0 1
    1 2355 2245 0 0
    1 2327 2323 0 0
    1 2299 2220 0 0
    1 2224 2321 0 0
    1 2354  536 0 0
    1 2325 2215 1 0
    1 2199 2355 0 0
    1 2199 2322 0 0
    1  739 2328 0 0
    1 2220 2352 0 0
    1  505  888 0 0
    1 1380 2320 0 0
    1 2220 2198 0 0
    1 2220 2346 0 0
    1 2331 2323 0 0
    1 2352 1380 0 0
    1 2209 2328 0 0
    1 2374 2276 0 0
    1 2215 2209 0 0
    1  460 2247 0 0
    1 2216 2285 0 1
    1 2328 2336 0 0
    1 2093 2326 0 0
    1 2209 1380 0 0
    1  460 2320 0 0
    1  460 2361 0 0
    1 2347 2306 0 0
    1 2204 2347 0 0
    1 2247 2354 0 0
    1 2197 2215 0 0
    1 2326 2347 0 0
    end

  • #2
    Code:
    levelsof whitepid, local(pid1)
    levelsof blackpid, local(pid2)
    
    local pid : list pid1 | pid2 
    local count : list sizeof pid 
    
    levelsof whitepid if whitefemale, local(pid1)
    levelsof blackpid if blackfemale, local(pid2)
    
    local female : list pid1 | pid2
    local cfemale : list sizeof female 
    
    di `count' " players, " `cfemale' " female, " `count' - `cfemale' " male"

    Comment


    • #3
      (I forgot to mention: the total dataset is large, 8mil+ observations)

      Comment


      • #4
        Code:
        mata
        st_view(data = ., ., ("blackpid", "whitepid"))
        rows(uniqrows(uniqrows(data[,1]) \ uniqrows(data[,2])))
        end

        Comment


        • #5
          Thank you. Referring to #2: I want the count of female and male players for each event in the dataset. There are 94,000 events. Is there any faster way than looping the code in #2 over all events?

          Regarding #4: Is this in addition to #2, or an alternate method?

          Comment


          • #6
            I doubt you want to save 94000 pairs of local macros. So for the full problem you may well be better off with at least one and possibly two different versions of the dataset. Continuing your datat example I get


            Code:
            stack eid blackpid blackfemale eid whitepid whitefemale, into(eid pid female) clear 
            rename _stack colour
            label def colour 1 black 2 white
            label val colour colour
            save version2 
            
            contract eid female

            Comment

            Working...
            X