Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a variable counting prior observations of the same kind

    Hi Statalist folks,

    I'm having trouble creating a variable, and I'd like some help. I’ve figured out one way to do it, but I think it is prohibitively computationally demanding.

    Essentially, I'm looking at individual competitors and trying to figure out, for a given competition, their average win rate in prior competitions of the same kind. The relevant variables are:

    competitor_id (this identifies the competitor)
    competition_id (this is type of competition)
    win (a dummy variable, 1 if win and 0 if loss)
    date (formatted %tcDD/NN/CCYY)

    Example:
    Competitor id Competition id Date Win Desired: Prior competitions of that kind Desired: Prior wins in those competitions Desired: Prior win rate
    1A 123 1/1/2016 1 0 0 .
    1A 456 1/2/2016 0 0 0 .
    1A 123 1/3/2016 1 1 1 1
    1A 123 1/4/2016 0 2 2 0.667
    1A 456 1/5/2016 0 1 0 0
    1A 456 1/6/2016 1 2 0 0
    2A 123 1/1/2016 0 0 0 .
    2A 456 1/2/2016 0 0 0 .
    2A 123 1/3/2016 1 1 0 0
    2A 123 1/4/2016 1 2 1 0.5
    2A 456 1/5/2016 1 1 0 0
    2A 456 1/6/2016 1 2 1 0.5
    So, for the last observation I’m asking the question: What’s competitor 2A’s win rate at competitions 456 that occurred before the date of the observation?

    One way that works is as follows. For each competitor_id, I simply count the number of prior observations that have the same competition_id. Then, I count the number of those observations where the competitor also wins. Then, I divide the two to get the prior win rate:

    gen priorcompetitions=.
    gen priorwins=.

    * Generate local macros for each competitor’s competition dates and competition_ids
    quietly levelsof competitor_id, local(competitor)
    foreach nm of local competitor {
    quietly levelsof datetime if competitor_id=="`nm'", local(dates`nm')
    quietly levelsof competition_id if competitor_id=="`nm'", local(competition`nm')
    }

    * Use a triple loop (competitor, dates, competition) to create the variables
    foreach nm of local competitor {
    foreach ad of local competition`nm'{
    foreach dt of local dates`nm'{
    count if (competitor_id=="`nm'" & competition_id == `ad' & datetime < `dt')
    replace priorcompetitions = r(N) if (competitor_id=="`nm'" & competition_id == `ad' & datetime == `dt')
    count if (competitor_id=="`nm'" & competition_id == `ad' & datetime < `dt' & win==1)
    replace priorwins = r(N) if (competitor_id=="`nm'" & competition_id == `ad' & datetime == `dt')
    }
    }
    }
    gen priorwinrate=priorwins/priorcompetitions

    Is there an easier (less computationally demanding) way of doing this, perhaps using by x y z: egen (total) or a similar command? I have 250k+ observations, and it's been running for over 5 hours... (I tested a smaller sample of 300 and it works fine).

    Thanks,
    Daniel

  • #2
    Your data table is helpful but using dataex (SSC) as requested in the FAQ Advice would have been even more helpful. See http://www.statalist.org/forums/help#stata

    I can't agree that 2/2 = 0.667. Otherwise I believe you can avoid loops completely.

    Code:
    . * Example generated by -dataex-. To install: ssc install dataex
    . clear
    
    . input str2 competitorid int competitionid str8 sdate byte(win Prior_competitions Prior_wins) float Prior_win_rate
    
         compe~rid  comp~nid      sdate       win  Prio~ons  Prio~ins  Prior_w~e
      1. "1A" 123 "1/1/2016" 1 0 0    .
      2. "1A" 456 "1/2/2016" 0 0 0    .
      3. "1A" 123 "1/3/2016" 1 1 1    1
      4. "1A" 123 "1/4/2016" 0 2 2 .667
      5. "1A" 456 "1/5/2016" 0 1 0    0
      6. "1A" 456 "1/6/2016" 1 2 0    0
      7. "2A" 123 "1/1/2016" 0 0 0    .
      8. "2A" 456 "1/2/2016" 0 0 0    .
      9. "2A" 123 "1/3/2016" 1 1 0    0
     10. "2A" 123 "1/4/2016" 1 2 1   .5
     11. "2A" 456 "1/5/2016" 1 1 0    0
     12. "2A" 456 "1/6/2016" 1 2 1   .5
     13. end
    
    . gen ndate = daily(sdate, "DMY")
    
    . format ndate %td
    
    . bysort  competitorid competitionid ( ndate) : gen prior_c = _n - 1
    
    . bysort  competitorid competitionid ( ndate) : gen prior_w = sum(win[_n-1])
    
    . gen prior_wr = prior_w/prior_c
    (4 missing values generated)
    
    . list *id ndate win prior*, sepby(*id)
    
         +----------------------------------------------------------------------+
         | comp~rid   comp~nid       ndate   win   prior_c   prior_w   prior_wr |
         |----------------------------------------------------------------------|
      1. |       1A        123   01jan2016     1         0         0          . |
      2. |       1A        123   01mar2016     1         1         1          1 |
      3. |       1A        123   01apr2016     0         2         2          1 |
         |----------------------------------------------------------------------|
      4. |       1A        456   01feb2016     0         0         0          . |
      5. |       1A        456   01may2016     0         1         0          0 |
      6. |       1A        456   01jun2016     1         2         0          0 |
         |----------------------------------------------------------------------|
      7. |       2A        123   01jan2016     0         0         0          . |
      8. |       2A        123   01mar2016     1         1         0          0 |
      9. |       2A        123   01apr2016     1         2         1         .5 |
         |----------------------------------------------------------------------|
     10. |       2A        456   01feb2016     0         0         0          . |
     11. |       2A        456   01may2016     1         1         0          0 |
     12. |       2A        456   01jun2016     1         2         1         .5 |
         +----------------------------------------------------------------------+

    Comment


    • #3
      Thanks so much, Nick -- and apologies for the 2/2 error. I'll use dataex (SSC) next time.

      Comment

      Working...
      X