Hi Statalist folks,
I'm having trouble creating a variable, and I'd like some help. I’ve figured out one way to do it, but I think it is prohibitively computationally demanding.
Essentially, I'm looking at individual competitors and trying to figure out, for a given competition, their average win rate in prior competitions of the same kind. The relevant variables are:
competitor_id (this identifies the competitor)
competition_id (this is type of competition)
win (a dummy variable, 1 if win and 0 if loss)
date (formatted %tcDD/NN/CCYY)
Example:
So, for the last observation I’m asking the question: What’s competitor 2A’s win rate at competitions 456 that occurred before the date of the observation?
One way that works is as follows. For each competitor_id, I simply count the number of prior observations that have the same competition_id. Then, I count the number of those observations where the competitor also wins. Then, I divide the two to get the prior win rate:
gen priorcompetitions=.
gen priorwins=.
* Generate local macros for each competitor’s competition dates and competition_ids
quietly levelsof competitor_id, local(competitor)
foreach nm of local competitor {
quietly levelsof datetime if competitor_id=="`nm'", local(dates`nm')
quietly levelsof competition_id if competitor_id=="`nm'", local(competition`nm')
}
* Use a triple loop (competitor, dates, competition) to create the variables
foreach nm of local competitor {
foreach ad of local competition`nm'{
foreach dt of local dates`nm'{
count if (competitor_id=="`nm'" & competition_id == `ad' & datetime < `dt')
replace priorcompetitions = r(N) if (competitor_id=="`nm'" & competition_id == `ad' & datetime == `dt')
count if (competitor_id=="`nm'" & competition_id == `ad' & datetime < `dt' & win==1)
replace priorwins = r(N) if (competitor_id=="`nm'" & competition_id == `ad' & datetime == `dt')
}
}
}
gen priorwinrate=priorwins/priorcompetitions
Is there an easier (less computationally demanding) way of doing this, perhaps using by x y z: egen (total) or a similar command? I have 250k+ observations, and it's been running for over 5 hours... (I tested a smaller sample of 300 and it works fine).
Thanks,
Daniel
I'm having trouble creating a variable, and I'd like some help. I’ve figured out one way to do it, but I think it is prohibitively computationally demanding.
Essentially, I'm looking at individual competitors and trying to figure out, for a given competition, their average win rate in prior competitions of the same kind. The relevant variables are:
competitor_id (this identifies the competitor)
competition_id (this is type of competition)
win (a dummy variable, 1 if win and 0 if loss)
date (formatted %tcDD/NN/CCYY)
Example:
Competitor id | Competition id | Date | Win | Desired: Prior competitions of that kind | Desired: Prior wins in those competitions | Desired: Prior win rate |
1A | 123 | 1/1/2016 | 1 | 0 | 0 | . |
1A | 456 | 1/2/2016 | 0 | 0 | 0 | . |
1A | 123 | 1/3/2016 | 1 | 1 | 1 | 1 |
1A | 123 | 1/4/2016 | 0 | 2 | 2 | 0.667 |
1A | 456 | 1/5/2016 | 0 | 1 | 0 | 0 |
1A | 456 | 1/6/2016 | 1 | 2 | 0 | 0 |
2A | 123 | 1/1/2016 | 0 | 0 | 0 | . |
2A | 456 | 1/2/2016 | 0 | 0 | 0 | . |
2A | 123 | 1/3/2016 | 1 | 1 | 0 | 0 |
2A | 123 | 1/4/2016 | 1 | 2 | 1 | 0.5 |
2A | 456 | 1/5/2016 | 1 | 1 | 0 | 0 |
2A | 456 | 1/6/2016 | 1 | 2 | 1 | 0.5 |
One way that works is as follows. For each competitor_id, I simply count the number of prior observations that have the same competition_id. Then, I count the number of those observations where the competitor also wins. Then, I divide the two to get the prior win rate:
gen priorcompetitions=.
gen priorwins=.
* Generate local macros for each competitor’s competition dates and competition_ids
quietly levelsof competitor_id, local(competitor)
foreach nm of local competitor {
quietly levelsof datetime if competitor_id=="`nm'", local(dates`nm')
quietly levelsof competition_id if competitor_id=="`nm'", local(competition`nm')
}
* Use a triple loop (competitor, dates, competition) to create the variables
foreach nm of local competitor {
foreach ad of local competition`nm'{
foreach dt of local dates`nm'{
count if (competitor_id=="`nm'" & competition_id == `ad' & datetime < `dt')
replace priorcompetitions = r(N) if (competitor_id=="`nm'" & competition_id == `ad' & datetime == `dt')
count if (competitor_id=="`nm'" & competition_id == `ad' & datetime < `dt' & win==1)
replace priorwins = r(N) if (competitor_id=="`nm'" & competition_id == `ad' & datetime == `dt')
}
}
}
gen priorwinrate=priorwins/priorcompetitions
Is there an easier (less computationally demanding) way of doing this, perhaps using by x y z: egen (total) or a similar command? I have 250k+ observations, and it's been running for over 5 hours... (I tested a smaller sample of 300 and it works fine).
Thanks,
Daniel
Comment