Creating a variable counting prior observations of the same kind

Daniel Tartakovsky

Join Date: Jan 2017
Posts: 2

Creating a variable counting prior observations of the same kind

29 Jan 2017, 16:31

Hi Statalist folks,

I'm having trouble creating a variable, and I'd like some help. I’ve figured out one way to do it, but I think it is prohibitively computationally demanding.

Essentially, I'm looking at individual competitors and trying to figure out, for a given competition, their average win rate in prior competitions of the same kind. The relevant variables are:

competitor_id (this identifies the competitor)
competition_id (this is type of competition)
win (a dummy variable, 1 if win and 0 if loss)
date (formatted %tcDD/NN/CCYY)

Example:

Competitor id	Competition id	Date	Win	Desired: Prior competitions of that kind	Desired: Prior wins in those competitions	Desired: Prior win rate
1A	123	1/1/2016	1	0	0	.
1A	456	1/2/2016	0	0	0	.
1A	123	1/3/2016	1	1	1	1
1A	123	1/4/2016	0	2	2	0.667
1A	456	1/5/2016	0	1	0	0
1A	456	1/6/2016	1	2	0	0
2A	123	1/1/2016	0	0	0	.
2A	456	1/2/2016	0	0	0	.
2A	123	1/3/2016	1	1	0	0
2A	123	1/4/2016	1	2	1	0.5
2A	456	1/5/2016	1	1	0	0
2A	456	1/6/2016	1	2	1	0.5

So, for the last observation I’m asking the question: What’s competitor 2A’s win rate at competitions 456 that occurred before the date of the observation?

One way that works is as follows. For each competitor_id, I simply count the number of prior observations that have the same competition_id. Then, I count the number of those observations where the competitor also wins. Then, I divide the two to get the prior win rate:

gen priorcompetitions=.
gen priorwins=.

* Generate local macros for each competitor’s competition dates and competition_ids
quietly levelsof competitor_id, local(competitor)
foreach nm of local competitor {
quietly levelsof datetime if competitor_id=="`nm'", local(dates`nm')
quietly levelsof competition_id if competitor_id=="`nm'", local(competition`nm')
}

* Use a triple loop (competitor, dates, competition) to create the variables
foreach nm of local competitor {
foreach ad of local competition`nm'{
foreach dt of local dates`nm'{
count if (competitor_id=="`nm'" & competition_id == `ad' & datetime < `dt')
replace priorcompetitions = r(N) if (competitor_id=="`nm'" & competition_id == `ad' & datetime == `dt')
count if (competitor_id=="`nm'" & competition_id == `ad' & datetime < `dt' & win==1)
replace priorwins = r(N) if (competitor_id=="`nm'" & competition_id == `ad' & datetime == `dt')
}
}
}
gen priorwinrate=priorwins/priorcompetitions

Is there an easier (less computationally demanding) way of doing this, perhaps using by x y z: egen (total) or a similar command? I have 250k+ observations, and it's been running for over 5 hours... (I tested a smaller sample of 300 and it works fine).

Thanks,
Daniel

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35754

29 Jan 2017, 16:56

Your data table is helpful but using dataex (SSC) as requested in the FAQ Advice would have been even more helpful. See http://www.statalist.org/forums/help#stata

I can't agree that 2/2 = 0.667. Otherwise I believe you can avoid loops completely.

Code:

. * Example generated by -dataex-. To install: ssc install dataex
. clear

. input str2 competitorid int competitionid str8 sdate byte(win Prior_competitions Prior_wins) float Prior_win_rate

     compe~rid  comp~nid      sdate       win  Prio~ons  Prio~ins  Prior_w~e
  1. "1A" 123 "1/1/2016" 1 0 0    .
  2. "1A" 456 "1/2/2016" 0 0 0    .
  3. "1A" 123 "1/3/2016" 1 1 1    1
  4. "1A" 123 "1/4/2016" 0 2 2 .667
  5. "1A" 456 "1/5/2016" 0 1 0    0
  6. "1A" 456 "1/6/2016" 1 2 0    0
  7. "2A" 123 "1/1/2016" 0 0 0    .
  8. "2A" 456 "1/2/2016" 0 0 0    .
  9. "2A" 123 "1/3/2016" 1 1 0    0
 10. "2A" 123 "1/4/2016" 1 2 1   .5
 11. "2A" 456 "1/5/2016" 1 1 0    0
 12. "2A" 456 "1/6/2016" 1 2 1   .5
 13. end

. gen ndate = daily(sdate, "DMY")

. format ndate %td

. bysort  competitorid competitionid ( ndate) : gen prior_c = _n - 1

. bysort  competitorid competitionid ( ndate) : gen prior_w = sum(win[_n-1])

. gen prior_wr = prior_w/prior_c
(4 missing values generated)

. list *id ndate win prior*, sepby(*id)

     +----------------------------------------------------------------------+
     | comp~rid   comp~nid       ndate   win   prior_c   prior_w   prior_wr |
     |----------------------------------------------------------------------|
  1. |       1A        123   01jan2016     1         0         0          . |
  2. |       1A        123   01mar2016     1         1         1          1 |
  3. |       1A        123   01apr2016     0         2         2          1 |
     |----------------------------------------------------------------------|
  4. |       1A        456   01feb2016     0         0         0          . |
  5. |       1A        456   01may2016     0         1         0          0 |
  6. |       1A        456   01jun2016     1         2         0          0 |
     |----------------------------------------------------------------------|
  7. |       2A        123   01jan2016     0         0         0          . |
  8. |       2A        123   01mar2016     1         1         0          0 |
  9. |       2A        123   01apr2016     1         2         1         .5 |
     |----------------------------------------------------------------------|
 10. |       2A        456   01feb2016     0         0         0          . |
 11. |       2A        456   01may2016     1         1         0          0 |
 12. |       2A        456   01jun2016     1         2         1         .5 |
     +----------------------------------------------------------------------+

Comment

Daniel Tartakovsky

Join Date: Jan 2017

Posts: 2
#3

29 Jan 2017, 17:42

Thanks so much, Nick -- and apologies for the 2/2 error. I'll use dataex (SSC) next time.
Comment

Announcement

Creating a variable counting prior observations of the same kind

Comment

Comment