Recursive Percentile by increasing data

Ali Ahmed Awan

Join Date: Aug 2024
Posts: 14

Recursive Percentile by increasing data

03 Aug 2024, 17:31

Hi,

I am trying to find recursive percentiles of data.

I have voting data on proposals. Each proposal has a start date, end date and multiple votes. Each vote has a voting power, which is represented by the personal variable. I want to calculate the 90th percentile of the voting power of votes cast in all previous proposals before the start of the focal proposal. For example, for proposal 136, I want to find the 90th percentile of the voting power of all previously cast votes and place it in front of the proposal 136 votes. For proposal after 136, we should also include all previous votes and including votes cast in proposal 136 and then find the percentile.

Here is my data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(propid voterid votedate startday endday personalvp)
 147 3710 22425 22424 22431    178001
 136  504 22430 22424 22431 4589.7886
 147  647 22424 22424 22431      6000
1668 3521 22425 22424 22431   1414000
 136 4363 22424 22424 22431  121889.9
 147 4064 22424 22424 22431       159
 136  524 22425 22424 22431    110000
 136 3052 22426 22424 22431        69
1359  881 22425 22424 22431 16005.003
 136 3896 22426 22424 22431 1104002.5
1668 4089 22425 22424 22431  8433.903
1828  881 22424 22424 22431 16005.003
  31 3395 22424 22424 22431 4269.1855
 147 4252 22424 22424 22431  17034.73
 136 3208 22426 22424 22431 2005.8096
 227  881 22425 22424 22431 16005.003
  93 4350 22424 22424 22431  200.0035
1359 3791 22427 22424 22431     40172
  67 2409 22426 22424 22431      6000
  93 3791 22427 22424 22431     40172
1668 3565 22426 22424 22431 4182.7217
1668 1661 22425 22424 22431 14682.105
 136 3521 22425 22424 22431   1414000
 147  546 22425 22424 22431  241.7764
1668  647 22426 22424 22431      6000
  67 3551 22425 22424 22431 20190.824
  31  472 22431 22424 22431 251.49693
 147 2660 22430 22424 22431     36000
 147  940 22424 22424 22431    356310
 609 3395 22424 22424 22431 4269.1855
end
format %td votedate
format %td startday
format %td endday

Last edited by Ali Ahmed Awan; 03 Aug 2024, 17:34.

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30063

03 Aug 2024, 18:35

Code:

preserve
keep votedate personalvp
sort votedate
tempfile holding
save `holding'

restore
gen `c(obs_t)' obs_no = _n
rangejoin votedate . -1 using `holding'

collapse (p90) wanted = personalvp_U (first) propid votedate startday endday personalvp, ///
    by(obs_no)
    
sort votedate propid
drop obs_no

-rangejoin- is by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Comment

Ali Ahmed Awan

Join Date: Aug 2024

Posts: 14
#3

03 Aug 2024, 22:16

Thank you, Clyde, for the code.

When I ran this, I got "op. sys. refuses to provide memory."

I have 80k observations and only 7 variables.

. query memory
--------------------------------------------------------------------
Memory settings
set maxvar 5000 2048-120000; max. vars allowed
set niceness 5 0-10
set min_memory 0 0-1600g
set max_memory . 32m-1600g or .
set segmentsize 32m 1m-32g
set adosize 1000 kilobytes
set max_preservemem 1g 0-1600g

I am running Windows 11 Enterprise and Stata MP 4-core 18.5. I believe I have enough RAM:

Installed Physical Memory (RAM) 32.0 GB
Total Physical Memory 31.7 GB
Available Physical Memory 21.6 GB
Total Virtual Memory 36.4 GB
Available Virtual Memory 26.6 GB
Page File Space 4.75 GB

Is there any solution for this?
Thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#4

03 Aug 2024, 23:21

Try this instead. I didn't realize you had such a large data set--it's too big for this code.

Code:

capture program drop one_date program define one_date centile personalvp, centile(90) gen wanted = r(c_1) display wanted[1] exit end by votedate, sort: gen lower = cond(_n == 1, ., votedate) by votedate: gen upper = votedate-1 rangerun one_date, interval(votedate lower upper) use(votedate personalvp) by votedate (wanted), sort: replace wanted = wanted[1] sort votedate propid

-rangerun- is written by Robert Picard and available from SSC. It, too, requires that you have -rangestat- installed, but I'm inferring you've already done that. This will not only be much less memory intensive, it will be much faster than the original code would have been had it run.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#5

04 Aug 2024, 08:24

It occurred to me that I may have misunderstood your question. I have no idea what the variable personalvp is supposed to mean, so I don't know what ways of using it are appropriate and what ways are not. So it dawns on me that perhaps you want this running 90th percentile calculation done separately for each propid, whereas the code shown in #4 does it separately for each date and includes the observations for all propids that were voted on prior to its date in the calculation.

If you want it done separately by propid, the code is only a little different:

Code:

capture program drop one_run program define one_run centile personalvp, centile(90) gen wanted = r(c_1) display wanted[1] exit end by propid votedate, sort: gen lower = cond(_n == 1, ., votedate) by propid votedate: gen upper = votedate-1 rangerun one_run, interval(votedate lower upper) use(votedate personalvp) by(propid) by propid votedate (wanted), sort: replace wanted = wanted[1]
1 like
Comment
Ali Ahmed Awan

Join Date: Aug 2024

Posts: 14
#6

04 Aug 2024, 15:36

Hi Clyde,
Thank you again for your help.
I think this is not what I am looking for. I want a variable that calculates the 90th percentile of personalvp of all the previous proposals votes. Startdate of each proposal is shown and we can see if those casted votes are before the start date of the focal proposal. Persoanlvp is the weight of the vote casted by someone. Once I have the 90th percentile based on the previous votes, my final goal is to identify the high power voters in the focal proposal. Both of these codes are not exactly doing this.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#7

04 Aug 2024, 15:55

Let's see if I understand what you want, because I'm not finding what you write all that clear.

The unit of analysis is actually the proposal (propid). You want to identify all votes, on any proposal at all, that chronologically precede the startday of the proposition and calculate the 90th percentile of the personalvp's among those votes.

If that is what you want, the following code, a minor variant of that in #4, should do that:

Code:

capture program drop one_date program define one_date centile personalvp, centile(90) gen wanted = r(c_1) display wanted[1] exit end by startday, sort: gen lower = cond(_n == 1, ., startday) by startday: gen upper = startday-1 rangerun one_date, interval(votedate lower upper) use(votedate personalvp) by startday (wanted), sort: replace wanted = wanted[1] sort propid startday votedate

Now, this code does not work with the example data you showed, because in that data there is only one value of startday, for all propositions, and that startday is 24 May 2021, whereas all votedays are that date or later, so there are no votes preceding any proposition start days in the data and the code will fail accordingly.

If this code does not do what you need, when posting back please do the following:
Post a new data example that has a configuration of all the date variables so that at least some of the propositions will have votes that precede them.

Try explaining again in words what you want, but, in addition, please also hand-calculate what the results for your data example should look like and show those results. With the shown results, even if I can't quite follow the explanation in words, I may be able to better perceive what is wanted.
Comment

Ali Ahmed Awan

Join Date: Aug 2024
Posts: 14

04 Aug 2024, 18:20

Hi Clyde,
I'm sorry for not being very clear. I tried this new code, but I think it's not giving the output I am looking for.

My data is proposal i,vote j, and lets say the proposals are created from i = 1, then i = 2…n, and for example I am looking at proposal 10. I want to use the personal vp (votingpower_ij) from i = 1 to i = 9 for all j to find out the distribution and then 90th percentile for it and use it for proposal 10. for proposal 11, I want to find the 90th percentile of votes for proposal i=1 to i=10 and place it in front of proposal 11.

I hope this helps.

Here is the more data.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(propid voterid votedate startday endday) float personalvp
1  260 23005 23002 23009 434.77335
1  117 23003 23002 23009 222361.45
1 1985 23003 23002 23009  706.6568
1 4256 23004 23002 23009 28144.137
1  289 23003 23002 23009 119.81952
1 1560 23003 23002 23009 1013.3813
1 1296 23006 23002 23009   487.677
1 2614 23005 23002 23009  2865.476
1 3016 23006 23002 23009  3322.486
1  259 23009 23002 23009  5.046278
1 3462 23009 23002 23009 188.42117
1 2525 23006 23002 23009  119081.6
1 3626 23007 23002 23009 18.193323
1  695 23003 23002 23009  332.5309
1 1827 23005 23002 23009  3176.301
1 4036 23003 23002 23009        25
1 4013 23005 23002 23009  1159.202
1 1615 23009 23002 23009  11273.88
1 4474 23005 23002 23009         0
1 1337 23003 23002 23009 112.31062
1 4257 23006 23002 23009 247.64377
1  137 23004 23002 23009 2304.1035
1  103 23003 23002 23009 35024.332
1   16 23007 23002 23009 15.191555
1 1050 23006 23002 23009  4655.418
1 3085 23009 23002 23009  908.3974
1 1540 23006 23002 23009  92886.45
1 2529 23004 23002 23009  9.086147
1  815 23003 23002 23009  14.10034
1 1120 23004 23002 23009     13000
1  691 23008 23002 23009  18330.22
1  255 23003 23002 23009  197.9965
1 2442 23003 23002 23009  6.184289
1 1500 23003 23002 23009        15
1 1980 23008 23002 23009  6.169023
1 2211 23005 23002 23009  31.79943
1  657 23003 23002 23009  201.0495
1 3494 23004 23002 23009 27.519766
1 1580 23006 23002 23009 100.00812
1 2679 23007 23002 23009    53.761
1  719 23003 23002 23009  1639.149
1  928 23003 23002 23009   157.432
1 2065 23007 23002 23009   925.282
1 4057 23008 23002 23009       100
1 2400 23005 23002 23009 101.94601
1 1661 23003 23002 23009   9299.51
1  693 23007 23002 23009 257.34985
1 2116 23005 23002 23009       705
1 2047 23003 23002 23009  10.47606
2 4003 23252 23251 23265  2652.393
2 3644 23251 23251 23265       101
2 3862 23251 23251 23265     46.42
2 1540 23259 23251 23265  100493.1
2  509 23251 23251 23265 219.92685
2 2533 23251 23251 23265  259.7312
2 1623 23264 23251 23265  992.9409
2  389 23251 23251 23265 10.986145
2 1617 23255 23251 23265  768.2908
2  657 23251 23251 23265  300.0884
2 1827 23251 23251 23265   3345.41
2   26 23252 23251 23265      60.1
2 2829 23259 23251 23265 1074.2432
2 2804 23251 23251 23265 10.758668
2 2871 23257 23251 23265  8515.141
2 3253 23255 23251 23265  5138.463
2 1183 23259 23251 23265       100
2 3622 23254 23251 23265  861.5488
2 4176 23256 23251 23265  138.4365
2 3526 23262 23251 23265 1365.0107
2 1296 23251 23251 23265 1026.5492
2  769 23255 23251 23265  537.2064
2  414 23253 23251 23265       100
2 1497 23251 23251 23265  102.4375
2 1325 23258 23251 23265 1466.0007
2  693 23251 23251 23265  762.8785
2 3828 23259 23251 23265 210.62195
2  567 23251 23251 23265  442.5756
2 3385 23252 23251 23265   100.999
2 2403 23252 23251 23265 1499.8783
2 1713 23251 23251 23265  676.9896
2  117 23251 23251 23265 282235.28
2  958 23256 23251 23265  2587.324
2 3896 23252 23251 23265  848121.2
2 4256 23251 23251 23265  30102.39
2  103 23251 23251 23265  36782.19
2 1658 23251 23251 23265  4073.821
2  137 23251 23251 23265  3803.511
2  695 23257 23251 23265  551.9114
2  937 23251 23251 23265  503.3798
2 4474 23252 23251 23265  6412.084
2 1192 23251 23251 23265   25461.6
2 4237 23258 23251 23265 1023.9673
2 1286 23253 23251 23265  3231.102
2 3969 23255 23251 23265  30.36248
4 1627 22836 22832 22837   201.221
4 2056 22833 22832 22837 1511.5885
4 2001 22835 22832 22837   450.127
4 3820 22835 22832 22837 36.988346
4  535 22833 22832 22837  2171.743
4  103 22832 22832 22837  31842.68
end
format %td votedate
format %td startday
format %td endday

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#9

04 Aug 2024, 19:04

OK, but let's double check I understand this. This is actually radically different from anything you said before. It is the propid that drives this now. In fact, for all observations having the same propid, we will have the same value of the wanted variable. That will be the 90th percentile of all personalvp values associated with any smaller propid--it has nothing to do with dates at all. (And that's important, because in the example data, propid 4 starts earlier than either propid 1 or propid 2.) In fact, we should be able to do this from a data set that has only propid and personalvp in it. Is that right? If so:

Code:

capture program drop one_propid program define one_propid centile personalvp, centile(90) gen wanted = r(c_1) exit end sort propid, stable by propid: gen lower = cond(_n == 1, ., propid) gen upper = propid-1 rangerun one_propid, interval(propid lower upper) use(propid personalvp) by propid (wanted), sort: replace wanted = wanted[1]

I am not sure what you mean when you say you want the result placed "in front of" proposition j. I'm assuming you want it to be in the observations with propid == j. If you care about the left to right order in the data set, you can just -order wanted, before(propid)- at the end.
Comment
Ali Ahmed Awan

Join Date: Aug 2024

Posts: 14
#10

04 Aug 2024, 23:03

Well, not exactly.
This part is correct that all observations having the same propid, we will have the same value of the wanted variable. Your understanding that the "90th percentile of all personalvp values associated with any smaller propid--it has nothing to do with dates at all" is not correct. propid is a random ID generated for each proposal and only used for the identification of a proposal. Order of propid does not matter. Start date of the propid sets the order of the proposal and should be taken into consideration.

Please see the image below. Once I have the wanted values for each proposal, I can compare each vote's personalvp (personal voting power) in that proposal to see whether they are greater than the wanted value or not. The final step is to identify high voting power voters in a proposal based on the previously cast votes in the 90th percentile. I didn't mention this step earlier because its too basic.

I hope this helps. Let me know if you have any questions.
Comment
Ali Ahmed Awan

Join Date: Aug 2024

Posts: 14
#11

04 Aug 2024, 23:10

Also, we will have missing want values for the first group of proposals with the same start date.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1371

#12

05 Aug 2024, 02:03

I don't know if I'm following this correctly, but is this all you need?

Code:

levelsof startday, local(startdates)

gen wanted = .

foreach stdt of local startdates {
    centile personalvp if startday < `stdt', centile(90)
    replace wanted = r(c_1) if startday == `stdt'
}

which produces:

Code:

. egen byte tagged = tag(propid)
. list propid startday endday wanted if tagged, noobs

  +-------------------------------------------+
  | propid    startday      endday     wanted |
  |-------------------------------------------|
  |      1   23dec2022   30dec2022   31842.68 |
  |      2   29aug2023   12sep2023   29623.55 |
  |      4   06jul2022   11jul2022          . |
  +-------------------------------------------+

Comment

Ali Ahmed Awan

Join Date: Aug 2024

Posts: 14
#13

05 Aug 2024, 06:23

Hi Hemanshu,
Thanks for the code. But wanted is all empty.

Last edited by Ali Ahmed Awan; 05 Aug 2024, 06:39.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1371
#14

05 Aug 2024, 07:36

As you can see, that is not the case for the example data you gave. So it is hard for me to diagnose what is happening in your actual data. Are you sure no typos have been made in copying my code?
Comment
Ali Ahmed Awan

Join Date: Aug 2024

Posts: 14
#15

05 Aug 2024, 09:28

Hi Hemanshu,
I am sorry for my earlier post. Your code is right on the spot. It does what I need. thank you!
Comment

Announcement

Recursive Percentile by increasing data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment